## Concept1. Tokenization with nltk

**Splitting a string into a list of words is known as tokenization.**

In [1]:
from nltk.tokenize import word_tokenize 
from nltk.tokenize import wordpunct_tokenize


In [2]:
s = "hi, how are you doing today?"

word_split = s.split()
print(word_split)

['hi,', 'how', 'are', 'you', 'doing', 'today?']


In [3]:
# The split method of the string doesn't split the punctuations <comma and question mark> which doesn't 
# have much impact in understandindg the context of the sentenence. So, we're using nltk work_tokenize.

s = "hi, how are you doing today?"
word_tokens = word_tokenize(s)
print(word_tokens)

['hi', ',', 'how', 'are', 'you', 'doing', 'today', '?']


In [4]:
word_tokens_punc = wordpunct_tokenize(s)
print(word_tokens_punc)

['hi', ',', 'how', 'are', 'you', 'doing', 'today', '?']


## Concept1b. Tokenization & word-indicing using tf-keras

### Tokenzier API

**Tokenizer can be fit on raw text or integer encoded text documents.**

**Once fit, the Tokenizer provides 4 attributes that we can use to learned about your documents:**
\
 word counts: A dictionary of words and their counts.
\
 word docs: An integer count of the total number of documents that were used to fit the
Tokenizer.
\
 word index: A dictionary of words and their uniquely assigned integers.
\
 document count: A dictionary of words and how many documents each appeared in.

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)

print("Four attributes:")

# A dictionary of words and their counts.
print(f"\nword counts: {tokenizer.word_counts}")

# An integer count of the total number of documents that were used to fit the Tokenizer.
print(f"\nword docs: {tokenizer.word_docs}")

# A dictionary of words and their uniquely assigned integers.
print(f"\nword index: {tokenizer.word_index}")

# A dictionary of words and how many documents each appeared in.
print(f"\ndocument count: {tokenizer.document_count}")

Four attributes:

word counts: OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])

word docs: defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'work': 2, 'good': 1, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1})

word index: {'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}

document count: 5


In [15]:
from tensorflow.keras.preprocessing import text
corpus = [
"hello, how are you?",
"im getting bored at home. And you? What do you think?",
"did you know about counts",
"let's see if this works!",
"YES!!!!"
]

tokenizer = text.Tokenizer(num_words=100)
tokenizer.fit_on_texts(corpus)

corpus_sequences = tokenizer.texts_to_sequences(corpus)
print(len(tokenizer.word_index))
print(tokenizer.word_index)
print("\nCorpus_sequences assigning an integer to each token:", corpus_sequences)

23
{'you': 1, 'hello': 2, 'how': 3, 'are': 4, 'im': 5, 'getting': 6, 'bored': 7, 'at': 8, 'home': 9, 'and': 10, 'what': 11, 'do': 12, 'think': 13, 'did': 14, 'know': 15, 'about': 16, 'counts': 17, "let's": 18, 'see': 19, 'if': 20, 'this': 21, 'works': 22, 'yes': 23}

Corpus_sequences assigning an integer to each token: [[2, 3, 4, 1], [5, 6, 7, 8, 9, 10, 1, 11, 12, 1, 13], [14, 1, 15, 16, 17], [18, 19, 20, 21, 22], [23]]


### text_to_word_sequence & one_hot

**Keras provides the text to word sequence() function that you can use to split text into a list of words. By
default, this function automatically does 3 things:**

 Splits words by space.
\
 Filters out punctuation.
\
 Converts text to lowercase (lower=True)

**Also, the integer associated with thw words get changes everytime clear the memory, bcz of stochastic nature of neural networks**

In [1]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.preprocessing.text import one_hot
text = 'The quick brown fox jumped over the lazy dog.'
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(f"Unique words in the sentence: {vocab_size}")

## one_hot seems to perform onehot_encodding, but it's generally perform hashing..
result = one_hot(text, round(vocab_size*1.3))
print(result)

Unique words in the sentence: 8
[7, 4, 8, 2, 2, 4, 7, 3, 4]


### Hash Encoding with hashing trick

**It's like one_hot but It also provides more exibility, allowing you to specify
the hash function as either hash (the default) or other hash functions such as the built in md5
function or your own function.**

**Unlike One_hot, with the use of a different hash function results in consistent, but different integers
for words as the one hot() function.**

In [2]:
from tensorflow.keras.preprocessing import text
from tensorflow.keras.preprocessing.text import hashing_trick
data = 'The quick brown fox jumped over the lazy dog.'
tokens = set(text.text_to_word_sequence(data))
vocab_size = len(tokens)

print(f"Unique words in the sentence: {vocab_size}")
results = hashing_trick(data, round(vocab_size*1.4), hash_function="md5")
print(results)

Unique words in the sentence: 8
[10, 5, 8, 9, 10, 8, 10, 3, 6]


### Concept2. Bag of words

**In bag of words, we create a huge sparse matrix that stores counts of all the words in our corpus (corpus = all the documents = all the sentences).**

**For this, we will use CountVectorizer from scikit-learn.**
The way CountVectorizer works is it first tokenizes the sentence and then assigns a
value to each token. So, each token is represented by a unique index. These unique
indices are the columns that we see.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

# create a corpus of sentences
corpus = [
"hello, how are you?",
"im getting bored at home. And you? What do you think?",
"did you know about counts",
"let's see if this works!",
"YES!!!!"
]

ctv = CountVectorizer()
ctv.fit(corpus)

corpus_transformed = ctv.transform(corpus)  ## <class 'scipy.sparse.csr.csr_matrix'> i.e. sparse_matrix
print(ctv.vocabulary_)
print("\n Stopwords----", ctv.stop_words_)

print("Features i.e. words from a given sentences")
print(ctv.get_feature_names())

{'hello': 9, 'how': 11, 'are': 2, 'you': 22, 'im': 13, 'getting': 8, 'bored': 4, 'at': 3, 'home': 10, 'and': 1, 'what': 19, 'do': 7, 'think': 17, 'did': 6, 'know': 14, 'about': 0, 'counts': 5, 'let': 15, 'see': 16, 'if': 12, 'this': 18, 'works': 20, 'yes': 21}

 Stopwords---- set()
Features i.e. words from a given sentences
['about', 'and', 'are', 'at', 'bored', 'counts', 'did', 'do', 'getting', 'hello', 'home', 'how', 'if', 'im', 'know', 'let', 'see', 'think', 'this', 'what', 'works', 'yes', 'you']


In [32]:
## The first sentence denoted by 0 and each word with a number as mentioned in vocabulary and its count.
## e.g. (0,2) -- 0 means 1st sentence & 2 is indx of are as mentioned in vocabulary.
## We see that index 22 belongs to “you” and in the second sentence, we have used
## “you” twice. Thus, the count is 2.

In [42]:
print(type(corpus_transformed))
print(corpus_transformed)  

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 2)	1
  (0, 9)	1
  (0, 11)	1
  (0, 22)	1
  (1, 1)	1
  (1, 3)	1
  (1, 4)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 13)	1
  (1, 17)	1
  (1, 19)	1
  (1, 22)	2
  (2, 0)	1
  (2, 5)	1
  (2, 6)	1
  (2, 14)	1
  (2, 22)	1
  (3, 12)	1
  (3, 15)	1
  (3, 16)	1
  (3, 18)	1
  (3, 20)	1
  (4, 21)	1


#### Note:
Above, special characters were missing. Let’s integrate word_tokenize from scikit-learn in CountVectorizer and see what happens.

In [35]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# create a corpus of sentences
corpus = [
"hello, how are you?",
"im getting bored at home. And you? What do you think?",
"did you know about counts",
"let's see if this works!",
"YES!!!!"
]

ctv = CountVectorizer(tokenizer = word_tokenize)
ctv.fit(corpus)

corpus_transformed_wt = ctv.transform(corpus)

# This changes our vocabulary, now puncutations are also included :
print(ctv.vocabulary_)

print(ctv.stop_words_)

{'hello': 14, ',': 2, 'how': 16, 'are': 7, 'you': 27, '?': 4, 'im': 18, 'getting': 13, 'bored': 9, 'at': 8, 'home': 15, '.': 3, 'and': 6, 'what': 24, 'do': 12, 'think': 22, 'did': 11, 'know': 19, 'about': 5, 'counts': 10, 'let': 20, "'s": 1, 'see': 21, 'if': 17, 'this': 23, 'works': 25, '!': 0, 'yes': 26}
set()


In [36]:
print(corpus_transformed_wt)

  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 14)	1
  (0, 16)	1
  (0, 27)	1
  (1, 3)	1
  (1, 4)	2
  (1, 6)	1
  (1, 8)	1
  (1, 9)	1
  (1, 12)	1
  (1, 13)	1
  (1, 15)	1
  (1, 18)	1
  (1, 22)	1
  (1, 24)	1
  (1, 27)	2
  (2, 5)	1
  (2, 10)	1
  (2, 11)	1
  (2, 19)	1
  (2, 27)	1
  (3, 0)	1
  (3, 1)	1
  (3, 17)	1
  (3, 20)	1
  (3, 21)	1
  (3, 23)	1
  (3, 25)	1
  (4, 0)	4
  (4, 26)	1


## Concept3. TF-IDF

**here we get float for each word, whereas in CountVectorizer getting count of each word in a sentence.
The drawback of CountVectorizer is:
    different words have same count, may have different index but have same count.
    word having greater count have more influence thus the approach is not that good**

**TF-IDF represent count by float so better than CountVectorizer, still this appproach there is some influence of the larger number.**


#### State of Art approach is ---->> Word embedding.


In [37]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# create a corpus of sentences
corpus = [
"hello, how are you?",
"im getting bored at home. And you? What do you think?",
"did you know about counts",
"let's see if this works!",
"YES!!!!"
]

tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
print(tfv)
tfv.fit(corpus)

corpus_transformed= tfv.transform(corpus)

print(tfv.vocabulary_)
print("\nFeatures names:--")
print(tfv.get_feature_names())

TfidfVectorizer(token_pattern=None,
                tokenizer=<function word_tokenize at 0x0000023E72C53EA0>)
{'hello': 14, ',': 2, 'how': 16, 'are': 7, 'you': 27, '?': 4, 'im': 18, 'getting': 13, 'bored': 9, 'at': 8, 'home': 15, '.': 3, 'and': 6, 'what': 24, 'do': 12, 'think': 22, 'did': 11, 'know': 19, 'about': 5, 'counts': 10, 'let': 20, "'s": 1, 'see': 21, 'if': 17, 'this': 23, 'works': 25, '!': 0, 'yes': 26}

Features names:--
['!', "'s", ',', '.', '?', 'about', 'and', 'are', 'at', 'bored', 'counts', 'did', 'do', 'getting', 'hello', 'home', 'how', 'if', 'im', 'know', 'let', 'see', 'think', 'this', 'what', 'works', 'yes', 'you']


#### Note: ***We can see that instead of integer values, this time we get floats.***

In [65]:
print(corpus_transformed)

  (0, 27)	0.2965698850220162
  (0, 16)	0.4428321995085722
  (0, 14)	0.4428321995085722
  (0, 7)	0.4428321995085722
  (0, 4)	0.35727423026525224
  (0, 2)	0.4428321995085722
  (1, 27)	0.35299699146792735
  (1, 24)	0.2635440111190765
  (1, 22)	0.2635440111190765
  (1, 18)	0.2635440111190765
  (1, 15)	0.2635440111190765
  (1, 13)	0.2635440111190765
  (1, 12)	0.2635440111190765
  (1, 9)	0.2635440111190765
  (1, 8)	0.2635440111190765
  (1, 6)	0.2635440111190765
  (1, 4)	0.42525129752567803
  (1, 3)	0.2635440111190765
  (2, 27)	0.31752680284846835
  (2, 19)	0.4741246485558491
  (2, 11)	0.4741246485558491
  (2, 10)	0.4741246485558491
  (2, 5)	0.4741246485558491
  (3, 25)	0.38775666010579296
  (3, 23)	0.38775666010579296
  (3, 21)	0.38775666010579296
  (3, 20)	0.38775666010579296
  (3, 17)	0.38775666010579296
  (3, 1)	0.38775666010579296
  (3, 0)	0.3128396318588854
  (4, 26)	0.2959842226518677
  (4, 0)	0.9551928286692534


## Concept4. n-grams

In [38]:
from nltk import word_tokenize
from nltk import ngrams

In [39]:
N=3
sentence = "Hello, how are you?"

tokens = word_tokenize(sentence)
print(f"word_tokens--- {tokens}")

n_grams = list(ngrams(tokens, N))
print()
print("n_grams for the given sentence with N=2 combinations:")
n_grams

word_tokens--- ['Hello', ',', 'how', 'are', 'you', '?']

n_grams for the given sentence with N=2 combinations:


[('Hello', ',', 'how'),
 (',', 'how', 'are'),
 ('how', 'are', 'you'),
 ('are', 'you', '?')]

### Note: 
**Both CountVectorizer and TfidfVectorizer implementations of scikit-learn offers ngrams
by ngram_range parameter, which has a minimum and maximum limit.**

In [13]:
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model
from sklearn import decomposition

In [20]:
corpus = pd.read_csv("../input/IMDB_Dataset-folds.csv", nrows=10000)
corpus = corpus.review.values
len(corpus)

tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)

corpus_transformed = tfv.fit_transform(corpus)

svd = decomposition.TruncatedSVD(n_components=10)

corpus_svd = svd.fit(corpus_transformed)

In [15]:
print(type(corpus_transformed))

<class 'scipy.sparse.csr.csr_matrix'>


In [24]:
print(type(svd))
print(svd)

<class 'sklearn.decomposition._truncated_svd.TruncatedSVD'>
TruncatedSVD(n_components=10)


In [19]:
type(corpus_svd)

sklearn.decomposition._truncated_svd.TruncatedSVD

In [41]:
## get the words present inthe senetence..
print(tfv.get_feature_names()[0:92])
print(type(tfv.get_feature_names))

['\x08\x08\x08\x08a', '!', '#', '$', '%', '&', "'", "''", "''and", "''the", "'00s", "'01", "'03", "'04", "'05", "'06", "'07", "'08", "'10", "'10.5", "'12", "'15", "'20", "'20th", "'24", "'28", "'30", "'30s", "'30s-early", "'30s/'40s", "'32", "'34", "'39", "'40", "'40s", "'42", "'43", "'46", "'48", "'50", "'50s", "'51", "'53", "'54", "'55", "'56", "'59", "'60", "'60s", "'60s.", "'60´s", "'62", "'64", "'66", "'70", "'70's-style", "'70's.", "'70s", "'71", "'73", "'77", "'79", "'80", "'80s", "'80s/early", "'81", "'84", "'86", "'87", "'88", "'90", "'90s", "'92", "'93", "'94-'95", "'95", "'96", "'97", "'99", "'aaaaagh", "'aasmaan", "'about", "'absorbed", "'ace", "'ack", "'act", "'acting", "'action", "'actor", "'actors", "'actual", "'addiction"]
<class 'method'>


In [50]:
len(tfv.get_feature_names())

69838

In [39]:
corpus_svd.components_.shape

(10, 69838)

In [49]:
sample_index = 0

feature_scores = dict(zip(
                        tfv.get_feature_names(), 
                        corpus_svd.components_[sample_index]
                        )
                     )
