### 1. Tokenizing text into sentences

In [2]:
from nltk.tokenize import sent_tokenize

# paragraph of text
text = "Hello World. It's good to see you. Thanks for buying this book."

sentences = sent_tokenize(text)
sentences

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

----------------------------------------------------------------------------------------------------------------------
The sent_tokenize function uses an instance of PunktSentenceTokenizer from the
nltk.tokenize.punkt module. This instance has already been trained and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.

The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the
PunktSentenceTokenizer class once, and call its tokenize() method instead.

----------------------------------------------------------------------------------------------------------------------

In [4]:
import nltk.data

# paragraph of text
text = "Hello World. It's good to see you. Thanks for buying this book."

tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

----------------------------------------------------------------------------------------------------------------------
Note :<br>
    1> For other languages<br>
       >>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')<br>
       >>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')<br>
           ['Hola amigo.', 'Estoy bien.']<br>

### 2. Tokenizing Sentences into Words

In [5]:
from nltk.tokenize import word_tokenize

word_tokenize('Hi how are you!!!')

['Hi', 'how', 'are', 'you', '!', '!', '!']

----------------------------------------------------------------------------------------------------------------------
The word_tokenize() function is a wrapper function that calls tokenize() on an
instance of the TreebankWordTokenizer class. It's equivalent to the following code:

----------------------------------------------------------------------------------------------------------------------

In [6]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hi how are you!!!')

['Hi', 'how', 'are', 'you', '!', '!', '!']

### 3. Tokenizing Sentences Using Regular Expressions

Regular expressions can be used if you want complete control over how to tokenize text.<br>

First you need to decide how you want to tokenize a piece of text as this will determine how<br>
you construct your regular expression. The choices are:<br>
1> Match on the tokens<br>
2> Match on the separators or gaps<br>

In [7]:
from nltk.tokenize import regexp_tokenize

regexp_tokenize("Can't be done this.", "[\w']+")

["Can't", 'be', 'done', 'this']

In [9]:
regexp_tokenize("Can't be done this.", "\s+", gaps=True)

["Can't", 'be', 'done', 'this.']

### 4. Training a Sentence Tokenizer

NLTK's default sentence tokenizer is general purpose, and usually works quite well. But
sometimes it is not the best choice for your text. Perhaps our text uses nonstandard
punctuation, or is formatted in a unique way. In such cases, training your own sentence
tokenizer can result in much more accurate sentence tokenization.

In [12]:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)

In [18]:
sents1 = sent_tokenizer.tokenize(text)
print(sents1[0])
print('\n')
print(sents1[678])

White guy: So, do you have any plans for this evening?


Girl: But you already have a Big Mac...


In [19]:
# Let's compare the results to the default sentence tokenizer
from nltk.tokenize import sent_tokenize
sents2 = sent_tokenize(text)
print(sents2[0])
print('\n')
print(sents2[678])

White guy: So, do you have any plans for this evening?


Girl: But you already have a Big Mac...
Hobo: Oh, this is all theatrical.


----------------------------------------------------------------------------------------------------------------------
While the first sentence is the same, we can see that the tokenizers disagree on how to
tokenize sentence 679 (this is the first sentence where the tokenizers diverge). The default
tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that
the next line is a separate sentence. This difference is a good demonstration of why it can
be useful to train your own sentence tokenizer, especially when your text isn't in the typical
paragraph-sentence structure

----------------------------------------------------------------------------------------------------------------------

### 5. Filtering Stopwords in a Tokenized Sentence

Stopwords are common words that generally do not contribute to the meaning of a sentence,
at least for the purposes of information retrieval and natural language processing. These are
words such as the and a. Most search engines will filter out stopwords from search queries
and documents in order to save space in their index.<br>

NLTK comes with a stopwords corpus that contains word lists for many languages.

In [21]:
# Create a set of all English stopwords, then use it to filter stopwords from a sentence
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = ["Can't", 'be', 'done', 'this']
[word for word in words if word not in stop_words]

["Can't", 'done']

In [22]:
# There are also stopword lists for many other languages
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [24]:
# First 5 stopwords of dutch language
stopwords.words('dutch')[0:5]

['de', 'en', 'van', 'ik', 'te']

### 6. Looking Up Synsets(synonims) for a Word in WordNet

WordNet is a lexical database for the English language. In other words, it's a dictionary
designed specifically for natural language processing.<br>

NLTK comes with a simple interface to look up words in WordNet. What you get is a list of
Synset instances, which are groupings of synonymous words that express the same concept.

In [37]:
from nltk.corpus import wordnet
syn = wordnet.synsets('book')[0]
print(syn.name())
print('\n')
print('Definition :\n',syn.definition())
print('\n')
print('Example : \n',syn.examples())
print('\n')
print('parts of speech :\n', syn.pos())

book.n.01


Definition :
 a written work or composition that has been published (printed on pages bound together)


Example : 
 ['I am reading a good book on economics']


parts of speech :
 n


### 7. Looking up Lemmas and Synonyms in WordNet

A lemma (in linguistics), is the canonical form or morphological form of a word.<br>

we can also look up lemmas in WordNet to find synonyms of a word

In [43]:
from nltk.corpus import wordnet

syn = wordnet.synsets('cookbook')[0]
lemmas = syn.lemmas()
print(lemmas)
print('\n')
print(len(lemmas))

[Lemma('cookbook.n.01.cookbook'), Lemma('cookbook.n.01.cookery_book')]


2


In [44]:
print(lemmas[0].name())
print(lemmas[1].name())

cookbook
cookery_book


In [45]:
lemmas[0].synset() == lemmas[1].synset()

True

### 8. Stemming Words

Stemming is a technique to remove affixes from a word, ending up with the stem. For
example, the stem of cooking is cook, and a good stemming algorithm knows that the ing
suffix can be removed.<br>

Stemming is most commonly used by search engines for indexing
words. Instead of storing all forms of a word, a search engine can store only the stems, greatly
reducing the size of index while increasing retrieval accuracy.<br>

One of the most common stemming algorithms is the Porter stemming algorithm. It is designed to remove and replace well-known suffixes of English words

In [47]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('cooking'))
print(stemmer.stem('cookery'))

cook
cookeri


### 9. Lemmatizing Words with WordNet

Lemmatization is very similar to stemming, but is more akin to synonym replacement.
A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always
left with a valid word that means the same thing.<br>

In [48]:
# We will use the WordNetLemmatizer class to find lemmas
from nltk.stem import WordNetLemmatizer

lemmatizer =  WordNetLemmatizer()
print(lemmatizer.lemmatize('cooking'))
print(lemmatizer.lemmatize('cooking', pos='v'))
print(lemmatizer.lemmatize('cookbooks'))

cooking
cook
cookbook


### 10. Replacing Words Matching Regular Expressions

We need to define a number of replacement patterns. This will be a list of tuple
pairs, where the first element is the pattern to match with and the second element is
the replacement.<br>

Next, we will create a RegexpReplacer class that will compile the patterns and provide
a replace() method to substitute all the found patterns with their replacements.

In [56]:
import re

replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]

In [57]:
class RegexReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
        
    def replace(self, text):
        s = text
        for (pattern, repl) in self. patterns:
            s = re.sub(pattern, repl, s)
        return s

In [60]:
replacer = RegexReplacer()
replacer.replace("can't is a contraction")

'cannot is a contraction'

In [61]:
replacer.replace("I should've done that thing I didn't do")

'I should have done that thing I did not do'

### 11. NLP Tasks

<b>Preprocessing :</b>
* Removing special characters
* Removing numbers
* Take care of case letters
* Abbrevations
* Standardization of words
* Removing sparse words
* Lemmatization
* Stemming
* Spelling corrections
* Handling other symbols (emojis)

<b>Converting text to numeric vector</b>
* Count
* TF-IDF
* Hash
* Word Embeddings (word2vec, glove)

<b>Features</b>
* Words
* N-grams
* Characters
* Derived features(no of letters, no of sentences, no of caps/small, no of POS related like noun/verb/adjuctive etc.)

<b>Exploratory Analysis</b>

* Word associations
* Word counts(word cloud, bar charts)
* Similarity between documents, words/associations
* TOPIC mining / clustering

### 12. Text mining applications

<b>Many to one :</b> Sentiment analysis, text recommendation<br>
<b>One to many :</b> Image caption(series of words)<br>
<b>Many to many :</b> Chatbot(conversation), Machine translations<br>

### 13. NLP, NLU and NLG

<b>NLP :</b> Processing text<br>
<b>NLU :</b> Exploratory analysis, predictive analysis<br>
<b>NLG :</b> Text generation (chatbot, voicebot, machine translations, text recommandation, caption generation, subtiles generation)

### 14. Word Embeddings

* Technique to turn words into numbers
* Will take care of relationship between the words
* Context based vectorization
* Word embedding is learnt from the data
* It is dense as compared to normal one hot encoding
* Idea is to embed words into lower dimensional space
* Dimensions of this spaceare typically defined by word context, i.e. semantically similar wordsare embedde near each other
* Popular algorithms are : Word2Vec (skipgram, CBOW), Glove, FastTect, PMI
* We can use pre-trained models of Word2vec

<b>How to implement?</b>

1. Pre-trained models : Google news articles, wikipedia articles, american news agency, routers
2. Build it from scratch : Using Gensim
3. While building deep learning model use embedded layer

### 15. Bag of Words(BOW)

r1 : This pasta is very tasty and affordable<br>
r2 : This pasta is not tasty and is affordable<br>
r3 : This pasta is delicious and cheap<br>
r4 : Pasta is tasty and pasts tastes good<br>

r1 -> Review / Text document<br>
n  -> No. of reviews / documents 
corpus -> Collection of all documents

1. Constucting a set/dictionary of all words in all reviews
2. For each review create vector.
3. Vector consisting of each word as each document
4. Each word is a different dimension
5. Each cell in vector is no. of times that text occurs in that review
6. Dimesions of the vector is very large
7. Vector is sparse vector. Most of the elements of the vector are zero.
8. Similar documents have closer distance
9. BOW is counting common words
10. It won't consider symantic meaning os words

<b>Binary Bag of Words</b>

* Vector consists of boolean value of occurance of word instead count
* It is also called boolean Bag of Words
* Value is '1' if atleast once the word occurs in document
* Value is '0' if the word doesn't occur in the document

In [None]:
count_vect = CountVectorizer() #in scikit-learn
final_counts = count_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

### 15. Stop Words

* Removing few words for the document doesn't affect meaning. Such words are stopwords
* Ex :- the, this, and, is, at etc.
* By removing stop words, BOW will be smaller ans more meaningful

In [None]:
import nltk  
nltk.download('stopwords')
stop = set(stopwords.words('english')) #set of stopwords

* We can edit this set of stopwords according to out need. Ex:- removing 'not' from the set for the problem of review polarity.
* Carefully we need to remove stopwords. We should not lose information by removing stopwords.

### 16. Stemming

* Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
* Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.
* This indiscriminate cutting can be successful in some occasions, but not always
* Ex :-<br>
<b>|FORM|SUFFIX|STEM|</b><br>
|studies|-es|studi|<br>
|studying|-ing|study|<br>
* There are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer.
* Stemming is definitely the simpler of the two approaches. 
* With stemming, words are reduced to their word stems.
* Different algorithms are Porter stemmer, Snowball stemmer, Lancaster stemmer
* Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
* Stemming is the process of converting the words of a sentence to its non-changing portions. In the example of amusing, amusement, and amused above, the stem would be amus.
* Lemmatization is the process of converting the words of a sentence to its dictionary form. For example, given the words amusement, amusing, and amused, the lemma for each and all would be amuse.
* Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.

<b>Porter Stemmer :</b>

In [4]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


Note how the stemmer recognizes “runner” as a noun, not a verb form or participle. Also, the adverbs “easily” and “fairly” are stemmed to the unusual root “easili” and “fairli”

<b>Snowball Stemmer :</b>

In [7]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


<b>Lancaster Stemmer :</b>

In [None]:
import nltk
from nltk.stem import LancasterStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')

### 17. Lemmatization

* Takes into consideration the morphological analysis of the words
* lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.
* To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.
* Ex:-<br>
<b>|FORM|LEMMA|</b><br>
|studies|study|<br>
|studying|study|<br>
* To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis.
* The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.
* Lemmatization is typically seen as much more informative than simple stemming, which is why Spacy has opted to only have Lemmatization available instead of Stemming
* Lemmatization looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases.

In [None]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [None]:
doc = nlp(u"I saw eighteen mice today!")

show_lemmas(doc)

In [None]:
#output
I            PRON   561228191312463089     -PRON-
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !

### 17. Conclusion Stemming vs Lemmatization

* One thing to note about lemmatization is that it is harder to create a lemmatizer in a new language than it is a stemming algorithm because we require a lot more knowledge about structure of a language in lemmatizers.
* Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word.
* Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used a corpus also to supply lemma which makes it slower than stemming. you furthermore might had to define a parts-of-speech to get the proper lemma.
* The above points show that if speed is concentrated then stemming should be used since lemmatizers scan a corpus which consumes time and processing. It depends on the problem you’re working on that decides if stemmers should be used or lemmatizers.

### 18. Uni gram, Bi gram and n gram

* <b>Uni gram :</b> We use single word for each dimension.
* <b>Bi gram :</b> We use 2 words for each dimension
* <b>n gram :</b> We use 'n' words for each dimension
* Uni grams ignore sequence information
* Bi grams and n grams, keep some sequence information though not completely
* No. of tri grams >= no. of bi grams >= no. of uni grams

<b>Example : -</b><br>
Sentence - "I have a lovely dog"<br>
Uni grams - "I", "have", "a" , "lovely" , "dog"<br>
Bi grams - "I have" , "have a" , "a lovely" , "lovely dog"<br>

In [None]:
count_vect = CountVectorizer(ngram_range=(1,2) ) #in scikit-learn
final_bigram_counts = count_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

### 19. TF-IDF

* Term Frequency Inverse Document Frequency
* TF(wi, rj) = # of times wi occur in rj / total # of words in rj
* 0 <= TF(wi, rj) <= 1
* TF is probability of finding word in document
* TF is how ofter wi occur in rj
* IDF(wi, Dc) = log(# of documents / # of docs containing wi) = log(N/ni)
* IDF(wi, Dc) >= 0
* If Wi is more common in corpus then IDF value is lower
* If Wi is less common in corpus then IDF value is higher
* While converting text to numeric, in the vector we add TF\*IDF value of text in place of word
* TF-IDF gives more importance to rarer words in the corpus and words which are frequent in documents
* <b>TF-IDF doesnot donsider symantic meaning of the text</b>

r1 : w1, w2, w3, w2, w5<br>
r2 : w1, w3, w4, w5, w6, w2<br>
r3 : ...............<br>
r4 : ...............<br>
Dc = {r1, r2, r3.....}<br>

In [None]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

In [None]:
features = tf_idf_vect.get_feature_names()
print("some sample features(unique words in the corpus)",features[100000:100010])

### 20. Word2Vec

* Word2vec is a group of related models that are used to produce word embeddings.
* These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
* Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. 
* Each word is represented as dense vectors, with typically with 50, 100, 200, 300 dimensions 
*  Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space
* Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network.
* Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)
* Skip Gram works well with small amount of data and is found to represent rare words well.
* On the other hand, CBOW is faster and has better representations for more frequent words.
* Google has already trained word2vec, and has 300 dimensions
* Word2vec considers neighbourhood of word during training. If neighbourhoods are same then vectors should be close

<b>Average Word2Vec :</b><br>
If r1 : w1, w2, w3, w4, w5<br>
v1 = 1/n*(w2v(w1)+w2v(w2)+w2v(w3)+w2v(w4)+w2v(w5))<br>
<br>
It works well for sometimes as this is simple method

<b>TF-IDF weighted Word2vec :</b><br>
If r1 : w1, w2, w3, w4, w5<br>
v1 = ((t1\*w2v(w1))+(t2\*w2v(w2))+(t3\*w2v(w3))+(t4\*w2v(w4))+(t5\*w2v(w5))) / (t1+t2+t3+t4+t5)<br>
v1 = Sigma(ti\*w2v(wi))/Sigma(ti)<br>

In [None]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

model = Word2Vec.load("word2vec.model")
model.train([["hello", "world"]], total_examples=1, epochs=1)
vector = model.wv['computer']  # numpy vector of a word

In [None]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need
is_your_ram_gt_16g=False
want_to_read_sub_set_of_google_w2v = True
want_to_read_whole_google_w2v = True
if not is_your_ram_gt_16g:
    if want_to_read_sub_set_of_google_w2v and  os.path.isfile('google_w2v_for_amazon.pkl'):
        with open('google_w2v_for_amazon.pkl', 'rb') as f:
            # model is dict object, you can directly access any word vector using model[word]
            model = pickle.load(f)
else:
    if want_to_read_whole_google_w2v and os.path.isfile('GoogleNews-vectors-negative300.bin'):
        model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# print("the vector representation of word 'computer'",model.wv['computer'])
# print("the similarity between the words 'woman' and 'man'",model.wv.similarity('woman', 'man'))
# print("the most similar words to the word 'woman'",model.wv.most_similar('woman'))
# this will raise an error
# model.wv.most_similar('tasti')  # "tasti" is the stemmed word for tasty, tastful

<b>REFERENCE :</b>
1. Stemming and lemmatization : https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8
2. Stemming and Lemmatization : https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
3. Stemming and Lemmatization : https://www.guru99.com/stemming-lemmatization-python-nltk.html
4. Word Embeddings and Word2Vec : https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
5. Word embeddings and word2vec : https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
6. word2vec : https://radimrehurek.com/gensim/models/word2vec.html