# Basics
### 1. Tokenisation

### 2. Lexicographic corrections
#### a. Lemmatization b. Stemming

### 3. Bag of words and TF-IDF

### 4. Word Embeddings:
#### a. Word2Vec b. GloVe

## Tokenisation
     Tokenisation simply means to create units of words/sentences as per the needs. This is something very basic which is needed in order to create a vocabulary for your model.

In [5]:
!pip install nltk --upgrade
import nltk
nltk.download()

Requirement already up-to-date: nltk in c:\users\samriddh\anaconda3\lib\site-packages (3.5)
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [6]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [58]:
#tokens of sentences
sentences = nltk.sent_tokenize(paragraph)
#Tokens of words from the para
words = nltk.word_tokenize(paragraph)

In [59]:
len(words) #we have 399 words in the words variable

399

In [60]:
len(sentences) #we have 31 sentences in the words variable

31

## Lexicographic corrections:
The step converts all the disparities of a word into their normalized form (also known as lemma).
The most common lexicon normalization practices are :
#### Stemming: 
    Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. We are just finding the base word. This is used for analytical application.
    Eg: playing, player, playboy, playstation, playground --> play
        Finalists,Final, Finalle, Finance --> fina

The problem is that the words after stemming may not have any meaning to us. Lemmatization takes care of maintaining a congruent nature with our human vocabulary to convey a sensible meaning.
#### Lemmatization:
    Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word. This is generally used where human interaction is needed.
   **It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).**
   
    The outcome is a reduction in features to store only useful ones to for the model.
    
#### Stopwords
Stopwords are those words which do not contribute anything significant to the model. Example-  The, this, is, of, in , from, and, our bla bla. A general characteristic is they are repeated multiple times.

Remember that: Sometimes you would not wanna remove some helping verbs and other kind of words where tense of the sentences is to be identified.

In [None]:
# to see the list of stopwords in English. You can check for other languages.
stopwords.words("English")

In [52]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stemmer = PorterStemmer()

#stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    #List Comprehension- if the word belongs to stopwords list then remove it else stem it.
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words("English"))]
    sentences[i]= " ".join(words)

In [53]:
# No stop words and no extended words. See histori, peopl, invad etc as stemming.
sentences

len(sentences)

31

In [70]:
from nltk.stem import WordNetLemmatizer

lemmer = WordNetLemmatizer()

# List comprehension using Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = " ".join(words)      

In [71]:
sentences
len(sentences)

31

### Vocabulary creation

#### (1) Bag of words model
###### a. Binary
###### b. Non-Binary(Stores the frequency as well.)

A bag-of-words is a representation of text that describes the occurrence of words within a document. It provides with a vocabulary of known words and their frequency.
Because we know the vocabulary has n words, we can use a fixed-length document representation of n, with one position in the vector to score each word. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present ( one-hot-encoding on the data to formulate our vector x as binary or non-binary models ).The vectors x are derived from textual data, in order to reflect various linguistic properties of the text.

**Disadvantages:**

* This creates a very messy vector when the vocabulary is huge.
    Solution: Use Word2Vec
* The semantic of the words are not preserved. "He is an intelligent boy." We would have same importance of the adjective and the noun here, hence can not derive much information.
    Solution: Use TF-IDF


In [83]:
#Text preprocessing before the BoW

import re #regular exploration library
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""
paragraph = paragraph.lower()

ps = PorterStemmer()
wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)

corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    #split creates a list of words
    review = review.split()
    # list comprehension using stemming
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)
    


from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [86]:
X.shape

(31, 114)

#### (2) TF-IDF (Term Frequency and Inverse Document Frequency)



**TF = Number of repition of words in sentence[i] / Total words in sentence[i]**


**IDF = log(Total number of sentences(n)/Number of sentences containing the word)**

Examples:

Sentence[0] = good boy

Sentence[1] = good girl

Sentence[2] = boy girl good 

**TF**

| words | Sentence[0] | Sentence[1] | Sentence[2] |
|-------|-------------|-------------|-------------|
| good  | 1/2         | 1/2         | 1/3         |
| boy   | 1/2         | 0           | 1/3         |
| girl  | 0           | 1/2         | 1/3         |

**IDF**

| Words | IDF <br>log() |
|-------|---------------|
| good  | 3/3           |
| boy   | 3/2           |
| girl  | 3/2           |


Now let us see the magic of analysing texts with mathematics: 

**TF x IDF**

Basically, we product the two tables by multiplying TF of words with IDF of words in each sentence.

| sentence[i] | f1 (good) | f2 (boy)       | f3 (girl)      |
|-------------|-----------|----------------|----------------|
| sentence[0] | <br>0     | 1/2 * log(3/2) | 0              |
| sentence[1] | 0         | 0              | 1/2 * log(3/2) |
| sentence[2] | 0         | 0              | 0              |

We will have a label which will be a dependent feature of the above three features. For a larger vocabulary this works wonders as it emphasies on the presence of word being present in a sentence.

**Disadvantages**

We apply some mathematical formulae, in case of BoW and TF-IDF as well, but these causes some problems:
1. Semantic information is not stored well in the two models.
2. Highly ineffecient form of word representation and feature generation, as a lot of data present is a sparse matrix of the size of vocabulary.
Say you have a dictionary of 10k words and Car is at 2500th position, we will have a vector of 10k-1 zeros and 1 one. We can not relate words to preserve semantic information.

In [89]:
import re #regular exploration library
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

wordnet = WordNetLemmatizer()

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

paragraph = paragraph.lower()
sentences = nltk.sent_tokenize(paragraph)

corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    #split creates a list of words
    review = review.split()
    # list comprehension using stemming
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)

In [105]:

display(corpus)
display(sentences)
display(len(sentences))

['three vision india',
 'year history people world come invaded u captured land conquered mind',
 'alexander onwards greek turk mogul portuguese british french dutch came looted u took',
 'yet done nation',
 'conquered anyone',
 'grabbed land culture history tried enforce way life',
 '',
 'respect freedom others first vision freedom',
 'believe india got first vision started war independence',
 'freedom must protect nurture build',
 'free one respect u',
 'second vision india development',
 'fifty year developing nation',
 'time see developed nation',
 'among top nation world term gdp',
 'percent growth rate area',
 'poverty level falling',
 'achievement globally recognised today',
 'yet lack self confidence see developed nation self reliant self assured',
 'incorrect',
 'third vision',
 'india must stand world',
 'believe unless india stand world one respect u',
 'strength respect strength',
 'must strong military power also economic power',
 'must go hand hand',
 'good fortune worked

['i have three visions for india.',
 'in 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'from alexander onwards, the greeks, the turks, the moguls, the portuguese, the british,\n               the french, the dutch, all of them came and looted us, took over what was ours.',
 'yet we have not done this to any other nation.',
 'we have not conquered anyone.',
 'we have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'why?',
 'because we respect the freedom of others.that is why my \n               first vision is that of freedom.',
 'i believe that india got its first vision of \n               this in 1857, when we started the war of independence.',
 'it is this freedom that\n               we must protect and nurture and build on.',
 'if we are not free, no one will respect us.',
 'my second vision for india’s developme

31

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
X

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25883507, 0.30512561,
        0.        ],
       [0.        , 0.28867513, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Word Embeddings:

Word embeddings is the idea of creating mathematical description to make the computer understand analogies like man is to woman as boy is to girl and avoid any biases.
How should the  algorithm know that orange and juice or apple and pie are related, orange and apple are similar but juice and pie are not. 

### Featurised Representation
We will add some k number of features against which the words are weighed from -1 to 1 to understand what feature does a particular word may represent. 
Following are 4 features which are used to represent the 3 words as featurised vectors.

........ Man Queen  Orange
 
Gender   =    1           -0.99          0

Royal    =    0.02        1             0

Colour   =     0         0             0.95
 
Fruit    =      0         0            0.99

Word embeddings are beneficial in analysing enormous vocabulary as well, as this will group the similar kind of words together and preserve semantics as well.

_Name Entity Recognition Example_

Eg- Sally Johnson is an orange farmer.
Our model should be able to identify the name Sally Johnson as a person and not an organisation as a farmer is a person. This is carried out by Bidirectional Recurrent Neural Network.

Transfer Learning and Word Embedding steps:
1. Learn word embeddings from large text corpus(or use pre-trained embedding).
2. Transfer embedding to new task with smaller training set.
3. Fine tune the word embeddings with new data.

Als0, if you know about dealing with images in a NN, encoding is very much similar to embedding, they might even be used interchangably time to time.

**Properties of word embeddings** 

V[Man]-V[Woman] == V[King]-V[Queen]

This will result in a difference of 2 in the gender section for both the pairs. The vectors of similar words correlate in properties and their differences as well which can be used to generate analogies in an n-dimensional space(n=no. of features).


Visualise the two as a pair of almost parallel vectors in an n dimensional space. This can be used to understand **Cosine Similarity** as well, if the vectors are parallel then the cosine of angle between them would give 1 !

That is-- if we ask the question: 

V[Man]-V[Woman] == V[King]-V[?]
The output is highly likely to be Queen.

Rupee:India::Pound:[?]

<img src="C:/Users/Samriddh/python jupyter/NLP Learn/tds.png">




**Embedding Matrix E**

A n_feature x n_vocab_size matrix is created which stores all the words with the weighted features according to the meaning of that word.

If we multiply a column vector O_i representing any specific word, to this matrix E, we get a (n_feature,1) e_i vector which would constitute the related features described by this word,that is: E.O=e .

In practise since the matrix is still largely sparse, we use inbuilt function of Keras to use embedding which uses specialised and optimal function to carry out the process



**Learning Embeddings** 

Let me take the list of words, "I want a glass of orange"  and let's start with the first word I. So there's a one add vector with a one in position V[I] . So this is going to be n_vocab_size(lets say 10k) dimensional vector. And what we're going to do is then have a matrix of parameters E, and take E times O to get an embedding vector e[I] , and this step really means that e[I] is obtained by the matrix E times the one hot vector. And then we'll do the same for all of the other words. So now you have a bunch of three dimensional embedding, so each of this is a n_feature(lets say 300) dimensional embedding vector. And what we can do, is fill all of them into a neural network. And then this neural network feeds to a softmax, which has it's own parameters as well. And a softmax classifies among the 10,000 possible outputs in the vocab for those final word we're trying to predict. And so, if in the training slide we saw the word juice then, the target for the softmax in training repeat that it should predict the other word juice was what came after this. So this hidden name here will have his own parameters.I'm going to call this W1 and there's also B1. The softmax there was with own parameters W2, B2, and they're using 300 dimensional word embeddings, then here we have six words. So, this would be six times 300. So this layer or this input will be a 1,800 dimensional vector obtained by taking your six embedding vectors and stacking them together. Well, what's actually more commonly done is to have a fixed historical window. So for example, you might decide that you always want to predict the next word given say the previous four words, where four here is a hyperparameter of the algorithm. So this is how you adjust to either very long or very short sentences or you decide to always just look at the previous four words, so you say, I will still use those four words. And so, let's just get rid of these. And so, if you're always using a four word history, this means that your neural network will input a 1,200 dimensional feature vector, go into this layer, then have a softmax and try to predict the output. And again, variety of choices. And using a fixed history, just means that you can deal with even arbitrarily long sentences because the input sizes are always fixed. And it turns out that this algorithm we'll learn pretty decent word embeddings. And the reason is, if you remember our orange juice, apple juice example, is in the algorithm's incentive to learn pretty similar word embeddings for orange and apple because doing so allows it to fit the training set better because it's going to see orange juice sometimes, or see apple juice sometimes, and so, if you have only a 300 dimensional feature vector to represent all of these words, the algorithm will find that it fits the training set fast.

_Generalisation:_ 

We can set the parameters to learn the context around which the prediction is to be made for example- use first 4 words or last 4 words or right 4 and left 4 words to predict a single word or a sentence as well.

Credits: AndrewNG Lec- 5.2.1

**Benefits**
1. Denser matrix which is way more compact and stores semantics, relations, visualisable model.
2. Lower dimension to speed up the math, lesser sparsenes. 

#### 1. Word2Vec 

Word2Vec is basically the simplest model of word embedding that has just been discussed.
We provide here two examples where word2vec is implemented:

**Skip_Gram Model and CBoW Model **


We use a context of a sentence and then use that informatino to predict further any particular word/group of worrds by skipping some number of words, hence the neame skip_gram.
Let us take an example:
I want a glass of orange juice to go along with my cereal.
We decide a target word, and a context in which we map the context with the target.
Step 1. O_c-->E=e_c
Step 2. Softmax(e_c)-->y_hat


Softmax:
p(t/c) = exp[f(t)'e_c]/Sigma(1,n)exp[f(t)'e_c]

We have a loss function and a normalised softmax function

This skipgram model is extremely slow in implementation, we can use a heirarchial distribution of classes to manipulate where to look for a particular word instead of searching through the whole model vocabulary.

How to sample the context C?

1. Removal of stopwords.
2. Balance out the less common words from the more common one.
3. **Negative Sampling:**

Effecient version of skip gram:
Problem: Given a pair of words, like orange and juice, we are gonna predict is this a context- target pair?

x = Context-target pair
y = labelled vector
**Orange** -Juice = 1
Orange-King = 0  
Orange-book = 0
Orange-of = 0


This is how the dataset will be trained- that is we pick up a context word like orange and pick up k random words from the dictionary and they will be labeled if they are related or not. The k words are chosen on the basis empirical frequency of the words or to use normalised frequency^0.75.

The benefit is that instead of a softmax-ed problem we developed a binary problem in this case.

**Difference between SkipGram and CBow**
The CBOW model learns to predict a target word leveraging all words in its neighborhood. The sum of the context vectors are used to predict the target word. The neighboring words taken into consideration is determined by a pre-defined window size surrounding the target word.

The SkipGram model on the other hand, learnt to predict a word based on a neighboring word. To put it simply, given a word, it learns to predict another word in it’s context.


[Research paper to refer]https://www.aclweb.org/anthology/D17-1056.pdf
[Difference between above two models]https://stackoverflow.com/questions/38287772/cbow-v-s-skip-gram-why-invert-context-and-target-words



#### 2. GloVe- Global Vectors for word representations
Not used a lot, but it is very simple and works well.

Previously we were sampling words based on context and tarfets, but now we wish to globalise this explicitly.


X_ij = # times i appears in context of j

It might happen that X_ij== X_ji, this means a symmertic semantic similarity, which is useful for forming strong correlations.

Model:
Minimise this-

f(X_ij)(F(t)'e_i+b_i+b_j'-logX_ij)^2       (skip if X_ij=0)

f(X_ij)=weighting factor for the target word.




#### Word2Vec simple one liner code
from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count=1)
words = model.wv.vocab
vector = model.wv['war']
similar = model.wv.most_similar('vikram')

In [108]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/0b/66/04faeedb98bfa5f241d0399d0102456886179cabac0355475f23a2978847/gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2MB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/18/9c/a16951b5a66c86f0ea8ff5aca8d5c700138e708a76412ee7a2ec7fbd4b44/smart_open-4.1.0.tar.gz (116kB)
Collecting Cython==0.29.14 (from gensim)
  Downloading https://files.pythonhosted.org/packages/1f/be/b14be5c3ad1ff73096b518be1538282f053ec34faaca60a8753d975d7e93/Cython-0.29.14-cp37-cp37m-win_amd64.whl (1.7MB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-4.1.0-cp37-none-any.whl size=106210 sha256=a648b4ac6eeccf2c2c42e67cafa5d93929c0bb44cfaa4224d1003aa7c7772403
  Stored in directory: C:\Users\Samriddh\AppData\Local\pip\Cache\wheels\e

In [126]:
import nltk

from gensim.models import Word2Vec
from nltk.corpus import stopwords

import re

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
    
    
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)


words = model.wv.vocab

# Finding Word Vectors
vector = model.wv['war']

# Most similar words
similar = model.wv.most_similar('freedom')

In [129]:
similar
#to see the list of correlated words

[('three', 0.2606293559074402),
 ('dutch', 0.19812467694282532),
 ('first', 0.19498443603515625),
 ('moguls', 0.18788421154022217),
 ('nuclear', 0.1849081963300705),
 ('invaded', 0.18165282905101776),
 ('others.that', 0.17736905813217163),
 ('people', 0.16865241527557373),
 ('life', 0.1643061339855194),
 ('dept', 0.16008752584457397)]

# Neural Networks and NLP

Since I have studied Sequence models from Andrew NG and did the assignments as well. I will upload the DL application notebook separately.