# Modern Data Science
**(Module 07: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/mds](https://github.com/tulip-lab/mds/issues)

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---

# Session  D - Word2vec
- Note: This code is written in Python 3 (+Gensim 2.3.0)

## Contents

1 [Import training dataset](#Import)

2 [Preprocess data](#Preprocess)

3 [Create and train model](#Create)

4 [Save and load model](#Save)

5 [Similarity calculation](#Similarity)


In [None]:
import re
import numpy as np

from gensim.models import Word2Vec
from nltk.corpus import gutenberg
from multiprocessing import Pool
from scipy import spatial

<a id = "Import"></a>
### <span style="color:#0b486b">Import training dataset</span>
- Import Shakespeare's Hamlet corpus from nltk library

In [None]:
sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

In [None]:
print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

<a id = "Preprocess"></a>
### <span style="color:#0b486b">Preprocess data</span>

- Use re module to preprocess data
- Convert all letters into lowercase
- Remove punctuations, numbers, etc.

In [None]:
for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z]+', word)]  

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

<a id = "Create"></a>
### <span style="color:#0b486b">Create and train model</span>

- Create a word2vec model and train it with Hamlet corpus
- Key parameter description (https://radimrehurek.com/gensim/models/word2vec.html)
    - **sentences**: training data (has to be a list with tokenized sentences)
    - **size**: dimension of embedding space
    - **sg**: CBOW if 0, skip-gram if 1
    - **window**: number of words accounted for each context (if the window size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
    - **min_count**: minimum count of words to be included in the vocabulary
    - **iter**: number of training iterations
    - **workers**: number of worker threads to train

In [None]:
model = Word2Vec(sentences = sentences, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = Pool()._processes)

In [None]:
model.init_sims(replace = True)

<a id = "Save"></a>
### <span style="color:#0b486b">Save and load model</span>

- word2vec model can be saved and loaded locally
- Doing so can reduce time to train model again

In [None]:
model.save('word2vec_model')

In [None]:
model = Word2Vec.load('word2vec_model')

<a id = "Similarity"></a>

### <span style="color:#0b486b">Similarity calculation</span>


- Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity
- For other metrics and comparisons between them, refer to: https://github.com/taki0112/Vector_Similarity

In [None]:
model.most_similar('hamlet')

In [None]:
v1 = model['king']
v2 = model['queen']

In [None]:
# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

In [None]:
cosine_similarity(v1, v2)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

words = "hamlet horatio deere queene noble king queen oh".split()
for w1 in words:
    for w2 in words:
        print(w1, w2, model.similarity(w1, w2))

In [None]:
M = np.zeros((len(words), len(words)))
for i, w1 in enumerate(words):
    for j, w2 in enumerate(words):
        M[i,j] = model.similarity(w1, w2)
        
plt.imshow(M, interpolation='nearest')
plt.colorbar()
 
ax = plt.gca()
ax.set_xticklabels([''] + words, rotation=45)
ax.set_yticklabels([''] + words)