## NLP - Word2vec using Gensim

**Word Embeddings** are the texts converted into numbers. It tries to map a word using a dictionary to a vector.

The different types of word embeddings can be broadly classified into two categories.
**Frequency based Embedding**   
    - Count Vector
    - TF-IDF Vector
    - Co-Occurrence Vector
**Prediction based Embedding**  
    **- CBOW (Continuous Bag of words)** 
      It predict the probability of a word given a context. A context may be a single word or a group of words. But for simplicity, I will take a single context word and try to predict a single target word.    
    **- Skip-Gram model** 
      It just flips CBOW’s architecture on its head. The aim of skip-gram is to predict the context given a word. It perform beter than CBOW.
      
We use Skip-Gram model to work on text data.  

In [21]:
# Install gensim
!pip install gensim --quiet

^C


In [1]:
import pandas as pd
import re, string
import gensim
import logging



### Load Moview reviews Text Data

Download data from Kaggle -> https://www.kaggle.com/c/word2vec-nlp-tutorial/data.

Filename: unlabeledTrainData.tsv.zip

In [2]:
your_local_path="C:/Users/s.mudalapuram/Documents/PythonMe/data/"

In [3]:
df = pd.read_csv(your_local_path+'unlabeledTrainData.zip', header=0, delimiter="\t", quoting=3)

In [4]:
print(df.shape)
df.head()

(50000, 2)


Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


## Function to Clean up data

In [5]:
def clean_str(string):
  """
  String cleaning before vectorization
  """
  try:    
    string = re.sub(r'^https?:\/\/<>.*[\r\n]*', '', string, flags=re.MULTILINE)
    string = re.sub(r"[^A-Za-z]", " ", string)         
    words = string.strip().lower().split()    
    words = [w for w in words if len(w)>=1]
    return " ".join(words)	
  except:
    return ""

Clean the Data using routine above

In [6]:
df['clean_review'] = df['review'].apply(clean_str)
df.head()

Unnamed: 0,id,review,clean_review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was...",watching time chasers it obvious that it was m...
1,"""45057_0""","""I saw this film about 20 years ago and rememb...",i saw this film about years ago and remember i...
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B...",minor spoilers br br in new york joan barnard ...
3,"""7161_0""","""I went to see this film with a great deal of ...",i went to see this film with a great deal of e...
4,"""43971_0""","""Yes, I agree with everyone on this site this ...",yes i agree with everyone on this site this mo...


In [7]:
df.describe

<bound method NDFrame.describe of               id                                             review  \
0       "9999_0"  "Watching Time Chasers, it obvious that it was...   
1      "45057_0"  "I saw this film about 20 years ago and rememb...   
2      "15561_0"  "Minor Spoilers<br /><br />In New York, Joan B...   
3       "7161_0"  "I went to see this film with a great deal of ...   
4      "43971_0"  "Yes, I agree with everyone on this site this ...   
5      "36495_0"  "Jennifer Ehle was sparkling in \"Pride and Pr...   
6      "49472_0"  "Amy Poehler is a terrific comedian on Saturda...   
7      "36693_0"  "A plane carrying employees of a large biotech...   
8        "316_0"  "A well made, gritty science fiction movie, it...   
9      "32454_0"  "Incredibly dumb and utterly predictable story...   
10     "37128_0"  "After reading the comments for this movie, I ...   
11     "19439_0"  "It's hard to describe Elfen Lied to someone w...   
12     "10760_0"  "Of all the bile-inducing

## Convert Each Review to a Word List
This is required for gensim Word2Vec

In [8]:
documents = []

for doc in df['clean_review']:
    documents.append(doc.split(' '))

## Build the Model

In [9]:
#Logging for training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#Build the model
model = gensim.models.Word2Vec(documents, #Word list
                               min_count=10, #Ignore all words with total frequency lower than this                           
                               workers=4, #Number of CPUs
                               size=50,  #Embedding size
                               window=5, #Maximum Distance between current and predicted word
                               iter=10   #Number of iterations over the text corpus
                              )  

2018-10-06 12:26:35,142 : INFO : collecting all words and their counts
2018-10-06 12:26:35,143 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-10-06 12:26:38,543 : INFO : PROGRESS: at sentence #10000, processed 2399440 words, keeping 51654 word types
2018-10-06 12:26:42,128 : INFO : PROGRESS: at sentence #20000, processed 4835846 words, keeping 69077 word types
2018-10-06 12:26:45,678 : INFO : PROGRESS: at sentence #30000, processed 7267977 words, keeping 81515 word types
2018-10-06 12:26:49,196 : INFO : PROGRESS: at sentence #40000, processed 9669772 words, keeping 91685 word types
2018-10-06 12:26:52,570 : INFO : collected 100479 word types from a corpus of 12084660 raw words and 50000 sentences
2018-10-06 12:26:52,575 : INFO : Loading a fresh vocabulary
2018-10-06 12:26:54,809 : INFO : effective_min_count=10 retains 28322 unique words (28% of original 100479, drops 72157)
2018-10-06 12:26:54,814 : INFO : effective_min_count=10 leaves 11910457 word cor

2018-10-06 12:27:56,146 : INFO : EPOCH 2 - PROGRESS: at 25.10% examples, 192879 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:27:57,172 : INFO : EPOCH 2 - PROGRESS: at 27.41% examples, 193162 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:27:58,206 : INFO : EPOCH 2 - PROGRESS: at 29.64% examples, 193235 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:27:59,247 : INFO : EPOCH 2 - PROGRESS: at 31.97% examples, 193618 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:28:00,264 : INFO : EPOCH 2 - PROGRESS: at 34.36% examples, 194344 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:28:01,293 : INFO : EPOCH 2 - PROGRESS: at 36.56% examples, 193910 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:28:02,331 : INFO : EPOCH 2 - PROGRESS: at 38.51% examples, 192660 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:28:03,348 : INFO : EPOCH 2 - PROGRESS: at 40.87% examples, 193657 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:28:04,380 : INFO : EPOCH 2 - PROGRESS: at 43.23% examples, 194036 words/s, in_qsiz

2018-10-06 12:29:08,036 : INFO : EPOCH 3 - PROGRESS: at 71.19% examples, 173820 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:29:09,087 : INFO : EPOCH 3 - PROGRESS: at 73.32% examples, 173908 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:29:10,113 : INFO : EPOCH 3 - PROGRESS: at 75.50% examples, 174292 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:29:11,185 : INFO : EPOCH 3 - PROGRESS: at 77.92% examples, 175009 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:29:12,209 : INFO : EPOCH 3 - PROGRESS: at 80.30% examples, 175519 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:29:13,217 : INFO : EPOCH 3 - PROGRESS: at 82.47% examples, 175927 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:29:14,249 : INFO : EPOCH 3 - PROGRESS: at 84.71% examples, 176389 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:29:15,251 : INFO : EPOCH 3 - PROGRESS: at 86.93% examples, 176619 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:29:16,263 : INFO : EPOCH 3 - PROGRESS: at 88.85% examples, 176303 words/s, in_qsiz

2018-10-06 12:30:15,280 : INFO : EPOCH 5 - PROGRESS: at 11.73% examples, 203471 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:30:16,302 : INFO : EPOCH 5 - PROGRESS: at 13.95% examples, 200856 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:30:17,342 : INFO : EPOCH 5 - PROGRESS: at 16.14% examples, 198730 words/s, in_qsize 6, out_qsize 1
2018-10-06 12:30:18,361 : INFO : EPOCH 5 - PROGRESS: at 18.51% examples, 199207 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:30:19,421 : INFO : EPOCH 5 - PROGRESS: at 20.97% examples, 199624 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:30:20,442 : INFO : EPOCH 5 - PROGRESS: at 23.31% examples, 199297 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:30:21,486 : INFO : EPOCH 5 - PROGRESS: at 25.58% examples, 199252 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:30:22,522 : INFO : EPOCH 5 - PROGRESS: at 27.91% examples, 198826 words/s, in_qsize 8, out_qsize 2
2018-10-06 12:30:23,560 : INFO : EPOCH 5 - PROGRESS: at 30.17% examples, 198359 words/s, in_qsiz

2018-10-06 12:31:27,025 : INFO : EPOCH 6 - PROGRESS: at 65.16% examples, 180464 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:31:28,088 : INFO : EPOCH 6 - PROGRESS: at 66.93% examples, 179395 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:31:29,107 : INFO : EPOCH 6 - PROGRESS: at 68.55% examples, 178049 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:31:30,175 : INFO : EPOCH 6 - PROGRESS: at 70.22% examples, 176741 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:31:31,235 : INFO : EPOCH 6 - PROGRESS: at 72.08% examples, 176107 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:31:32,294 : INFO : EPOCH 6 - PROGRESS: at 73.81% examples, 175131 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:31:33,317 : INFO : EPOCH 6 - PROGRESS: at 75.33% examples, 173999 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:31:34,401 : INFO : EPOCH 6 - PROGRESS: at 77.57% examples, 174314 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:31:35,453 : INFO : EPOCH 6 - PROGRESS: at 79.91% examples, 174545 words/s, in_qsiz

2018-10-06 12:32:40,551 : INFO : EPOCH 7 - PROGRESS: at 63.22% examples, 108761 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:32:41,598 : INFO : EPOCH 7 - PROGRESS: at 64.37% examples, 108508 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:32:42,676 : INFO : EPOCH 7 - PROGRESS: at 65.41% examples, 108058 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:32:43,681 : INFO : EPOCH 7 - PROGRESS: at 66.62% examples, 108025 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:32:44,711 : INFO : EPOCH 7 - PROGRESS: at 67.64% examples, 107570 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:32:45,751 : INFO : EPOCH 7 - PROGRESS: at 69.27% examples, 108132 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:32:46,779 : INFO : EPOCH 7 - PROGRESS: at 71.51% examples, 109563 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:32:47,810 : INFO : EPOCH 7 - PROGRESS: at 73.57% examples, 110682 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:32:48,855 : INFO : EPOCH 7 - PROGRESS: at 75.93% examples, 112229 words/s, in_qsiz

2018-10-06 12:33:53,684 : INFO : EPOCH 8 - PROGRESS: at 50.57% examples, 100551 words/s, in_qsize 6, out_qsize 1
2018-10-06 12:33:54,747 : INFO : EPOCH 8 - PROGRESS: at 51.55% examples, 100084 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:33:55,787 : INFO : EPOCH 8 - PROGRESS: at 52.88% examples, 100318 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:33:56,855 : INFO : EPOCH 8 - PROGRESS: at 54.31% examples, 100787 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:33:57,859 : INFO : EPOCH 8 - PROGRESS: at 55.36% examples, 100633 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:33:58,910 : INFO : EPOCH 8 - PROGRESS: at 56.58% examples, 100662 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:33:59,958 : INFO : EPOCH 8 - PROGRESS: at 57.73% examples, 100580 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:34:00,986 : INFO : EPOCH 8 - PROGRESS: at 58.96% examples, 100794 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:34:01,997 : INFO : EPOCH 8 - PROGRESS: at 60.19% examples, 100909 words/s, in_qsiz

2018-10-06 12:35:08,085 : INFO : EPOCH 9 - PROGRESS: at 32.87% examples, 94476 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:35:09,087 : INFO : EPOCH 9 - PROGRESS: at 34.19% examples, 95109 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:35:10,202 : INFO : EPOCH 9 - PROGRESS: at 35.49% examples, 95371 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:35:11,365 : INFO : EPOCH 9 - PROGRESS: at 36.70% examples, 95258 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:35:12,407 : INFO : EPOCH 9 - PROGRESS: at 37.94% examples, 95693 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:35:13,484 : INFO : EPOCH 9 - PROGRESS: at 39.04% examples, 95620 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:35:14,490 : INFO : EPOCH 9 - PROGRESS: at 39.96% examples, 95315 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:35:15,491 : INFO : EPOCH 9 - PROGRESS: at 40.94% examples, 95085 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:35:16,575 : INFO : EPOCH 9 - PROGRESS: at 42.10% examples, 95013 words/s, in_qsize 7, out_

2018-10-06 12:36:22,251 : INFO : EPOCH 10 - PROGRESS: at 17.66% examples, 91249 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:36:23,302 : INFO : EPOCH 10 - PROGRESS: at 18.84% examples, 91804 words/s, in_qsize 8, out_qsize 2
2018-10-06 12:36:24,383 : INFO : EPOCH 10 - PROGRESS: at 19.91% examples, 91463 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:36:25,410 : INFO : EPOCH 10 - PROGRESS: at 20.97% examples, 91444 words/s, in_qsize 8, out_qsize 0
2018-10-06 12:36:26,509 : INFO : EPOCH 10 - PROGRESS: at 22.31% examples, 92073 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:36:27,520 : INFO : EPOCH 10 - PROGRESS: at 23.29% examples, 91758 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:36:28,539 : INFO : EPOCH 10 - PROGRESS: at 24.68% examples, 93299 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:36:29,655 : INFO : EPOCH 10 - PROGRESS: at 26.00% examples, 93729 words/s, in_qsize 7, out_qsize 0
2018-10-06 12:36:30,669 : INFO : EPOCH 10 - PROGRESS: at 26.95% examples, 93090 words/s, in_qsiz

2018-10-06 12:37:34,327 : INFO : training on a 120846600 raw words (88171431 effective words) took 635.3s, 138786 effective words/s


## Exploring the model

How many words in the model and how many features

In [12]:
model.wv.vectors.shape

(28322, 50)

In [13]:
len(model.wv.vocab)

28322

Get an embedding for a word

In [14]:
model.wv['flower']

array([-8.9273852e-01, -1.2788788e+00, -6.2977916e-01,  7.1243864e-01,
        4.2914748e-02,  1.2252207e+00,  5.7384908e-01, -2.1514966e-01,
       -2.4799021e-01, -2.8029731e-01, -1.4586948e-01,  1.3179365e-04,
       -7.5423993e-02, -1.3328782e+00, -1.2185906e+00,  3.5757071e-01,
       -5.2048624e-01, -5.4220170e-01,  4.8002150e-02, -3.0896321e-01,
        1.6975659e+00, -7.1009678e-01,  7.6559538e-01,  4.1087532e-01,
       -1.3164710e+00,  3.7278375e-01,  5.9689718e-01,  7.0897996e-01,
       -5.8611524e-01, -5.6818128e-01,  9.0191489e-01,  4.6683010e-01,
       -1.1454289e-01, -4.3652499e-01, -5.3469896e-01,  1.8673523e-01,
        6.8095994e-01, -5.2233654e-01, -4.3626478e-01, -7.8139061e-01,
       -1.4305691e-02, -1.7421122e-01, -1.3110901e+00,  3.7752640e-01,
       -2.4973139e-01, -3.1677586e-01,  1.1957284e-01, -5.0293458e-01,
        2.9535639e-01, -7.3778003e-01], dtype=float32)

Saving the model

In [15]:
model.save('word2vec-movie-50')

2018-10-06 18:04:34,267 : INFO : saving Word2Vec object under word2vec-movie-50, separately None
2018-10-06 18:04:34,277 : INFO : not storing attribute vectors_norm
2018-10-06 18:04:34,308 : INFO : not storing attribute cum_table
2018-10-06 18:04:35,441 : INFO : saved word2vec-movie-50


Finding Words which have similar meaning

In [16]:
model.wv.most_similar('great')

2018-10-06 18:04:40,552 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('fantastic', 0.8808208703994751),
 ('wonderful', 0.8690458536148071),
 ('terrific', 0.8667196035385132),
 ('fine', 0.846993088722229),
 ('good', 0.823563814163208),
 ('brilliant', 0.8093618154525757),
 ('superb', 0.7903881072998047),
 ('perfect', 0.7598206996917725),
 ('nice', 0.7592012882232666),
 ('marvelous', 0.7441679239273071)]

Find the Word which is not like others

In [17]:
model.wv.doesnt_match("man woman child kitchen".split())

  if np.issubdtype(vec.dtype, np.int):


'kitchen'

1. Equation king + man = queen + ?
2. In this case there may not be enough data for this equation

In [18]:
model.wv.most_similar(positive=['king','man'], negative=['queen'])

  if np.issubdtype(vec.dtype, np.int):


[('scientist', 0.5587218999862671),
 ('marine', 0.5246001482009888),
 ('soldier', 0.5158500075340271),
 ('nemesis', 0.5096298456192017),
 ('master', 0.5050163269042969),
 ('joker', 0.4986332952976227),
 ('cassavetes', 0.4934775233268738),
 ('genius', 0.49158549308776855),
 ('vigilante', 0.48155850172042847),
 ('mastermind', 0.48094528913497925)]

Loading a model from Memory

In [19]:
model = gensim.models.Word2Vec.load('word2vec-movie-50')

2018-10-06 18:04:53,250 : INFO : loading Word2Vec object from word2vec-movie-50
2018-10-06 18:04:53,628 : INFO : loading wv recursively from word2vec-movie-50.wv.* with mmap=None
2018-10-06 18:04:53,632 : INFO : setting ignored attribute vectors_norm to None
2018-10-06 18:04:53,633 : INFO : loading vocabulary recursively from word2vec-movie-50.vocabulary.* with mmap=None
2018-10-06 18:04:53,638 : INFO : loading trainables recursively from word2vec-movie-50.trainables.* with mmap=None
2018-10-06 18:04:53,643 : INFO : setting ignored attribute cum_table to None
2018-10-06 18:04:53,648 : INFO : loaded word2vec-movie-50


In [20]:
model.wv.wv.most_similar('educate')

  """Entry point for launching an IPython kernel.
2018-10-06 18:04:54,618 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('manipulate', 0.7469743490219116),
 ('impose', 0.7460099458694458),
 ('expose', 0.7449526786804199),
 ('cater', 0.7389202117919922),
 ('teach', 0.7382681369781494),
 ('convert', 0.7328129410743713),
 ('conduct', 0.7318152189254761),
 ('exploit', 0.7308415174484253),
 ('respond', 0.7161504030227661),
 ('entice', 0.7155665755271912)]