# PG AI - Natural Language Processing and Speech Recognition
# Project: Word Analogies

DESCRIPTION

In this project, I will show you how to train and evaluate Word2Vec models in  your business data using NLP.<br>
Word2Vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.<br>
Word2Vec is very useful in automatic text tagging, recommender systems, and machine translation. Here, I will show you the Python program to generate word vectors using Word2Vec.<br>

By Edson Teixeira<br>
teixeiraedson252@gmail.com <br>
December 29th 2021

In [1]:
# Step 1: Imports
import warnings
warnings.filterwarnings('ignore')
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec

In [2]:
# Step 2: Datase
sample = open("word_analogy.txt", "r", encoding='cp1252') 
s = sample.read() 

In [3]:
# Step 3: Replaces escaped character with space
f = s.replace("\n", " ") 
data=[]

In [4]:
# Step 4: Iterate each sentence in the file
for i in sent_tokenize(f): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    data.append(temp) 

In [5]:
# Step 5: Create CBOW (Continuous Bag of Words) Model
model1 = gensim.models.Word2Vec(data, min_count = 1,  
                              size = 100, window = 5)

In [6]:
# Step 6: Print results
print("Cosine similarity between 'alice' " + 
               "and 'wonderland' - CBOW : ", 
    model1.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ", 
      model1.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.99951744
Cosine similarity between 'alice' and 'machines' - CBOW :  0.9920805


In [7]:
# Step 7: Create a Skip Gram Model and print results
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5, sg = 1) 

print("Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : ", model2.similarity('alice', 'wonderland')) 
      
print("Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : ", model2.similarity('alice', 'machines')) 

Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.8932715
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.8536392
