# PG AI - Natural Language Processing and Speech Recognition
# Project: Word2vec Model Creation

DESCRIPTION

In this project, I will show you how to use the Gensim implementation of Word2Vec in Python using NLP.

The idea behind Word2Vec is pretty simple. Assuming  that the meaning of a word can be inferred by the company it belongs to. This is similar to the saying, “Show me your company and I will tell you who you are”.<br>

If you have two words that have very similar neighbors (the context in which it’s used is about the same), then these words are probably similar in meaning or are at least related. For example, the words shocked, appalled, and astonished are usually used in a similar context.<br>

By Edson Teixeira<br>
teixeiraedson252@gmail.com <br>
December 29th 2021

In [1]:
# Imports
import gzip
import gensim

In [2]:
# Dataset
data_file="reviews_data.txt.gz"

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

In [3]:
# Read files into a list
def read_input(input_file):
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 
           # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
print(documents[0])

['oct', 'nice', 'trendy', 'hotel', 'location', 'not', 'too', 'bad', 'stayed', 'in', 'this', 'hotel', 'for', 'one', 'night', 'as', 'this', 'is', 'fairly', 'new', 'place', 'some', 'of', 'the', 'taxi', 'drivers', 'did', 'not', 'know', 'where', 'it', 'was', 'and', 'or', 'did', 'not', 'want', 'to', 'drive', 'there', 'once', 'have', 'eventually', 'arrived', 'at', 'the', 'hotel', 'was', 'very', 'pleasantly', 'surprised', 'with', 'the', 'decor', 'of', 'the', 'lobby', 'ground', 'floor', 'area', 'it', 'was', 'very', 'stylish', 'and', 'modern', 'found', 'the', 'reception', 'staff', 'geeting', 'me', 'with', 'aloha', 'bit', 'out', 'of', 'place', 'but', 'guess', 'they', 'are', 'briefed', 'to', 'say', 'that', 'to', 'keep', 'up', 'the', 'coroporate', 'image', 'as', 'have', 'starwood', 'preferred', 'guest', 'member', 'was', 'given', 'small', 'gift', 'upon', 'check', 'in', 'it', 'was', 'only', 'couple', 'of', 'fridge', 'magnets', 'in', 'gift', 'box', 'but', 'nevertheless', 'nice', 'gesture', 'my', 'room

In [4]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)
print("done")

done


In [5]:
w1 = "happy"
model.wv.most_similar (positive=w1)

[('pleased', 0.8118133544921875),
 ('satisfied', 0.7390106320381165),
 ('delighted', 0.6555253267288208),
 ('impressed', 0.6475741863250732),
 ('thrilled', 0.6246829032897949),
 ('disappointed', 0.5821818709373474),
 ('grateful', 0.5685665607452393),
 ('dissapointed', 0.5578709840774536),
 ('displeased', 0.5250615477561951),
 ('dissappointed', 0.5181708335876465)]

In [6]:
print(model.wv.most_similar (positive=w1))

[('pleased', 0.8118133544921875), ('satisfied', 0.7390106320381165), ('delighted', 0.6555253267288208), ('impressed', 0.6475741863250732), ('thrilled', 0.6246829032897949), ('disappointed', 0.5821818709373474), ('grateful', 0.5685665607452393), ('dissapointed', 0.5578709840774536), ('displeased', 0.5250615477561951), ('dissappointed', 0.5181708335876465)]


In [7]:
w1 = "dirty"
model.wv.most_similar (positive=w1)

[('filthy', 0.8620666861534119),
 ('stained', 0.7946945428848267),
 ('unclean', 0.7829104661941528),
 ('dusty', 0.7526943683624268),
 ('smelly', 0.7459073662757874),
 ('grubby', 0.7402029037475586),
 ('grimy', 0.7381408214569092),
 ('disgusting', 0.7219486832618713),
 ('dingy', 0.7214626669883728),
 ('gross', 0.7195972800254822)]