# Training Word2Vec on Yelp reviews
Full dataset (8M) tested on `16 vCPUs, 128 GB RAM`, 1M sample tested on `8 vCPUs, 30 GB RAM`

Using Texthero + NLTK for preprocessing.

In case you missed the buzz, word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

Word2vec is very useful in [automatic text tagging](https://github.com/RaRe-Technologies/movie-plots-by-genre), recommender systems and machine translation.

Check out an [online word2vec demo](http://radimrehurek.com/2014/02/word2vec-tutorial/#app) where you can try this vector algebra for yourself. That demo runs `word2vec` on the Google News dataset, of **about 100 billion words**.

## This tutorial

In this tutorial you will learn how to train and evaluate word2vec models on your business data.  
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb


## Preparing the Input
Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence is a list of words (utf8 strings):

In [1]:
#Ensure gensim and Cython are installed
# !pip list | grep Cython

In [2]:
# import modules & set up logging
import os
import pandas as pd
from pandarallel import pandarallel

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from nltk.tokenize import sent_tokenize
# from textblob import TextBlob
import gensim

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_colwidth', 200)

In [3]:
import multiprocessing

num_processors = multiprocessing.cpu_count()
print(f'Available CPUs: {num_processors}')

Available CPUs: 16


In [4]:
# !pip install texthero

In [5]:
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md
# !python -m spacy download en_core_web_lg
# !python -m spacy download en_core_web_trf

In [6]:
# import spacy 
# spacy.prefer_gpu()

In [7]:
import texthero as hero

#### Copy files to local FS from GCP bucket

In [8]:
!mkdir -p /home/jupyter/data/wordvec

In [9]:
# !gsutil -m cp -n 'gs://msca-bdp-data-open/wordvec/questions-words.txt' '/home/jupyter/data/wordvec/'

### Training Word2vec on Yelp Reviews

In [10]:
%%time

# path_read = 'https://storage.googleapis.com/msca-bdp-data-open/yelp/yelp_academic_dataset_review.json'
# yelp_df = pd.read_json(path_read, orient='records', lines=True, nrows=1000000)
# yelp_df = yelp_df[['text']]

path_read = 'https://storage.googleapis.com/msca-bdp-data-open/yelp/yelp_academic_dataset_review.parquet'
yelp_df = pd.read_parquet(path_read, engine='pyarrow', columns=['text']).head(1_000_000)
# yelp_df = pd.read_parquet(path_read, engine='pyarrow', columns=['text'])


print(f'Memory used by DF {yelp_df.memory_usage().sum()}')
print(f'Read rows: {yelp_df.shape[0]}, columns: {yelp_df.shape[1]}')

Memory used by DF 8000128
Read rows: 1000000, columns: 1
CPU times: user 21.7 s, sys: 26.7 s, total: 48.4 s
Wall time: 1min 3s


In [11]:
yelp_df.head(5)

Unnamed: 0,text
0,"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the ..."
1,I am actually horrified this place is still in business. My 3 year old son needed a haircut this past summer and the lure of the $7 kids cut signs got me in the door. We had to wait a few minutes ...
2,"I love Deagan's. I do. I really do. The atmosphere is cozy and festive. The shrimp tacos and house fries are my standbys. The fries are sometimes good and sometimes great, and the spicy dipping sa..."
3,"Dismal, lukewarm, defrosted-tasting ""TexMex"" glop;\n\nMumbly, unengaged waiter;\n\nClueless manager, who seeing us with barely nibbled entrees\non plates shoved forward for pickup, thanked us\nper..."
4,"Oh happy day, finally have a Canes near my casa. Yes just as others are griping about the Drive thru is packed just like most of the other canes in the area but I like to go sit down to enjoy my c..."


#### Tokenize text into Sentences

In [12]:
pandarallel.initialize(nb_workers=num_processors-1, use_memory_fs=False)

INFO: Pandarallel will run on 15 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


#### SpaCy produces more accurate sentense segmentation, but runs a lot slower

In [13]:
# %%time

# nlp = spacy.load("en_core_web_sm")

# # Convert text into sentences
# yelp_df["sentences"] = yelp_df['text'].parallel_apply(lambda x: [sent.text for sent in nlp(x).sents])
# yelp_df = yelp_df[["sentences"]]

# # Create a dataframe with one sentence per row
# yelp_df = yelp_df.explode("sentences", ignore_index=True)

# print(f'Rows: {yelp_df.shape[0]}, Columns: {yelp_df.shape[1]}')

#### NLTK runs very fast, but less accurate in sentense segmentation

In [14]:
%%time

# Convert text into sentences

# yelp_df["sentences"] = yelp_df.apply(lambda x: sent_tokenize(x['text']), axis=1)
yelp_df["sentences"] = yelp_df.parallel_apply(lambda x: sent_tokenize(x['text']), axis=1)
yelp_df = yelp_df[["sentences"]]

# Create a dataframe with one sentence per row
yelp_df = yelp_df.explode("sentences", ignore_index=True)

print(f'Rows: {yelp_df.shape[0]}, Columns: {yelp_df.shape[1]}')

Rows: 7940623, Columns: 1
CPU times: user 6.56 s, sys: 4.66 s, total: 11.2 s
Wall time: 45.1 s


#### 

In [15]:
yelp_df.head(5)

Unnamed: 0,sentences
0,"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas."
1,"When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!"
2,"Tucked away near the gelateria and the garden, the Gallery is pretty much hidden from view."
3,"It's what real estate agents would call ""cozy"" or ""charming"" - basically any euphemism for small."
4,"That being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask?"


#### Clean-up text

In [16]:
%%time

custom_pipeline = [hero.preprocessing.fillna,
                   hero.preprocessing.remove_whitespace,
                  hero.preprocessing.remove_digits,
                   hero.preprocessing.remove_punctuation,
                   hero.preprocessing.remove_stopwords]

yelp_df['clean_sentences'] = hero.clean(yelp_df['sentences'], custom_pipeline)
yelp_df = yelp_df[['clean_sentences']]

CPU times: user 5min 27s, sys: 11 s, total: 5min 38s
Wall time: 5min 38s


In [17]:
yelp_df.head(5)

Unnamed: 0,clean_sentences
0,As someone worked many museums I eager visit gallery recent trip Las Vegas
1,When I saw would showing infamous eggs House Faberge Virginia Museum Fine Arts VMFA I knew I go
2,Tucked away near gelateria garden Gallery pretty much hidden view
3,It real estate agents would call cozy charming basically euphemism small
4,That said still see wonderful art gallery size two ask


In [18]:
sentences = [row.split() for row in yelp_df['clean_sentences']]
sentences[:2]

[['As',
  'someone',
  'worked',
  'many',
  'museums',
  'I',
  'eager',
  'visit',
  'gallery',
  'recent',
  'trip',
  'Las',
  'Vegas'],
 ['When',
  'I',
  'saw',
  'would',
  'showing',
  'infamous',
  'eggs',
  'House',
  'Faberge',
  'Virginia',
  'Museum',
  'Fine',
  'Arts',
  'VMFA',
  'I',
  'knew',
  'I',
  'go']]

In [19]:
del yelp_df

## Training
`Word2Vec` accepts several parameters that affect both training speed and quality.

### min_count
`min_count` is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

### size
`size` is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

### workers
`workers`, the last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:

In [20]:
%%time

# default value of min_count=5
# default value of size=100
# default value of workers=3
# sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
workers = num_processors-1

model = gensim.models.Word2Vec(sentences, min_count=10, size=100, compute_loss=True, sg=0, workers=workers)

CPU times: user 16min 47s, sys: 12.4 s, total: 16min 59s
Wall time: 6min 20s


The `workers` parameter only has an effect if you have [Cython](http://cython.org/) installed. Without Cython, you’ll only be able to use one core because of the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (and `word2vec` training will be [miserably slow](http://rare-technologies.com/word2vec-in-python-part-two-optimizing/)).

## Training Loss Computation

The parameter `compute_loss` can be used to toggle computation of loss while training the Word2Vec model. The computed loss is stored in the model attribute `running_training_loss` and can be retrieved using the function `get_latest_training_loss` as follows : 

In [21]:
%%time

# getting the training loss value
training_loss = model.get_latest_training_loss()
print(training_loss)

25948194.0
CPU times: user 151 µs, sys: 6 µs, total: 157 µs
Wall time: 125 µs


## Memory
At its core, `word2vec` model parameters are stored as matrices (NumPy arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer `size=200`, the model will require approx. `100,000*200*4*3 bytes = ~229MB`.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

Gensim supports the same evaluation set, in exactly the same format:

In [22]:
directory_w2v = '/home/jupyter/data/wordvec/'

In [23]:
%%time

evaluation = model.wv.evaluate_word_analogies(directory_w2v + 'questions-words.txt')

CPU times: user 2min 21s, sys: 4min 9s, total: 6min 30s
Wall time: 24.6 s


In [24]:
evaluation[0]

0.33162530024019216

In [25]:
evaluation[1][7]

{'section': 'gram3-comparative',
 'correct': [('BAD', 'WORSE', 'BIG', 'BIGGER'),
  ('BAD', 'WORSE', 'BRIGHT', 'BRIGHTER'),
  ('BAD', 'WORSE', 'CHEAP', 'CHEAPER'),
  ('BAD', 'WORSE', 'EASY', 'EASIER'),
  ('BAD', 'WORSE', 'FAST', 'FASTER'),
  ('BAD', 'WORSE', 'GOOD', 'BETTER'),
  ('BAD', 'WORSE', 'HARD', 'HARDER'),
  ('BAD', 'WORSE', 'HIGH', 'HIGHER'),
  ('BAD', 'WORSE', 'LARGE', 'LARGER'),
  ('BAD', 'WORSE', 'LONG', 'LONGER'),
  ('BAD', 'WORSE', 'LOUD', 'LOUDER'),
  ('BAD', 'WORSE', 'LOW', 'LOWER'),
  ('BAD', 'WORSE', 'QUICK', 'QUICKER'),
  ('BAD', 'WORSE', 'SAFE', 'SAFER'),
  ('BAD', 'WORSE', 'SLOW', 'SLOWER'),
  ('BAD', 'WORSE', 'SMALL', 'SMALLER'),
  ('BAD', 'WORSE', 'SMART', 'SMARTER'),
  ('BAD', 'WORSE', 'STRONG', 'STRONGER'),
  ('BAD', 'WORSE', 'TALL', 'TALLER'),
  ('BAD', 'WORSE', 'TIGHT', 'TIGHTER'),
  ('BAD', 'WORSE', 'TOUGH', 'TOUGHER'),
  ('BAD', 'WORSE', 'WIDE', 'WIDER'),
  ('BIG', 'BIGGER', 'BRIGHT', 'BRIGHTER'),
  ('BIG', 'BIGGER', 'CHEAP', 'CHEAPER'),
  ('BIG', 'BIGGER', 

This `evaluate_word_analogies` takes an 
[optional parameter](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy) `restrict_vocab` 
which limits which test examples are to be considered.



In newer release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset specific to your business based on it. It contains word pairs together with human-assigned similarity judgments. It measures the relatedness or co-occurrence of two words. For example, 'coast' and 'shore' are very similar as they appear in the same context. At the same time 'clothes' and 'closet' are less similar because they are related but not interchangeable.

In [26]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep

In [27]:
model.wv.evaluate_word_pairs(test_data_dir + 'wordsim353.tsv')

((0.5198206451616163, 7.994675448150986e-24),
 SpearmanrResult(correlation=0.5147744445833151, pvalue=2.5431340515579797e-23),
 8.21529745042493)

Once again, **good performance on Google's or WS-353 test set doesn’t mean word2vec will work well in your application, or vice versa**. It’s always best to evaluate directly on your intended task. For an example of how to use word2vec in a classifier pipeline, see this [tutorial](https://github.com/RaRe-Technologies/movie-plots-by-genre).

## Storing and loading models
You can store/load models using the standard gensim methods:

In [28]:
save_w2v_model = '/home/jupyter/data/wordvec/yelp_model/yelp.model'
save_w2v_model

'/home/jupyter/data/wordvec/yelp_model/yelp.model'

In [29]:
model.save("yelp.model")  # save the model

In [30]:
!mkdir -p /home/jupyter/data/wordvec/yelp_model

In [31]:
!mv yelp.model* '/home/jupyter/data/wordvec/yelp_model/'

In [32]:
new_model = gensim.models.Word2Vec.load(save_w2v_model)  # open the model

which uses pickle internally, optionally `mmap`‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:
```
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip:
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
```

You may need to tweak the `total_words` parameter to `train()`, depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, `KeyedVectors.load_word2vec_format()`. You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

## Using the model
`Word2Vec` supports several word similarity tasks out of the box:

In [33]:
if 'tea' in model.wv:
    print(model.wv['tea'].shape)
else:
    print('{0} is an out of dictionary word'.format('tea'))

(100,)


In [34]:
model.wv.most_similar(positive=['tea', 'coffee'], negative=['sugar'], topn=10)

[('teas', 0.6905073523521423),
 ('coffees', 0.6865881681442261),
 ('cappuccino', 0.5914865136146545),
 ('americano', 0.5750434398651123),
 ('Coffee', 0.5590242147445679),
 ('latte', 0.5485410094261169),
 ('lattes', 0.5474990606307983),
 ('Tea', 0.5354260802268982),
 ('toddy', 0.5314257740974426),
 ('brew', 0.529987096786499)]

In [35]:
print(model.wv.similarity('tea', 'coffee'))
print(model.wv.similarity('bread', 'butter'))

0.71593136
0.6441566


You can get the probability distribution for the center word given the context words as input:

In [36]:
print(model.predict_output_word(['bread', 'butter', 'toast']))

[('french', 0.66018516), ('pudding', 0.055286277), ('French', 0.04826879), ('sourdough', 0.04570451), ('toasted', 0.04054609), ('garlic', 0.024658933), ('rye', 0.024489727), ('raisin', 0.011971361), ('butter', 0.010254043), ('loaf', 0.008183831)]


The results here don't look good because the training corpus is very small. To get meaningful results one needs to train on 500k+ words.

If you need the raw output vectors in your application, you can access these either on a word-by-word basis:

In [37]:
model.wv.__getitem__('bread')  # raw NumPy vector of a word

array([-2.5011024 ,  5.63185   , -2.4020245 , -0.95929575,  0.21978381,
       -1.3435128 ,  4.182844  , -0.22348645, -2.6355624 , -2.9018672 ,
       -0.6990518 ,  2.655727  ,  1.8324082 , -1.6864316 , -1.7130361 ,
        1.1478089 , -1.7601429 , -1.3889798 , -0.41824335, -0.30924654,
       -0.6155924 ,  3.8892884 , -3.728811  , -0.26188573, -1.9043893 ,
       -3.7732265 , -1.3325291 ,  1.3014245 ,  0.7662635 ,  1.9060816 ,
       -1.3467876 , -2.6118793 ,  0.41354725, -1.132492  , -2.2325377 ,
       -0.20009795, -0.4402575 ,  2.217988  , -0.6374768 ,  2.5066001 ,
        1.5218092 ,  0.3055557 ,  2.3131783 ,  0.68956023,  1.7644087 ,
       -0.10261958, -1.8795648 ,  0.5619815 , -3.2625968 ,  3.4835687 ,
        1.4512566 , -1.731525  , -2.3105912 ,  1.537147  ,  1.4920149 ,
       -4.0347395 ,  0.3883149 ,  1.6400788 , -2.7375712 , -0.9508272 ,
       -1.1099461 ,  2.3770325 ,  2.1726363 , -0.39731446,  2.468307  ,
        0.13749535, -2.0067146 ,  2.820077  , -0.32342052, -1.27

…or en-masse as a 2D NumPy matrix from `model.wv.syn0`.

## Conclusion

In this tutorial we learned how to train word2vec models on your custom data and also how to evaluate it. Hope that you too will find this popular tool useful in your Machine Learning tasks!

## Links


Full `word2vec` API docs [here](http://radimrehurek.com/gensim/models/word2vec.html); get [gensim](http://radimrehurek.com/gensim/) here. Original C toolkit and `word2vec` papers by Google [here](https://code.google.com/archive/p/word2vec/).

In [38]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Mon, 15 November 2021 22:37:28'