# NLP Seminar 3: static word embeddings (word2vec, fasttext, GloVe)

In this seminar, we will use the `gensim` package, as it has unifying easy-to-use implementations and pretrained word2vec, fasttext, and GloVe models

In [None]:
#!pip install --upgrade gensim

In [1]:
import numpy as np
import pandas as pd

In [2]:
import multiprocessing
cores = multiprocessing.cpu_count()
cores

8

In [3]:
from nltk.tokenize import word_tokenize

# Data preparation

Data source: https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset

In [4]:
simpsons = pd.read_csv("data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons.head()

Unnamed: 0,raw_character_text,raw_location_text,spoken_words,normalized_text
0,Miss Hoover,Springfield Elementary School,"No, actually, it was a little of both. Sometim...",no actually it was a little of both sometimes ...
1,Lisa Simpson,Springfield Elementary School,Where's Mr. Bergstrom?,wheres mr bergstrom
2,Miss Hoover,Springfield Elementary School,I don't know. Although I'd sure like to talk t...,i dont know although id sure like to talk to h...
3,Lisa Simpson,Springfield Elementary School,That life is worth living.,that life is worth living
4,Edna Krabappel-Flanders,Springfield Elementary School,The polls will be open from now until the end ...,the polls will be open from now until the end ...


In [5]:
simpsons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158271 entries, 0 to 158270
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   raw_character_text  140749 non-null  string
 1   raw_location_text   157863 non-null  string
 2   spoken_words        132112 non-null  string
 3   normalized_text     132087 non-null  string
dtypes: string(4)
memory usage: 4.8 MB


In [6]:
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

In [9]:
corpus_tok = simpsons['normalized_text'].str.split(' ').to_list()
corpus_tok[1]

['wheres', 'mr', 'bergstrom']

In [None]:
# If you don't know the Simpsons tv show, 
# you can e.g. use the wikipedia subset corpus instead,
# (and try different words when evaluating the vectors and similarities in next sections):

#import gensim.downloader as gensim_api
#gensim_api.info('text8')['description']
#corpus_tok = gensim_api.load('text8')

## Phraser

https://radimrehurek.com/gensim/models/phrases.html

In [28]:
from gensim.models.phrases import Phrases, Phraser
phrases = Phrases(corpus_tok, min_count=30)
phraser = Phraser(phrases)
del(phrases)

In [25]:
phraser[["homer", "simpson", "eats", "chocolate"]]

['homer_simpson', 'eats', 'chocolate']

In [26]:
corpus_phrased = phraser[corpus_tok]

In [27]:
corpus_phrased[1]

['wheres', 'mr', 'bergstrom']

# Word2vec

Word2vec has two sub-methods for training the word embeddings: continuous bag of words (CBOW) and skip-gram.
In both cases, a shallow neural network is trained to predict either

- a word given a context (CBOW), or
- a context of a given a word (skip-gram).

The context is defined as the other surrounding words in a given window. The word embedding vectors are then obained from the two trained weight matrices for each word in the vocabulary.


Official website: https://code.google.com/archive/p/word2vec/

Original papers: http://arxiv.org/abs/1301.3781 and http://arxiv.org/abs/1310.4546

### Training word2vec on the Simpson scripts

In [29]:
from gensim.models import Word2Vec, KeyedVectors

In [30]:
w2v_s = Word2Vec(corpus_phrased, vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)
#1st line: Method's hyperparameters
#2nd line: Optimization (gradient descent) hyperparameters

Can also be done in separate steps:

    w2v_s = Word2Vec(vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    w2v_s.build_vocab(sentences, progress_per=10000)
    w2v_s.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

Word embedding vectors can then be obtained from the trained model, for each word in the training vocabulary.

In [32]:
homer_vector = w2v_s.wv.get_vector("homer", norm=True) # Father of the Simpsons
homer_vector

array([-2.61049289e-02,  1.16754053e-02,  4.14576307e-02, -3.42069268e-02,
        1.30050406e-01, -9.81555060e-02,  4.27696370e-02,  3.90677335e-04,
        7.67412856e-02, -3.18297967e-02, -2.23855227e-01,  6.22445457e-02,
        2.95470506e-02, -6.00912683e-02,  1.74143090e-04, -9.63065997e-02,
       -3.22594284e-03, -1.07110972e-02, -1.21698245e-01,  3.19398683e-03,
        1.92827471e-02, -2.36378796e-02,  9.73904952e-02,  3.53702530e-02,
       -2.03763053e-01, -6.04840107e-02,  7.69039765e-02,  6.79884478e-02,
       -1.43225715e-01,  1.61224619e-01, -7.66186565e-02,  2.99185198e-02,
       -4.58554961e-02,  1.33616626e-01, -1.38652027e-01,  6.64678738e-02,
       -5.30304164e-02,  1.88559201e-02, -7.94080496e-02,  4.37537115e-03,
       -6.06294349e-02, -8.22855681e-02,  8.64137858e-02,  1.00588866e-01,
        1.08145639e-01, -1.00142181e-01, -6.77912086e-02, -8.14181045e-02,
       -2.70792772e-03, -1.27574295e-01,  1.34241477e-01,  1.57823578e-01,
       -2.79842759e-03,  

For a given word or vector, one can query the other most similar word vectors, in terms of cosine similarity.

In [33]:
w2v_s.wv.most_similar("homer") # Marge is Homer's wife

[('marge', 0.6593793630599976),
 ('homie', 0.6120056509971619),
 ('dad', 0.586644172668457),
 ('moe', 0.5705775618553162),
 ('bart', 0.5647057890892029),
 ('son', 0.5607966780662537),
 ('but', 0.5501118302345276),
 ('honey', 0.5330860018730164),
 ('ned', 0.5255712866783142),
 ('well', 0.5097299814224243)]

In [34]:
w2v_s.wv.most_similar(homer_vector)

[('homer', 0.9999998807907104),
 ('marge', 0.6593793630599976),
 ('homie', 0.6120056509971619),
 ('dad', 0.586644172668457),
 ('moe', 0.5705775618553162),
 ('bart', 0.5647057890892029),
 ('son', 0.5607966780662537),
 ('but', 0.5501118302345276),
 ('honey', 0.5330860018730164),
 ('ned', 0.5255712866783142)]

In [35]:
w2v_s.wv.most_similar("homer_simpson") # name bigram

[('robert', 0.49677714705467224),
 ('montgomery_burns', 0.4636591970920563),
 ('kent_brockman', 0.4602481722831726),
 ('ned_flanders', 0.4517003893852234),
 ('manager', 0.44521698355674744),
 ('mr_burns', 0.4346499741077423),
 ('sideshow_bob', 0.4296829402446747),
 ('bart_simpson', 0.4286467134952545),
 ('rabbi', 0.4273083508014679),
 ('hutz', 0.4256899356842041)]

In [36]:
w2v_s.wv.most_similar("bart") # Bart is the son, Lisa his sister and Milhouse his best friend

[('lisa', 0.6480810046195984),
 ('milhouse', 0.6399539113044739),
 ('dad', 0.6044421792030334),
 ('homer', 0.5647057890892029),
 ('mom', 0.5577839612960815),
 ('your_father', 0.5532432198524475),
 ('son', 0.5313892960548401),
 ('honey', 0.5028467774391174),
 ('maggie', 0.486422598361969),
 ('well', 0.4840283989906311)]

One can also compute the cosine similarity between two word vectors

In [37]:
w2v_s.wv.similarity('bart', 'lisa')

0.648081

In [38]:
w2v_s.wv.similarity('bart', 'bart')

1.0

Odd-one-out identification:

In [39]:
w2v_s.wv.doesnt_match(['homer', 'patty', 'selma']) # Patty and Selma are Marge's twin sisters

'homer'

Word analogies: how well do embeddings vectors capture intuitive semantic and syntactic analogy questions?

In [40]:
# " Homer - man + woman = ? " - i.e. " man:Homer :: woman:? "
w2v_s.wv.most_similar(positive=["homer", "woman"], negative=["man"], topn=3) # Marge is Homer's wife

[('marge', 0.5385385155677795),
 ('lisa', 0.535293698310852),
 ('homie', 0.5107763409614563)]

In [41]:
# " woman - Marge + Homer = ? " - i.e. " Marge:Homer :: woman:? " 
w2v_s.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

[('man', 0.49348142743110657),
 ('child', 0.4717772305011749),
 ('friend', 0.4468388855457306)]

### Sentence embedding

In [42]:
def document2vec(tokens, embedding_wv, phraser=None, normalize=True):
    """Returns the embedding of a sentence or document as the mean of its tokens/words embeddings."""
    if phraser:
        tokens = phraser[tokens]
    sent_mean = np.array([embedding_wv.get_vector(tok, norm=normalize) for tok in tokens]).mean(axis=0)
    return sent_mean

In [43]:
document2vec(["bart", "is", "grounded"], w2v_s.wv, phraser=phraser)

array([ 0.02951671,  0.05051985, -0.01147764,  0.00250905,  0.02797181,
       -0.04177618, -0.01815402,  0.06632936, -0.03103207,  0.05728322,
       -0.03415288,  0.06185882,  0.01379502, -0.012237  ,  0.0600308 ,
       -0.07371085,  0.05660002, -0.04321506, -0.00117968,  0.00423469,
        0.03577783, -0.03476532, -0.09413234,  0.02200813, -0.0224283 ,
       -0.08386368,  0.02829584,  0.00567001,  0.051918  , -0.00260505,
       -0.07733845, -0.01166051,  0.01938361, -0.00876098,  0.03038546,
        0.09039616,  0.01135792,  0.00505491, -0.01039881,  0.01160451,
        0.03023958, -0.00198646,  0.03735946,  0.06403868, -0.03261832,
       -0.04678274,  0.00698215, -0.01436172, -0.04199128, -0.03298742,
        0.02217922,  0.07795637,  0.01725201, -0.04941735,  0.06661747,
       -0.05657046,  0.02422118, -0.03922039,  0.09386795,  0.05391004,
        0.00784591, -0.0943244 , -0.09749899, -0.03016669, -0.00357802,
        0.00299996, -0.01701612,  0.06620205, -0.00322717,  0.02

### Pretrained word2vec vectors

https://github.com/RaRe-Technologies/gensim-data#models

In [44]:
import gensim.downloader as gensim_api

In [45]:
w2v_pret = gensim_api.load('word2vec-google-news-300')

In [None]:
# Or from downloaded source (e.g. https://code.google.com/archive/p/word2vec/):
#w2v_pret = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [46]:
w2v_pret

<gensim.models.keyedvectors.KeyedVectors at 0x198c21a1970>

In [47]:
w2v_pret.most_similar(positive=["eat"])

[('eating', 0.7529403567314148),
 ('ate', 0.7013993859291077),
 ('eaten', 0.6724975109100342),
 ('eats', 0.6589087843894958),
 ('munch', 0.6417747735977173),
 ('eat_healthfully', 0.6315395832061768),
 ('eat_fatty_foods', 0.6280142068862915),
 ('consume', 0.6184970140457153),
 ('Nutritionists_recommend', 0.6183844804763794),
 ('overeaten', 0.6109130382537842)]

In [48]:
w2v_pret.similarity("eat", 'consume')

0.6184971

In [49]:
w2v_pret.doesnt_match(["eat", 'dance', 'drink'])

'dance'

In [50]:
w2v_pret.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car'])

'car'

In [51]:
w2v_pret.most_similar(positive=["king", "woman"], negative=["man"], topn=3)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951)]

In [52]:
vectcalc = w2v_pret.get_vector("king", norm=True) - w2v_pret.get_vector("man", norm=True) + w2v_pret.get_vector("woman", norm=True)
w2v_pret.most_similar(vectcalc)

[('king', 0.7992597818374634),
 ('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593831062317)]

king - man + woman = king ?

In [75]:
w2v_pret.most_similar(positive=["bart"])

[('joseph', 0.620844841003418),
 ('boston_globe', 0.6205891370773315),
 ('liz', 0.6189237833023071),
 ('gerald', 0.6109275817871094),
 ('jta', 0.6039943099021912),
 ('jg', 0.6032029986381531),
 ('meyers', 0.6026628613471985),
 ('ellis', 0.5999051928520203),
 ('christine', 0.5992782115936279),
 ('becky', 0.5988665223121643)]

In [77]:
w2v_s.wv.most_similar("bart")

[('lisa', 0.6480810046195984),
 ('milhouse', 0.6399539113044739),
 ('dad', 0.6044421792030334),
 ('homer', 0.5647057890892029),
 ('mom', 0.5577839612960815),
 ('your_father', 0.5532432198524475),
 ('son', 0.5313892960548401),
 ('honey', 0.5028467774391174),
 ('maggie', 0.486422598361969),
 ('well', 0.4840283989906311)]

# Fasttext

"university" , "univ", "niver", ... , "sity"

Fastext is a static word embedding, that is very similar to word2vec. As a main difference, it uses character-level ngram vectors together with the word vectors.

Advantages:
+ Out of training vocabulary embeddings are obtainable.
+ Better representation for rare words (that are morphologically similar to others).
+ Tends to perform better for syntactic tasks.
+ Is more useful in morphologically rich languages (such as German, Arabic and Russian) compared to English (German example: 'table tennis' -> 'Tischtennis'), but it heavily depends on the task.
+ might work better for small datasets.
+ The "official" implementation is quite efficient, and allows training the embedding and classifier at once (see the official `fasttest` package documentation https://fasttext.cc/docs/en/python-module.html).

Disatvantages:
- Can overfit more easily, and is a bit harder to fine tune with the additionnal character ngram hyperparameters.
- Tends to perform more poorly for semantic tasks.
- May tend to privilege too much the morphologically close synonyms compared to other semantically closer synomyms.
- Can be heavier to train.

However, the differences between fasttext and word2vec thend to decrease as the size of the training corpus increases.

Official website: https://fasttext.cc/

Original paper: https://arxiv.org/abs/1607.04606

### Training fasttext on the Simpson scripts

In [53]:
from gensim.models import FastText

In [54]:
fst_s = FastText(corpus_phrased, vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                 min_n=3, max_n=6, #Additional fasttest hyperparameters
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)

Can also be performed in separate steps:

    fst_s = FastText(vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                     min_n = 1, max_n = 4,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    fst_s.build_vocab(corpus_phrased)
    print(len(fst_s.wv.vocab.keys()))
    fst_s.train(sentences, total_examples = fst_s.corpus_count, epochs=100) 

In [60]:
"unige" in fst_s.wv.index_to_key, "unige" in w2v_s.wv.index_to_key

(False, False)

In [61]:
try:
    print(w2v_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

KeyError: the given token is not not present in the vocabulary.


In [62]:
try:
    print(fst_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

[-2.32658256e-02  1.54256532e-02  1.02248400e-01  3.65314335e-02
  2.44426336e-02  2.30771720e-01 -1.05185851e-01  1.25458771e-02
  1.01180095e-02  7.86468089e-02 -4.43905145e-02 -1.22101642e-02
  2.64566615e-02  1.10361949e-01  1.82546936e-02 -8.55090022e-02
 -4.48813327e-02 -2.48250179e-02  1.09720200e-01  1.76316015e-02
 -1.21377856e-02 -1.01584643e-01  6.26003295e-02  1.60837501e-01
  1.28992632e-01  1.66536812e-02 -2.00505257e-01  6.09177910e-02
  6.00020774e-02 -1.29563481e-01 -8.92450511e-02 -2.23424267e-02
  2.59135365e-02 -8.79552215e-02  1.66614115e-01  1.44762427e-01
  8.79511461e-02 -9.38010067e-02 -1.18069828e-01  8.03272575e-02
 -1.60505429e-01  1.40563279e-01 -1.00491950e-02 -4.79830131e-02
 -1.38284331e-02  3.78725417e-02 -1.32014439e-01 -1.01249419e-01
 -1.79909602e-01  1.01562396e-01 -4.72330190e-02  3.71297598e-02
  6.80269897e-02  2.32637161e-03  1.23015074e-02 -7.23210722e-02
  5.05620390e-02  1.35850720e-02 -4.44205254e-02 -1.12646095e-01
  3.80120017e-02 -9.97520

In [63]:
fst_s.wv.most_similar("unige")

[('unit', 0.8582980632781982),
 ('unite', 0.8408799767494202),
 ('university', 0.8266153931617737),
 ('unique', 0.8156555891036987),
 ('union', 0.7794871926307678),
 ('universe', 0.7756810188293457),
 ('un', 0.7631798982620239),
 ('united_states', 0.7625829577445984),
 ('village', 0.760526180267334),
 ('universal', 0.7568336129188538)]

In [64]:
fst_s.wv.most_similar("homer", topn = 10)

[('homey', 0.8053441643714905),
 ('homer_j', 0.771196186542511),
 ('homeboy', 0.7320456504821777),
 ('homie', 0.7198623418807983),
 ('homers', 0.7179117798805237),
 ('knockahomer', 0.7139254808425903),
 ('homemaker', 0.7117170095443726),
 ('marge', 0.7071914076805115),
 ('homer_simpson', 0.6997377872467041),
 ('bart', 0.6730086803436279)]

In [65]:
fst_s.wv.most_similar("marge", topn = 10)

[('maaarge', 0.8417245745658875),
 ('margie', 0.8175097703933716),
 ('marge-a-rine', 0.7889267206192017),
 ('marges', 0.775191605091095),
 ('sarge', 0.7706901431083679),
 ('margaret', 0.7593051195144653),
 ('margarita', 0.7477173209190369),
 ('homer', 0.7071914672851562),
 ('marie', 0.6962291598320007),
 ('marjorie', 0.6722584366798401)]

In [66]:
w2v_s.wv.most_similar("marge", topn = 10)

[('homer', 0.6593793630599976),
 ('abe', 0.5592204928398132),
 ('homie', 0.5501308441162109),
 ('honey', 0.5463864207267761),
 ('lisa', 0.5315233469009399),
 ('maggie', 0.5128048062324524),
 ('but', 0.4928845465183258),
 ('family', 0.49145132303237915),
 ('moe', 0.47630375623703003),
 ('mom', 0.4672625958919525)]

In [67]:
fst_s.wv.most_similar("eat", topn = 10)

[('teat', 0.7125701308250427),
 ('neat', 0.6677605509757996),
 ('beat', 0.6515675783157349),
 ('earn', 0.6420717835426331),
 ('sweat', 0.6323785185813904),
 ('eatin', 0.6129869818687439),
 ('meat', 0.6103528738021851),
 ('drink', 0.6020302176475525),
 ('eats', 0.6009734869003296),
 ('cheat', 0.5747995376586914)]

In [68]:
w2v_s.wv.most_similar("eat", topn = 10)

[('drink', 0.5764316916465759),
 ('buy', 0.5259267687797546),
 ('steal', 0.5198866724967957),
 ('feed', 0.5180990099906921),
 ('wear', 0.5007943511009216),
 ('suck', 0.49226200580596924),
 ('lose', 0.491549015045166),
 ('sell', 0.4849689304828644),
 ('throw', 0.4725513756275177),
 ('scrape', 0.4724678695201874)]

### Pretrained fasttext vectors

In [69]:
fst_pret = gensim_api.load('fasttext-wiki-news-subwords-300')

In [None]:
# or from downloaded source (e.g. https://fasttext.cc/docs/en/english-vectors.html):
# fst_pret = FastText.load_fasttext_format('fasttest_file')

In [70]:
fst_pret.most_similar("eat", topn = 10)

[('eate', 0.80073082447052),
 ('eat-', 0.7924754619598389),
 ('eatin', 0.7710843682289124),
 ('eating', 0.7475292682647705),
 ('ate', 0.739423930644989),
 ('eat.', 0.7381978631019592),
 ('eats', 0.7242209315299988),
 ('consume', 0.7190532088279724),
 ('eaten', 0.7079055905342102),
 ('eatable', 0.7070158123970032)]

In [71]:
w2v_pret.most_similar("eat", topn = 10)

[('eating', 0.7529403567314148),
 ('ate', 0.7013993859291077),
 ('eaten', 0.6724975109100342),
 ('eats', 0.6589087843894958),
 ('munch', 0.6417747735977173),
 ('eat_healthfully', 0.6315395832061768),
 ('eat_fatty_foods', 0.6280142068862915),
 ('consume', 0.6184970140457153),
 ('Nutritionists_recommend', 0.6183844804763794),
 ('overeaten', 0.6109130382537842)]

In [72]:
fst_pret.most_similar("consume", topn = 10)

[('consumes', 0.7680187821388245),
 ('over-consume', 0.7483881711959839),
 ('consumed', 0.7472653388977051),
 ('consuming', 0.7410412430763245),
 ('eat', 0.7190532088279724),
 ('overconsume', 0.6897545456886292),
 ('devour', 0.6756592392921448),
 ('ingest', 0.65708327293396),
 ('Consume', 0.6442354917526245),
 ('consumption', 0.6395013928413391)]

In [73]:
w2v_pret.most_similar("consume", topn = 10)

[('consumed', 0.698164701461792),
 ('consumes', 0.6695359349250793),
 ('eat', 0.6184970736503601),
 ('consumption', 0.6043094992637634),
 ('guzzle', 0.5901384949684143),
 ('ingest', 0.5877813696861267),
 ('consuming', 0.5756837725639343),
 ('overconsume', 0.5606318712234497),
 ('devour', 0.5577657222747803),
 ('Consuming', 0.5523046851158142)]

# GloVe

Contrary to word2vec and fasttext, GloVe doesn't use skipgram or CBOW networks. GloVe relies on word-context co-occurrence matrix factorization to obtain the embedded word vectors.

- GloVe can be longer to train on larger corpora, compared to word2vec.
- It has fewer hyperparameters, so it's much easier to tune, but then cannot be fine tuned for a specific task.
- word2vec and fasttext are in comparison much more sensitive to the coices of hyperparameters, and results can thus vary much more.

Official website: https://nlp.stanford.edu/projects/glove/

Original paper: https://nlp.stanford.edu/pubs/glove.pdf

https://nlp.stanford.edu/projects/glove/

In [78]:
glv_pret = gensim_api.load("glove-wiki-gigaword-200")

# Are already available in gensim:
#'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300',
#'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200'

From downloaded source (e.g. https://nlp.stanford.edu/projects/glove/), one can do:

    from gensim.test.utils import datapath, get_tmpfile
    from gensim.models import KeyedVectors
    from gensim.scripts.glove2word2vec import glove2word2vec
    glove_file = datapath('DOWNLOADED_GLOVE_VECTORS.txt')
    tmp_file = get_tmpfile("test_word2vec.txt")
    glove2word2vec(glove_file, tmp_file)
    model = KeyedVectors.load_word2vec_format(tmp_file)

In [79]:
glv_pret.most_similar(positive=["better", "fast"], negative=["good"], topn=3)

[('faster', 0.7852828502655029),
 ('slow', 0.6931735873222351),
 ('slower', 0.6770131587982178)]

In [80]:
glv_pret.most_similar("eat", topn = 10)

[('eating', 0.7841552495956421),
 ('ate', 0.7657052874565125),
 ('eaten', 0.7538666129112244),
 ('meal', 0.6805590987205505),
 ('consume', 0.6571250557899475),
 ('eats', 0.6406125426292419),
 ('food', 0.6227813363075256),
 ('meat', 0.6211603879928589),
 ('drink', 0.6211259961128235),
 ('vegetables', 0.6168597340583801)]

In [81]:
glv_pret.most_similar("consume", topn = 10)

[('consumed', 0.7226211428642273),
 ('consumes', 0.6638200283050537),
 ('eat', 0.657124936580658),
 ('consuming', 0.6452376842498779),
 ('devour', 0.5809320211410522),
 ('consumption', 0.5808643698692322),
 ('feed', 0.5677404999732971),
 ('ingest', 0.5554704666137695),
 ('calories', 0.5538193583488464),
 ('drink', 0.5402347445487976)]

### Remark: training GloVe

Training GloVe vectors is not possible with gensim. If interested, one can use the [official GloVe code](https://nlp.stanford.edu/projects/glove/) (command line interface).

For a python interface, see for example the ("toy implementation") [`glove_python`](https://github.com/maciejkula/glove-python) pachage

    !pip install glove_python

### See also other vector embeddings...

https://github.com/RaRe-Technologies/gensim-data#models

In [84]:
gensim_api.info('conceptnet-numberbatch-17-06-300')['description']
#conceptnet = gensim_api.load("conceptnet-numberbatch-17-06-300")

'ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.'

# Saving gensim models and word vectors

One can save either the entire model (if further training is expected).

In [None]:
w2v_s.save('word2vec_simpson_model')
w2v_s = Word2Vec.load('word2vec_simpson_model')

Or only the word vectors (the `KeyedVectors`-type attribute) if the vecors are final. They are much more memory-efficient to save.

In [82]:
w2v_s.wv.save('word2vec_simpson_word_vectors')
w2v_s_wv = KeyedVectors.load('word2vec_simpson_word_vectors')

In [83]:
w2v_s_wv.most_similar("bart")

[('lisa', 0.6480810046195984),
 ('milhouse', 0.6399539113044739),
 ('dad', 0.6044421792030334),
 ('homer', 0.5647057890892029),
 ('mom', 0.5577839612960815),
 ('your_father', 0.5532432198524475),
 ('son', 0.5313892960548401),
 ('honey', 0.5028467774391174),
 ('maggie', 0.486422598361969),
 ('well', 0.4840283989906311)]

# Exercise: ML classification using advanced static embeddings

Compare the performance of the logistic regression classifier on the 20newsgroup dataset using word2vect, GloVe or fasttext to the performance achieved in the previous seminar using TF-IDF.