# NLP Seminar 3: Static word embeddings (word2vec, fasttext, GloVe)

In this seminar, we will use the `gensim` package, as it has unifying easy-to-use implementations and pretrained word2vec, fasttext, and GloVe models

In [1]:
#!pip install --upgrade gensim

In [2]:
import numpy as np
import pandas as pd

In [3]:
import multiprocessing
cores = multiprocessing.cpu_count()
cores

12

In [4]:
from nltk.tokenize import word_tokenize

# Data preparation

Data source: https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset

In [5]:
simpsons = pd.read_csv("../data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons.head()

Unnamed: 0,raw_character_text,raw_location_text,spoken_words,normalized_text
0,Miss Hoover,Springfield Elementary School,"No, actually, it was a little of both. Sometim...",no actually it was a little of both sometimes ...
1,Lisa Simpson,Springfield Elementary School,Where's Mr. Bergstrom?,wheres mr bergstrom
2,Miss Hoover,Springfield Elementary School,I don't know. Although I'd sure like to talk t...,i dont know although id sure like to talk to h...
3,Lisa Simpson,Springfield Elementary School,That life is worth living.,that life is worth living
4,Edna Krabappel-Flanders,Springfield Elementary School,The polls will be open from now until the end ...,the polls will be open from now until the end ...


In [6]:
simpsons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158271 entries, 0 to 158270
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   raw_character_text  140749 non-null  string
 1   raw_location_text   157863 non-null  string
 2   spoken_words        132112 non-null  string
 3   normalized_text     132087 non-null  string
dtypes: string(4)
memory usage: 4.8 MB


In [7]:
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

In [8]:
corpus_tok = simpsons['normalized_text'].str.split(' ').to_list()
corpus_tok[1]

['wheres', 'mr', 'bergstrom']

In [9]:
# If you don't know the Simpsons tv show, 
# you can e.g. use the wikipedia subset corpus instead,
# (and try different words when evaluating the vectors and similarities in next sections):

#import gensim.downloader as gensim_api
#gensim_api.info('text8')['description']
#corpus_tok = gensim_api.load('text8')

## Phraser

Using a "phraser" is a common step before training static word embeddings. 
It merges together meaningful combinations of tokens, that often appear together in specific contexts and that seem to have a "meaning" when used together, into a single token, that one is interested in learning an embedding for. 
Those "common phrases" are also called multi-word expressions, collocations, or word n-grams.

For more implementation details: https://radimrehurek.com/gensim/models/phrases.html

In [10]:
from gensim.models.phrases import Phrases, FrozenPhrases, ENGLISH_CONNECTOR_WORDS

phrases = Phrases(corpus_tok, min_count=30, connector_words=ENGLISH_CONNECTOR_WORDS)

# To save space:
phraser = FrozenPhrases(phrases) # or Phraser(phrases) or phrases.freeze()
del(phrases)

In [11]:
phraser[["homer", "simpson", "eats", "donuts"]]

['homer_simpson', 'eats', 'donuts']

In [12]:
corpus_phrased = phraser[corpus_tok]

In [13]:
corpus_phrased[1]

['wheres', 'mr', 'bergstrom']

# Word2vec

Word2vec has two sub-methods for training the word embeddings: continuous bag of words (CBOW) and skip-gram.
In both cases, a shallow neural network is trained to predict either

- a word given a context (CBOW), or
- a context of a given a word (skip-gram).

The context is defined as the other surrounding words in a given window. The word embedding vectors are then obained from the two trained weight matrices for each word in the vocabulary.


Official website: https://code.google.com/archive/p/word2vec/

Original papers: http://arxiv.org/abs/1301.3781 and http://arxiv.org/abs/1310.4546

### Training word2vec on the Simpson scripts

In [14]:
from gensim.models import Word2Vec, KeyedVectors

In [None]:
w2v_s = Word2Vec(corpus_phrased, vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)
# 1st line: Method's hyperparameters
# 2nd line: Optimization (gradient descent) hyperparameters

Can also be done in separate steps:
```python
    w2v_s = Word2Vec(vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    w2v_s.build_vocab(sentences, progress_per=10000)
    w2v_s.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
```

*Note:* For reproducibility, one must set `workers=1` and a `seed=...`. 

Word embedding vectors can then be obtained from the trained model, for each word in the training vocabulary. These `KeyedVectors` are stored in the `.wv` argument, and have convenient methods to access and combine them.

In [16]:
w2v_s.wv

<gensim.models.keyedvectors.KeyedVectors at 0x1b5597f89b0>

In [16]:
homer_vector = w2v_s.wv.get_vector("homer", norm=True) # Father of the Simpsons
homer_vector

array([-1.97504416e-01,  6.43460974e-02,  4.45451364e-02, -2.24762127e-01,
        2.18845621e-01, -4.93652187e-02,  3.65614407e-02,  5.74157014e-02,
        1.12886772e-01, -1.30046293e-01,  1.45423878e-02, -1.66172273e-02,
        5.73144294e-02, -4.79874685e-02,  1.00222528e-01,  1.35900602e-01,
       -7.18876347e-02,  5.77266812e-02,  6.15007803e-03, -7.56959245e-02,
       -1.40379772e-01, -5.59713803e-02,  8.03122595e-02,  6.45850878e-03,
       -1.28210813e-01,  7.50770047e-02,  3.26105542e-02, -8.92766714e-02,
       -2.27091219e-02,  6.46338016e-02, -1.46641741e-02,  9.02169049e-02,
        1.21256644e-02,  8.66212547e-02, -7.10247606e-02,  7.43433684e-02,
       -3.08084220e-01, -3.93914953e-02,  6.38459921e-02, -8.82648379e-02,
       -7.35132098e-02,  1.51740210e-02,  1.23751201e-01,  4.52707103e-03,
        2.82497145e-02, -1.00846723e-01, -1.15346827e-01, -3.87128443e-02,
       -1.58500969e-01,  4.52705510e-02, -4.78413515e-02, -1.30678609e-01,
       -4.90224250e-02, -

For a given word or vector, one can query the other most similar word vectors, in terms of cosine similarity.

In [17]:
w2v_s.wv.most_similar("homer") # Marge is Homer's wife and Bart is his son. Homer is the "dad" of the family.

[('marge', 0.6317975521087646),
 ('homie', 0.5999241471290588),
 ('bart', 0.5930087566375732),
 ('dad', 0.5614809393882751),
 ('son', 0.5609906911849976),
 ('moe', 0.549460232257843),
 ('ned', 0.5298517942428589),
 ('grampa', 0.5063692927360535),
 ('mom', 0.5041996836662292),
 ('honey', 0.4845598340034485)]

In [18]:
w2v_s.wv.most_similar(homer_vector)

[('homer', 1.0),
 ('marge', 0.6317975521087646),
 ('homie', 0.5999241471290588),
 ('bart', 0.5930087566375732),
 ('dad', 0.5614809393882751),
 ('son', 0.5609906911849976),
 ('moe', 0.549460232257843),
 ('ned', 0.5298517942428589),
 ('grampa', 0.5063692927360535),
 ('mom', 0.5041996836662292)]

In [19]:
w2v_s.wv.most_similar("homer_simpson") # name bigram

[('manager', 0.4670054316520691),
 ('ned_flanders', 0.45402565598487854),
 ('montgomery_burns', 0.4459589719772339),
 ('abraham', 0.43679898977279663),
 ('bart_simpson', 0.43056923151016235),
 ('robert', 0.4298505485057831),
 ('chester', 0.42625194787979126),
 ('kent_brockman', 0.4187189042568207),
 ('johnson', 0.4120151698589325),
 ('mayor', 0.40999096632003784)]

In [20]:
w2v_s.wv.most_similar("bart") # Bart is the son, Lisa his sister and Milhouse his best friend

[('lisa', 0.6728383898735046),
 ('milhouse', 0.6387649774551392),
 ('dad', 0.5937414765357971),
 ('homer', 0.5930086970329285),
 ('son', 0.5514666438102722),
 ('mom', 0.5388515591621399),
 ('your_father', 0.5289860963821411),
 ('honey', 0.5228714346885681),
 ('principal_skinner', 0.4911167025566101),
 ('homie', 0.4858196973800659)]

One can also compute the cosine similarity between two word vectors

In [21]:
w2v_s.wv.similarity('bart', 'lisa')

0.6728384

In [22]:
w2v_s.wv.similarity('bart', 'bart')

0.99999994

In [None]:
w2v_s.wv.n_similarity(["marge", "homer"], ["wife", "husband"])

0.46287888

Odd-one-out identification:

In [23]:
w2v_s.wv.doesnt_match(['homer', 'patty', 'selma']) # Patty and Selma are Marge's twin sisters

'homer'

Word analogies: how well do embeddings vectors capture intuitive semantic and syntactic analogy questions?

In [24]:
# " Homer - man + woman = ? " - i.e. " man:Homer :: woman:? "
w2v_s.wv.most_similar(positive=["homer", "woman"], negative=["man"], topn=3) # Marge is Homer's wife

[('marge', 0.5275154709815979),
 ('mom', 0.5146074891090393),
 ('homie', 0.5048220753669739)]

In [25]:
# " woman - Marge + Homer = ? " - i.e. " Marge:Homer :: woman:? " 
w2v_s.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

[('man', 0.47464245557785034),
 ('friend', 0.4618160128593445),
 ('child', 0.438487708568573)]

### Sentence embedding

A sentence or document embedding can, for example, be obtained by averaging its tokens' embeddings.

In [None]:
def document2vec(tokens, embedding_wv, phraser=None, normalize=True):
    """Returns the embedding of a sentence or document as the mean of its tokens/words embeddings."""
    if phraser:
        tokens = phraser[tokens]
    # sent_mean = np.array([embedding_wv.get_vector(tok, norm=normalize) for tok in tokens]).mean(axis=0) # same result
    sent_mean = embedding_wv.get_mean_vector(keys=tokens, ignore_missing=True) # same result (robust to OOV words)
    return sent_mean

In [40]:
document2vec(["bart", "is", "grounded"], w2v_s.wv, phraser=phraser)

array([-0.009877  ,  0.01109686, -0.00165891, -0.017362  ,  0.05698981,
       -0.02443988, -0.03313335,  0.05180913, -0.01051246,  0.00570679,
        0.00610514, -0.01178046,  0.00710768,  0.05084235,  0.00538733,
        0.04277755, -0.05651882,  0.00355378, -0.04752327,  0.04393594,
       -0.04326501, -0.01511771,  0.00783788,  0.00466334, -0.01279658,
       -0.04956766, -0.02699601, -0.0048399 ,  0.04675311,  0.00818683,
       -0.01231978, -0.03479793, -0.00063789,  0.04933694, -0.00058257,
        0.09272969,  0.02091086,  0.04867287,  0.04694089,  0.023596  ,
       -0.00434448,  0.03485893,  0.03434121,  0.01915275,  0.02651068,
        0.02080218, -0.04208039, -0.01016911,  0.03299143,  0.00735796,
        0.0140503 , -0.01229267,  0.09844785, -0.05196367,  0.09528995,
        0.05697097, -0.05021961, -0.02764725,  0.06670328,  0.02159331,
        0.04906306, -0.03101779,  0.00488023, -0.04220416,  0.01963334,
       -0.05249187, -0.03727938,  0.04126221,  0.03961276,  0.00

### Pretrained word2vec vectors

https://github.com/RaRe-Technologies/gensim-data#models

In [32]:
import gensim.downloader as gensim_api

In [33]:
w2v_pret = gensim_api.load('word2vec-google-news-300')

In [None]:
# Or from downloaded source (e.g. https://code.google.com/archive/p/word2vec/):
#w2v_pret = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [41]:
w2v_pret

<gensim.models.keyedvectors.KeyedVectors at 0x2b7809449e0>

In [42]:
w2v_pret.most_similar(positive=["eat"])

[('eating', 0.7529403567314148),
 ('ate', 0.7013993859291077),
 ('eaten', 0.6724975109100342),
 ('eats', 0.6589087843894958),
 ('munch', 0.6417747735977173),
 ('eat_healthfully', 0.6315395832061768),
 ('eat_fatty_foods', 0.6280142068862915),
 ('consume', 0.6184970140457153),
 ('Nutritionists_recommend', 0.6183844804763794),
 ('overeaten', 0.6109130382537842)]

In [43]:
w2v_pret.similarity("eat", 'consume')

0.6184971

In [44]:
w2v_pret.doesnt_match(["eat", 'dance', 'drink'])

'dance'

In [45]:
w2v_pret.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car'])

'car'

In [46]:
w2v_pret.most_similar(positive=["king", "woman"], negative=["man"], topn=3)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951)]

In [47]:
vectcalc = w2v_pret.get_vector("king", norm=True) - w2v_pret.get_vector("man", norm=True) + w2v_pret.get_vector("woman", norm=True)
w2v_pret.most_similar(vectcalc)

[('king', 0.7992597818374634),
 ('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593831062317)]

king - man + woman = king ?

In [48]:
w2v_pret.most_similar(positive=["bart"])

[('joseph', 0.620844841003418),
 ('boston_globe', 0.6205891370773315),
 ('liz', 0.6189237833023071),
 ('gerald', 0.6109275817871094),
 ('jta', 0.6039943099021912),
 ('jg', 0.6032029986381531),
 ('meyers', 0.6026628613471985),
 ('ellis', 0.5999051928520203),
 ('christine', 0.5992782115936279),
 ('becky', 0.5988665223121643)]

In [49]:
w2v_s.wv.most_similar("bart")

[('lisa', 0.7069635987281799),
 ('milhouse', 0.6307767629623413),
 ('dad', 0.5766920447349548),
 ('homer', 0.5670747756958008),
 ('mom', 0.5558942556381226),
 ('son', 0.5200268030166626),
 ('your_father', 0.5010845065116882),
 ('honey', 0.49810099601745605),
 ('you', 0.4977051317691803),
 ('maggie', 0.4751102328300476)]

# Fasttext

Fastext is a static word embedding, that is very similar to word2vec. As a main difference, it uses character-level ngram vectors together with the word vectors.

Advantages:
+ Out of training vocabulary embeddings are obtainable.
+ Better representation for rare words (that are morphologically similar to others).
+ Tends to perform better for syntactic tasks.
+ Is more useful in morphologically rich languages (such as German, Arabic and Russian) compared to English (German example: 'table tennis' -> 'Tischtennis'), but it heavily depends on the task.
+ might work better for small datasets.
+ The "official" implementation is quite efficient, and allows training the embedding and classifier at once (see the official `fasttest` package documentation https://fasttext.cc/docs/en/python-module.html).

Disatvantages:
- Can overfit more easily, and is a bit harder to fine tune with the additionnal character ngram hyperparameters.
- Tends to perform more poorly for semantic tasks.
- May tend to privilege too much the morphologically close synonyms compared to other semantically closer synomyms.
- Can be heavier to train.

However, the differences between fasttext and word2vec thend to decrease as the size of the training corpus increases.

Official website: https://fasttext.cc/

Original paper: https://arxiv.org/abs/1607.04606

E.g.: for a charactrer-ngram window of n=4, it uses the average of vectors for "university", "univ", "nive", ... , "sity" as input representation.

### Training fasttext on the Simpson scripts

In [15]:
from gensim.models import FastText

In [None]:
fst_s = FastText(corpus_phrased, vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                 min_n=3, max_n=6, # Additional fasttext hyperparameters
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)

Can also be performed in separate steps:
```python
    fst_s = FastText(vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                     min_n = 1, max_n = 4,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    fst_s.build_vocab(corpus_phrased)
    print(len(fst_s.wv.vocab.keys()))
    fst_s.train(sentences, total_examples = fst_s.corpus_count, epochs=100) 
```

In [None]:
print(fst_s.wv.index_to_key[:100]) # First 100 (most frequent) words in the model vocabulary

['the', 'you', 'i', 'a', 'to', 'and', 'of', 'it', 'that', 'in', 'is', 'my', 'this', 'for', 'your', 'me', 'on', 'oh', 'we', 'im', 'have', 'but', 'what', 'no', 'well', 'its', 'just', 'with', 'do', 'are', 'now', 'not', 'be', 'was', 'all', 'so', 'get', 'can', 'youre', 'dont', 'like', 'one', 'at', 'thats', 'hey', 'here', 'out', 'if', 'know', 'up', 'he', 'homer', 'were', '--', 'our', 'go', 'from', 'bart', 'they', 'ill', 'yeah', 'there', 'about', 'think', 'how', 'want', 'an', 'right', 'as', 'look', 'see', 'marge', 'good', 'got', 'uh', 'dad', 'okay', 'him', 'when', 'back', 'will', 'some', 'cant', 'little', 'us', 'man', 'could', 'time', 'come', 'who', 'did', 'his', 'say', 'why', 'would', 'hes', 'or', 'take', 'by', 'make']


In [50]:
"unige" in fst_s.wv.index_to_key, "unige" in w2v_s.wv.index_to_key

(False, False)

In [53]:
try:
    print(w2v_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

KeyError: the given token is not not present in the vocabulary.


In [54]:
try:
    print(fst_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

[-0.01608711  0.04110799  0.17988564 -0.01597484  0.09143241  0.0618559
 -0.10572785  0.04773679  0.06128706  0.08347788 -0.12178781  0.00537748
  0.01148988  0.05912064 -0.14781815 -0.01172447  0.03322294 -0.06930365
  0.14204437 -0.0088919  -0.01566534 -0.09653875  0.05829462  0.14226532
  0.04213895  0.17472796  0.06464078 -0.03843708  0.1057763  -0.11764871
 -0.0761048  -0.10572392 -0.0178628  -0.10062825  0.22040248  0.11360672
 -0.07555386 -0.02123816 -0.07728361 -0.05543618 -0.14223951  0.07038567
  0.01215003 -0.08780414  0.0427446   0.03708046 -0.12724902 -0.13829008
 -0.09344315  0.12672943 -0.05068863  0.06170019 -0.06100393  0.1073439
 -0.16181694 -0.03976557 -0.00716341  0.09881999  0.0078082  -0.04061393
 -0.06061785 -0.08024641 -0.03772819  0.01696859  0.10869474 -0.00998171
  0.02471528  0.07926501  0.03131536  0.07140689  0.03026498 -0.0235145
  0.00374927  0.04195894  0.05051354 -0.07675543 -0.00371963  0.05776955
 -0.0427718  -0.06232553  0.01374752 -0.05637484 -0.04

In [55]:
fst_s.wv.most_similar("unige")

[('unit', 0.8625775575637817),
 ('unite', 0.8335436582565308),
 ('university', 0.8153852820396423),
 ('unique', 0.8051110506057739),
 ('universe', 0.796246349811554),
 ('union', 0.7779499888420105),
 ('universal', 0.7628664970397949),
 ('un', 0.7502937316894531),
 ('uniform', 0.7458596229553223),
 ('unemployment', 0.7442814707756042)]

In [56]:
fst_s.wv.most_similar("homer", topn = 10)

[('homey', 0.8086385726928711),
 ('knockahomer', 0.7412236928939819),
 ('homeboy', 0.7314950823783875),
 ('homers', 0.7251979112625122),
 ('homer_simpson', 0.7221113443374634),
 ('homemaker', 0.7107552886009216),
 ('homie', 0.6970348954200745),
 ('marge', 0.6948388814926147),
 ('bart', 0.6739369630813599),
 ('customer', 0.6527592539787292)]

In [61]:
fst_s.wv.most_similar("marge", topn = 10)

[('maaarge', 0.8323073387145996),
 ('margie', 0.809052050113678),
 ('sarge', 0.7575308680534363),
 ('marge-a-rine', 0.7560793161392212),
 ('margaret', 0.7486603260040283),
 ('marges', 0.748385488986969),
 ('margarita', 0.7326395511627197),
 ('marie', 0.695029079914093),
 ('homer', 0.69483882188797),
 ('marjorie', 0.6896591186523438)]

In [62]:
w2v_s.wv.most_similar("marge", topn = 10)

[('abe', 0.5942814946174622),
 ('homer', 0.5923839211463928),
 ('homie', 0.5613185167312622),
 ('honey', 0.5219800472259521),
 ('lisa', 0.51099693775177),
 ('maggie', 0.5003818869590759),
 ('but', 0.48739856481552124),
 ('mom', 0.48072031140327454),
 ('you', 0.4766426682472229),
 ('well', 0.4755488932132721)]

In [63]:
fst_s.wv.most_similar("eat", topn = 10)

[('teat', 0.7086275219917297),
 ('neat', 0.6601749062538147),
 ('beat', 0.6542387008666992),
 ('earn', 0.6485193967819214),
 ('meat', 0.6215378642082214),
 ('drink', 0.6123024821281433),
 ('eats', 0.6090422868728638),
 ('sweat', 0.6041677594184875),
 ('eatin', 0.6033998131752014),
 ('ear', 0.5854046940803528)]

In [64]:
w2v_s.wv.most_similar("eat", topn = 10)

[('drink', 0.5787950158119202),
 ('feed', 0.5256111025810242),
 ('buy', 0.5093466639518738),
 ('steal', 0.49575188755989075),
 ('die', 0.4946196675300598),
 ('throw', 0.48809781670570374),
 ('sell', 0.486087828874588),
 ('lose', 0.4823223054409027),
 ('wear', 0.4689778983592987),
 ('scrape', 0.46692678332328796)]

### Pretrained fasttext vectors

In [36]:
fst_pret = gensim_api.load('fasttext-wiki-news-subwords-300')

In [None]:
# or from downloaded source (e.g. https://fasttext.cc/docs/en/english-vectors.html):
# fst_pret = FastText.load_fasttext_format('fasttest_file')

In [65]:
fst_pret.most_similar("eat", topn = 10)

[('eate', 0.80073082447052),
 ('eat-', 0.7924754619598389),
 ('eatin', 0.7710843682289124),
 ('eating', 0.7475292682647705),
 ('ate', 0.739423930644989),
 ('eat.', 0.7381978631019592),
 ('eats', 0.7242209315299988),
 ('consume', 0.7190532088279724),
 ('eaten', 0.7079055905342102),
 ('eatable', 0.7070158123970032)]

In [66]:
w2v_pret.most_similar("eat", topn = 10)

[('eating', 0.7529403567314148),
 ('ate', 0.7013993859291077),
 ('eaten', 0.6724975109100342),
 ('eats', 0.6589087843894958),
 ('munch', 0.6417747735977173),
 ('eat_healthfully', 0.6315395832061768),
 ('eat_fatty_foods', 0.6280142068862915),
 ('consume', 0.6184970140457153),
 ('Nutritionists_recommend', 0.6183844804763794),
 ('overeaten', 0.6109130382537842)]

In [67]:
fst_pret.most_similar("consume", topn = 10)

[('consumes', 0.7680187821388245),
 ('over-consume', 0.7483881711959839),
 ('consumed', 0.7472653388977051),
 ('consuming', 0.7410412430763245),
 ('eat', 0.7190532088279724),
 ('overconsume', 0.6897545456886292),
 ('devour', 0.6756592392921448),
 ('ingest', 0.65708327293396),
 ('Consume', 0.6442354917526245),
 ('consumption', 0.6395013928413391)]

In [68]:
w2v_pret.most_similar("consume", topn = 10)

[('consumed', 0.698164701461792),
 ('consumes', 0.6695359349250793),
 ('eat', 0.6184970736503601),
 ('consumption', 0.6043094992637634),
 ('guzzle', 0.5901384949684143),
 ('ingest', 0.5877813696861267),
 ('consuming', 0.5756837725639343),
 ('overconsume', 0.5606318712234497),
 ('devour', 0.5577657222747803),
 ('Consuming', 0.5523046851158142)]

# GloVe

Contrary to word2vec and fasttext, GloVe doesn't use skipgram or CBOW networks. GloVe relies on the word-context co-occurrence matrix to obtain the embedded word vectors.

- GloVe can be longer to train on larger corpora, compared to word2vec.
- It has fewer hyperparameters, so it's much easier to tune, but then cannot be fine tuned for a specific task.
- word2vec and fasttext are in comparison much more sensitive to the coices of hyperparameters, and results can thus vary much more.

Official website: https://nlp.stanford.edu/projects/glove/

Original paper: https://nlp.stanford.edu/pubs/glove.pdf

https://nlp.stanford.edu/projects/glove/

In [37]:
glv_pret = gensim_api.load("glove-wiki-gigaword-200")

# Are already available in gensim:
#'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300',
#'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200'

From downloaded source (e.g. https://nlp.stanford.edu/projects/glove/), one can do:
```python
    from gensim.test.utils import datapath, get_tmpfile
    from gensim.models import KeyedVectors
    from gensim.scripts.glove2word2vec import glove2word2vec
    glove_file = datapath('DOWNLOADED_GLOVE_VECTORS.txt')
    tmp_file = get_tmpfile("test_word2vec.txt")
    glove2word2vec(glove_file, tmp_file)
    model = KeyedVectors.load_word2vec_format(tmp_file)
```

In [69]:
glv_pret.most_similar(positive=["better", "fast"], negative=["good"], topn=3)

[('faster', 0.7852828502655029),
 ('slow', 0.6931735873222351),
 ('slower', 0.6770131587982178)]

In [70]:
glv_pret.most_similar("eat", topn = 10)

[('eating', 0.7841552495956421),
 ('ate', 0.7657052874565125),
 ('eaten', 0.7538666129112244),
 ('meal', 0.6805590987205505),
 ('consume', 0.6571250557899475),
 ('eats', 0.6406125426292419),
 ('food', 0.6227813363075256),
 ('meat', 0.6211603879928589),
 ('drink', 0.6211259961128235),
 ('vegetables', 0.6168597340583801)]

In [71]:
glv_pret.most_similar("consume", topn = 10)

[('consumed', 0.7226211428642273),
 ('consumes', 0.6638200283050537),
 ('eat', 0.657124936580658),
 ('consuming', 0.6452376842498779),
 ('devour', 0.5809320211410522),
 ('consumption', 0.5808643698692322),
 ('feed', 0.5677404999732971),
 ('ingest', 0.5554704666137695),
 ('calories', 0.5538193583488464),
 ('drink', 0.5402347445487976)]

### Remark: training GloVe

Training GloVe vectors is not possible with gensim. If interested, one can use the [official GloVe code](https://nlp.stanford.edu/projects/glove/) (command line interface).

For a python interface, see for example the ("toy implementation") [`glove_python`](https://github.com/maciejkula/glove-python) pachage

    !pip install glove_python

### See also other vector embeddings...

https://github.com/RaRe-Technologies/gensim-data#models

In [72]:
gensim_api.info('conceptnet-numberbatch-17-06-300')['description']
#conceptnet = gensim_api.load("conceptnet-numberbatch-17-06-300")

'ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.'

# Saving gensim models and word vectors

One can save either the entire model (if further training is expected).

In [None]:
w2v_s.save('./word2vec_simpson_model')
w2v_s = Word2Vec.load('./word2vec_simpson_model')

Or only the word vectors (the `KeyedVectors`-type attribute) if the vecors are final. They are much more memory-efficient to save.

In [73]:
w2v_s.wv.save('word2vec_simpson_word_vectors')
w2v_s_wv = KeyedVectors.load('word2vec_simpson_word_vectors')

In [74]:
w2v_s_wv.most_similar("bart")

[('lisa', 0.7069635987281799),
 ('milhouse', 0.6307767629623413),
 ('dad', 0.5766920447349548),
 ('homer', 0.5670747756958008),
 ('mom', 0.5558942556381226),
 ('son', 0.5200268030166626),
 ('your_father', 0.5010845065116882),
 ('honey', 0.49810099601745605),
 ('you', 0.4977051317691803),
 ('maggie', 0.4751102328300476)]

# Exercise: ML classification using advanced static embeddings

Compare the performance of the logistic regression classifier on the 20newsgroup dataset using word2vect, GloVe or fasttext to the performance achieved in the previous seminar using TF-IDF.