# GloVe (Global Vectors for Word Representation) - pretrained vectors


GloVe: Global Vectors for Word Representation: https://nlp.stanford.edu/projects/glove/

Pre-trained vectors were downloaded from their website and put in `./data/`:

- `glove.840B.300d.zip`
- `glove.6B.zip`

In [3]:
import numpy as np
import pandas as pd

import zipfile
import pickle

## Preprocessing: storing for local usage

The pretrained vectors are stored in text files. To enable faster loading in the other notebooks, we convert the text file to a pickle object with a dictionary of numpy arrays (a dictionary gives faster retrieval of vectors than a dataframe):

In [4]:
z = zipfile.ZipFile("/Users/sherryruan/data/glove/glove.840B.300d.zip")

In [5]:
%time glove = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)

CPU times: user 2min 22s, sys: 21 s, total: 2min 43s
Wall time: 2min 54s


In [6]:
glove.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2196017 entries, , to zulchzulu
Columns: 300 entries, 1 to 300
dtypes: float64(300)
memory usage: 4.9+ GB


In [7]:
%time glove2 = {key: val.values for key, val in glove.T.items()}

CPU times: user 45.1 s, sys: 490 ms, total: 45.6 s
Wall time: 45.8 s


In [8]:
with open('/Users/sherryruan/data/glove/glove.840B.300d.pkl', 'wb') as output:
    pickle.dump(glove2, output)

OSError: [Errno 22] Invalid argument

The same for a smaller set of pretrained vectors (for testing on laptop):

In [2]:
z = zipfile.ZipFile("data/glove.6B.zip")

In [3]:
glove = pd.read_csv(z.open('glove.6B.50d.txt'), sep=" ", quoting=3, header=None, index_col=0)

In [4]:
len(glove)

400001

In [5]:
glove2 = {key: val.values for key, val in glove.T.items()}

In [6]:
with open('data/glove.6B.50d.pkl', 'wb') as output:
    pickle.dump(glove2, output)

In [7]:
glove = pd.read_csv(z.open('glove.6B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)

In [8]:
glove2 = {key: val.values for key, val in glove.T.items()}

In [9]:
with open('data/glove.6B.300d.pkl', 'wb') as output:
    pickle.dump(glove2, output)

## "Most similar" examples

Typical example of looking for similar words and using arithmetic with the vectors:

In [2]:
with open('data/glove.6B.50d.pkl', 'rb') as pkl:
    glove = pickle.load(pkl)

In [3]:
len(glove)

400001

In [4]:
words = np.array(list(glove.keys()), dtype=object)

In [5]:
words

array([nan, 'h2', 'ukrainka', ..., 'mccauley', 'hochstein', 'formula_16'], dtype=object)

In [6]:
W = np.array(list(glove.values()))

In [7]:
W.shape

(400001, 50)

In [8]:
def most_similar(positive, negative, topn=10, freq_threshold=5):
    # Build a "mean" vector for the given positive and negative terms
    mean_vecs = []
    for word in positive: mean_vecs.append(np.array(glove[word]))
    for word in negative: mean_vecs.append(-1 * np.array(glove[word]))
    
    mean = np.array(mean_vecs).mean(axis=0)
    mean /= np.linalg.norm(mean)
    
    # Now calculate cosine distances between this mean vector and all others
    dists = np.dot(W, mean)
    
    best = np.argsort(dists)[::-1][:topn + len(positive) + len(negative) + 100]
    #result = [(glove.index[i], dists[i]) for i in best if (glove.index[i] not in positive
    #                                                   and glove.index[i] not in negative)]
    result = [(words[i], dists[i]) for i in best if (words[i] not in positive
                                                     and words[i] not in negative)]
    return result[:topn]

In [9]:
most_similar(['king', 'woman'], ['man'], topn=10)

[('emperor', 4.5522317437387363),
 ('queen', 4.4707483412266873),
 ('daughter', 4.4056688956373744),
 ('throne', 4.3990343836225838),
 ('princess', 4.2912411625741882),
 ('mother', 4.1046861120937779),
 ('son', 4.0783652863046713),
 ('wife', 4.0341628529250606),
 ('father', 3.8841932219905333),
 ('prince', 3.8836605925009895)]

In [10]:
most_similar(['brought', 'seek'], ['bring'], topn=10)

[('government', 4.4060056808470422),
 ('court', 4.2230486058381436),
 ('authorities', 4.1728568687095517),
 ('officials', 4.0757379978136026),
 ('seeking', 4.028460594566452),
 ('federal', 3.9511732544821281),
 ('legal', 3.9461690939479515),
 ('law', 3.8573565404382761),
 ('sought', 3.813629103812473),
 ('lawyers', 3.8025097790503763)]

In [11]:
most_similar(['frog'], [], topn=25)

[('leptodactylidae', 4.6448479174033874),
 ('species', 4.524305824885948),
 ('ranidae', 4.4521810203063836),
 ('snails', 4.305010980198257),
 ('colubrid', 4.169195834533201),
 ('hylidae', 4.0862506971969976),
 ('snake', 4.0245443051038361),
 ('genus', 4.0109517209291301),
 ('cichlid', 4.0078101032802813),
 ('rhacophoridae', 3.972399387075912),
 ('bulbophyllum', 3.9643366428646245),
 ('eleutherodactylus', 3.9527980532154907),
 ('salticidae', 3.9363336863806349),
 ('nonvenomous', 3.9025431608540417),
 ('swallowtail', 3.8977048290428926),
 ('larvae', 3.8944229951597307),
 ('spiny', 3.8917372539864701),
 ('shrub', 3.8694668851746679),
 ('microhylidae', 3.8545271995546564),
 ('endemic', 3.8507680036214382),
 ('frogs', 3.8420260919459257),
 ('litoria', 3.8346105522739773),
 ('conita', 3.8340106472154991),
 ('orchid', 3.8120856931806566),
 ('deciduous', 3.790041215037518)]

The same but with the longer vectors (300D instead of 50D)

In [12]:
with open('data/glove.6B.300d.pkl', 'rb') as pkl:
    glove = pickle.load(pkl)

In [13]:
words = np.array(list(glove.keys()), dtype=object)

In [14]:
words

array([nan, 'h2', 'ukrainka', ..., 'mccauley', 'hochstein', 'formula_16'], dtype=object)

In [15]:
W = np.array(list(glove.values()))

In [16]:
most_similar(['king', 'woman'], ['man'], topn=10)

[('queen', 4.7798604491546293),
 ('throne', 4.1617390290010059),
 ('princess', 4.0176014384787102),
 ('monarch', 3.5503961933354162),
 ('prince', 3.5282495697922043),
 ('emperor', 3.4229433212252323),
 ('bhumibol', 3.39467904195855),
 ('daughter', 3.3614950090666631),
 ('royal', 3.3260806738609885),
 ('kingdom', 3.2824227939558357)]

In [17]:
most_similar(['brought', 'seek'], ['bring'], topn=10)

[('seeking', 3.8770598487739543),
 ('sought', 3.5124546878821725),
 ('asylum', 3.3548816905035657),
 ('court', 3.2465193778798045),
 ('appeals', 3.148861767209775),
 ('extradition', 3.1360908305538659),
 ('legal', 3.0855724113011336),
 ('filed', 3.0321240339064923),
 ('immediate', 3.0124572034007948),
 ('appeal', 3.0000189683126619)]

In [18]:
most_similar(['frog'], [], topn=10)

[('toad', 4.3813890523860888),
 ('frogs', 4.3293474979923019),
 ('genus', 4.0009066283615216),
 ('species', 3.9308852243381445),
 ('moth', 3.5720826464332598),
 ('ranidae', 3.4867772155465415),
 ('salticidae', 3.4715610081939543),
 ('hylidae', 3.3220798456072944),
 ('snake', 3.2789408441515406),
 ('toads', 3.229218005366838)]

In [19]:
most_similar(['paris', 'germany'], ['france'], topn=10)

[('berlin', 5.3698714145161528),
 ('frankfurt', 5.1199776702533448),
 ('munich', 4.7197065599932646),
 ('german', 4.4663966237105521),
 ('cologne', 4.1446456615296841),
 ('vienna', 4.1214259539953755),
 ('bonn', 4.0708874256000485),
 ('hamburg', 4.0663950993484459),
 ('stuttgart', 3.9669588674404164),
 ('leipzig', 3.8246614146305467)]