# Exploring Word Vectors - GloVe

By [Akshaj Verma](https://akshajverma.com)



## Introduction

Import the libraries.

In [1]:
from pprint import pprint

import torch
import torchtext.vocab

ModuleNotFoundError: No module named 'torchtext'

The first argument here means that the vectors were generated by training a corpus of 6 Billion words. The second dimenstion is arguement, which means that each word is a vector of length 100.   

42B and 840B word vectors are also available, but only at a dimension of 300.

In [2]:
glove = torchtext.vocab.GloVe(name = '6B', dim = 100)

Let's see the total number of words in `glove`.   
All the words are in lowercase.

In [4]:
print(f"Total number of words in glove = {len(glove.itos)}")

Total number of words in glove = 400000


There are two functios that we use to convert a string to integer and an integer to string. The integer here refers to the index of the word in the vector. Each vector is of length 100 here.  


`glove.itos[n]` is used to convert an index to word.  
`glove.stoi[n]` is used to convert an word to index.  

In [12]:
print(f"The 10th element in the glove vector list is : '{glove.itos[10]}'")
print(f"The index of word 'for' is: {glove.stoi['for']}")

The 10th element in the glove vector list is : 'for'
The index of word 'for' is: 10


Shape of each vector should be 100.

We can obtain the vector representation of a word by first converting the word into an integer (index) and then convert that into a vector.

In [29]:
print(f"Shape of the vector is = {glove.vectors[glove.stoi['python']].shape}\n") 
glove.vectors[glove.stoi['python']]

Shape of the vector is = torch.Size([100])



tensor([ 0.2493,  0.6832, -0.0447, -1.3842, -0.0073,  0.6510, -0.3396, -0.1979,
        -0.3392,  0.2669, -0.0331,  0.1592,  0.8955,  0.5400, -0.5582,  0.4624,
         0.3672,  0.1889,  0.8319,  0.8142, -0.1183, -0.5346,  0.2416, -0.0389,
         1.1907,  0.7935, -0.1231,  0.6642, -0.7762, -0.4571, -1.0540, -0.2056,
        -0.1330,  0.1224,  0.8846,  1.0240,  0.3229,  0.8210, -0.0694,  0.0242,
        -0.5142,  0.8727,  0.2576,  0.9153, -0.6422,  0.0412, -0.6021,  0.5463,
         0.6608,  0.1980, -1.1393,  0.7951,  0.4597, -0.1846, -0.6413, -0.2493,
        -0.4019, -0.5079,  0.8058,  0.5336,  0.5273,  0.3925, -0.2988,  0.0096,
         0.9995, -0.0613,  0.7194,  0.3290, -0.0528,  0.6714, -0.8025, -0.2579,
         0.4961,  0.4808, -0.6840, -0.0122,  0.0482,  0.2946,  0.2061,  0.3356,
        -0.6417, -0.6471,  0.1338, -0.1257, -0.4638,  1.3878,  0.9564, -0.0679,
        -0.0017,  0.5296,  0.4567,  0.6104, -0.1151,  0.4263,  0.1734, -0.7995,
        -0.2450, -0.6089, -0.3847, -0.47

## Vector Arithmetic

### Words Used In Similar Context

In [36]:
def word_2_vec(embeddings, word):
    i = embeddings.stoi[word]
    v = embeddings.vectors[i]
    
    return v

In [43]:
def similar_words(embeddings, vector, n = 5):
    distances = [(word, torch.dist(vector, word_2_vec(embeddings, word)).item()) for word in embeddings.itos]    
    shortest_distances = sorted(distances, key = lambda w: w[1])[:n]
    
    return shortest_distances

In [44]:
vector_plane = word_2_vec(glove, "plane")
similar_words(glove, vector_plane)

[('plane', 0.0),
 ('airplane', 3.212670087814331),
 ('jet', 3.7022786140441895),
 ('flight', 3.788144588470459),
 ('crashed', 3.8278510570526123)]

### Analogies

We'll try the famous **king - man + woman = queen**.

In [62]:
def get_analogy(embeddings, w1, w2, w3, n = 5):
    
    # Convert word to vector
    v1 = word_2_vec(embeddings, w1)
    v2 = word_2_vec(embeddings, w2)
    v3 = word_2_vec(embeddings, w3)
    
    # Perform vector arithmetic
    analogy_vec = v1 - v2 + v3
    
    # Get closest word
    closest_words = similar_words(embeddings, analogy_vec, n)
    
    # Remove words alread in w1, w2, oe w3.
    possible_words = [(word, dist) for (word, dist) in closest_words if word not in [w1, w2, w3]]
    
    return possible_words    

In [64]:
a_word = get_analogy(glove, 'king', 'man', 'woman')
pprint(a_word)
print(f"\nking - man + woman = {a_word[0][0]}")

[('queen', 4.08107852935791),
 ('monarch', 4.642907619476318),
 ('throne', 4.905500888824463),
 ('elizabeth', 4.921558380126953)]

king - man + woman = queen


In [65]:
a_word = get_analogy(glove, 'actor', 'man', 'woman')
pprint(a_word)
print(f"\nactor - man + woman = {a_word[0][0]}")

[('actress', 2.8133397102355957),
 ('comedian', 5.003941535949707),
 ('actresses', 5.139926433563232),
 ('starred', 5.277286052703857)]

actor - man + woman = actress


In [73]:
a_word = get_analogy(glove, 'london', 'britain', 'france')
pprint(a_word)
print(f"\nbritain:london = france:{a_word[0][0]}")

[('paris', 2.9362051486968994),
 ('amsterdam', 5.051050186157227),
 ('lyon', 5.251315116882324)]

britain:london = france:paris
