## Word to Vectors using pre-trained Glove word vectors

Glove stands for `Global Vectors for Word Representation`

In this notebook we will explore

1. Load GloVe vectors
2. Word to Vector
3. Sentence to Vector
4. Cosine similarity between two vectors/ sentences

### Setup

Below are installation instruction of libraries requeried for this notebook

> * !pip install -U scikit-learn
* !pip install gensim
* !pip install numpy

Also you need to download pre-trained [Glove vectors](http://nlp.stanford.edu/data/glove.6B.zip).

After download is complete unzip the folder. We are going to use file named `glove.6B.300d.txt` in this notebook.

In [49]:
# Required Imports
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.keyedvectors import KeyedVectors

### Load GloVe vectors

In [48]:
# File path for `glove.6B.300d.txt`
glove_file_path = "D:/Code/Resources/Models/glove.6B/glove.6B.300d.txt"

In [50]:
# converting GloVe txt file into gensim readable word2vec file format
# this may  take arround couple of minutes depending upon your system configuration

glove2word2vec(glove_input_file=glove_file_path, word2vec_output_file="gensim_glove_vectors.txt")

(400001, 300)

In [51]:
# loading the word2vec created in above step
# this may  take arround couple of minutes depending upon your system configuration

glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt")

### Word to Vector

In [70]:
# Lets see vectors of few words, all words should be in lowercase
glove_model.get_vector("king")

array([ 0.0033901, -0.34614  ,  0.28144  ,  0.48382  ,  0.59469  ,
        0.012965 ,  0.53982  ,  0.48233  ,  0.21463  , -1.0249   ,
       -0.34788  , -0.79001  , -0.15084  ,  0.61374  ,  0.042811 ,
        0.19323  ,  0.25462  ,  0.32528  ,  0.05698  ,  0.063253 ,
       -0.49439  ,  0.47337  , -0.16761  ,  0.045594 ,  0.30451  ,
       -0.35416  , -0.34583  , -0.20118  ,  0.25511  ,  0.091111 ,
        0.014651 , -0.017541 , -0.23854  ,  0.48215  , -0.9145   ,
       -0.36235  ,  0.34736  ,  0.028639 , -0.027065 , -0.036481 ,
       -0.067391 , -0.23452  , -0.13772  ,  0.33951  ,  0.13415  ,
       -0.1342   ,  0.47856  , -0.1842   ,  0.10705  , -0.45834  ,
       -0.36085  , -0.22595  ,  0.32881  , -0.13643  ,  0.23128  ,
        0.34269  ,  0.42344  ,  0.47057  ,  0.479    ,  0.074639 ,
        0.3344   ,  0.10714  , -0.13289  ,  0.58734  ,  0.38616  ,
       -0.52238  , -0.22028  , -0.072322 ,  0.32269  ,  0.44226  ,
       -0.037382 ,  0.18324  ,  0.058082 ,  0.26938  ,  0.3620

In [72]:
# Getting simillar word from vector space
glove_model.similar_by_word("office")

[('offices', 0.7270073294639587),
 ('headquarters', 0.5834001302719116),
 ('administration', 0.5706813335418701),
 ('department', 0.5504103899002075),
 ('government', 0.5341647267341614),
 ('building', 0.5243759155273438),
 ('house', 0.5168567895889282),
 ('official', 0.512373685836792),
 ('agency', 0.499081552028656),
 ('officials', 0.4982209801673889)]

### Sentence to Vector

In [96]:
sentence_1 = "I want to buy a Shirt"
sentence_2 = 'Lets purchase Sandals'
sentence_3 = 'I want to go US'
sentence_4 = 'Lets travel to Germany'

In [99]:
# Sentence vector using mean

np.mean([glove_model.get_vector(w) for w in sentence_1.lower().split()],axis=0)

array([-1.40374497e-01,  1.41578345e-02, -2.90735334e-01, -1.79177001e-01,
       -8.35621655e-02, -1.26622736e-01, -6.88918447e-03, -5.38648367e-02,
        1.32374331e-01, -1.61352336e+00,  9.17003378e-02, -2.78061479e-01,
        6.13557212e-02,  1.71973631e-01,  3.01549975e-02,  9.00465176e-02,
       -1.41399905e-01, -9.85701755e-03,  2.60451715e-02, -2.32639983e-01,
       -6.51486740e-02,  1.73771665e-01,  1.92217484e-01,  3.70683372e-02,
       -4.03613329e-01, -3.36246639e-01,  4.15169708e-02,  5.54849952e-02,
        9.46778283e-02, -1.35774836e-01, -8.92418325e-02,  1.21702336e-01,
       -1.37903333e-01, -5.61618321e-02, -1.19466662e+00,  3.10603350e-01,
       -1.41031504e-01,  2.93279290e-02, -9.81973037e-02,  1.82833429e-02,
       -1.40799955e-03, -3.21630508e-01, -1.93498302e-02,  3.60874981e-02,
        1.85119510e-01, -7.44343325e-02,  3.18585008e-01, -1.47339165e-01,
       -1.80896483e-02, -5.20383306e-02,  3.77085060e-02, -2.76311010e-01,
        5.80017269e-03, -

In [100]:
# Cosine similarity between different pair of sentences

sent_vector_1 = np.mean([glove_model.get_vector(w) for w in sentence_1.lower().split()],axis=0)
sent_vector_2 = np.mean([glove_model.get_vector(w) for w in sentence_2.lower().split()],axis=0)
sent_vector_3 = np.mean([glove_model.get_vector(w) for w in sentence_3.lower().split()],axis=0)
sent_vector_4 = np.mean([glove_model.get_vector(w) for w in sentence_4.lower().split()],axis=0)

print("Similarities :")
print("\n1: ",sentence_1, "\n2: ", sentence_2, "\nCosine: ", cosine_similarity([sent_vector_1],[sent_vector_2]))

print("\n1: ",sentence_1, "\n2: ", sentence_3, "\nCosine: ", cosine_similarity([sent_vector_1],[sent_vector_3]))

print("\n1: ",sentence_1, "\n2: ", sentence_4, "\nCosine: ", cosine_similarity([sent_vector_1],[sent_vector_4]))

print("\n1: ",sentence_2, "\n2: ", sentence_3, "\nCosine: ", cosine_similarity([sent_vector_2],[sent_vector_3]))

print("\n1: ",sentence_2, "\n2: ", sentence_4, "\nCosine: ", cosine_similarity([sent_vector_2],[sent_vector_4]))

print("\n1: ",sentence_3, "\n2: ", sentence_4, "\nCosine: ", cosine_similarity([sent_vector_3],[sent_vector_4]))

Similarities :

1:  I want to buy a Shirt 
2:  Lets purchase Sandals 
Cosine:  [[0.5360273]]

1:  I want to buy a Shirt 
2:  I want to go US 
Cosine:  [[0.88368845]]

1:  I want to buy a Shirt 
2:  Lets travel to Germany 
Cosine:  [[0.6558078]]

1:  Lets purchase Sandals 
2:  I want to go US 
Cosine:  [[0.3950703]]

1:  Lets purchase Sandals 
2:  Lets travel to Germany 
Cosine:  [[0.4297342]]

1:  I want to go US 
2:  Lets travel to Germany 
Cosine:  [[0.7169389]]


##### Refrence : https://nlp.stanford.edu/projects/glove/