## A simple example of keyword matching using gloVe embedding vectors of Wikipedia

### Step 1: Install the following packages for this Sample (requirements.txt is included)
 - numpy
 - scipy
 - matplotlib
 - scikit-learn

In [1]:
!pip install -U -r 001-requirements.txt



## Step 2: Download and unzip the gloVe embedding vectors for words in Wikipedia

Head over to https://nlp.stanford.edu/projects/glove/.
<p>
Then underneath “Download pre-trained word vectors,” you can choose any of the four options for different sizes or training datasets.
I have chosen the Wikipedia 2014 + Gigaword 5 vectors. You can download those exact vectors at http://nlp.stanford.edu/data/glove.6B.zip (WARNING: THIS IS A 822 MB DOWNLOAD)
    
Note: The Stanford library is having an outage as-of writing this article, so I was able to retrieve and compress the file from Kaggle.
    
https://cs.stanford.edu/srcf_404

In [2]:
%%bash

L_GLOVE_ZIPFILE = 'data/glove/glove.6B.zip'

if [[ ! -f ${L_GLOVE_ZIPFILE} ]]; then
    mkdir -p data/glove
    wget https://nlp.stanford.edu/data/glove.6B.zip -O data/glove/glove.6B.zip
fi

bash: line 2: L_GLOVE_ZIPFILE: command not found
--2023-06-28 12:18:10--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cs.stanford.edu/srcf_404 [following]
--2023-06-28 12:18:11--  https://cs.stanford.edu/srcf_404
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘data/glove/glove.6B.zip’

     0K .......... .......... .......... .........              460K=0.09s

2023-06-28 12:18:11 (460 KB/s) - ‘data/glove/glove.6B.zip’ saved [40360]



### Step 3: Import required libraries

In [3]:
import numpy as np
from scipy import spatial
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import gzip

### Step 4: Load the enbedding vectors

The vector files that we downloaded is formatted in a text file by a word, followed by N numbers for each line.
The N numbers describe the vector of the word's position. N will vary depending on the vectors you use from the zip file.
I am using glove.6B.50d, so I will 50 100 N in each line.

```
python 0.24934 0.68318 -0.044711 -1.3842 -0.0073079 0.651 -0.33958 -0.19785 -0.33925 0.26691 -0.033062 0.15915 0.89547 0.53999 -0.55817 0.46245 0.36722 0.1889 0.83189 0.81421 -0.11835 -0.53463 0.24158 -0.038864 1.1907 0.79353 -0.12308 0.6642 -0.77619 -0.45713 -1.054 -0.20557 -0.13296 0.12239 0.88458 1.024 0.32288 0.82105 -0.069367 0.024211 -0.51418 0.8727 0.25759 0.91526 -0.64221 0.041159 -0.60208 0.54631 0.66076 0.19796 -1.1393 0.79514 0.45966 -0.18463 -0.64131 -0.24929 -0.40194 -0.50786 0.80579 0.53365 0.52732 0.39247 -0.29884 0.009585 0.99953 -0.061279 0.71936 0.32901 -0.052772 0.67135 -0.80251 -0.25789 0.49615 0.48081 -0.68403 -0.012239 0.048201 0.29461 0.20614 0.33556 -0.64167 -0.64708 0.13377 -0.12574 -0.46382 1.3878 0.95636 -0.067869 -0.0017411 0.52965 0.45668 0.61041 -0.11514 0.42627 0.17342 -0.7995 -0.24502 -0.60886 -0.38469 -0.4797
```


In [4]:
embeddings_dict = dict()

with gzip.open("data/glove/glove.6B.50d.txt.gz", 'rt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

#print(list(embeddings_dict.items())[:10])

# Step 5: define a function that will find the closest words to provided keyword

The sorted method takes an iterable as a parameter and sorts it by the provided key. we can get the iterable from the dictionary of embeddings by using the *keys()* method of dict.
<p>
The built-in euclidean distance method in scipy's spatial package is used to calculate distance between each word embedding.


In [5]:
def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(), key=lambda word: spatial.distance.euclidean(embeddings_dict[word], embedding))


### Step 6: Find words closest related to a keyword

Since we now have a method that will measure distance between words that are already known, we can call this to find the words that are closest to the keyword.

```
find_closest_embeddings(embeddings_dict["python"])
```
Sine this will return every match that we have in embeddings, I am limiting this to the top 10 matches.

In [6]:
find_closest_embeddings(embeddings_dict['math'])[:10]

['math',
 'maths',
 'instruction',
 'curriculum',
 'graders',
 'graduates',
 'lesson',
 'learning',
 'curricula',
 'undergraduate']

As you will notice, the first match is always the keyword. It can be ignored by limiting the results as follows

In [7]:
find_closest_embeddings(embeddings_dict['math'])[1:10]

['maths',
 'instruction',
 'curriculum',
 'graders',
 'graduates',
 'lesson',
 'learning',
 'curricula',
 'undergraduate']

## Math with keywords
Now that we have a created a method that we can use to map closest keywords, we can now use this same function to do math equations - just like we would do with numbers.

In [8]:
equation = embeddings_dict['bulky'] + embeddings_dict['luggage'] + embeddings_dict['airplane']

closest_matches = find_closest_embeddings(equation)[:5]

print(closest_matches)

['luggage', 'baggage', 'cargo', 'bags', 'airplane']


This returns luggage and baggage as its top results, so this seems logical.

### Visualization