> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


### Word2Vec

In this notebook we're going to learn and play around with word2vec embeddings that are packaged with Spacy. We'll try to build intuition on how they work and what can we do with them.

Install all the required dependencies for the project

In [None]:
%%capture
!pip install spacy==2.2.4 --quiet
!python -m spacy download en_core_web_md
!apt install libopenblas-base libomp-dev
!pip install faiss-gpu

Import all the necessary libaries

In [None]:
from collections import defaultdict
import en_core_web_md
import numpy as np
import spacy
import time
import faiss

Now let's load the Spacy data which comes with pre-trainined embeddings. This process is expensive so only do this once.



In [None]:
spacyModel = en_core_web_md.load()

First, let's play with some basic similarity functions.

In [None]:
banana = spacyModel("banana")
fruit = spacyModel("fruit")
table = spacyModel("table")
print(banana.similarity(fruit))
print(banana.similarity(table))

0.671483588786149
0.22562773991991913


As expected `Banana` is a lot more similar to `Fruit` than to `Table`. Now let's iterate over the entire vocabulary and build a search index using **Faiss**. This will make it a lot faster for us to find similar words instead of having to loop over the entire corpus each time. 

Feel free to ignore learning more about **Faiss** at this time as we'll dive more into it in Week 3. At the high-level it is a really efficient library to find similar vectors from a corpus.

Note: This next cell will take a fair bit of time to run.

In [None]:
def load_index():
  """Expensive method - call only once!!
  """
  word_to_id = {}
  id_to_word = {}
  vectors = []
  vector_dimension = 300
  id = 0

  # Iterate over the entire vocabulary
  for i, tok in enumerate(spacyModel.vocab):
    vector = tok.vector
    l2_norm = np.linalg.norm(vector)

    # Ignore zero vectors, nan vlaues
    if (np.isclose(l2_norm, 0.0) or 
        np.isnan(l2_norm) or 
        np.any(np.isnan(vector))):
      continue
    else:
      vectors.append(np.divide(vector, l2_norm))

    # Add to the output variables
    word_to_id[tok.text.lower()] = id
    id_to_word[id] = tok.text.lower()
    id += 1

  
  vectors = np.array(vectors)
  index = faiss.IndexFlatIP(vector_dimension)
  index.add(vectors)
  return word_to_id, id_to_word, vectors, index

word_to_id, id_to_word, vectors, index = load_index()
vector_size = len(vectors)
print("We created a search index of %d vectors" % vector_size)

We created a search index of 684754 vectors


Now we're going to add a helper functions to calculate top_k similar words to some input in the index.

In [None]:
def search_vector(word_vector, top_k=100, print_time_taken=False):
  word_vector = np.array([word_vector])
  start_time = time.time()
  scores, indices = index.search(word_vector, top_k)
  if print_time_taken:
    print("Time taken to search amongst {} words is {:.3}s".format(
        vector_size, (time.time() - start_time))
    )
  results = []
  words = set()
  for i, query_index in enumerate(indices):
      # Matches for the i'th one 
      for inn_idx, word_index in enumerate(query_index):
          if word_index < 0:
              continue
          word = id_to_word[word_index]
          if word in words:
            continue
          words.add(word)
          results.append((word, float(scores[i][inn_idx])))
  return sorted(results, key=lambda tup: -tup[1])

Let's do an empirical test by searching similar words to a few terms

### Search

In [None]:
def search(word, top_k=100,print_time_taken=False):
  word = word.lower()
  if word not in word_to_id:
    print("Oops, the word {} is not in loaded dictionary".format(word))
    return
  id = word_to_id[word]
  word_vector = vectors[id]
  search_results = search_vector(word_vector, top_k, print_time_taken)
  print(f"The top similar words to {word} are - ")
  for i in range(len(search_results)):
    print(f"Word = {search_results[i][0]} and similarity = {search_results[i][1]}")
  return search_results

In [None]:
output = search("happy")

The top similar words to happy are - 
Word = happy and similarity = 1.0
Word = glad and similarity = 0.7701864838600159
Word = hope and similarity = 0.7318376898765564
Word = everyone and similarity = 0.7277779579162598
Word = thankful and similarity = 0.6912474632263184
Word = excited and similarity = 0.6901209354400635
Word = love and similarity = 0.6869319081306458
Word = wish and similarity = 0.6853764653205872
Word = appreciative and similarity = 0.6847968101501465
Word = greatful and similarity = 0.6847968101501465
Word = grateful and similarity = 0.6847968101501465
Word = gratefully and similarity = 0.6847968101501465
Word = always and similarity = 0.679514467716217
Word = lucky and similarity = 0.6788510084152222
Word = feel and similarity = 0.6768772006034851
Word = friends and similarity = 0.6759893894195557
Word = freinds and similarity = 0.6759893894195557
Word = thank and similarity = 0.6728549599647522
Word = really and similarity = 0.6657895445823669
Word = sure and simi

In [None]:
output = search("baseball", 10)

The top similar words to baseball are - 
Word = fastpitch and similarity = 1.0
Word = softball and similarity = 1.0
Word = baseballs and similarity = 1.0
Word = scorebook and similarity = 1.0
Word = sandlot and similarity = 1.0
Word = baseball and similarity = 1.0


In [None]:
output = search("cheese", 25)

The top similar words to cheese are - 
Word = cheese and similarity = 0.9999999403953552
Word = mozzarella and similarity = 0.9999999403953552
Word = fromage and similarity = 0.9999999403953552
Word = cheeses and similarity = 0.9999999403953552
Word = chevre and similarity = 0.8228567242622375
Word = velveeta and similarity = 0.8228567242622375
Word = crouton and similarity = 0.8228567242622375
Word = chedder and similarity = 0.8228567242622375
Word = emmental and similarity = 0.8228567242622375
Word = velveta and similarity = 0.8228567242622375
Word = cheddar and similarity = 0.8228567242622375
Word = chèvre and similarity = 0.8228567242622375
Word = cheeseboard and similarity = 0.8228567242622375
Word = part-skim and similarity = 0.8228567242622375


Now why don't you try out a few different words that come to mind and see where does the model perform well and where it struggles!! 

### Analogies

In [None]:
def analogy(word1, word2, word3):
  word1 = word1.lower()
  word2 = word2.lower()
  word3 = word3.lower()
  if word1 not in word_to_id or word2 not in word_to_id or word3 not in word_to_id:
    print("word not present in dictionary, try something else")
  vector1 = vectors[word_to_id[word1]]
  vector2 = vectors[word_to_id[word2]]
  vector3 = vectors[word_to_id[word3]]
  analogy_results = search_vector(np.add(np.subtract(vector1, vector2), vector3), 10)
  print(f"The top similar item for ({word1} - {word2} + {word3}) = {analogy_results[0][0]}")
  print(f"The top similar words to ({word1} - {word2} + {word3}) are - ")
  for i in range(len(analogy_results)):
    print(f"Word = {analogy_results[i][0]} and similarity = {analogy_results[i][1]}")
  return analogy_results

In [None]:
output = analogy("king", "man", "woman")

The top similar item for (king - man + woman) = queen
The top similar words to (king - man + woman) are - 
Word = queen and similarity = 0.8607760071754456
Word = king and similarity = 0.8567199110984802
Word = highness and similarity = 0.679933488368988
Word = prince and similarity = 0.679933488368988
Word = commoner and similarity = 0.679933488368988


In [None]:
output = analogy("smallest", "small", "short")

The top similar item for (smallest - small + short) = shortest
The top similar words to (smallest - small + short) are - 
Word = shortest and similarity = 0.8045327663421631
Word = straightest and similarity = 0.8045327663421631
Word = second-longest and similarity = 0.7076637148857117
Word = longest-ever and similarity = 0.7076637148857117
Word = 200-mile and similarity = 0.7076637148857117
Word = longest-running and similarity = 0.7076637148857117


Now why don't you try out a few different examples see what comes out :) 