# Word2Vec - Similarity between 2 words
Download the vectors, and use 'gensim' to load the word vectors and query a mapping of the word:embedding (300-D vector)

In [3]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-12-26 02:39:26--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.207.5
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.207.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2020-12-26 02:39:45 (80.0 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [5]:
!pip install gensim
!pip install scikit-learn



In [6]:
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
print('Imported Successfully!')

Imported Successfully!


In [7]:
# Load model. the model was trained on Google News dataset. To create vectors of 300 dimensions
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
# Some queries
v_banana = word_vectors['banana']
v_mango = word_vectors['mango']
cosine_similarity([v_banana], [v_mango])

array([[0.63652116]], dtype=float32)

## Project 1: Odd Word Picker
- pick the odd word from a set of words

In [14]:
import numpy as np

# takes list of strings (plus our word embedding information), and return odds one out
def odd_one_out(words, word_vectors):
  all_word_vectors = [word_vectors[w] for w in words] # array of our words' embeddings
  avg_vector = np.mean(all_word_vectors, axis=0) # a vector of our word embedding's averages for each dimension

  # take out the word with lowest similarity with the average
  odd_one_out = None
  min_sim = 1.0

  for w in words:
    sim = cosine_similarity([word_vectors[w]], [avg_vector])
    if sim < min_sim:
      min_sim = sim
      odd_one_out = w
    
  return odd_one_out
 
print('function created')


function created


In [16]:
odd_word = odd_one_out(['apple', 'mango', 'party', 'coconut', 'juice', 'orange'], word_vectors)
print('odd one out is ' + odd_word)

odd one out is party


## Project 2: Word Analagy System
- complete the sentence **A is to B as C is to __**
- we are trying to find the word D, such that associated word vector is related in a similar manner (B-A = D-C)

Therefore D = B-A+C

In [22]:
# takes 3 inputs (A, B, C) and the embedding information
def word_analogies(A,B,C,word_vectors):
  A,B,C = A.lower(), B.lower(), C.lower()

  # loop over every word in the vocab and get the max similarity of A-B and C-D (similar distance vector)
  # O(vocab words)
  max_sim = -100
  D = None
  words = word_vectors.vocab.keys()
  WA, WB, WC = word_vectors[A], word_vectors[B], word_vectors[C]

  for w in words:
    if w in [A, B, C]:
      continue
    w_vector = word_vectors[w]
    sim = cosine_similarity( [WB-WA], [w_vector - WC] )
    if sim > max_sim:
      max_sim = sim
      D = w
  return D

print('function created')

function created


In [23]:
D = word_analogies('Man', 'Woman', 'King', word_vectors)
print('A is to B as C is to ' + D)

A is to B as C is to queen


Now rather than O(all vocab) we could do this in O(1) using Gensim's **most_similar(positive, negative)** fumction
Since D = B-A+C, B and C is positive, and A is supplied as the negative.
We will find the top 1 word in the vocab with most similar to the vector we have 

In [25]:
result = word_vectors.most_similar(positive = ['woman', 'king'], negative = ['man'], topn = 1)
print(result)

[('queen', 0.7118192911148071)]
