# Setup


You only need to run this code cell once.
<p>
  Google allows you to download and use their pre-trained word vectors. These word vectors were trained on the Google News dataset (about 100 billion words). There are vectors for 3 million words and phrases, each 300 dimensions.
<p>
Using all three million word vectors would be too time consuming. You can change the number on line 1 of the following code cell to change how many word vectors to load and use.
  <p>
The downloaded word vector file is 1.6 Gb compressed, 3.6 GB when expanded. If you have a fast internet connection this should take about 3 minutes. 
  <p>
 Once you download it, if you run it again it will ask you "GoogleNews-vectors-negative300.bin already exists; do you wish to overwrite (y or n)?" Enter 'n' in the text box and press the return key to use what you've already downloaded. 



In [0]:
max_number_of_vectors = 40000 # 10k - 50k is recommended

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = '0B7XkCwpI5KDYNlNUTTlSS21pQmM'
downloaded = drive.CreateFile({'id':file_id})
downloaded.FetchMetadata(fetch_all=True)
downloaded.GetContentFile(downloaded.metadata['title'])
print('Vectors download, now unzipping.')
!gunzip GoogleNews-vectors-negative300.bin.gz
print('Word vectors are almost ready to use.')


def print_similarities(word_to_find, num_of_similar_words):
  if word_to_find in word_vectors.vocab:
    print("Word\tSimilarity Score (1.0 is closest)")
    for w,s in word_vectors.most_similar(positive=[word_to_find], topn=num_of_similar_words):
      print("{}\t{:.2f}".format(w,s))
  else:
    print('No vector for word:',word_to_find)
  print()

  
def print_similarities_pos_neg(words_positive_list, words_negative_list, num_of_similar_words):
  if word_to_find in word_vectors.vocab:
    print("Word\tSimilarity Score (1.0 is best)")
    for w,s in word_vectors.most_similar(positive = words_positive_list,negative = words_negative_list, topn = num_of_similar_words):
      print("{}\t{:.2f}".format(w,s))
  else:
    print('No vector for word:',word_to_find)
  print()
  
  
# A is to B as C is to ___
def print_analogy_words(word_A, word_B, word_C, num_of_similar_words):
  print("{} is to {} as {} is to:".format(word_A, word_B, word_C))
  words_negative_list = [word_A]
  words_positive_list = [word_B, word_C]
  print_similarities_pos_neg(words_positive_list, words_negative_list, num_of_similar_words)


from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=max_number_of_vectors)
print('Word vectors are ready to use.')

# Visualizing large numbers of word vectors

Visualizing the word vectors in some manner can help us understand our data better. This plot displays words with similar usages/meanings near each other. 
<p>
 You can display a desired number of the word vectors by changing the first line in the code cell below. If this number is too high the graph will be too cluttered to interpret easily. Around 200 - 400 should be a useful range. You can also change the next two lines, graph_width and graph_height, to change the size of the graph and possibly show tightly clustered points more clearly though the graph will be bigger than your monitor's area. 
<p>
If you right click on the plot and choose 'show image in new tab', you can probably navigate around the plot more easily.

In [0]:
number_of_vectors_to_show = 300
graph_width = 40
graph_height = 40

first_word = 0
last_word = first_word + number_of_vectors_to_show

from sklearn.manifold import TSNE
from matplotlib import pylab


words = [word for word in word_vectors.index2word[first_word:last_word]]
embeddings = [word_vectors[word] for word in words]
words_embedded = TSNE(n_components=2).fit_transform(embeddings)

pylab.figure(figsize=(graph_width, graph_height))
for i, label in enumerate(words):
  x, y = words_embedded[i, :]
  pylab.scatter(x, y)
  pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                 ha='right', va='bottom', fontsize=15)
pylab.show()

# Visualizing specific sets of word vectors

The first line includes a list of words you can change as desired. For each word in this list, a set of the most similar words will be chosen, and all words will be plotted. The words on the plot in red font are on the orignal list on the first line, and the similar words are in black font. 

In [0]:
words_of_interest = ["room",
                "clean",
                "pool",
                "service",
                "staff",
                "dirty"]


number_of_similar_words  = 10
words_to_plot = []

for woi in words_of_interest:
  words_to_plot.append(woi) 
  for w,s in word_vectors.most_similar(positive=[woi], topn=number_of_similar_words):
    words_to_plot.append(w) 

embeddings = [word_vectors[word] for word in words_to_plot]
words_embedded = TSNE(n_components=2).fit_transform(embeddings)

from sklearn.manifold import TSNE
from matplotlib import pylab


pylab.figure(figsize=(20, 20))
for i, label in enumerate(words_to_plot):
  x, y = words_embedded[i, :]
  pylab.scatter(x, y)
  color = 'black'
  if label in words_of_interest:
    color = 'red'
  pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                 ha='right', va='bottom', fontsize=12, color=color)
pylab.show()

# Using word vectors


## List similar words

This code cell lists words that are similar to a target word.<p> Change the first line below to test different target words. The second line specifies how many similar words to list.

In [0]:
word_to_find = "good"
number_of_similar_words = 30

print_similarities(word_to_find, number_of_similar_words)

## List analogies

Change the three words in the last line of this code cell to test analogies. The order of the words specifies the analogy: 


> WORD_1 is to WORD_2 as WORD_3 is to: _______



In [0]:
# man is to king as woman is to ???
number_of_similar_words = 10

print_analogy_words('man', 'king', 'woman', number_of_similar_words)
