## Text Similarity using Word Embeddings

In this notebook we're going to play around with pre build word embeddings and do some fun calcultations:

In [None]:
import sys
sys.path

In [None]:
# 06/17/23 NOTE: haven't run this in 4y 7m: the below comments were when run 
# on a macbook pro. Now on acer. 

# WS 02/13/19 adding the correct path for the env automatically by including the following path:
# /home/smithw/.virtualenvs/osinga/lib/python3.6/site-packages
# in the file 
# _virtualenv_path_extensions.pth in the env's /site-packages area; no need for the following line now
#sys.path.insert(0,'/home/smithw/.virtualenvs/osinga/lib/python3.6/site-packages')
#sys.path

In [None]:
%matplotlib inline

import os
from keras.utils import get_file
import gensim
import subprocess
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(10, 10)

from sklearn.manifold import TSNE
import json
from collections import Counter
from itertools import chain

We'll start by downloading a pretrained model from Google News. We're using `zcat` to unzip the file, so you need to make sure you have that installed or replace it by something else.

In [None]:
MODEL = 'GoogleNews-vectors-negative300.bin'
# WS 06/17/23: this get_file() location is OBE: downloaded directly from
# https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download

#path = get_file(MODEL + '.gz', 'https://s3.amazonaws.com/dl4j-distribution/%s.gz' % MODEL)

In [None]:
#if not os.path.isdir('generated'):
#    os.mkdir('generated')

In [None]:
data_loc = '/home/smithw/Downloads/deep_learning' # WS: files not backed up here
zipped   = os.path.join(data_loc, MODEL + '.gz')  # WS mod
unzipped = os.path.join(data_loc, MODEL)  # WS

In [None]:
zipped, unzipped

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)

In [None]:
# WS 11/10/18 this model took about 10m to load on macbook pro
# WS 06/17/23 took 35s to load on the acer

Let's take this model for a spin by looking at what things are most similar to espresso. As expected, coffee like items show up:

In [None]:
model.most_similar(positive=['espresso'])

Now for the famous equation, what is like woman if king is like man? We create a quick method to these calculations here:

In [None]:
# simple function to listify inputs, WS comment
a, b, c = 'hi', ['list'], 8
a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))
a, b, c

In [None]:
def A_is_to_B_as_C_is_to(a, b, c, topn=1, score=False):  # WS added score output
    a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))  # WS listify inputs
    res = model.most_similar(positive=b + c, negative=a, topn=topn)
    if len(res):
        if topn == 1:
            if score: return res[0]
            else:     return res[0][0]
        if score: return res
        else:     return [x[0] for x in res]
    return None

In [None]:
A_is_to_B_as_C_is_to('man', 'woman', 'king', topn=5, score=True)

In [None]:
A_is_to_B_as_C_is_to('man', 'king', 'woman', topn=5, score=True) # identical results

In [None]:
# WS comment 06/17/23:
# in embedding space, let x = (king - man) vector (ie, positive - negative inputs)
# rewriting, this is king = man + x, or the x vector is pointing from man to king in embedding
# then with input (woman + king) - man , rearranging it is woman + (king - man) 
# which is woman + x; if x points from man to king, presumably x also points approximately 
# from woman to queen if relationships are preserved; this is borne out by these results
# also get identical results by swapping woman and king, as shown by the math above

In [None]:
A_is_to_B_as_C_is_to('man', 'woman', 'boy', topn=5, score=True)

In [None]:
#model.most_similar?

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
model.most_similar(positive=['king'], negative=['man'])

In [None]:
model.most_similar(positive=['man'])

In [None]:
model.most_similar(negative=['man'])

We can use this equation to acurately predict the capitals of countries by looking at what has the same relationship as Berlin has to Germany for selected countries:

In [None]:
for country in 'Italy', 'France', 'India', 'China', 'America', 'USA': # WS added last two: it fails
    print('%s is the capital of %s' % 
          (A_is_to_B_as_C_is_to('Germany', 'Berlin', country, score=True), country))

Or we can do the same for important products for given companies. Here we seed the products equation with two products, the iPhone for Apple and Starbucks_coffee for Starbucks. Note that numbers are replaced by # in the embedding model:

In [None]:
for company in 'Google', 'IBM', 'Boeing', 'Microsoft', 'Samsung':
    products = A_is_to_B_as_C_is_to(
        ['Starbucks', 'Apple'], 
        ['Starbucks_coffee', 'iPhone'], 
        company, topn=3)
    print('%s -> %s' % 
          (company, ', '.join(products)))

Let's do some clustering by picking three categories of items, drinks, countries and sports:

In [None]:
beverages = ['espresso', 'beer', 'vodka', 'wine', 'cola', 'tea']
countries = ['Italy', 'Germany', 'Russia', 'France', 'USA', 'India']
sports    = ['soccer', 'handball', 'hockey', 'cycling', 'basketball', 'cricket']
food      = ['hamburger', 'pizza', 'soup', 'steak', 'chicken']  # WS added
vehicles  = ['airplane', 'locomotive', 'automobile', 'ship', 'submarine', 'rocket'] # WS added
tools     = ['hammer', 'screwdriver', 'drill', 'ruler', 'pencil', 'level'] # WS added
misc      = ['man', 'woman', 'king', 'queen']

items = beverages + countries + sports + food + vehicles + tools + misc
len(items)

And looking up their vectors:

In [None]:
item_vectors = [(item, model[item]) for item in items if item in model]
len(item_vectors)

In [None]:
item_vectors[0][1].shape  # 300 dimensions in embedding space

Now use TSNE for clustering:

In [None]:
vectors      = np.asarray([x[1] for x in item_vectors])
lengths      = np.linalg.norm(vectors, axis=1)
norm_vectors = (vectors.T / lengths).T

In [None]:
norm_vectors.shape

In [None]:
norm_vectors[0][:10]

In [None]:
# WS original: perplexity 10 (20 made it a worse clustering, 5 better), verbose=2
tsne = TSNE(n_components=2, perplexity=8, verbose=1).fit_transform(norm_vectors)

And matplotlib to show the results:

In [None]:
x=tsne[:,0]
y=tsne[:,1]
fig, ax = plt.subplots()
ax.scatter(x, y)
for item, x1, y1 in zip(item_vectors, x, y):
    ax.annotate(item[0], (x1, y1), size=14)
plt.grid() # WS
plt.show()

In [None]:
for k in item_vectors:
    print(k[0])

In [None]:
TSNE?