# Cosine similarity and finding closest neighbors

The cell below defines a function cosine(), which returns the cosine similarity of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces.

In [None]:
def distance(coord1, coord2):
    # note, this is VERY SLOW, don't use for actual code
    return math.sqrt(sum([(i - j)**2 for i, j in zip(coord1, coord2)]))
distance([10, 1], [5, 2])

In [1]:
import spacy
import numpy as np
from numpy import dot
from numpy.linalg import norm

In [2]:
nlp = spacy.load('en_core_web_md')


In [4]:
def vec(s):
    return nlp.vocab[s].vector

In [5]:
#cosine similarity
def cosine(v1, v2):
    if norm(v1)>0 and norm(v2)>0:
        return dot(v1, v2)/(norm(v1)*norm(v2))
    else:
        return 0.0

The following cell shows that the cosine similarity between dog and puppy is larger than the similarity between trousers and octopus, thereby demonstrating that the vectors are working how we expect them to:

In [6]:
v1 = vec('dog')
v2 = vec('puppy')

s1 = vec('trousers')
s2 = vec('octopus')

cosine(v1, v2) > cosine(s1, s2)

True

The following cell defines a function that iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [8]:
def spacy_closest(token_list, vec_to_check, n = 10):
    return sorted(token_list, 
                 key = lambda x: cosine(vec_to_check, vec(x)), 
                 reverse = True)[:n]

In [29]:
file = open("pg345.txt").read()
file = file.replace('  ', ' ')
file = file.replace('\n', '')
file = file[:70000]

In [30]:
doc = nlp(file)

In [31]:
# all of the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha]))
tokens

['though',
 'rifts',
 'PROJECT',
 'too',
 'burned',
 'field',
 'Yes',
 'salient',
 'gold',
 'sworn',
 'toconduct',
 'eve',
 'notgaiety',
 'neither',
 'atmy',
 'IIIJONATHAN',
 'shelter',
 'blanket',
 'tribe',
 'Hawkins',
 'rattling',
 'by',
 'known',
 'thoroughly',
 'shoulder',
 'flungout',
 'itself',
 'bigwhip',
 'thestreets',
 'Voivode',
 'ahorrible',
 'lookdid',
 'hard',
 'vice',
 'wereclaimed',
 'confidence',
 'havebeen',
 'minded',
 'sheered',
 'wastrickling',
 'peasant',
 'moredangerous',
 'loading',
 'however',
 'anyone',
 'wintersnows',
 'self',
 'file',
 'bridges',
 'Copyright',
 'politicaleconomy',
 'rathercoarse',
 'massive',
 'earlier',
 'twenty',
 'atmospheres',
 'Tokay',
 'outto',
 'worse',
 'content',
 'jumped',
 'becamequite',
 'refused',
 'expecting',
 'foot',
 'see',
 'Mem',
 'Iinscribe',
 'allperiods',
 'pleasant',
 'deficiencies',
 'law',
 'them',
 'helping',
 'Onmy',
 'True',
 'nearly',
 'curve',
 'His',
 'especially',
 'of',
 'exploring',
 'follow',
 'thenwhat',
 '

Using this function, we can get a list of synonyms, or words closest in meaning (or distribution, depending on how you look at it), to any arbitrary word in spaCy's vocabulary. In the following example, we're finding the words in Dracula closest to "basketball":

In [32]:
# what's the closest equivalent of basketball?
spacy_closest(tokens, vec("basketball"))

['coach',
 'Team',
 'guard',
 'Court',
 'streak',
 'bench',
 'victorious',
 'triumph',
 'beat',
 'history']

### Fun with spaCy, Dracula, and vector arithmetic

Now we can start doing vector arithmetic and finding the closest words to the resulting vectors. For example, what word is closest to the halfway point between day and night?

In [35]:
def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean
meanv([[0, 1], [2, 2], [4, 3]])

[2.0, 2.0]

In [37]:
# halfway between day and night
spacy_closest(tokens, meanv([vec("day"), vec("night")]))

['night',
 'day',
 'Day',
 'evening',
 'morning',
 'afternoon',
 'Nights',
 'last',
 'Last',
 'days']

In [38]:
spacy_closest(tokens, vec("wine"))

['wine',
 'bottle',
 'drink',
 'fruit',
 'coffee',
 'draught',
 'dinner',
 'supper',
 'glass',
 'meal']

The subtractv function subtracts one vector from another:

In [39]:
def subtractv(coord1, coord2):
    return [c1 - c2 for c1, c2 in zip(coord1, coord2)]
subtractv([10, 1], [5, 2])

[5, -1]

In [40]:
spacy_closest(tokens, subtractv(vec("wine"), vec("alcohol")))

['wine',
 'graceful',
 'fabulous',
 'magnificent',
 'splendid',
 'dinner',
 'supper',
 'dining',
 'charming',
 'salad']

In [41]:
def addv(coord1, coord2):
    return [c1 + c2 for c1, c2 in zip(coord1, coord2)]
addv([10, 1], [5, 2])

[15, 3]

In [42]:
spacy_closest(tokens, vec("water"))

['water',
 'waters',
 'pond',
 'sea',
 'lake',
 'cold',
 'river',
 'air',
 'rivers',
 'clean']

But if you add "frozen" to "water," you get "snow":

In [43]:
spacy_closest(tokens, addv(vec("water"), vec("frozen")))

['water',
 'cold',
 'waters',
 'pond',
 'sea',
 'ground',
 'lake',
 'drink',
 'snowy',
 'snow']

You can even do analogies! For example, the words most similar to "grass":

In [44]:
spacy_closest(tokens, vec("grass"))

['grassy',
 'grass',
 'foliage',
 'treetops',
 'trees',
 'GARDEN',
 'soil',
 'green',
 'ground',
 'leaves']

In [45]:
# analogy: blue is to sky as X is to grass
blue_to_sky = subtractv(vec("blue"), vec("sky"))
spacy_closest(tokens, addv(blue_to_sky, vec("grass")))

['grassy',
 'grass',
 'green',
 'red',
 'Red',
 'purple',
 'pink',
 'blue',
 'Blue',
 'orange']

### Sentence similarity

To get the vector for a sentence, we simply average its component vectors, like so:

In [46]:
def sentvec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

Let's find the sentence in our text file that is closest in "meaning" to an arbitrary input sentence. First, we'll get the list of sentences:

In [47]:
sentences = list(doc.sents)

The following function takes a list of sentences from a spaCy parse and compares them to an input sentence, sorting them by cosine similarity.

In [48]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

Here are the sentences in Dracula closest in meaning to "My favorite food is strawberry ice cream." (Extra linebreaks are present because we didn't strip them out when we originally read in the source text.)

In [50]:
for sent in spacy_closest_sent(sentences, "My favorite food is strawberry ice cream."):
    print (sent.text)
    print( "---")

I had for breakfast more paprika, and a sort of porridge of maize flourwhich
---
I had for dinner, orrather supper, a chicken done up some way with red pepper, which wasvery good but thirsty.
---
There was everywhere a bewildering mass of fruit blossom--apple,plum, pear, cherry; and as we drove by I could see the green grass
---
This, with some cheeseand a salad and a bottle of old Tokay, of which I had two glasses, wasmy supper.
---
I dined on what theycalled "robber steak"--bits of bacon, onion, and beef, seasoned with redpepper, and strung on sticks and roasted over the fire, in the simplestyle of the London cat's meat!
---
There is not even a toilet glass on mytable, and I had to get the little shaving glass from my bag before Icould either shave or brush my hair.
---
When I had dressed myself I went into the room where we hadsupped, and found a cold breakfast laid out, with coffee kept hot by thepot being placed on the hearth.
---
This wasemphasised by the fact that the snowy moun