# Using Word Vectors

## Imports and English model

In [1]:
import spacy
import numpy as np

For this notebook we are using the default English corpus of spaCy that includes a relatively small model (see https://spacy.io/models/en for details):

In [2]:
nlp = spacy.load('en')

## Word vectors

Here is a set of words related to animals:

In [3]:
animals = 'horse cat kitten puppy dog mouse pony'
tokens = nlp(animals)

`tokens` is an iterator of tokens, let's grab the horse token:

In [4]:
horse = tokens[0]

Here is the actual word the token points to:

In [5]:
str(horse)

'horse'

The main idea of word vectors is that for each word, there is a corresponding 1d vector (a NumPy array). These word vectors encode information about the meaning of the word and its function in sentences. Here are the first 10 elements of the word vector for *horse*:

In [6]:
horse.vector[0:10]

array([-1.19629192,  3.07423615, -1.23166108,  0.56064409, -0.5370236 ,
        2.51031804, -3.66074538,  1.92708302,  1.98932076,  2.73977804], dtype=float32)

Create a Python `dict` named `wordvecs` where the keys are the string names of the animals and the values are the `np.ndarray` word vectors:

In [7]:
animals = 'horse cat kitten puppy dog mouse pony'
tokens = nlp(animals)
wordvecs = {}
for token in tokens:
    wordvecs[str(token)] = token.vector

In [8]:
assert len(wordvecs)==7
assert set(animals.split(' '))==set(wordvecs.keys())
assert all(isinstance(v, np.ndarray) for v in wordvecs.values())

## Similarity

Write a function `vector_norm` that computes the $L_2$ vector norm:

$$
\left|\vec{v}\right| = \sqrt{ \sum_{i=0}^{n-1} v_i^2 }
$$

of a NumPy array. Don't use any `for` loops and don't use anything from `np.linalg`.

In [9]:
array_test = np.array([1,2,3,4,5])
array_test ** 2

array([ 1,  4,  9, 16, 25])

In [10]:
def vector_norm(v):
    """Compute the L2 norm of a 1d NumPy array v."""
    # YOUR CODE HERE
    l2 = np.sum(v ** 2) ** .5
    return l2

Your `vector_norm` function should pass these tests:

In [11]:
a = np.linspace(0,10.0,10)
assert np.allclose(vector_norm(a), np.linalg.norm(a))
assert np.allclose(vector_norm(np.array([1,0])), 1.0)
assert np.allclose(vector_norm(np.array([1,1])), np.sqrt(2.0))

Write a function `inner_product` that computes the dot or inner product:

$$
\vec{v} \cdot \vec{w} = \sum_{i=0}^{n-1} v_i w_i
$$

of two NumPy arrays. Don't use any `for` loops and don't use `np.dot`.

In [12]:
def inner_product(v, w):
    """Compute the inner/dot product of two 1d NumPy arrays v, w."""
    # YOUR CODE HERE
    inner = np.multiply(v,w)
    v_dot_w = np.sum(inner)
    return v_dot_w

Your `inner_product` function should pass these tests:

In [13]:
assert np.allclose(inner_product(np.array([0,1]),np.array([1,0])), 0.0)
assert np.allclose(inner_product(np.array([1,0]),np.array([1,0])), 1.0)
assert np.allclose(inner_product(np.array([1,2]),np.array([3,4])), 11)

Once you have represented different entities, such as words, as vectors, it is often useful to have a measure of how similar two vectors are to each other. For example, from a language and meaning perspective, we would expect *dog* and *puppy* to be more similar than *shark* and *kitten*. One common way of measuring this type of similarity is the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), which is defined as follows:

$$
S(\vec{v}, \vec{w}) = \frac{\vec{v}\cdot\vec{w}}{\left|\vec{v}\right|\left|\vec{w}\right|}
$$

Write a function `similarity`, that computes the cosine similarity of two vectors that are 1d NumPy arrays. Use the above `inner_product` and `vector_norm` functions.

In [14]:
def similarity(v, w):
    """Compute the cosine similarity of two 1d NumPy arrays v, w."""
    # YOUR CODE HERE
    similiarity = inner_product(v,w) / (vector_norm(v) * vector_norm(w))
    return similiarity

We can now compute the similarity between different pairs of words:

In [15]:
similarity(wordvecs['horse'], wordvecs['pony'])

0.36603757321214497

In [16]:
similarity(wordvecs['kitten'], wordvecs['puppy'])

0.37314380206073866

These results are a bit confusing: you might think that *horse* and *pony* would be more similar than *kitten* and *puppy*. This particular corpus of word vectors is rather small and as a result is not particularly accurate. More accurate sets of word vectors are available with spaCy, but they take more memory.

Notice that words should have a self-similarity of `1.0`:

In [17]:
similarity(wordvecs['cat'], wordvecs['cat'])

1.0

Your `similarity` function should pass the following tests:

In [18]:
for k, v in wordvecs.items():
    assert np.allclose(similarity(v,v), 1.0)
assert np.allclose(similarity(wordvecs['horse'], wordvecs['pony']),0.36603758)
assert np.allclose(similarity(wordvecs['kitten'], wordvecs['puppy']),0.37314379)

## Finding similar words

Write a function `most_similar` that takes a single word (like `puppy`) and the `dict` of word vectors (whose keys are word and values are corresponding word vectors) and returns the word in the word vector set that is most cosine-similar to `word` (other than the word itself). Return a tuple of the matched word and its cosine similarity.

In [19]:
def most_similar(word, wordvecs):
    """Find the most similar word in wordvecs to the input word.
    
    Parameters
    ==========
    word : str
        A single input word.
    wordvecs : dict
        A dict whose keys are words and values are word vectors.
    
    Return
    ======
    (word, similarity)
    """
    # YOUR CODE HERE
    biggest_value = 0
    biggest_key = ""
    for key in wordvecs:
        if key == word:
            continue
        else:
            cos_sim = similarity(wordvecs[word], wordvecs[key])
            if cos_sim > biggest_value:
                biggest_value = cos_sim
                biggest_word = key
    return (biggest_word, biggest_value)

Here we see that *mouse* is the most similar word to *horse* (which is silly, but expected given this small dataset).

In [20]:
most_similar('horse', wordvecs)

('mouse', 0.51169699114944167)

Find a list of the most similar words to all the words in the word vector set. Save your list in a variable named `matches`:

In [21]:
# YOUR CODE HERE
matches = []
for key in wordvecs:
    matches.append(most_similar(key, wordvecs)[0])
    

Your `most_similar` function should pass the following test:

In [22]:
match = most_similar('horse', wordvecs)
assert match[0]=='mouse'
assert np.allclose(match[1], 0.51169699)
assert matches==['mouse', 'mouse', 'mouse', 'dog', 'mouse', 'dog', 'mouse']