<a href="https://colab.research.google.com/github/superkisa/MaGaML/blob/main/MathRefresher/sem_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word vectors**


In the previous exercise we observed that colors that we think of as similar are 'closer' to each other in RGB vector space. Is it possible to create a vector space for all English words that has this same 'closer in space is closer in meaning' property?

The answer is yes! Luckily, you don't need to create those vectors from scratch. Many researchers have made downloadable databases of pre-trained vectors. One such project is [Stanford's Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). 

These $300$-dimensional vectors are included with $\texttt{spaCy}$, and they're the vectors we'll be using in this exercise.

![cosine similarity: picture](https://d33wubrfki0l68.cloudfront.net/d2742976a92aa4d6c39f19c747ec5f56ed1cec30/3803f/images/guide-to-word-vectors-with-gensim-and-keras_files/word2vec-king-queen-vectors.png)

In [1]:
# The following will download the language model.
# Resart the runtime (Runtime -> Restart runtime) after running this cell
# (and don't run it for the second time).
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 9.8 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


Let's load the model now:

In [1]:
import spacy

nlp = spacy.load('en_core_web_lg')

## **Word vectors: the first glance**

You can see the vector of any word in $\texttt{spaCy}$' s vocabulary using the $\texttt{vector}$ attribute:

In [2]:
# A 300-dimensional vector
len(nlp('dog').vector)

300

In [3]:
nlp('dog').vector

array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

## **Cosine similarity**

**Cosine similarity** is a common way of assessing similarity between words in NLP. It is essentially defined as the cosine of the angle between the vectors representing the words of interest.

Recall that the angle $\phi$ between two non-zero vectors $u$ and $v$ can be computed as follows:

$cos(\phi) = \frac{(u,v)}{||u||\cdot||v||}$

![](https://miro.medium.com/max/1394/1*_Bf9goaALQrS_0XkBozEiQ.png)



Define a function computing cosine similarity between two vectors.

In [4]:
import numpy as np
from math import sqrt

def cosine(v1, v2):
  # Your code here
  return np.sum(v1*v2)/(sqrt(np.sum(v1*v1))*sqrt(np.sum(v2*v2)))

Test your function by computing similarities of some random pairs of words, e.g. $dog$ and $puppy$ vs. $dog$ and $kitten$. 

In [5]:
# Your code here
print(cosine(nlp('dog').vector, nlp('puppy').vector))
print(cosine(nlp('dog').vector, nlp('kitten').vector))

0.8107667523600081
0.6515031235181183


## **Loading the text**

Let's load the full text of *Alice in Wonderland*. It will serve us as a corpus of English words.

In [17]:
import requests

# Alice in Wonderland
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')

# If you prefer Dracula, load this instead:
#response = requests.get('https://www.gutenberg.org/cache/epub/345/pg345.txt')

# Extracting separate words from the text
doc = nlp(response.text)
tokens = list(set([w.text.lower() for w in doc if w.is_alpha]))

Check out the content of $\texttt{tokens}$ now.

In [19]:
tokens[:7]

['grow', 'ii', 'tight', 'rippling', 'exempt', 'older', 'hiss']

Define a function that takes a word and lists the $n$ most similar words in our corpus.

In [20]:
import pandas as pd
def spacy_closest(tokens, new_vec, n=10):
  # Your code here
    d = {'tokens': tokens,}
    df = pd.DataFrame(data=d)
    df['tokens'] = tokens
    df['cos_token'] = [cosine(new_vec, nlp(token_i).vector) for token_i in tokens]
    df['abs_cos_token'] = [abs(cos_i) for cos_i in df['cos_token']]
    return df.sort_values(by="abs_cos_token", ascending=False)[:n]

Try to find words similar to some random words, e.g. $good$.

In [9]:
spacy_closest(tokens, nlp('good').vector)

  


Unnamed: 0,tokens,cos_token,abs_cos_token
1112,good,1.0,1.0
234,great,0.75513,0.75513
2885,bad,0.739189,0.739189
1975,excellent,0.69052,0.69052
292,nice,0.671777,0.671777
1691,better,0.661587,0.661587
1442,wonderful,0.640145,0.640145
240,pleasant,0.5885,0.5885
1391,wise,0.582179,0.582179
2354,happy,0.576959,0.576959


You can also get creative and search for combinations of words. For example, what is similar to $king - man + woman$? 

In [21]:
# Your code here
spacy_closest(tokens, (nlp('king').vector - nlp('man').vector + nlp('woman').vector))

  


Unnamed: 0,tokens,cos_token,abs_cos_token
2218,king,0.848954,0.848954
839,kings,0.718906,0.718906
2045,queen,0.617801,0.617801
2104,throne,0.608184,0.608184
1138,courtiers,0.596625,0.596625
1898,royal,0.589522,0.589522
2140,crown,0.520517,0.520517
660,conquest,0.518965,0.518965
1682,conqueror,0.51093,0.51093
1884,father,0.48368,0.48368


## **Sentence vectors**

We can also construct a vector representation for the whole sentence. For example, we can define it as an *average* of the   vectors representing the words in it.

Let's take a random sentence *My favorite food is strawberry ice cream* and construct its vector representation.

In [22]:
sent = nlp('My favorite food is strawberry ice cream.')
sentv = np.sum(sent[:-1].vector)/(len(sent) - 1)
sentv

# Your code here
# sentv ...

-6.100927625383649

Let's also extract sentences (as opposed to individual words) from our corpus:

In [23]:
sents = list(doc.sents)
sents

[ï»¿The Project Gutenberg eBook of Aliceâs Adventures in Wonderland, by Lewis Carroll
 
 This eBook is for the use of anyone anywhere in the United States and
 most other parts of the world at no cost and with almost no restrictions
 whatsoever., You may copy it, give it away or re-use it under the terms
 of the Project Gutenberg License included with this eBook or online at
 www.gutenberg.org., If you are not located in the United States, you
 will have to check the laws of the country where you are located before
 using this eBook.
 , Title: Aliceâs Adventures in Wonderland
 
 Author: Lewis Carroll
 
 Release Date: January, 1991, [eBook #11], [Most recently updated: October 12, 2020]
 , Language: English
 
 Character set encoding: UTF-8
 
 Produced by: Arthur DiBianca and David Widger
 
 *** START OF THE PROJECT GUTENBERG EBOOK, ALICEâS ADVENTURES IN WONDERLAND, *, **
 
 [Illustration]
 
 
 
 
 Aliceâs Adventures in Wonderland
 
 by Lewis Carroll
 
 THE MILLENNIUM FULCRUM EDI

Define a function that takes a random sentence and lists $n$ most similar sentences from our corpus.

In [24]:
def spacy_closest_sent(sentences, input_vec, n=10):
  # Your code here
  d = {'sentences': sentences,}
  df = pd.DataFrame(data=d)
  df['sentences'] = sentences
  df['cos_sentence'] = [cosine(input_vec, sentence_i.vector) for sentence_i in sentences]
  df['abs_cos_sentence'] = [abs(cos_i) for cos_i in df['cos_sentence']]
  return df.sort_values(by="abs_cos_sentence", ascending=False)[:n]

Let's try it out!

In [25]:
for s in spacy_closest_sent(sents, sentv, n=10):
  print(s)
  print('\n---')

sentences

---
cos_sentence

---
abs_cos_sentence

---


  


## **References**

This notebook is inspired by a [tutorial by Allison Parrish](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469).