So far, we've only studied word embeddings, where each word is represented by a vector of numbers. For instance, the word cat might be represented as 

```python
cat = [0.23, 0.10, -0.23, -0.01, 0.91, 1.2, 1.01, -0.92]
```

But how would you represent a **sentence**? There are many different ways to represent sentences, but the simplest, and often very effective way is to **take the average of all the word embeddings of that sentence**.

### Important Note:

 Before you start this next portion, download `en_core_web_md` for spacy - the `en_core_web_sm` model we used in class is good as a starter introduction to word embeddings, but won't give you as great results in the long run:
 
>*To make them compact and fast, spaCy’s small models (all packages that end in `sm`) don’t ship with **true word vectors**, and only include context-sensitive tensors. This means you can still use the `similarity()` methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger model*.

You can download the larger model in Python by using `python -m spacy en_core_web_md`. In your Jupyter notebook cell, you can also type the command `!{sys.executable} -m spacy download en_core_web_md` in a cell.

In [20]:
# load in spacy
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_md.load()

In [23]:
sentenceA = "I watched a movie with my friend."

sentenceA_tokens = nlp(sentenceA)
print("\nSentence A:")
for token in nlp(sentenceA): # I am only going to show the first 6 values of the word embedding, but 
    # remember that the embedding itself is usually 50, 100, 300, 500 elements long (in Spacy's case, 384)
    print(f"{token}'s word embedding: {token.vector[:6]}'")
print("\nSentence B:")
for token in nlp(sentenceB):
      print(f"{token}'s word embedding: {token.vector[:6]}'")


Sentence A:
I's word embedding: [ 0.18733   0.40595  -0.51174  -0.55482   0.039716  0.12887 ]'
watched's word embedding: [ 0.08763  -0.41748  -0.33357  -0.080973 -0.089307  0.12784 ]'
a's word embedding: [ 0.043798  0.024779 -0.20937   0.49745   0.36019  -0.37503 ]'
movie's word embedding: [ 0.2071  -0.47656  0.15479 -0.38965  0.48447  0.59815]'
with's word embedding: [-0.099534  0.028202 -0.23189   0.094477  0.12191  -0.18962 ]'
my's word embedding: [ 0.08649  0.14503 -0.4902   0.34224  0.36343  0.10046]'
friend's word embedding: [ 0.07781   0.17561  -0.59164   0.25467   0.35536  -0.012292]'
.'s word embedding: [ 0.012001  0.20751  -0.12578  -0.59325   0.12525   0.15975 ]'

Sentence B:
I's word embedding: [ 0.18733   0.40595  -0.51174  -0.55482   0.039716  0.12887 ]'
saw's word embedding: [ 0.028726  0.15006  -0.19278  -0.13624   0.21288   0.085543]'
a's word embedding: [ 0.043798  0.024779 -0.20937   0.49745   0.36019  -0.37503 ]'
film's word embedding: [ 0.30708 -0.30603  0.43486  

Note that if you had used `en_core_web_sm`, spacy will generate your word embeddings on the fly, the same word, like `I` might have slightly different embedding values! In `en_core_web_md`, spacy downloads and uses pre-trained embeddings that are fixed and more accurate.

To find the sentence vector for sentence A, sum each of the words in sentence A:

In [34]:
# how to find the sentence embedding of sentence A
# create a 300 length word embedding (spacy's en_core_web_md model uses 300-dimensional word embeddings)
running_total = np.zeros(300) 
for token in nlp(sentenceA):
    running_total += token.vector # add the word embeddings to the running total

# divide by the total number of words in sentence to get the "average embedding"
sentence_embedding = running_total / len(nlp(sentenceA)) 

In [33]:
# these are the first 10 values of the 300-dimensional word embeddings in en_core_web_md for sentence A

sentence_embedding[:10]

array([ 0.07532812,  0.01163013, -0.292425  , -0.053732  ,  0.22012738,
        0.067266  ,  0.06414513, -0.40898649,  0.07971475,  2.07558748])

There's actually an even easier way to do this in spacy:

In [37]:
tokens = nlp(sentenceA)
tokens.vector[:10] # the same as the above, when we got the sentence embeddings ourselves!

array([ 0.07532812,  0.01163013, -0.29242504, -0.05373199,  0.22012737,
        0.067266  ,  0.06414513, -0.40898648,  0.07971475,  2.0755873 ],
      dtype=float32)

In [38]:
sentenceA_embedding = nlp(sentenceA).vector
sentenceB_embedding = nlp(sentenceB).vector

In [42]:
similarity = 1 - cosine(sentenceA_embedding, sentenceB_embedding)
print(f"The similarity between sentence A and sentence B is {similarity}")

The similarity between sentence A and sentence B is 0.9586231708526611


In [48]:
sentenceC = "I drank a watermelon with my dog." # structurally, this is extremely similar to sentence A and B. 
# however, semantically, it is extremely different! Let's prove that word embeddings can be used to tell that
# sentenceC is not as similar to A and B.

sentenceC_embedding = nlp(sentenceC).vector
similarity = 1 - cosine(sentenceC_embedding, sentenceA_embedding)
print(f"The similarity between sentence C and sentence A is {similarity}")
similarity = 1 - cosine(sentenceC_embedding, sentenceB_embedding)
print(f"The similarity between sentence C and sentence B is {similarity}")

The similarity between sentence C and sentence A is 0.8648006319999695
The similarity between sentence C and sentence B is 0.8382346630096436


What happens if we substitute in `pal` for `dog`? Our word count models would not have picked up on any real difference, since `pal` just another word to be counted. However, semantically, `pal` is an informal name for a friend, and substituting in this new word will increase our similarity.

In [47]:
sentenceC = "I drank a watermelon with my pal."

sentenceC_embedding = nlp(sentenceC).vector
similarity = 1 - cosine(sentenceC_embedding, sentenceA_embedding)
print(f"The similarity between sentence C and sentence A is {similarity}")
similarity = 1 - cosine(sentenceC_embedding, sentenceB_embedding)
print(f"The similarity between sentence C and sentence B is {similarity}")

The similarity between sentence C and sentence A is 0.8804790377616882
The similarity between sentence C and sentence B is 0.8583351373672485


In [49]:
sentenceC = "I saw a watermelon with my pal."

sentenceC_embedding = nlp(sentenceC).vector
similarity = 1 - cosine(sentenceC_embedding, sentenceA_embedding)
print(f"The similarity between sentence C and sentence A is {similarity}")
similarity = 1 - cosine(sentenceC_embedding, sentenceB_embedding)
print(f"The similarity between sentence C and sentence B is {similarity}")
# Notice the even higher similarity after I substitute in "saw", a synonym for watched.

The similarity between sentence C and sentence A is 0.9236767292022705
The similarity between sentence C and sentence B is 0.9158568978309631
