# Word embedding 


In [24]:
sentences = [
    'Molly ate a donut',
    'Molly ate a fish',
    'Jen consumed a carp',
    'Lenny fears the lions'
]

print('\n'.join(sentences))

Molly ate a donut
Molly ate a fish
Jen consumed a carp
Lenny fears the lions


**Which pair do you think is the most similar?** Two of them are Molly eating something, while two of them are women eating fish. Lenny is... definitely an outlier. While you weigh the options, let's get analyzing!

## Word counting

In [13]:
count_vectorizer = ??

Unnamed: 0,ate,carp,consumed,donut,fears,fish,jen,lenny,lions,molly,the
Molly ate a donut,1,0,0,1,0,0,0,0,0,1,0
Molly ate a fish,1,0,0,0,0,1,0,0,0,1,0
Jen consumed a carp,0,1,1,0,0,0,1,0,0,0,0
Lenny fears the lions,0,0,0,0,1,0,0,1,1,0,1


In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarities using the word counts
similarities = cosine_similarity(matrix)

# Make a fancy colored dataframe about it
pd.DataFrame(similarities,
             index=sentences,
             columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a donut,Molly ate a fish,Jen consumed a carp,Lenny fears the lions
Molly ate a donut,1.0,0.554205,0.0,0.0
Molly ate a fish,0.554205,1.0,0.0,0.0
Jen consumed a carp,0.0,0.0,1.0,0.0
Lenny fears the lions,0.0,0.0,0.0,1.0


## TF_IDF

In [27]:
tf_idf_vectorizer =  ??

Unnamed: 0,ate,carp,consumed,donut,fears,fish,jen,lenny,lions,molly,the
Molly ate a donut,0.526405,0.0,0.0,0.667679,0.0,0.0,0.0,0.0,0.0,0.526405,0.0
Molly ate a fish,0.526405,0.0,0.0,0.0,0.0,0.667679,0.0,0.0,0.0,0.526405,0.0
Jen consumed a carp,0.0,0.57735,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0
Lenny fears the lions,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.5,0.0,0.5


We'll be measuring similarity via [cosine similarity](https://www.machinelearningplus.com/nlp/cosine-similarity/), a standard measure of similarity in natural language processing. It's similar to how we might look at a graph with points at `(0,0)` and `(2,3)` and measure the distance between them - just a bit more complicated.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarities using the word counts
similarities = cosine_similarity(matrix2)

# Make a fancy colored dataframe about it
pd.DataFrame(similarities,
             index=sentences,
             columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a donut,Molly ate a fish,Jen consumed a carp,Lenny fears the lions
Molly ate a donut,1.0,0.554205,0.0,0.0
Molly ate a fish,0.554205,1.0,0.0,0.0
Jen consumed a carp,0.0,0.0,1.0,0.0
Lenny fears the lions,0.0,0.0,0.0,1.0


Document similarity is on a scale of zero to one, with zero being completely dissimilar and one being an exact match. Each sentence has a `1` when compared to itself - they're totally equal!

* "Molly ate a donut" and "Molly ate a fish" are both pretty similar - over half - since there's only one word that's different between the two.
* "Jen consumed a carp" only has the nigh-useless "a" in common with them, so it has a similarity score of 0 to both of the others.
* Lenny's sentence also has no shared words with anything else. Nor any topics in common, although it doesn't matter right now.

In our brains, though, _consumed_ means just about the same thing as _ate_. And a carp is a kind of fish, right? **If only there were some way of teaching a computer the meaning behind words!**

## Word embeddings

Word embeddings are a step up from just counting words. Word embeddings give words _meaning_ to computers, teaching it that puppies are kind of like kittens, kittens are like cats, and shoes are very very different from all of those animals.

We're going to be using the [spaCy word embeddings](https://spacy.io/usage/vectors-similarity/). Each word comes with a **300-dimension vector** that expresses things like how catlike the word is, whether you can wear it, if it's something people do during a basketball game (not those exactly, but the same idea). Think of it like 300 different scores for each word, all in different categories.


In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")

For example, let's check out the 300 dimensions of facts and feelings that spaCy knows about the word `cat`.

In [7]:
nlp('cat').vector

array([ 1.0514209 ,  0.34937912, -0.6064172 , -0.03143853, -0.25396183,
        0.25881147, -0.9598677 , -0.3540864 , -1.3083378 ,  0.8775649 ,
       -1.0778348 ,  0.36540714,  1.1036501 ,  0.22029386,  0.2998408 ,
        0.58526474,  0.08069035, -0.27865058,  0.5414338 ,  0.3210649 ,
        0.10975   , -0.63044214,  0.33942896, -1.057714  ,  0.80867803,
        0.70948964, -1.399974  , -1.0233183 ,  0.0383337 , -0.91863513,
        1.761926  , -1.2162635 ,  0.12481023,  0.25336656,  0.35139418,
       -1.8781887 , -0.32565773, -0.0695347 , -1.23301   ,  1.0268142 ,
       -0.74936724, -0.3522537 ,  0.6714414 ,  0.95270234, -0.8937878 ,
       -0.11669561, -0.20789865, -0.41127247,  1.1127558 , -0.02335785,
       -0.613902  ,  0.19115907, -0.48181045, -0.31004375, -0.01746552,
        0.513177  , -0.5450033 ,  0.25294054,  0.44701692,  0.18349233,
       -0.2334654 ,  0.29693076, -0.00564367, -1.1410292 ,  0.20844807,
       -0.08538185, -0.19233677,  0.77854466,  0.4607009 , -0.50

In [8]:
nlp('Some people have never eaten a taco').vector

array([-0.21850875, -0.182483  ,  0.39083776, -0.1519959 ,  0.38807243,
        0.0996739 , -0.56210274,  0.05712489,  0.46235996,  0.27945736,
       -0.03844428, -0.32904378, -0.3669289 , -0.37070793,  0.49637154,
        0.21949227,  0.59886414, -0.41467664,  0.01052142, -0.16370185,
       -0.28307548, -0.04534994, -0.35103294,  0.21465214, -0.24565716,
       -0.23907171, -0.31155214, -0.4792877 , -0.20790543, -0.4090588 ,
        0.12671374,  0.14905728,  0.22306387, -0.09149681,  0.0970765 ,
       -0.8155504 , -0.11731967, -0.06842994,  0.04897319, -0.23037854,
       -0.1394358 ,  0.01630883,  0.74178326, -0.2942308 ,  0.3700109 ,
       -0.6087653 ,  0.54290015,  0.01858242,  0.32218716, -0.18664262,
        0.4828271 , -0.34867483,  0.16468588, -0.41301322, -0.21643408,
        0.36265677,  0.4309161 ,  0.24136345,  0.19377467,  0.01774977,
       -0.20550406,  0.42326966, -0.1071355 , -0.17926626, -0.16798452,
       -0.08023013, -0.05092272, -0.379553  , -0.26245686, -0.57

In order to find the similarity of each of our sentences, we'll need to conver them each into vectors.

In [9]:
# We aren't printing this because it's 3 * 300 = 900 numbers
vectors = [nlp(sentence).vector for sentence in sentences]

# Print out some notes about it
print("We have", len(vectors), "different vectors")
print("And the first one has", len(vectors[0]), "measurements")
print("And the second one has", len(vectors[1]), "measurements")
print("And the third one has", len(vectors[2]), "measurements")
print("And the fourth one has", len(vectors[3]), "measurements")

We have 4 different vectors
And the first one has 96 measurements
And the second one has 96 measurements
And the third one has 96 measurements
And the fourth one has 96 measurements


It might be useful to compare these 300 measurements-per-sentence to what we were doing before. If we look back to when we were doing similarity by counting words, we only had **eleven measurements for each sentence: one count for every unique word.**

In [10]:
counts

Unnamed: 0,ate,carp,consumed,donut,fears,fish,jen,lenny,lions,molly,the
Molly ate a donut,1,0,0,1,0,0,0,0,0,1,0
Molly ate a fish,1,0,0,0,0,1,0,0,0,1,0
Jen consumed a carp,0,1,1,0,0,0,1,0,0,0,0
Lenny fears the lions,0,0,0,0,1,0,0,1,1,0,1


In [11]:
# Compute similarities
similarities = cosine_similarity(vectors)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a donut,Molly ate a fish,Jen consumed a carp,Lenny fears the lions
Molly ate a donut,1.0,0.906917,0.680188,0.541068
Molly ate a fish,0.906917,1.0,0.686074,0.64727
Jen consumed a carp,0.680188,0.686074,1.0,0.429788
Lenny fears the lions,0.541068,0.64727,0.429788,1.0
