# Exploring Spacy with Embeddings!

There are a lot of explanation about embeddings so I will not even try to explain it. What I am going to do here is to use embeddings to find words that are similar to each other. 
This similarrity means that they are use in the same context. So we are going to test this in the following example.

In [1]:
#pip install spacy
# the line below should be use in the anaconda prompt selecting the correct environment 
# pip install en_core_web_md

In [2]:
import spacy
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
nlp = spacy.load('en_core_web_lg')

## Embeddings on similar context

In [3]:
# how an embedding looks like
king = nlp('King').vector
man = nlp('Man').vector
woman = nlp('Woman').vector
king

array([  0.19343 ,  -4.1969  ,   4.8175  ,  -0.72863 ,   2.3177  ,
        -1.4221  ,   0.28923 ,   0.062839,   3.6781  ,   2.6208  ,
         1.8116  ,   0.42054 ,   3.3034  ,  -1.165   ,  -1.8362  ,
        -2.4683  ,   4.2381  ,   1.2929  ,  -0.37599 ,   3.2744  ,
        -2.8982  ,  -5.9219  ,  -1.8752  ,   3.8131  ,   6.583   ,
        -0.16072 ,  -1.1781  ,  -2.7252  ,  -3.3267  ,  -0.16564 ,
         1.4311  ,  -0.51942 ,   0.87652 ,   0.51414 ,   1.4174  ,
        -1.4736  ,   1.8717  ,  -0.99453 ,  -6.5019  ,   1.6999  ,
        -3.0466  ,  -2.4686  ,  -4.4889  ,   6.5907  ,   1.375   ,
        -3.0183  ,  -4.4784  ,   2.7568  ,   4.5392  ,  -2.9311  ,
        -3.6852  ,  -1.7053  ,   2.422   ,   3.9895  ,   5.0674  ,
         1.3144  ,   1.0707  ,  -9.2608  ,   0.62933 ,   5.3289  ,
        -3.6329  ,  -5.5805  ,   5.4988  ,   0.62285 ,   1.4319  ,
         2.2446  ,  -1.9759  ,  -1.7883  ,   5.6889  ,  -6.1173  ,
         0.40993 ,   1.436   ,  -6.6111  ,  -4.7627  ,  -1.945

In [4]:
print(f"King embedding shape: {king.shape}")
print(f"Man embedding shape: {man.shape}") 
print(f"Woman embedding shape: {woman.shape}")
# notice all have the same length!

King embedding shape: (300,)
Man embedding shape: (300,)
Woman embedding shape: (300,)


In [5]:
# so hopefully this substract give us the queen! 
queen = (king - man) + woman

In [6]:
def similar_words(vector):
    queries = np.asarray([vector])
    ms = nlp.vocab.vectors.most_similar(queries, n=10)
    words = [nlp.vocab.strings[w] for w in ms[0][0]]
    distances = ms[2]
    print(words)
    print(distances)
    return words

In [7]:
words = similar_words(queen)

['King', '-King', 'KastKing', 'R.M.King', 'Kingi', 'King-', 'Kingz', 'Queen', 'Kingu', 'king']
[[0.6952 0.6542 0.5835 0.5786 0.5623 0.5603 0.5439 0.5421 0.5272 0.5198]]


In [8]:
words = similar_words(man)

['Man', 'Brak', 'Woman', 'Womad', 'Girl', 'Fist', '-Girl', 'Womanly', 'Boy', 'Fisto']
[[1.     0.7302 0.6429 0.6036 0.5658 0.5608 0.5598 0.5513 0.542  0.5261]]


# What are we seeing?

So notice that we try to get Queen trying to substact the vector man from king and adding woman. 
but appearentely the vector king and queen are close together and the end result is that the most similar word at the end of all the operation was king. Notice that the word Man and woman are also close together!

# comparing different vectors embeddings

In [9]:
# how an embedding looks like on synonymous
dictionary = {}
dictionary['happy'] = nlp('Happy').vector
dictionary['cheerful'] = nlp('Cheerful').vector
dictionary['delighted'] = nlp('Delighted').vector
# antonymous
dictionary['sad'] = nlp('Sad').vector
# unrelated word
dictionary['banana'] = nlp('Banana').vector
    

In [10]:
words = similar_words(dictionary['happy'])

['Happy', '-Happy', 'Yappy', 'Happo', 'Happytime', 'appy', 'Dappy', 'Happn', 'Slappy', 'Sappy']
[[1.     0.9015 0.7374 0.7176 0.7136 0.698  0.6947 0.6864 0.6795 0.6792]]


In [11]:
for elem in dictionary.keys():
    print(f'we are comparing happy embedding with {elem}')
    print(cosine_similarity(dictionary['happy'].reshape(1, -1),dictionary[elem].reshape(1, -1)).flatten()[0])


we are comparing happy embedding with happy
1.0
we are comparing happy embedding with cheerful
0.50845826
we are comparing happy embedding with delighted
0.33883482
we are comparing happy embedding with sad
0.36764246
we are comparing happy embedding with banana
0.28835577


## What are we seeing?
Notice that the embeddings not always make the most sense. Sad is more similar to happy than delighted even though the latter is a synonymous. Maybe sad is more similar than delighted is because happy and sad are easier words and appear together for instance in children's book or whenever. So it is best **not to expect that similar words in meaning are similar in embedding** 
This is a way similar to T-sne, we think we understand the topic but the intuition fails because we expect that similar ideas or points in T-sne are group magically together, but they are not. So we need to use the concepts for what they are and what they truly provide us.


# Tokens

Even though I say I will not explain embeddings, I will just say that the embeddings are numerical representation of words or group of words. On the other hands tokens are the building block of NLP a token generally but not always are the words (can also be a point, parenthesis and other simbol use in writing). 

Spacy works with documents that in short create all the metadata that we can need to analyze text. So, here it is an easy example!

In [12]:
text = 'I hope you have a happy life forever in the USA, I will miss you Harry Potter. Please do not forget your wand!'
doc = nlp(text)

So let us examine what the model considers a token

In [13]:
for token in doc:
    print(token)

I
hope
you
have
a
happy
life
forever
in
the
USA
,
I
will
miss
you
Harry
Potter
.
Please
do
not
forget
your
wand
!


So the spacy library creates this metadata of the text, with token, and do more, each token has some very interesting atributes that we are going to explore here!

In [23]:
token1 = doc[10]
token1.text

'USA'

In [24]:
token1.ent_type_

'GPE'

In [25]:
token2 = doc[3]
token2.text

'have'

In [26]:
token2.morph

Mood=Ind|Tense=Pres|VerbForm=Fin

In [27]:
from spacy import displacy
displacy.render(doc,style='ent')

In [28]:
displacy.render(doc,style='dep')

**we will continue exploring this subject stayed tune***