# Week 2: Embeddings Concept Book

# Introduction
<hr style="border:2px solid gray"> </hr>


In the previous notebook, we learned about categorical variables and the challenge to put them into AI models. Here, we work with the human languages, made of dozens of thousands of different tokens. This problem seems insurmountable, because we are no longer dealing with a small amount of categories anymore.

Here, as we did with the latent factors in the recommender systems, we represent each word as a large vector. Not only does this solve the problem of encoding, but also this is creating very meaningful coordinates, enable to represent the similarities between words, finding homonyms, synonyms, and much more !

### Goals and Objectives
<hr style="border:2px solid gray"> </hr>


* Have an appreciation on the way words can be represented as vectors 
* Grasp the concept behind the distribution hypothesis
* See the potential of embeddings

### Key Ideas
<hr style="border:2px solid gray"> </hr>

* Vector Encoding
* Feature Space
* Distribution Hypothesis
* Dimensionality Reduction

In [None]:
is_kaggle = True   # True if you are on Kaggle, False for local Windows, Linux or Mac environments.

In [None]:
# libraries installation
if is_kaggle:
    !pip install -Uqqq spaCy 
    !python -m spacy download en_core_web_lg
    from IPython.display import clear_output
    clear_output()

In [None]:
from scipy import spatial # to compute the distance between words vectors
import spacy # to load a word embedding

In [None]:
nlp = spacy.load('en_core_web_lg')

## The Distribution Hypothesis
<hr style="border:2px solid gray"> </hr>

Word embeddings are built on top of something called the <em>Distribution Hypothesis</em>.

The Distribution Hypothesis, in plain English, is the idea that in large bodies of text, certain words will probably appear more closely to eachother than than words that are unrelated.

Embeddings can be trained using data, such as from Wikipedia, news articles, or questions and answers, and achieve some shocking results. 

The distance between these vectors can be calculated by using cosine similarity.

Let's say that we have A1, A2, B1 and B2.
There is almost the same relationship between A1 and A2, and between B1 and B2 (for instance, between man and woman, and between king and queen).

Then, we have 
A1 - A2 = B1 - B2

So, an approximation of B2 is equal to

B2' = B1 - (A1 - A2)

B2' = B1 - A1 + A2

In [None]:
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

In [None]:
A1 = nlp.vocab['female'].vector
A2 = nlp.vocab['male'].vector

B1 = nlp.vocab['queen'].vector

# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
B2_approximation = B1 - A1 + A2
#approximation of king = queen - female + male

computed_similarities = []
 
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
 
    similarity = cosine_similarity(B2_approximation, word.vector)
    computed_similarities.append((word, similarity))
 
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print('Closest words in the vector space: ')
print([w[0].text for w in computed_similarities[:9]])

Here, are the most similar words to the vector. It works suprisingly well, and even includes slang. Cuz is short for Cousin, and is an informal way to address a friend, such as calling someone dude or bro.

### Dimensionality Reduction
<hr style="border:2px solid gray"> </hr>


If we go back to the encoding examples, then it shows us how different representation can expand the feature space tremendously. Embeddings, once trained are suprisingly compared to a one hot dictionary encoding. 

In [None]:
A1.shape

The embedding for a word, in this model, is embodied in a one dimension vector, of size 300.
But we can also consider this as a point in a 300 D space (it would have 300 coordinates). And we can use Principal Component Analysis to represent it in 2 or 3D space.