<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Lesson 4*

# Word Embeddings

![](https://d33wubrfki0l68.cloudfront.net/b2e6528737d6e532e7f39a50f7f1c3ffae9b6559/93c1e/blog/img/sense2vec.jpg)
*Image credit of [Explosion.ai](https://explosion.ai/blog/sense2vec-with-spacy)*

## Unstructured -> Structured

Processing text data to prepare it for maching learning models often means translating the information from documents into a numerical format. Bag-of-Words approaches (sometimes referred to as Frequency-Based word embeddings) accomplish this by "vectorizing" tokenized documents. This is done by representing each document as a row in a dataframe and creating a column for each unique word in the corpora (group of documents). The presence or lack of a given word in a document is then represented either as a raw count of how many times a given word appears in a document (CountVectorizer) or as that word's TF-IDF score (TfidfVectorizer).

## BoW discards textual context

One of the limitations of Bag-of-Words approaches is that any information about the textual context surrounding that word is lost. This also means that with bag-of-words approaches often the only tools that we have for identifying words with similar usage or meaning and subsequently consolidating them into a single vector is through the processes of stemming and lemmatization which tend to be quite limited at consolidating words unless the two words are very close in their spelling or in their root parts-of-speech.

## Word2Vec approaches preserve more textual context

Word2Vec is an increasingly popular word embedding technique. Like Bag-of-words it learns a real-value vector representation for a predefined fixed-size vocabulary that is generated from a corpus of text. However, in contrast to BoW, Word2Vec approaches are much more capable of accounting for textual context, and are better at discovering words with similar meanings or usages (semantic or syntactic similarity).

# CountVectorizer

### Corpora:

1) "the cat and dog sat"

2) "the dog and cat sat"

3) "the cat sat and sat"

4) "the cat killed the dog"

### Vocabulary:

{"the": 1, "cat": 2, "sat": 3, "dog": 4, "and": 5, "killed": 6}

### Vectorization

| document | the | cat | sat | dog | and | killed |
|:----|:-----|:-----|:-----|:-----|:-----|:--------|
| d1 | 1   | 1   | 1   | 1   | 1   | 0      |
| d2 | 1   | 1   | 1   | 1   | 1   | 0      |
| d3 | 1   | 1   | 2   | 0   | 1   | 0      |
| d4 | 1   | 1   | 0   | 1   | 0   | 1      |


# TF-IDF

### Corpora:

1) "the cat and dog sat"

2) "the dog and cat sat"

3) "the cat sat and sat"

4) "the cat killed the dog"

### Vocabulary:

{"the": 1, "cat": 2, "sat": 3, "dog": 4, "and": 5, "killed": 6}

### Vectorization

| document   | the | cat | sat | dog | and | killed |
|----|-----|-----|-----|-----|-----|--------|
| d1 | .25   | .25   | .33   | .33   | .33   | 0      |
| d2 | .25   | .25   | .33   | .33   | .33   | 0      |
| d3 | .25   | .25   | .67   | 0   | .33   | 0      |
| d4 | .5   | .25   | 0   | .33   | 0   | 1.00      |

# Word2Vec Intuition

## The Distribution Hypothesis

In order to understand how Word2Vec preserves textual context we have to understand what's called the Distribution Hypothesis (Reference: Distribution Hypothesis Theory  -https://en.wikipedia.org/wiki/Distributional_semantics. The Distribution Hypothesis operates under the assumption that words that have similar contexts will have similar meanings. Practically speaking, this means that if two words are found to have similar words both to the right and to the left of them throughout the corpora then those words have the same context and are assumed to have the same meaning. 

> "You shall know a word by the company it keeps" - John Firth

This means that we let the usage of a word define its meaning and its "similarity" to other words. In the following example, which words would you say have a similar meaning? 

**Sentence 1**: Traffic was light today

**Sentence 2**: Traffic was heavy yesterday

**Sentence 3**: Prediction is that traffic will be smooth-flowing tomorrow since it is a national holiday

What words in the above sentences seem to have a similar meaning if all you knew about them was the context in which they appeared above? 

Lets take a look at how this might work in action, the following example is simplified, but will give you an idea of the intuition for how this works.

### Corpora:

1) "It was the sunniest of days."

2) "It was the raniest of days."

### Vocabulary:

{"it": 1, "was": 2, "the": 3, "of": 4, "days": 5, "sunniest": 6, "raniest": 7}

### Vectorization

|       doc   | START_was | it_the | was_sunniest | the_of | sunniest_days | of_it | days_was | it_the | was_raniest | raniest_days | of_END |
|----------|-----------|--------|--------------|--------|---------------|-------|----------|--------|-------------|--------------|--------|
| it       | 1         | 0      | 0            | 0      | 0             | 0     | 1        | 0      | 0           | 0            | 0      |
| was      | 0         | 1      | 0            | 0      | 0             | 0     | 0        | 1      | 0           | 0            | 0      |
| the      | 0         | 0      | 1            | 0      | 0             | 0     | 0        | 0      | 1           | 0            | 0      |
| sunniest | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0      | 0           | 0            | 0      |
| of       | 0         | 0      | 0            | 0      | 1             | 0     | 0        | 0      | 0           | 1            | 0      |
| days     | 0         | 0      | 0            | 0      | 0             | 0     | 0        | 0      | 0           | 0            | 1      |
| raniest  | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0      | 0           | 0            | 0      |

Each column vector represents the word's context -in this case defined by the words to the left and right of the center word. How far we look to the left and right of a given word is referred to as our "window of context." Each row vector represents the the different usages of a given word. Word2Vec can consider a larger context than only words that are immediately to the left and right of a given word, but we're going to keep our window of context small for this example. What's most important is that this vectorization has translated our documents from a text representation to a numeric one in a way that preserves information about the underlying context. 

We can see that words that have a similar context will have similar row-vector representations, but before looking that more in-depth, lets simplify our vectorization slightly. You'll notice that we're repeating the column-vector "it_the" twice. Lets combine those into a single vector by adding them element-wise. 

|       *   | START_was | it_the | was_sunniest | the_of | sunniest_days | of_it | days_was | was_raniest | raniest_days | of_END |
|----------|-----------|--------|--------------|--------|---------------|-------|----------|-------------|--------------|--------|
| it       | 1         | 0      | 0            | 0      | 0             | 0     | 1        | 0           | 0            | 0      |
| was      | 0         | 2      | 0            | 0      | 0             | 0     | 0        | 0           | 0            | 0      |
| the      | 0         | 0      | 1            | 0      | 0             | 0     | 0        | 1           | 0            | 0      |
| sunniest | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0           | 0            | 0      |
| of       | 0         | 0      | 0            | 0      | 1             | 0     | 0        | 0           | 1            | 0      |
| days     | 0         | 0      | 0            | 0      | 0             | 0     | 0        | 0           | 0            | 1      |
| raniest  | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0           | 0            | 0      |

Now, can you spot which words have a similar row-vector representation? Hint: Look for values that are repeated in a given column. Each column represents the context that word was found in. If there are multiple words that share a context then those words are understood to have a closer meaning with each other than with other words in the text.

Lets look specifically at the words sunniest and raniest. You'll notice that these two words have exactly the same 10-dimensional vector representation. Based on this very small corpora of text we would conclude that these two words have the same meaning because they share the same usage. Is this a good assumption? Well, they are both referring to the weather outside so that's better than nothing. You could imagine that as our corpora grows larger we will be exposed a greater number of contexts and the Distribution Hypothesis assumption will improve. 

# Word2Vec Variants

## Skip-Gram

The Skip-Gram method predicts the neighbors’ of a word given a center word. In the skip-gram model, we take a center word and a window of context (neighbors) words to train the model and then predict context words out to some window size for each center word.

This notion of “context” or “neighboring” words is best described by considering a center word and a window of words around it. 

For example, if we consider the sentence **“The speedy Porsche drove past the elegant Rolls-Royce”** and a window size of 2, we’d have the following pairs for the skip-gram model:

**Text:**
**The**	speedy	Porsche	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (the, speedy), (the, Porsche)

**Text:**
The	**speedy**	Porsche	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (speedy, the), (speedy, Porsche), (speedy, drove)

**Text:**
The	speedy	**Porsche**	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (Porsche, the), (Porsche, speedy), (Porsche, drove), (Porsche, past)

**Text:**
The	speedy	Porsche	**drove**	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (drove, speedy), (drove, Porsche), (drove, past), (drove, the)

The **Skip-gram model** is going to output a probability distribution i.e. the probability of a word appearing in context given a center word and we are going to select the vector representation that maximizes the probability.

With CountVectorizer and TF-IDF the best we could do for context was to look at common bi-grams and tri-grams (n-grams). Well, skip-grams go far beyond that and give our model much stronger contextual information.


![alt text](https://www.dropbox.com/s/c7mwy6dk9k99bgh/Image%202%20-%20SkipGrams.jpg?raw=1)

## Continuous Bag of Words

This model takes thes opposite approach from the skip-gram model in that it tries to predict a center word based on the neighboring words. In the case of the CBOW model, we input the context words within the window (such as “the”, “Proshe”, “drove”) and aim to predict the target or center word “speedy” (the input to the prediction pipeline is reversed as compared to the SkipGram model).

A graphical depiction of the input to output prediction pipeline for both variants of the Word2vec model is attached. The graphical depiction will help crystallize the difference between SkipGrams and Continuous Bag of Words.

![alt text](https://www.dropbox.com/s/k3ddmbtd52wq2li/Image%203%20-%20CBOW%20Model.jpg?raw=1)

## Notable Differences between Word Embedding methods:

1) W2V focuses less document topic-modeling. You'll notice that the vectorizations don't really retain much information about the original document that the information came from. At least not in our examples.

2) W2V can result in really large and complex vectorizations. In fact, you need Deep Neural Networks to train your Word2Vec models from scratch, but we can use helpful pretrained embeddings (thank you Google) to do really cool things!

# Lets give it a go!

In [1]:
!pip install -U gensim
import gensim

Requirement already up-to-date: gensim in /Users/jonathansokoll/anaconda3/lib/python3.7/site-packages (3.7.2)


## Lets just downlad all of nltk like a madman. 

![](https://media.giphy.com/media/kYkQYXkO3XyRa/giphy.gif)

In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/jonathansokoll/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_dat

True

### Tokenize some documents. You know the drill.

In [4]:
# Step 1
raw_content = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
             'My name is Aaron Gallant, commander of the Machine Learning program at Lambda School.',
             'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
            'Machine Learning is one of my favorite subjects.',
            'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
                'When does the Machine Learning program kick-off at Lambda school?',
                'The batter hit the ball out off AT&T park into the pacific ocean.',
                'The pitcher threw the ball into the dug-out.']

from nltk.tokenize import word_tokenize
sentences = [word_tokenize(text) for text in raw_content]
print(sentences)

[['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'the', 'owner', "'s", 'room', 'to', 'check', 'if', 'the', 'owner', 'was', 'in', 'the', 'room', '.'], ['My', 'name', 'is', 'Aaron', 'Gallant', ',', 'commander', 'of', 'the', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'School', '.'], ['I', 'am', 'creating', 'the', 'curriculum', 'for', 'the', 'Machine', 'Learning', 'program', 'and', 'will', 'be', 'teaching', 'the', 'full-time', 'Machine', 'Learning', 'program', '.'], ['Machine', 'Learning', 'is', 'one', 'of', 'my', 'favorite', 'subjects', '.'], ['I', 'am', 'excited', 'about', 'taking', 'the', 'Machine', 'Learning', 'class', 'at', 'the', 'Lambda', 'school', 'starting', 'in', 'April', '.'], ['When', 'does', 'the', 'Machine', 'Learning', 'program', 'kick-off', 'at', 'Lambda', 'school', '?'], ['The', 'batter', 'hit', 'the', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'the', 'pacific', 'ocean', '.'], ['The', 'pitcher', 'threw', 'the', 'ball', 'into', 'the', 'dug-ou

### Train the Word2vec model with tokenized content 

Size of the word vectors is 5; the word should show-up at least once in the raw content.

In [5]:
# Step 2
from gensim.models.word2vec import Word2Vec

help(Word2Vec)

Help on class Word2Vec in module gensim.models.word2vec:

class Word2Vec(gensim.models.base_any2vec.BaseWordEmbeddingsModel)
 |  Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)
 |  
 |  Train, use and evaluate neural networks described in https://code.google.com/p/word2vec/.
 |  
 |  Once you're finished training a model (=no more updates, only querying)
 |  store and use only the :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `self.wv` to reduce memory.
 |  
 |  The model can be stored/loaded via its :meth:`~gensim.models.word2vec.Word2Vec.save` and
 |  :meth:`~gensim.models.word2vec.Word2Vec.load` methods.
 |  
 |  The trained word vectors can a

### Lets take a look at our model

In [7]:
model = Word2Vec(sentences, min_count=1, size=5)
dir(model)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_check_input_data_sanity',
 '_check_training_sanity',
 '_clear_post_train',
 '_do_train_epoch',
 '_do_train_job',
 '_get_job_params',
 '_get_thread_working_mem',
 '_job_producer',
 '_load_specials',
 '_log_epoch_end',
 '_log_epoch_progress',
 '_log_progress',
 '_log_train_end',
 '_minimize_model',
 '_raw_word_count',
 '_save_specials',
 '_set_train_params',
 '_smart_save',
 '_train_epoch',
 '_train_epoch_corpusfile',
 '_update_job_params',
 '_worker_loop',
 '_worker_loop_corpusfile',
 'accuracy',
 'alpha',
 'batch_words',
 'build_vocab',
 'build_vocab_from_freq',
 'ca

In [8]:
print(model)
print(list(model.wv.vocab))
print(len(model.wv.vocab))

Word2Vec(vocab=70, size=5, alpha=0.025)
['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'owner', "'s", 'room', 'to', 'check', 'if', 'was', 'in', '.', 'My', 'name', 'is', 'Aaron', 'Gallant', ',', 'commander', 'of', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'School', 'I', 'am', 'creating', 'curriculum', 'for', 'will', 'be', 'teaching', 'full-time', 'one', 'my', 'favorite', 'subjects', 'excited', 'about', 'taking', 'class', 'school', 'starting', 'April', 'When', 'does', 'kick-off', '?', 'batter', 'hit', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'pacific', 'ocean', 'pitcher', 'threw', 'dug-out']
70


### Output the vector of words

Lets look at vectors for the following tokens: a) curriculum, b) ocean, and c) pitcher

In [9]:
# Step 4
print(model.wv['curriculum', 'ocean', 'pitcher'])

[[-0.09642836  0.07874887  0.03452344 -0.0228136   0.07907513]
 [ 0.04031094 -0.01060158 -0.06604789 -0.07084186 -0.00400378]
 [ 0.05306373  0.05876096  0.04561632 -0.00241437 -0.00500072]]


In [10]:
model.wv.most_similar('Machine')

[('School', 0.92108553647995),
 ('commander', 0.9050584435462952),
 ('taking', 0.8464348316192627),
 ('.', 0.8354135751724243),
 ('if', 0.8092228770256042),
 ('dug-out', 0.7784776091575623),
 ('at', 0.7723556756973267),
 ('Learning', 0.753736138343811),
 ('and', 0.7488025426864624),
 ('up', 0.7221638560295105)]

![Got a Fever](http://i.imgur.com/VV53EPb.jpg)

### MORE DATA!

Now we are going to train the model with more data - larger corpus i.e. the 20 newsgroups text dataset. Fetch the data from the training subset

*Reference*: http://scikit-learn.org/stable/datasets/index.html

In [11]:
from sklearn.datasets import fetch_20newsgroups
text_from_corpus = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


### What did I even just import?

Output the metadata for the data that is fetched (investigate the object and what you can do with it)

In [12]:
# Step 6
print(dir(text_from_corpus))
print(text_from_corpus.DESCR)

['DESCR', 'data', 'filenames', 'target', 'target_names']
.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            188

![](https://memegenerator.net/img/instances/37841437/how-much-are-we-talking-about.jpg)

### How much data are we talkin' bout? 

In [13]:
len(text_from_corpus.data)

11314

### Tokenize it!

![I heard you like tokens](https://www.drupal.org/files/project-images/53298506.jpg)

In [14]:
import string

def process_text(text):
  """Remove punctuation, lowercase, and tokenize text."""
  # TODO: check for special cases like "I'll"
  text = "".join([char.lower() for char in text
                  if char not in string.punctuation])
  return word_tokenize(text)

sentences = [process_text(document) for document in text_from_corpus.data]

print(sentences[:10])



### Train the model
Train the Word2vec model - words should show up at least 3 times in the corpus of text
and the size of each word vector is 200 (i.e. dimension = 200)

Reference" Scroll down to the section "A closer look at the parameter settings" to review the parameters that can be set

In [15]:
news_model = Word2Vec(sentences, min_count=3, size=200, window=2)

### Generate the Vocabulary - or at least look at how big it is

![Vocabulary Words](https://www.fluentu.com/blog/english/wp-content/uploads/sites/4/2015/04/dAeF4PO.png)

In [16]:
# Step 10
print(len(news_model.wv.vocab))

43312


![Wow](https://media1.tenor.com/images/c2a921072f98952c52042d6e28c72854/tenor.gif?itemid=9987719")

### Examine word similarity to the word "Christ" (find other words most similar to it)

In [17]:
# Step 11
news_model.wv.most_similar('christ')

[('jesus', 0.918633759021759),
 ('spirit', 0.8739626407623291),
 ('satan', 0.8596612215042114),
 ('himself', 0.8482229113578796),
 ('lord', 0.8443716764450073),
 ('resurrection', 0.8276926279067993),
 ('father', 0.8269971013069153),
 ('god', 0.8263020515441895),
 ('messiah', 0.8168060779571533),
 ('son', 0.810607373714447)]

![](https://memegenerator.net/img/instances/73402168/i-know-some-of-these-words.jpg)



### What other words should we try?

In [23]:
### Try some things


[('richard', 0.9503076076507568),
 ('jason', 0.9434071779251099),
 ('robert', 0.9414317607879639),
 ('ken', 0.9404526948928833),
 ('stephen', 0.9341440200805664),
 ('phil', 0.9331650733947754),
 ('daniel', 0.9313520193099976),
 ('lee', 0.9293191432952881),
 ('scott', 0.9288771152496338),
 ('craig', 0.9272451400756836)]

# Lets try it with a different dataset:

![The Simpsons](https://media1.tenor.com/images/0273468e8e2921a39d75aa1f2ca461a2/tenor.gif?itemid=3865850)

<https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons>

In [None]:
##### Lets do it! #####
