In a nutshell, the word2vec algorithm takes a corpus of text as input and generates a vector of several hundred dimensions for each word in the corpus. In doing this, it doesn't rely on the counts of the occurrences of the words. Instead, it follows a much more complicated procedure that considers the surrounding words of each word in the sentences.

Before moving further, note that you're still in the feature-engineering step as shown below:

![Feature Engineering](assets/feature_engineering.png)



# What is word2vec?

The most common feature-generation approach for NLP tasks is word2vec. What word2vec does is that it trains a shallow neural network model in an unsupervised manner for converting words to vectors. At the highest level of abstraction, word2vec assigns a vector of random values to each word. For a word *W*, it looks at the words that are near *W* in the sentence. It then shifts the values in the word vectors, such that the vectors for words near *W* are closer to the *W* vector, and vectors for words not near *W* are farther away from the *W* vector. With a large enough corpus, this will eventually result in words that often appear together having vectors that are near one another, and words that rarely or never appear together having vectors that are far away from each other.

This may sound quite similar to the latent semantic analysis approach that you learned about in the previous checkpoint. The conceptual difference is that LSA creates vector representations of sentences based on the words in them, while word2vec creates representations of individual words, based on the words around them.

## What is it good for?

Word2vec is strong at capturing the meanings of the words, so it's also good at detecting words that have similar meanings. The challenge with human communication is that there are many different ways to communicate the same concept. It's easy for humans to know that `the silverware` and `the utensils` can refer to the same thing. But computers can't do that unless you teach them, and this can be a real choke point for human-computer interactions. If you've ever played a text adventure game like *Colossal Cave Adventure* or *Zork*, you may have encountered the following scenario:

    GAME: You are on a forest path north of the field. A cave leads into a granite butte to the north.
    A thick hedge blocks the way to the west.
    A hefty stick lies on the ground.

    YOU: pick up stick  

    GAME: You don't know how to do that.  

    YOU: lift stick  

    GAME: You don't know how to do that.  

    YOU: take stick  

    GAME: You don't know how to do that.  

    YOU: grab stick  

    GAME: You grab the stick from the ground and put it in your bag.  

And your brain explodes from frustration. A text adventure game that incorporates a properly trained word2vec model would have vectors for `pick up`, `lift`, and `take` that are close to the vector for `grab`. Therefore, it could accept those other verbs as synonyms so that you could move ahead faster. In more practical applications, word2vec and other similar algorithms are what help a search engine return the best results for your query, not just the results that contain the exact words that you used. In fact, a search is a better example. Not only does the search engine need to understand your request, it also needs to match it to web pages that were also written by humans and therefore also use idiosyncratic language.

Next, look very briefly at the word2vec algorithm and examine how it comes up with vector representations of words that capture semantics.

## Generating vectors: Multiple algorithms

In considering the relationship between a word and its surrounding words, word2vec has two options that are the inverse of one another:

 * **Continuous bag of words (CBOW):** The identity of a word is predicted using the words near it in a sentence.
 * **Skip-gram:** The identities of words are predicted from the word that they surround. Skip-gram seems to work better for larger corpora.

Now, consider the following sentence:
    
    "Terry Gilliam is a better comedian than a director" 

Focus on the word `comedian` here. CBOW will try to predict `comedian` using `is`, `a`, `better`, `than`, `a`, and `director`. Skip-gram will try to predict `is`, `a`, `better`, `than`, `a`, and `director` using the word `comedian`. In practice, for CBOW, the vector for `comedian` will be pulled closer to the other words. But for skip-gram, the vectors for the other words will be pulled closer to `comedian`.  

In addition to moving the vectors for nearby words closer together, each time a word is processed, some vectors are moved farther away. Word2vec has two approaches to pushing vectors apart:
 
 * **Negative sampling:** Like it sounds, each time that a word is pulled toward some neighbors, the vectors for a randomly chosen small set of other words are pushed away.
 * **Hierarchical softmax:** Every neighboring word is pulled closer or farther from a subset of words chosen based on a tree of probabilities.

## What is similarity? Strengths and weaknesses of word2vec

Keep in mind that word2vec operates on the assumption that frequent proximity indicates similarity, but words can be similar in various ways. They may be conceptually similar (`royal`, `king`, and `throne`), but they may also be functionally similar (`tremendous` and `negligible` are both common modifiers of `size`). Here is a more [detailed exploration, with examples](https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/), of what similarity means in word2vec.

One cool thing about word2vec is that it can identify similarities between words that never occur near one another in the corpus. For example, consider these sentences:

    "The dog played with an elastic ball."
    "Babies prefer the ball that is bouncy."
    "I wanted to find a ball that's elastic."
    "Tracy threw a bouncy ball."

`Elastic` and `bouncy` are similar in meaning in the text but don't appear in the same sentence. However, both appear near `ball`. In the process of nudging the vectors around so that `elastic` and `bouncy` are both near the vector for `ball`, the words also become nearer to one another and their similarity can be detected.

For a while, after it was introduced, [no one was really sure why word2vec worked as well as it did](https://arxiv.org/pdf/1402.3722v1.pdf) (see the last paragraph of the linked paper). A few years later, some additional math was developed to explain word2vec and similar models. If you are comfortable with both math and academic writing, have a lot of time on your hands, and want to take a deep dive into the inner workings of word2vec, [check out this paper](https://arxiv.org/pdf/1502.03520v7.pdf) from 2016.

One of the draws of word2vec when it first came out was that the vectors could be used to convert analogies (`king` is to `queen` as `man` is to `woman`, for example) into mathematical expressions (`king` + `woman` - `man` = ?) and solve for the missing element (`queen`). This is kind of nifty.

A drawback of word2vec is that it works best with a corpus that is at least several billion words long. Even though the word2vec algorithm is speedy, this is a lot of data and takes a long time! In the following examples, your dataset is very short. This allows you to run it in the Notebook without overwhelming the kernel, but probably won't give great results. Still, you'll explore how you can implement word2vec using the Gensim library.

## Implementing word2vec

Now, you can start to use word2vec representations of the words to feed into machine-learning models. There are a few word2vec implementations in Python, but the general consensus is that the easiest one to use is [Gensim](https://radimrehurek.com/gensim/models/word2vec.html). Now is a good time to install this library if you don't have it yet. Install it as follows:

```bash
pip install gensim
````

In the following examples, you'll use the Gensim library along with others. As you did in the previous checkpoints, you'll be working on Jane Austen's *Persuasion* and Lewis Carroll's *Alice's Adventures in Wonderland*.

You have two options when working with the word2vec vectors in Gensim. The first one is to train your own word2vec algorithm using your own corpus. This will be your first approach in the following process. However, for the word2vec algorithm to perform well, you need a much larger corpus than you have here. So, the second option is to load a pretrained word2vec vector that's been trained on a very large corpus. After you train your word2vec representations, you'll also load a pretrained one.

Now, start with importing the libraries that you'll use:

In [2]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
import nltk
from nltk.corpus import gutenberg
import gensim
import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/emetozar/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


Before moving on to vectorizing the text, you need to clean your data. You can use the same cleaning codes as in the previous checkpoints, because you're using the same documents.

In [3]:
# Utility function for standard text cleaning
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation that spaCy doesn't
    # recognize: the double dash --. Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [4]:
# Load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [5]:
# Parse the cleaned novels. This can take some time.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [6]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


In [7]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop]

Now, you're ready to vectorize your words using word2vec. For this purpose, use `Word2Vec` from Gensim's `models` module. The `Word2Vec` class has several parameters. Set the following parameters:

* `workers=4`: Set the number of threads to run in parallel to 4 (which makes sense if your computer has available computing units).
* `min_count=1`: Set the minimum word count threshold to 1.
* `window=6`: Set the number of words around the target word to consider to 6.
* `sg=0`: Use CBOW because your corpus is small.
* `sample=1e-3`: Penalize frequent words.
* `size=100`: Set the word vector length to 100.
* `hs=1`: Use hierarchical softmax.

In [8]:
# Train word2vec on the sentences
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

Before jumping into the machine-learning model for prediction, play with the word2vec word representation that you just trained. Specifically, look into the following:

* The first five words that are closer to `lady`
* The word that doesn't fit in this list: `dad`, `dinner`, `mom`, `aunt`, `uncle`
* The similarity score of `woman` and `man`
* The similarity score of `horse` and `cat`

Note that all of the above calculations are based on the word2vec representations of the words that you just trained above.

In [None]:
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print(model.similarity('woman', 'man'))
print(model.similarity('horse', 'cat'))

[('want', 0.9981542825698853), ('assure', 0.9980653524398804), ('conduct', 0.9980344772338867), ('have', 0.997972309589386), ('small', 0.9979379177093506)]
dinner
0.9992296
0.9922296


Well, the results make sense to some degree, but it's obvious that your representations aren't perfect. This is because your corpus is small. To get more meaningful results, you need to train word2vec representations using much larger corpora.

Now, create your numerical features using the word2vec representations of the words. In the following, get the word2vec vectors of each word in a sentence. Then take the average of all the vectors in the high dimensional space (in your case, it's 100). So, as a result, you'll have a vector of 100 dimensions as the feature for a sentence. You can then use each dimension as a separate feature—which means that you'll have 100 numerical features in your final dataset.

In [None]:
word2vec_arr = np.zeros((sentences.shape[0],100))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,...,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",0.105171,0.029203,0.106189,-0.233402,0.454079,-0.172252,-0.41767,-0.443123,...,0.318644,-0.169716,0.242794,-0.158143,-0.208328,0.137009,0.190339,-0.32745,-0.189454,0.313232
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",0.093776,0.021333,0.078646,-0.176176,0.361667,-0.138862,-0.339069,-0.361669,...,0.272195,-0.124776,0.194985,-0.134077,-0.180503,0.123825,0.152447,-0.266867,-0.141993,0.262484
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit]",0.133105,0.012398,0.13744,-0.294095,0.558005,-0.226012,-0.52631,-0.528327,...,0.401526,-0.212107,0.294243,-0.198852,-0.230249,0.167386,0.248123,-0.417506,-0.231474,0.384248
3,Carroll,"[oh, dear]",0.096552,0.037364,0.095361,-0.230718,0.464172,-0.167147,-0.438785,-0.448484,...,0.322805,-0.166143,0.26097,-0.179004,-0.235276,0.134661,0.203214,-0.327309,-0.201553,0.320577
4,Carroll,"[oh, dear]",0.096552,0.037364,0.095361,-0.230718,0.464172,-0.167147,-0.438785,-0.448484,...,0.322805,-0.166143,0.26097,-0.179004,-0.235276,0.134661,0.203214,-0.327309,-0.201553,0.320577


This is a good dataset format. Now, you're ready to jump into the modeling step with your features. The diagram below shows where you're at the data science pipeline:

![Modeling](assets/modeling.png)

## Word2vec in action

Notice that you now have a dataset where the columns named from *0* to *99* are the features that you'll use in the following models. Use the same models that you built in the previous checkpoints to predict the author of a sentence.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.8088563236699586

Test set score: 0.7975167144221585
----------------------Random Forest Scores----------------------
Training set score: 0.9827970691302963

Test set score: 0.8285577841451767
----------------------Gradient Boosting Scores----------------------
Training set score: 0.9104810449187639

Test set score: 0.8347659980897804


The scores aren't great compared to the scores of the previous checkpoints. The main reason is the small size of your corpus.

So, use word2vec vectors that are trained on a very large corpus. For this, use pretrained vectors released by Google. Google released a large set of word2vec vectors that are trained on around 100,000,000,000 words from the Google News dataset. Their corpus contains 3,000,000 words, and the word vectors that they trained have 300 features each.

Download the pretrained vectors from this address: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz. Note that the download and the following codes take some time. So, it's recommended to run the following cells in Google Colab.

In [None]:
# Load Google's pretrained word2vec model.
model_pretrained = gensim.models.KeyedVectors.load_word2vec_format(
    'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

Now you have the pretrained vectors in a variable called `model_pretrained`. Next, look for the vector representations of the words in your corpus. For simplicity, if a word in a sentence can't be found in the vocabulary of these pretrained vectors, you can just drop those sentences from your dataset. But you could follow alternative approaches if you like. 

In [None]:
word2vec_arr = np.zeros((sentences.shape[0],300))

for i, sentence in enumerate(sentences["text"]):
  try:
    word2vec_arr[i,:] = np.mean([model_pretrained[lemma] for lemma in sentence], axis=0)
  except KeyError:
    word2vec_arr[i,:] = np.full((1,300), np.nan)
    continue

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

print("Shape of the dataset: {}".format(sentences.shape))
sentences.head()

As a result, you have a dataset of 4,114 rows and 300 features (excluding the *text* and *author* columns). Now, you can run your classifiers using this dataset.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

Obviously, the scores are much better than you got with the previous word2vec vectors that you trained using your corpus. But there is still a lot of room for improvement. You can also use the pretrained vectors above as features in other models or try to gain insights from the vector compositions themselves.

# Example word2vec applications

Here are some neat things that people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's *Pride and Prejudice*](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a truly honest account of this data scientist's process.

 * [Tracking changes in Dutch newspapers' associations](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces) with words like `propaganda` and `alien`, from 1950 to 1990

 * [Helping customers find clothing items](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/) that are similar to a given item but differ on one or more characteristics
 
Before finishing this checkpoint, there's one last vectorization method that you'll briefly cover: [GloVe](https://nlp.stanford.edu/projects/glove/). GloVe is another popular vectorization method that is similar to word2vec and developed by the researchers at Stanford University. Gensim also supports working with GloVe vectors. If you want, you can play with this method using Gensim.