## Train your own word2vec representations, as you did in the first example in this checkpoint. However, you need to experiment with the hyperparameters of the vectorization step. Modify the hyperparameters and run the classification models again. Can you wrangle any improvements?

In [1]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
import nltk
from nltk.corpus import gutenberg
import gensim
import warnings
warnings.filterwarnings("ignore")

# nltk.download('gutenberg')
# !python -m spacy download en

Before moving on to vectorizing the text, you need to clean your data. You can use the same cleaning codes as in the previous checkpoints, because you're using the same documents.

In [2]:
# Utility function for standard text cleaning
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation that spaCy doesn't
    # recognize: the double dash --. Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [3]:
# Load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [4]:
# Parse the cleaned novels. This can take some time.
nlp = spacy.load('en_core_web_sm')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [5]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


In [6]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop]

Now, you're ready to vectorize your words using word2vec. For this purpose, use `Word2Vec` from Gensim's `models` module. The `Word2Vec` class has several parameters. Set the following parameters:

* `workers=4`: Set the number of threads to run in parallel to 4 (which makes sense if your computer has available computing units).
* `min_count=1`: Set the minimum word count threshold to 1.
* `window=6`: Set the number of words around the target word to consider to 6.
* `sg=0`: Use CBOW because your corpus is small.
* `sample=1e-3`: Penalize frequent words.
* `size=100`: Set the word vector length to 100.
* `hs=1`: Use hierarchical softmax.

In [14]:
# Train word2vec on the sentences
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=0.5,
    window=12,
    sg=0,
    sample=0.001,
    size=100,
    hs=1
)

Before jumping into the machine-learning model for prediction, play with the word2vec word representation that you just trained. Specifically, look into the following:

* The first five words that are closer to `lady`
* The word that doesn't fit in this list: `dad`, `dinner`, `mom`, `aunt`, `uncle`
* The similarity score of `woman` and `man`
* The similarity score of `horse` and `cat`

Note that all of the above calculations are based on the word2vec representations of the words that you just trained above.

In [15]:
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print(model.similarity('woman', 'man'))
print(model.similarity('horse', 'cat'))

[('life', 0.9990490078926086), ('young', 0.9986709952354431), ('understand', 0.9986592531204224), ('board', 0.998489499092102), ('concern', 0.9983574151992798)]
dinner
0.99772704
0.9253813


Well, the results make sense to some degree, but it's obvious that your representations aren't perfect. This is because your corpus is small. To get more meaningful results, you need to train word2vec representations using much larger corpora.

Now, create your numerical features using the word2vec representations of the words. In the following, get the word2vec vectors of each word in a sentence. Then take the average of all the vectors in the high dimensional space (in your case, it's 100). So, as a result, you'll have a vector of 100 dimensions as the feature for a sentence. You can then use each dimension as a separate feature—which means that you'll have 100 numerical features in your final dataset.

In [16]:
word2vec_arr = np.zeros((sentences.shape[0],100))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,...,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",-0.28633,0.189602,0.040008,0.50431,0.122309,-0.015341,-0.251262,-0.18305,...,0.10808,0.218905,-0.378288,-0.071109,-0.051211,-0.158687,0.090778,0.460301,-0.224636,0.050887
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",-0.246307,0.180035,0.012813,0.439538,0.096807,-0.022329,-0.224655,-0.147564,...,0.113968,0.20387,-0.314791,-0.076602,-0.03528,-0.114029,0.101395,0.379345,-0.168127,0.027302
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit]",-0.347402,0.232404,0.048664,0.612269,0.152939,-0.036372,-0.301442,-0.22948,...,0.12982,0.250894,-0.446079,-0.086098,-0.082392,-0.183035,0.096149,0.539733,-0.27301,0.060775
3,Carroll,"[oh, dear]",-0.304499,0.233723,0.049377,0.521712,0.138025,-0.032585,-0.274533,-0.22662,...,0.132914,0.198088,-0.32866,-0.073778,-0.035742,-0.124326,0.092076,0.438679,-0.25618,0.018199
4,Carroll,"[oh, dear]",-0.304499,0.233723,0.049377,0.521712,0.138025,-0.032585,-0.274533,-0.22662,...,0.132914,0.198088,-0.32866,-0.073778,-0.035742,-0.124326,0.092076,0.438679,-0.25618,0.018199


## Word2vec in action

Notice that you now have a dataset where the columns named from *0* to *99* are the features that you'll use in the following models. Use the same models that you built in the previous checkpoints to predict the author of a sentence.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.8113207547169812

Test set score: 0.8073170731707318
----------------------Random Forest Scores----------------------
Training set score: 0.9934938191281718

Test set score: 0.8312195121951219
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8845152895250488

Test set score: 0.8346341463414634
