In [1]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv,index_col=0)
my_df.head()

Unnamed: 0,text,target
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [4]:
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
my_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1596019 entries, 0 to 1596018
Data columns (total 2 columns):
text      1596019 non-null object
target    1596019 non-null int64
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [5]:
x = my_df.text
y = my_df.target

In [7]:
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

In [10]:
from sklearn.linear_model import LogisticRegression

## Doc2Vec

Before we jump into doc2vec, it will be better to first start by word2vec. "Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words."

Word2vec is not a single algorithm but consists of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these techniques learn weights which act as word vector representations. With a corpus, CBOW model predicts the current word from a window of surrounding context words, while Skip-gram model predicts surrounding context words given the current word. In Gensim package, you can specify whether to use CBOW or Skip-gram by passing the argument "sg" when implementing Word2Vec. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.

For example, let's say we have the following sentence: "I love dogs". CBOW model tries to predict the word "love" when given "I", "dogs" as inputs, on the other hand, Skip-gram model tries to predict "I", "dogs" when given the word "love" as input.

Below picture represents more formally how these two models work.

![title](img/w2v.png)

But what's used as word vectors are actually not the predicted results from these models but the weights of the trained models. By extracting the weights, such a vector comes to represent in some abstract way the ‘meaning’ of a word.

Then what is doc2vec? Doc2vec uses the same logic as word2vec, but apply this to the document level. According to Mikolov et al. (2014), "every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context...The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph." https://cs.stanford.edu/~quocle/paragraph_vector.pdf

![title](img/d2v01.png)

DM: 
This is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph. 

DBOW:
This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

I implemented Doc2Vec model using a Python library, Gensim. In case of DM model, I implemented averaging and concatenating. This is inspired by the research paper from Le and Mikolov (2014). In their paper, they have implemented DM model in two different way, using average calculation process for the paragraph matrix, and concatenating calculation method for the paragraph matrix. This has also been shown in Gensim's tutorial.

Below are the methods I used to get the vectors for each tweet.

1. DBOW (Distributed Bag of Words)
2. DMC (Distributed Memory Concatenated)
3. DMM (Distributed Memory Mean)
4. DBOW + DMC
5. DBOW + DMM

With above vectors, I fit a simple logistic regression model and evaluated the result on the validation set.

In [175]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
import multiprocessing
from sklearn import utils

In [29]:
def labelize_tweets_ug(tweets,label):
    result = []
    prefix = label
    for i, t in zip(tweets.index, tweets):
        result.append(LabeledSentence(t.split(), [prefix + '_%s' % i]))
    return result

For training Doc2Vec, I used the whole data set. The rationale behind this is that the doc2vec training is completely unsupervised and thus there is no need to hold out any data, as it is unlabelled. This rationale is inspired by the rationale of Lau and Baldwin (2016) in their research paper "An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation" https://arxiv.org/pdf/1607.05368.pdf

Also, the same rationale has been applied in the Gensim's Doc2Vec tutorial. In the IMDB tutorial, vector training is occurring on all documents of the data set, including all train/test/dev set. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

In [30]:
all_x = pd.concat([x_train,x_validation,x_test])
all_x_w2v = labelize_tweets_ug(all_x, 'all')

In [31]:
len(all_x_w2v)

1596019

## DBOW

In [76]:
model_ug_dbow = Doc2Vec(dm=0, size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_dbow.build_vocab([x for x in tqdm(all_x_w2v)])

100%|██████████| 1596019/1596019 [00:01<00:00, 1103371.28it/s]


According to the developer Radim Řehůřek who created Gensim,
"One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results."

Below iteration implement explicit multiple-pass, alpha-reduction approach with added shuffling. This has been already presented in Gensim's IMDB tutorial.

In [77]:
%%time
for epoch in range(30):
    model_ug_dbow.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
    model_ug_dbow.alpha -= 0.002
    model_ug_dbow.min_alpha = model_ug_dbow.alpha

100%|██████████| 1596019/1596019 [00:01<00:00, 1219082.43it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1261002.16it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1288925.92it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1305433.30it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1272539.49it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1326664.83it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1303369.44it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1311243.19it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1374487.58it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1344628.55it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1342260.54it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1312454.04it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1316521.68it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1298431.37it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1310747.92it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1286284.

CPU times: user 37min 59s, sys: 17min 26s, total: 55min 25s
Wall time: 34min 9s


In [16]:
def get_vectors(model, corpus, size):
    vecs = np.zeros((len(corpus), size))
    n = 0
    for i in corpus.index:
        prefix = 'all_' + str(i)
        vecs[n] = model.docvecs[prefix]
        n += 1
    return vecs

In [78]:
train_vecs_dbow = get_vectors(model_ug_dbow, x_train, 100)
validation_vecs_dbow = get_vectors(model_ug_dbow, x_validation, 100)

In [79]:
clf = LogisticRegression()
clf.fit(train_vecs_dbow, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [80]:
clf.score(validation_vecs_dbow, y_validation)

0.73890977443609018

Even though the DBOW model doesn't learn the meaning of the individual words, but as features to feed to a classifier, it seems like it's doing its job.

But the result doesn't seem to excel count vectorizer or Tfidf vectorizer. It might not be a direct comparison since either count vectorizer of Tfidf vectorizer uses a large number of features to represent a tweet, but in this case, a vector for each tweet has only 200 dimensions.

In [81]:
model_ug_dbow.save('d2v_model_ug_dbow.doc2vec')
model_ug_dbow = Doc2Vec.load('d2v_model_ug_dbow.doc2vec')

In [82]:
model_ug_dbow.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## Distributed Momory (concatenated)

In [90]:
cores = multiprocessing.cpu_count()
model_ug_dmc = Doc2Vec(dm=1, dm_concat=1, size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_dmc.build_vocab([x for x in tqdm(all_x_w2v)])

100%|██████████| 1596019/1596019 [00:01<00:00, 953468.23it/s]


In [91]:
%%time
for epoch in range(30):
    model_ug_dmc.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
    model_ug_dmc.alpha -= 0.002
    model_ug_dmc.min_alpha = model_ug_dmc.alpha

100%|██████████| 1596019/1596019 [00:01<00:00, 1306549.28it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1419834.51it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1287296.73it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1090579.91it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1395998.90it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1403537.78it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1344315.05it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1393606.22it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1345848.57it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1368462.81it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1289542.44it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1407111.38it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1384729.94it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1358161.42it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1376371.99it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1252529.

CPU times: user 47min 4s, sys: 16min 55s, total: 1h 3min 59s
Wall time: 35min


In [176]:
model_ug_dmc = Doc2Vec.load('d2v_model_ug_dmc.doc2vec')

What's nice about Doc2Vec is that after training you can retrieve not only document vectors but also individual word vectors as well. Note, however, that a Doc2Vec DBOW model doesn't learn semantic word vectors, so the word vectors you retrieve from pure DBOW model will be the automatic randomly-initialized vectors, with no meaning.
But with DM model, you can see the semantic relationship between words. Let's see what word vectors it has learned through training.

In [92]:
model_ug_dmc.most_similar('good')

[('goood', 0.7454031705856323),
 ('gud', 0.7452770471572876),
 ('gd', 0.7434319853782654),
 ('gooood', 0.7358574271202087),
 ('great', 0.7102019786834717),
 ('goooood', 0.6563930511474609),
 ('guud', 0.6441871523857117),
 ('gooooood', 0.6416404843330383),
 ('gooooooood', 0.6410443782806396),
 ('cnceled', 0.6380959153175354)]

In [93]:
model_ug_dmc.most_similar('happy')

[('hapy', 0.7785520553588867),
 ('hapi', 0.7260264158248901),
 ('happpy', 0.7140897512435913),
 ('happpppy', 0.6873939037322998),
 ('pleased', 0.6722116470336914),
 ('hppy', 0.6686583161354065),
 ('happpppppy', 0.6357202529907227),
 ('teuni', 0.6338286399841309),
 ('haaappy', 0.6285831928253174),
 ('unhappy', 0.6153473854064941)]

What's interesting with DMC model is, somehow it learned all the misspelled version of a word as you can see from the above.

In [178]:
model_ug_dmc.most_similar('facebook')

[('myspace', 0.8975507020950317),
 ('fb', 0.8213573694229126),
 ('youtube', 0.7994770407676697),
 ('msn', 0.7978468537330627),
 ('ym', 0.7916175127029419),
 ('bebo', 0.7702169418334961),
 ('weebly', 0.764090359210968),
 ('yahoo', 0.7522760033607483),
 ('flickr', 0.7478793263435364),
 ('gmail', 0.7433972954750061)]

In [179]:
model_ug_dmc.most_similar(positive=['bigger', 'small'], negative=['big'])

[('smaller', 0.6462364792823792),
 ('larger', 0.6360152959823608),
 ('confections', 0.5971038341522217),
 ('stricter', 0.5868656039237976),
 ('braver', 0.5825048685073853),
 ('chillier', 0.5745378732681274),
 ('sharper', 0.5676980018615723),
 ('colorfull', 0.567488431930542),
 ('scarier', 0.5673887729644775),
 ('slower', 0.5586768388748169)]

The model successfully catches the comparative form of "small", on feeding the word "big" and "bigger". The above line of code is like asking the model to add the vectors associated with the word "bigger" and "small" while subtracting "big" is equal to the top result, "smaller".

In [94]:
train_vecs_dmc = get_vectors(model_ug_dmc, x_train, 100)
validation_vecs_dmc = get_vectors(model_ug_dmc, x_validation, 100)

In [95]:
clf = LogisticRegression()
clf.fit(train_vecs_dmc, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [96]:
clf.score(validation_vecs_dmc, y_validation)

0.6646616541353384

In [97]:
model_ug_dmc.save('d2v_model_ug_dmc.doc2vec')
model_ug_dmc = Doc2Vec.load('d2v_model_ug_dmc.doc2vec')
model_ug_dmc.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## Distributed Memory (mean)

In [98]:
cores = multiprocessing.cpu_count()
model_ug_dmm = Doc2Vec(dm=1, dm_mean=1, size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_dmm.build_vocab([x for x in tqdm(all_x_w2v)])

100%|██████████| 1596019/1596019 [00:01<00:00, 1236915.22it/s]


In [99]:
%%time
for epoch in range(30):
    model_ug_dmm.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
    model_ug_dmm.alpha -= 0.002
    model_ug_dmm.min_alpha = model_ug_dmm.alpha

100%|██████████| 1596019/1596019 [00:01<00:00, 1098305.56it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1300438.43it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1344649.08it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1342917.83it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1381563.17it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1263600.00it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1390100.80it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1354605.11it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1332752.43it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1299577.54it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1303878.45it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1347850.31it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1285677.08it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1314748.43it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1504333.07it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1492316.

CPU times: user 50min 41s, sys: 22min 21s, total: 1h 13min 3s
Wall time: 44min 14s


In [100]:
model_ug_dmm.most_similar('good')

[('great', 0.9227250814437866),
 ('bad', 0.883027195930481),
 ('nice', 0.8792778849601746),
 ('this', 0.8624216914176941),
 ('you', 0.8607751727104187),
 ('busy', 0.8561026453971863),
 ('sad', 0.8500779867172241),
 ('better', 0.8498481512069702),
 ('long', 0.8439643979072571),
 ('not', 0.8438945412635803)]

In [101]:
model_ug_dmc.most_similar('happy')

[('hapy', 0.7785520553588867),
 ('hapi', 0.7260264158248901),
 ('happpy', 0.7140897512435913),
 ('happpppy', 0.6873939037322998),
 ('pleased', 0.6722116470336914),
 ('hppy', 0.6686583161354065),
 ('happpppppy', 0.6357202529907227),
 ('teuni', 0.6338286399841309),
 ('haaappy', 0.6285831928253174),
 ('unhappy', 0.6153473854064941)]

In [102]:
train_vecs_dmm = get_vectors(model_ug_dmm, x_train, 100)
validation_vecs_dmm = get_vectors(model_ug_dmm, x_validation, 100)

In [103]:
clf = LogisticRegression()
clf.fit(train_vecs_dmm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [104]:
clf.score(validation_vecs_dmm, y_validation)

0.72556390977443608

In [105]:
model_ug_dmm.save('d2v_model_ug_dmm.doc2vec')
model_ug_dmm = Doc2Vec.load('d2v_model_ug_dmm.doc2vec')
model_ug_dmm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

Since I have the document vectors from four different models, now I can concatenate them in combination to see how it affects the performance. Below I defined a simple function to concatenate document vectors from different models.

In [15]:
def get_concat_vectors(model1,model2, corpus, size):
    vecs = np.zeros((len(corpus), size))
    n = 0
    for i in corpus.index:
        prefix = 'all_' + str(i)
        vecs[n] = np.append(model1.docvecs[prefix],model2.docvecs[prefix])
        n += 1
    return vecs

In [110]:
train_vecs_dbow_dmc = get_concat_vectors(model_ug_dbow,model_ug_dmc, x_train, 200)
validation_vecs_dbow_dmc = get_concat_vectors(model_ug_dbow,model_ug_dmc, x_validation, 200)

In [111]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_dbow_dmc, y_train)

CPU times: user 2min 22s, sys: 14min 10s, total: 16min 32s
Wall time: 36min 27s


In [112]:
clf.score(validation_vecs_dbow_dmc, y_validation)

0.74580200501253135

In [113]:
train_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_train, 200)
validation_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_validation, 200)

In [114]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_dbow_dmm, y_train)

CPU times: user 1min 48s, sys: 7min 46s, total: 9min 34s
Wall time: 20min 20s


In [115]:
clf.score(validation_vecs_dbow_dmm, y_validation)

0.75513784461152877

In case of unigram, concatenating document vectors in different combination boosted the model performance. The best validation accuracy I got from a single model is from DBOW at 73.89%. With concatenated vectors, I get the highest validation accuracy of 75.51% with DBOW+DMM model.