In [1]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv,index_col=0)
my_df.head()

Unnamed: 0,text,target
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [4]:
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
my_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1596019 entries, 0 to 1596018
Data columns (total 2 columns):
text      1596019 non-null object
target    1596019 non-null int64
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [5]:
x = my_df.text
y = my_df.target

In [7]:
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

In [8]:
print "Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(x_train),
                                                                             (len(x_train[y_train == 0]) / (len(x_train)*1.))*100,
                                                                            (len(x_train[y_train == 1]) / (len(x_train)*1.))*100)
print "Validation set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(x_validation),
                                                                             (len(x_validation[y_validation == 0]) / (len(x_validation)*1.))*100,
                                                                            (len(x_validation[y_validation == 1]) / (len(x_validation)*1.))*100)
print "Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(x_test),
                                                                             (len(x_test[y_test == 0]) / (len(x_test)*1.))*100,
                                                                            (len(x_test[y_test == 1]) / (len(x_test)*1.))*100)

Train set has total 1564098 entries with 50.00% negative, 50.00% positive
Validation set has total 15960 entries with 50.40% negative, 49.60% positive
Test set has total 15961 entries with 50.26% negative, 49.74% positive


# Neural Networks with Doc2Vec

Before I jump into neural network modelling with the vectors I got from Doc2Vec, I would like to give you some background on how I got these document vectors. I have implemented Doc2Vec using Gensim library in the 6th part of this series. 

There are three different methods used to train Doc2Vec. Distributed Bag of Words, Distributed Memory (Mean), Distributed Memory (Concatenation). These models were trained with 1.5 million tweets through 30 epochs and the output of the models are 100 dimension vectors for each tweet. After I got document vectors from each model, I have tried concatenating these (so the concatenated document vectors have 200 dimensions) in combination: DBOW + DMM, DBOW + DMC, and saw an improvement to the performance when compared with models with one pure method. Using different methods of training and concatenating them to improve the performance has already been demonstrated by Le and Mikolov (2014) in their research paper.
https://cs.stanford.edu/~quocle/paragraph_vector.pdf

Finally, I have applied phrase modelling to detect bigram phrase and trigram phrase as a pre-step of Doc2Vec training and tried different combination across n-grams. When tested with a logistic regression model, I got the best performance result from 'unigram DBOW + trigram DMM' document vectors.

I will first start by loading Gensim's Doc2Vec, and define a function to extract document vectors, then load the doc2vec model I trained.

In [9]:
from gensim.models import Doc2Vec

def get_concat_vectors(model1,model2, corpus, size):
    vecs = np.zeros((len(corpus), size))
    n = 0
    for i in corpus.index:
        prefix = 'all_' + str(i)
        vecs[n] = np.append(model1.docvecs[prefix],model2.docvecs[prefix])
        n += 1
    return vecs

In [11]:
model_ug_dbow = Doc2Vec.load('d2v_model_ug_dbow.doc2vec')
model_tg_dmm = Doc2Vec.load('d2v_model_tg_dmm.doc2vec')
model_ug_dbow.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
model_tg_dmm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [12]:
train_vecs_ugdbow_tgdmm = get_concat_vectors(model_ug_dbow,model_tg_dmm, x_train, 200)
validation_vecs_ugdbow_tgdmm = get_concat_vectors(model_ug_dbow,model_tg_dmm, x_validation, 200)

In [13]:
%%time
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(train_vecs_ugdbow_tgdmm, y_train)

CPU times: user 58.9 s, sys: 35 s, total: 1min 33s
Wall time: 2min 3s


In [14]:
%%time
clf.score(train_vecs_ugdbow_tgdmm, y_train)

CPU times: user 1.04 s, sys: 4.62 s, total: 5.66 s
Wall time: 8.79 s


0.7590662477670836

In [15]:
%%time
clf.score(validation_vecs_ugdbow_tgdmm, y_validation)

CPU times: user 11.8 ms, sys: 47.9 ms, total: 59.6 ms
Wall time: 90.1 ms


0.7576441102756892

When fed to a simple logistic regression, the concatenated document vectors (unigram DBOW + trigram DMM) yields 75.90% training set accuracy, and 75.76% validation set accuracy.

I will try different numbers of hidden layers, hidden nodes to compare the performance. In the below code block, you see I first define the seed as "7" but not setting the random seed, "np.random.seed()" will be defined at the start of each model. This is for a reproducibility of various results from different model structures.

*Side Note (reproducibility): To be honest, this took me a while to figure out. I first tried by setting the random seed before I import Keras, and ran one model after another. However, if I define the same model structure after it has run, I couldn't get the same result. But I also realised if I restart the kernel, and re-run code blocks from start it gives me the same result as the last kernel. So I figured, after running a model the random seed changes, and that is the reason why I cannot get the same result with the same structure if I run them in the same kernel consecutively. Anyway, that is why I set the random seed every time I try a different model. For your information, I am running Keras with Theano backend, and only using CPU not GPU. If you are on the same setting, this should work. I explicitly specified backend as Theano by launching Jupyter Notebook in the command line as follows: "KERAS_BACKEND=theano jupyter notebook"

Please note that not all of the dependencies loaded in the below cell has been used for this post, but imported for later use.

In [12]:
seed = 7

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

  from ._conv import register_converters as _register_converters
Using Theano backend.


In [16]:
%%time
np.random.seed(seed)
model_d2v_01 = Sequential()
model_d2v_01.add(Dense(64, activation='relu', input_dim=200))
model_d2v_01.add(Dense(1, activation='sigmoid'))
model_d2v_01.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_01.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 30s - loss: 0.4791 - acc: 0.7749 - val_loss: 0.4661 - val_acc: 0.7787
Epoch 2/10
 - 30s - loss: 0.4637 - acc: 0.7829 - val_loss: 0.4717 - val_acc: 0.7816
Epoch 3/10
 - 30s - loss: 0.4593 - acc: 0.7852 - val_loss: 0.4614 - val_acc: 0.7838
Epoch 4/10
 - 30s - loss: 0.4567 - acc: 0.7867 - val_loss: 0.4607 - val_acc: 0.7837
Epoch 5/10
 - 29s - loss: 0.4552 - acc: 0.7878 - val_loss: 0.4586 - val_acc: 0.7862
Epoch 6/10
 - 29s - loss: 0.4537 - acc: 0.7883 - val_loss: 0.4579 - val_acc: 0.7853
Epoch 7/10
 - 30s - loss: 0.4527 - acc: 0.7887 - val_loss: 0.4576 - val_acc: 0.7863
Epoch 8/10
 - 30s - loss: 0.4519 - acc: 0.7891 - val_loss: 0.4566 - val_acc: 0.7866
Epoch 9/10
 - 30s - loss: 0.4512 - acc: 0.7896 - val_loss: 0.4573 - val_acc: 0.7877
Epoch 10/10
 - 30s - loss: 0.4507 - acc: 0.7898 - val_loss: 0.4585 - val_acc: 0.7856
CPU times: user 4min 53s, sys: 16.1 s, total: 5min 9s
Wall time: 4min 58s


In [19]:
np.random.seed(seed)
model_d2v_02 = Sequential()
model_d2v_02.add(Dense(64, activation='relu', input_dim=200))
model_d2v_02.add(Dense(64, activation='relu'))
model_d2v_02.add(Dense(1, activation='sigmoid'))
model_d2v_02.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_02.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 39s - loss: 0.4661 - acc: 0.7777 - val_loss: 0.4541 - val_acc: 0.7826
Epoch 2/10
 - 39s - loss: 0.4483 - acc: 0.7878 - val_loss: 0.4504 - val_acc: 0.7901
Epoch 3/10
 - 41s - loss: 0.4431 - acc: 0.7906 - val_loss: 0.4491 - val_acc: 0.7898
Epoch 4/10
 - 43s - loss: 0.4403 - acc: 0.7923 - val_loss: 0.4472 - val_acc: 0.7927
Epoch 5/10
 - 41s - loss: 0.4382 - acc: 0.7933 - val_loss: 0.4472 - val_acc: 0.7942
Epoch 6/10
 - 40s - loss: 0.4369 - acc: 0.7940 - val_loss: 0.4441 - val_acc: 0.7912
Epoch 7/10
 - 40s - loss: 0.4359 - acc: 0.7946 - val_loss: 0.4465 - val_acc: 0.7910
Epoch 8/10
 - 40s - loss: 0.4348 - acc: 0.7951 - val_loss: 0.4495 - val_acc: 0.7955
Epoch 9/10
 - 41s - loss: 0.4341 - acc: 0.7956 - val_loss: 0.4511 - val_acc: 0.7900
Epoch 10/10
 - 40s - loss: 0.4336 - acc: 0.7961 - val_loss: 0.4457 - val_acc: 0.7928


<keras.callbacks.History at 0x146c17c10>

In [22]:
np.random.seed(seed)
model_d2v_03 = Sequential()
model_d2v_03.add(Dense(64, activation='relu', input_dim=200))
model_d2v_03.add(Dense(64, activation='relu'))
model_d2v_03.add(Dense(64, activation='relu'))
model_d2v_03.add(Dense(1, activation='sigmoid'))
model_d2v_03.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_03.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 44s - loss: 0.4655 - acc: 0.7778 - val_loss: 0.4548 - val_acc: 0.7853
Epoch 2/10
 - 49s - loss: 0.4479 - acc: 0.7883 - val_loss: 0.4484 - val_acc: 0.7905
Epoch 3/10
 - 52s - loss: 0.4426 - acc: 0.7912 - val_loss: 0.4487 - val_acc: 0.7902
Epoch 4/10
 - 55s - loss: 0.4395 - acc: 0.7925 - val_loss: 0.4465 - val_acc: 0.7916
Epoch 5/10
 - 57s - loss: 0.4372 - acc: 0.7939 - val_loss: 0.4459 - val_acc: 0.7925
Epoch 6/10
 - 58s - loss: 0.4358 - acc: 0.7948 - val_loss: 0.4432 - val_acc: 0.7919
Epoch 7/10
 - 58s - loss: 0.4345 - acc: 0.7955 - val_loss: 0.4435 - val_acc: 0.7937
Epoch 8/10
 - 60s - loss: 0.4336 - acc: 0.7960 - val_loss: 0.4433 - val_acc: 0.7934
Epoch 9/10
 - 59s - loss: 0.4328 - acc: 0.7966 - val_loss: 0.4500 - val_acc: 0.7914
Epoch 10/10
 - 60s - loss: 0.4322 - acc: 0.7967 - val_loss: 0.4421 - val_acc: 0.7912


<keras.callbacks.History at 0x15c5a8e10>

In [27]:
np.random.seed(seed)
model_d2v_04 = Sequential()
model_d2v_04.add(Dense(128, activation='relu', input_dim=200))
model_d2v_04.add(Dense(1, activation='sigmoid'))
model_d2v_04.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_04.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 37s - loss: 0.4762 - acc: 0.7762 - val_loss: 0.4624 - val_acc: 0.7830
Epoch 2/10
 - 39s - loss: 0.4592 - acc: 0.7851 - val_loss: 0.4640 - val_acc: 0.7843
Epoch 3/10
 - 36s - loss: 0.4533 - acc: 0.7883 - val_loss: 0.4576 - val_acc: 0.7868
Epoch 4/10
 - 36s - loss: 0.4497 - acc: 0.7903 - val_loss: 0.4561 - val_acc: 0.7883
Epoch 5/10
 - 36s - loss: 0.4473 - acc: 0.7915 - val_loss: 0.4555 - val_acc: 0.7865
Epoch 6/10
 - 38s - loss: 0.4455 - acc: 0.7928 - val_loss: 0.4538 - val_acc: 0.7882
Epoch 7/10
 - 38s - loss: 0.4440 - acc: 0.7934 - val_loss: 0.4523 - val_acc: 0.7896
Epoch 8/10
 - 36s - loss: 0.4428 - acc: 0.7940 - val_loss: 0.4537 - val_acc: 0.7911
Epoch 9/10
 - 36s - loss: 0.4420 - acc: 0.7947 - val_loss: 0.4539 - val_acc: 0.7851
Epoch 10/10
 - 36s - loss: 0.4412 - acc: 0.7946 - val_loss: 0.4533 - val_acc: 0.7914


<keras.callbacks.History at 0x146ac9fd0>

In [28]:
np.random.seed(seed)
model_d2v_05 = Sequential()
model_d2v_05.add(Dense(128, activation='relu', input_dim=200))
model_d2v_05.add(Dense(128, activation='relu'))
model_d2v_05.add(Dense(1, activation='sigmoid'))
model_d2v_05.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_05.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 69s - loss: 0.4609 - acc: 0.7807 - val_loss: 0.4496 - val_acc: 0.7887
Epoch 2/10
 - 77s - loss: 0.4420 - acc: 0.7912 - val_loss: 0.4427 - val_acc: 0.7928
Epoch 3/10
 - 88s - loss: 0.4358 - acc: 0.7948 - val_loss: 0.4433 - val_acc: 0.7902
Epoch 4/10
 - 91s - loss: 0.4319 - acc: 0.7968 - val_loss: 0.4386 - val_acc: 0.7960
Epoch 5/10
 - 93s - loss: 0.4290 - acc: 0.7983 - val_loss: 0.4398 - val_acc: 0.7950
Epoch 6/10
 - 94s - loss: 0.4267 - acc: 0.7995 - val_loss: 0.4379 - val_acc: 0.7955
Epoch 7/10
 - 95s - loss: 0.4251 - acc: 0.8003 - val_loss: 0.4383 - val_acc: 0.7942
Epoch 8/10
 - 96s - loss: 0.4235 - acc: 0.8013 - val_loss: 0.4416 - val_acc: 0.7944
Epoch 9/10
 - 96s - loss: 0.4223 - acc: 0.8019 - val_loss: 0.4445 - val_acc: 0.7926
Epoch 10/10
 - 94s - loss: 0.4213 - acc: 0.8024 - val_loss: 0.4388 - val_acc: 0.7950


<keras.callbacks.History at 0x154e557d0>

In [29]:
np.random.seed(seed)
model_d2v_06 = Sequential()
model_d2v_06.add(Dense(128, activation='relu', input_dim=200))
model_d2v_06.add(Dense(128, activation='relu'))
model_d2v_06.add(Dense(128, activation='relu'))
model_d2v_06.add(Dense(1, activation='sigmoid'))
model_d2v_06.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_06.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 124s - loss: 0.4613 - acc: 0.7806 - val_loss: 0.4481 - val_acc: 0.7869
Epoch 2/10
 - 146s - loss: 0.4420 - acc: 0.7914 - val_loss: 0.4426 - val_acc: 0.7923
Epoch 3/10
 - 175s - loss: 0.4352 - acc: 0.7951 - val_loss: 0.4425 - val_acc: 0.7931
Epoch 4/10
 - 184s - loss: 0.4312 - acc: 0.7971 - val_loss: 0.4381 - val_acc: 0.7953
Epoch 5/10
 - 188s - loss: 0.4281 - acc: 0.7989 - val_loss: 0.4373 - val_acc: 0.7946
Epoch 6/10
 - 189s - loss: 0.4259 - acc: 0.8000 - val_loss: 0.4370 - val_acc: 0.7975
Epoch 7/10
 - 196s - loss: 0.4243 - acc: 0.8010 - val_loss: 0.4395 - val_acc: 0.7969
Epoch 8/10
 - 224s - loss: 0.4228 - acc: 0.8017 - val_loss: 0.4398 - val_acc: 0.7942
Epoch 9/10
 - 210s - loss: 0.4220 - acc: 0.8022 - val_loss: 0.4455 - val_acc: 0.7948
Epoch 10/10
 - 221s - loss: 0.4209 - acc: 0.8029 - val_loss: 0.4373 - val_acc: 0.7947


<keras.callbacks.History at 0x155e35910>

In [30]:
np.random.seed(seed)
model_d2v_07 = Sequential()
model_d2v_07.add(Dense(256, activation='relu', input_dim=200))
model_d2v_07.add(Dense(1, activation='sigmoid'))
model_d2v_07.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_07.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 57s - loss: 0.4746 - acc: 0.7770 - val_loss: 0.4627 - val_acc: 0.7821
Epoch 2/10
 - 73s - loss: 0.4572 - acc: 0.7861 - val_loss: 0.4639 - val_acc: 0.7840
Epoch 3/10
 - 72s - loss: 0.4505 - acc: 0.7897 - val_loss: 0.4568 - val_acc: 0.7857
Epoch 4/10
 - 71s - loss: 0.4458 - acc: 0.7923 - val_loss: 0.4541 - val_acc: 0.7860
Epoch 5/10
 - 67s - loss: 0.4423 - acc: 0.7947 - val_loss: 0.4547 - val_acc: 0.7891
Epoch 6/10
 - 68s - loss: 0.4395 - acc: 0.7962 - val_loss: 0.4526 - val_acc: 0.7870
Epoch 7/10
 - 67s - loss: 0.4370 - acc: 0.7978 - val_loss: 0.4516 - val_acc: 0.7912
Epoch 8/10
 - 67s - loss: 0.4349 - acc: 0.7988 - val_loss: 0.4548 - val_acc: 0.7904
Epoch 9/10
 - 67s - loss: 0.4332 - acc: 0.7999 - val_loss: 0.4571 - val_acc: 0.7890
Epoch 10/10
 - 67s - loss: 0.4318 - acc: 0.8007 - val_loss: 0.4580 - val_acc: 0.7895


<keras.callbacks.History at 0x15ac06f50>

In [31]:
np.random.seed(seed)
model_d2v_08 = Sequential()
model_d2v_08.add(Dense(256, activation='relu', input_dim=200))
model_d2v_08.add(Dense(256, activation='relu'))
model_d2v_08.add(Dense(1, activation='sigmoid'))
model_d2v_08.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_08.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 172s - loss: 0.4581 - acc: 0.7824 - val_loss: 0.4465 - val_acc: 0.7866
Epoch 2/10
 - 224s - loss: 0.4383 - acc: 0.7936 - val_loss: 0.4416 - val_acc: 0.7939
Epoch 3/10
 - 283s - loss: 0.4304 - acc: 0.7979 - val_loss: 0.4431 - val_acc: 0.7927
Epoch 4/10
 - 308s - loss: 0.4251 - acc: 0.8007 - val_loss: 0.4409 - val_acc: 0.7904
Epoch 5/10
 - 323s - loss: 0.4209 - acc: 0.8029 - val_loss: 0.4400 - val_acc: 0.7908
Epoch 6/10
 - 334s - loss: 0.4177 - acc: 0.8047 - val_loss: 0.4386 - val_acc: 0.7937
Epoch 7/10
 - 341s - loss: 0.4150 - acc: 0.8062 - val_loss: 0.4427 - val_acc: 0.7948
Epoch 8/10
 - 347s - loss: 0.4126 - acc: 0.8074 - val_loss: 0.4471 - val_acc: 0.7949
Epoch 9/10
 - 354s - loss: 0.4105 - acc: 0.8083 - val_loss: 0.4449 - val_acc: 0.7926
Epoch 10/10
 - 358s - loss: 0.4089 - acc: 0.8091 - val_loss: 0.4438 - val_acc: 0.7951


<keras.callbacks.History at 0x1618ba590>

In [32]:
np.random.seed(seed)
model_d2v_09 = Sequential()
model_d2v_09.add(Dense(256, activation='relu', input_dim=200))
model_d2v_09.add(Dense(256, activation='relu'))
model_d2v_09.add(Dense(256, activation='relu'))
model_d2v_09.add(Dense(1, activation='sigmoid'))
model_d2v_09.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_09.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 349s - loss: 0.4579 - acc: 0.7827 - val_loss: 0.4448 - val_acc: 0.7904
Epoch 2/10
 - 500s - loss: 0.4371 - acc: 0.7944 - val_loss: 0.4401 - val_acc: 0.7967
Epoch 3/10
 - 649s - loss: 0.4287 - acc: 0.7987 - val_loss: 0.4396 - val_acc: 0.7948
Epoch 4/10
 - 672s - loss: 0.4229 - acc: 0.8019 - val_loss: 0.4369 - val_acc: 0.7957
Epoch 5/10
 - 664s - loss: 0.4182 - acc: 0.8046 - val_loss: 0.4353 - val_acc: 0.7953
Epoch 6/10
 - 664s - loss: 0.4146 - acc: 0.8063 - val_loss: 0.4363 - val_acc: 0.7974
Epoch 7/10
 - 670s - loss: 0.4115 - acc: 0.8079 - val_loss: 0.4403 - val_acc: 0.7993
Epoch 8/10
 - 670s - loss: 0.4087 - acc: 0.8094 - val_loss: 0.4437 - val_acc: 0.7964
Epoch 9/10
 - 672s - loss: 0.4061 - acc: 0.8107 - val_loss: 0.4435 - val_acc: 0.7926
Epoch 10/10
 - 672s - loss: 0.4037 - acc: 0.8118 - val_loss: 0.4411 - val_acc: 0.7952


<keras.callbacks.History at 0x1637e7890>

In [33]:
np.random.seed(seed)
model_d2v_10 = Sequential()
model_d2v_10.add(Dense(512, activation='relu', input_dim=200))
model_d2v_10.add(Dense(1, activation='sigmoid'))
model_d2v_10.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_10.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 89s - loss: 0.4739 - acc: 0.7773 - val_loss: 0.4615 - val_acc: 0.7823
Epoch 2/10
 - 129s - loss: 0.4556 - acc: 0.7872 - val_loss: 0.4603 - val_acc: 0.7864
Epoch 3/10
 - 142s - loss: 0.4480 - acc: 0.7914 - val_loss: 0.4570 - val_acc: 0.7874
Epoch 4/10
 - 154s - loss: 0.4418 - acc: 0.7948 - val_loss: 0.4522 - val_acc: 0.7865
Epoch 5/10
 - 157s - loss: 0.4367 - acc: 0.7981 - val_loss: 0.4567 - val_acc: 0.7865
Epoch 6/10
 - 159s - loss: 0.4319 - acc: 0.8009 - val_loss: 0.4577 - val_acc: 0.7872
Epoch 7/10
 - 156s - loss: 0.4276 - acc: 0.8032 - val_loss: 0.4586 - val_acc: 0.7904
Epoch 8/10
 - 157s - loss: 0.4237 - acc: 0.8058 - val_loss: 0.4602 - val_acc: 0.7873
Epoch 9/10
 - 154s - loss: 0.4208 - acc: 0.8073 - val_loss: 0.4645 - val_acc: 0.7857
Epoch 10/10
 - 154s - loss: 0.4179 - acc: 0.8091 - val_loss: 0.4719 - val_acc: 0.7835


<keras.callbacks.History at 0x164363f10>

In [34]:
np.random.seed(seed)
model_d2v_11 = Sequential()
model_d2v_11.add(Dense(512, activation='relu', input_dim=200))
model_d2v_11.add(Dense(512, activation='relu'))
model_d2v_11.add(Dense(1, activation='sigmoid'))
model_d2v_11.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_11.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 623s - loss: 0.4564 - acc: 0.7835 - val_loss: 0.4439 - val_acc: 0.7911
Epoch 2/10
 - 955s - loss: 0.4355 - acc: 0.7952 - val_loss: 0.4398 - val_acc: 0.7945
Epoch 3/10
 - 1239s - loss: 0.4264 - acc: 0.8003 - val_loss: 0.4431 - val_acc: 0.7978
Epoch 4/10
 - 1270s - loss: 0.4190 - acc: 0.8044 - val_loss: 0.4404 - val_acc: 0.7964
Epoch 5/10
 - 1329s - loss: 0.4131 - acc: 0.8070 - val_loss: 0.4452 - val_acc: 0.7954
Epoch 6/10
 - 1503s - loss: 0.4080 - acc: 0.8093 - val_loss: 0.4429 - val_acc: 0.7937
Epoch 7/10
 - 1516s - loss: 0.4034 - acc: 0.8116 - val_loss: 0.4433 - val_acc: 0.7964
Epoch 8/10
 - 1401s - loss: 0.3995 - acc: 0.8137 - val_loss: 0.4583 - val_acc: 0.7937
Epoch 9/10
 - 1445s - loss: 0.3961 - acc: 0.8153 - val_loss: 0.4540 - val_acc: 0.7934
Epoch 10/10
 - 1530s - loss: 0.3930 - acc: 0.8166 - val_loss: 0.4583 - val_acc: 0.7957


<keras.callbacks.History at 0x1655bf3d0>

In [35]:
np.random.seed(seed)
model_d2v_12 = Sequential()
model_d2v_12.add(Dense(512, activation='relu', input_dim=200))
model_d2v_12.add(Dense(512, activation='relu'))
model_d2v_12.add(Dense(512, activation='relu'))
model_d2v_12.add(Dense(1, activation='sigmoid'))
model_d2v_12.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_12.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), epochs=10, batch_size=32, verbose=2)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/10
 - 1384s - loss: 0.4558 - acc: 0.7840 - val_loss: 0.4426 - val_acc: 0.7876
Epoch 2/10
 - 2148s - loss: 0.4332 - acc: 0.7966 - val_loss: 0.4344 - val_acc: 0.7965
Epoch 3/10
 - 2682s - loss: 0.4224 - acc: 0.8027 - val_loss: 0.4385 - val_acc: 0.7951
Epoch 4/10
 - 2612s - loss: 0.4140 - acc: 0.8070 - val_loss: 0.4338 - val_acc: 0.7977
Epoch 5/10
 - 2596s - loss: 0.4068 - acc: 0.8109 - val_loss: 0.4341 - val_acc: 0.7962
Epoch 6/10
 - 2618s - loss: 0.4006 - acc: 0.8144 - val_loss: 0.4362 - val_acc: 0.7956
Epoch 7/10
 - 2624s - loss: 0.3948 - acc: 0.8171 - val_loss: 0.4364 - val_acc: 0.7983
Epoch 8/10
 - 2709s - loss: 0.3895 - acc: 0.8201 - val_loss: 0.4442 - val_acc: 0.7965
Epoch 9/10
 - 2730s - loss: 0.3847 - acc: 0.8226 - val_loss: 0.4431 - val_acc: 0.7948
Epoch 10/10
 - 2708s - loss: 0.3804 - acc: 0.8251 - val_loss: 0.4458 - val_acc: 0.7917


<keras.callbacks.History at 0x16f72ae50>

After trying 12 different models with a range of hidden layers (from 1 to 3) and a range of hidden nodes for each hidden layer (64, 128, 256, 512), below is the result I got. Best validation accuracy (79.93%) is from "model_d2v_09" at epoch 7, which has 3 hidden layers of 256 hidden nodes for each hidden layer.

| model | input layer (nodes) | hidden layer (nodes) | output layer (nodes) | best validation accuracy | number of epochs for best validation accuracy |
|-------|--------------|--------------|------------------|--------|--------|
| model_d2v_01 | 1 (200)  | 1 (64) relu  |  1 (1) sigmoid   | 78.77% | epoch 9 |
| model_d2v_02 | 1 (200)  | 2 (64) relu  |  1 (1) sigmoid   | 79.55% | epoch 8 |
| model_d2v_03 | 1 (200)  | 3 (64) relu  |  1 (1) sigmoid   | 79.37% | epoch 7 |
| model_d2v_04 | 1 (200)  | 1 (128) relu  |  1 (1) sigmoid   | 79.14% | epoch 10  |
| model_d2v_05 | 1 (200)  | 2 (128) relu  |  1 (1) sigmoid   | 79.60% | epoch 4  |
| model_d2v_06 | 1 (200)  | 3 (128) relu  |  1 (1) sigmoid   | 79.75% | epoch 6  |
| model_d2v_07 | 1 (200)  | 1 (256) relu  |  1 (1) sigmoid   | 79.12% | epoch 7  |
| model_d2v_08 | 1 (200)  | 2 (256) relu  |  1 (1) sigmoid   | 79.51% | epoch 10  |
| model_d2v_09 | 1 (200)  | 3 (256) relu |  1 (1) sigmoid   | 79.93% | epoch 7  |
| model_d2v_10 | 1 (200)  | 1 (512) relu |  1 (1) sigmoid   | 79.04% | epoch 7  |
| model_d2v_11 | 1 (200)  | 2 (512) relu |  1 (1) sigmoid   | 79.78% | epoch 3  |
| model_d2v_12 | 1 (200)  | 3 (512) relu |  1 (1) sigmoid   | 79.83% | epoch 7  |

Now I know which model gives me the best result, I will run the final model of "model_d2v_09", but this time with callback functions in Keras. I was not quite familiar with callback functions in Keras before I received a comment in my previous post. After I got the comment, I did some digging and found all the useful functions in Keras callbacks. Thanks to @rcshubha for the comment. With my final model of Doc2Vec below, I used "checkpoint" and "earlystop". You can set the "checkpoint" function with options, and with the below parameter setting, "checkpoint" will save the best performing model up until the point of running, and only if a new epoch outperforms the saved model it will save it as a new model. And "early_stop" I defined it as to monitor validation accuracy, and if it doesn't outperform the best validation accuracy so far for 5 epochs, it will stop.

In [36]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

filepath="d2v_09_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early_stop = EarlyStopping(monitor='val_acc', patience=5, mode='max') 
callbacks_list = [checkpoint, early_stop]
np.random.seed(seed)
model_d2v_09_es = Sequential()
model_d2v_09_es.add(Dense(256, activation='relu', input_dim=200))
model_d2v_09_es.add(Dense(256, activation='relu'))
model_d2v_09_es.add(Dense(256, activation='relu'))
model_d2v_09_es.add(Dense(1, activation='sigmoid'))
model_d2v_09_es.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_d2v_09_es.fit(train_vecs_ugdbow_tgdmm, y_train, validation_data=(validation_vecs_ugdbow_tgdmm, y_validation), 
                    epochs=100, batch_size=32, verbose=2, callbacks=callbacks_list)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/100
Epoch 00001: val_acc improved from -inf to 0.79041, saving model to d2v_09_best_weights.01-0.7904.hdf5
 - 354s - loss: 0.4579 - acc: 0.7827 - val_loss: 0.4448 - val_acc: 0.7904
Epoch 2/100
Epoch 00002: val_acc improved from 0.79041 to 0.79674, saving model to d2v_09_best_weights.02-0.7967.hdf5
 - 494s - loss: 0.4371 - acc: 0.7944 - val_loss: 0.4401 - val_acc: 0.7967
Epoch 3/100
Epoch 00003: val_acc did not improve
 - 635s - loss: 0.4287 - acc: 0.7987 - val_loss: 0.4396 - val_acc: 0.7948
Epoch 4/100
Epoch 00004: val_acc did not improve
 - 656s - loss: 0.4229 - acc: 0.8019 - val_loss: 0.4369 - val_acc: 0.7957
Epoch 5/100
Epoch 00005: val_acc did not improve
 - 665s - loss: 0.4182 - acc: 0.8046 - val_loss: 0.4353 - val_acc: 0.7953
Epoch 6/100
Epoch 00006: val_acc improved from 0.79674 to 0.79743, saving model to d2v_09_best_weights.06-0.7974.hdf5
 - 670s - loss: 0.4146 - acc: 0.8063 - val_loss: 0.4363 - val_acc: 0.7974
Epoch 

<keras.callbacks.History at 0x171067910>

If I evaluate the model I just run, it will give me the result as same as I got from the last epoch.

In [37]:
model_d2v_09_es.evaluate(x=validation_vecs_ugdbow_tgdmm, y=y_validation)



[0.4493457753556713, 0.787719298275491]

But if I load the saved model at the best epoch, then this model will give me the result at that epoch.

In [38]:
from keras.models import load_model
loaded_model = load_model('d2v_09_best_weights.07-0.7993.hdf5')

In [39]:
loaded_model.evaluate(x=validation_vecs_ugdbow_tgdmm, y=y_validation)



[0.4402723977739052, 0.7993107769722329]

If you remember the validation accuracy with the same vector representation of the tweets with a logistic regression model (75.76%), you can see that feeding the same information to neural networks yields a significantly better result. It's amazing to see how neural network can boost the performance of dense vectors, but the best validation accuracy is still lower than the Tfidf vectors + logistic regression model, which gave me 82.92% validation accuracy. 

If you have read my posts on Doc2Vec, or familiar with Doc2Vec, you might know that you can also extract word vectors for each word from the trained Doc2Vec model. I will move on to Word2Vec, and try different methods to see if any of those can outperform the Doc2Vec result (79.93%), ultimately outperform the Tfidf + logistic regression model (82.92%).

# Word2Vec

To make use of word vectors extracted from Doc2Vec model, I can no longer use the concatenated vectors of different n-grams, since they will not consist of the same vocabularies. Thus below, I load the model for unigram DMM and create concatenated vectors with unigram DBOW of 200 dimensions for each word in the vocabularies.

What I will do first before I try neural networks with document representations computed from word vectors is that I will fit a logistic regression with various methods of document representation and with the one that gives me the best validation accuracy, I will finally define neural network models.

I will also give you the summary of result from all the different word vectors fit with logistic regression as a table.

## Word vectors extracted from Doc2Vec models (Average/Sum)

There could be a number of different ways to come up with document representational vectors with individual word vectors. One obvious choice is to average them. For every word in a tweet, see if trained Doc2Vec has word vector representation of the word, if so, sum them up throughout the document while counting how many words were detected as having word vectors, and finally by dividing the summed vector by the count you get the averaged word vector for the whole document which will have the same dimension (200 in this case) as the individual word vectors.

Another method is just the sum of the word vectors without averaging them. This might distort the vector representation of the document if some tweets only have a few words in the Doc2Vec vocabulary and some tweets have most of the words in the Doc2Vec vocabulary. But I will try both summing and averaging and compare the results.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale

In [14]:
model_ug_dmm = Doc2Vec.load('d2v_model_ug_dmm.doc2vec')
model_ug_dmm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [14]:
def get_w2v_ugdbowdmm(tweet, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tweet.split():
        try:
            vec += np.append(model_ug_dbow[word],model_ug_dmm[word]).reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

In [15]:
def get_w2v_ugdbowdmm_sum(tweet, size):
    vec = np.zeros(size).reshape((1, size))
    for word in tweet.split():
        try:
            vec += np.append(model_ug_dbow[word],model_ug_dmm[word]).reshape((1, size))
        except KeyError:
            continue
    return vec

In [24]:
train_vecs_w2v_dbowdmm = np.concatenate([get_w2v_ugdbowdmm(z, 200) for z in x_train])
validation_vecs_w2v_dbowdmm = np.concatenate([get_w2v_ugdbowdmm(z, 200) for z in x_validation])

In [17]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_dbowdmm, y_train)

CPU times: user 6min 13s, sys: 2min 34s, total: 8min 48s
Wall time: 10min 58s


In [18]:
clf.score(validation_vecs_w2v_dbowdmm, y_validation)

0.7173558897243107

The validation accuracy with averaged word vectors of unigram DBOW + unigram DMM is 71.74%, which is significantly lower than document vectors extracted from unigram DBOW + trigram DMM (75.76%), and also from the results I got from the 6th part of this series, I know that document vectors extracted from unigram DBOW + unigram DMM will give me 75.51% validation accuracy.

I also tried scaling the vectors using ScikitLearn's scale function, and saw significant improvement in computation time and a slight improvement of the accuracy.

In [25]:
train_vecs_w2v_dbowdmm_s = scale(train_vecs_w2v_dbowdmm)
validation_vecs_w2v_dbowdmm_s = scale(validation_vecs_w2v_dbowdmm)

In [26]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_dbowdmm_s, y_train)

CPU times: user 1min 11s, sys: 34.6 s, total: 1min 46s
Wall time: 2min 29s


In [27]:
clf.score(validation_vecs_w2v_dbowdmm_s, y_validation)

0.7241854636591478

Let's see how summed word vectors perform compared to the averaged counter part.

In [16]:
train_vecs_w2v_dbowdmm_sum = np.concatenate([get_w2v_ugdbowdmm_sum(z, 200) for z in x_train])
validation_vecs_w2v_dbowdmm_sum = np.concatenate([get_w2v_ugdbowdmm_sum(z, 200) for z in x_validation])

In [17]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_dbowdmm_sum, y_train)

CPU times: user 22min 13s, sys: 1h 29min 43s, total: 1h 51min 57s
Wall time: 3h 28min 17s


In [18]:
clf.score(validation_vecs_w2v_dbowdmm_sum, y_validation)

0.7251253132832081

The summation method gave me higher accuracy without scaling compared to the average method. But the simple logistic regression with the summed vectors took more than 3 hours to run. So again I tried scaling these vectors.

In [19]:
train_vecs_w2v_dbowdmm_sum_s = scale(train_vecs_w2v_dbowdmm_sum)
validation_vecs_w2v_dbowdmm_sum_s = scale(validation_vecs_w2v_dbowdmm_sum)

In [22]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_dbowdmm_sum_s, y_train)

CPU times: user 2min 2s, sys: 48.4 s, total: 2min 51s
Wall time: 3min 41s


In [23]:
clf.score(validation_vecs_w2v_dbowdmm_sum_s, y_validation)

0.725250626566416

Surprising! With scaling, logistic regression fitting only took 3 minutes! That's quite a difference.

## Word vectors extracted from Doc2Vec models with TFIDF weighting (Average/Sum)

In the 5th part of this series, I have already explained what TF-IDF is. TF-IDF is a way of weighting each word by calculating the product of relative term frequency and inverse document frequency. Since it gives one scalar value for each word in the vocabulary, this can also be used as a weighting factor of each word vectors. Correa Jr. et al (2017) has implemented this Tf-idf weighting in their paper "NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis" http://www.aclweb.org/anthology/S17-2100

In order to get the Tfidf value for each word, I first fit and transform the training set with TfidfVectorizer and create a dictionary containing "word", "tfidf value" pairs.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(min_df=2)
tvec.fit_transform(x_train)
tfidf = dict(zip(tvec.get_feature_names(), tvec.idf_))
print 'vocab size :', len(tfidf)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


vocab size : 103691


In [33]:
len(set(model_ug_dbow.wv.vocab.keys()) & set(tvec.get_feature_names()))

103691

In [1]:
def get_w2v_general(tweet, size, vectors, aggregation='mean'):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tweet.split():
        try:
            vec += vectors[word].reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if aggregation == 'mean':
        if count != 0:
            vec /= count
        return vec
    elif aggregation == 'sum':
        return vec

The below code can also be implemented within the word vector averaging or summing function, but it seems like it's taking quite a long time, so I separated this and tried to make a dictionary of word vectors weighted by Tfidf values. To be honest, I am still not sure why it took so long to compute the Tfidf weighting of the word vectors, but after 5 hours it finally finished computing. You can also see later that I tried another method of weighting but that took less than 10 seconds. If you have an answer to this, any insight would be appreciated.

In [22]:
%%time
w2v_tfidf = {}
for w in model_ug_dbow.wv.vocab.keys():
    if w in tvec.get_feature_names():
        w2v_tfidf[w] = np.append(model_ug_dbow[w],model_ug_dmm[w]) * tfidf[w]

CPU times: user 4h 53min 1s, sys: 6min 1s, total: 4h 59min 2s
Wall time: 4h 58min 17s


In [25]:
import cPickle as pickle
with open('w2v_tfidf.p', 'wb') as fp:
    pickle.dump(w2v_tfidf, fp, protocol=pickle.HIGHEST_PROTOCOL)

In [33]:
import cPickle as pickle
with open('w2v_tfidf.p', 'rb') as fp:
    w2v_tfidf = pickle.load(fp)

In [37]:
%%time
train_vecs_w2v_tfidf_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'mean') for z in x_train]))
validation_vecs_w2v_tfidf_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'mean') for z in x_validation]))

CPU times: user 1min 18s, sys: 22.4 s, total: 1min 40s
Wall time: 1min 58s


In [38]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_tfidf_mean, y_train)

CPU times: user 52.4 s, sys: 28.7 s, total: 1min 21s
Wall time: 1min 52s


In [39]:
clf.score(validation_vecs_w2v_tfidf_mean, y_validation)

0.7057017543859649

In [40]:
%%time
train_vecs_w2v_tfidf_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'sum') for z in x_train]))
validation_vecs_w2v_tfidf_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_tfidf, 'sum') for z in x_validation]))

CPU times: user 1min 13s, sys: 20.8 s, total: 1min 34s
Wall time: 1min 52s


In [41]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_tfidf_sum, y_train)

CPU times: user 1min 16s, sys: 29.7 s, total: 1min 46s
Wall time: 2min 18s


In [42]:
clf.score(validation_vecs_w2v_tfidf_sum, y_validation)

0.7031954887218045

The result is not what I expected, especially after 5 hours of waiting. By weighting word vectors with Tfidf values, the validation accuracy dropped around 2% both for averaging and summing.

## Word vectors extracted from Doc2Vec models with custom weighting (Average/Sum)

In the 3rd part of this series, I have defined a custom metric called "pos_normcdf_hmean", which is a metric borrowed from the presentation by Jason Kessler in PyData 2017 Seattle. If you want to know more in detail about the calculation, you can either check my previous post or you can also watch Jason Kessler's presentation. To give you a high-level intuition, by calculating harmonic mean of CDF(Cumulative Distribution Function) transformed values of term frequency rate within the whole document and the term frequency within a class, you can get a meaningful metric which shows how each word is related to a certain class.

I have used this metric to visualise tokens in the 3rd part of the series, and also used this again to create custom lexicon to be used for classification purpose in the 5th part. I will use this again as a weighting factor for the word vectors, and see how it affects the performance.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(max_features=100000)
cvec.fit(x_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=100000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [54]:
neg_train = x_train[y_train == 0]
pos_train = x_train[y_train == 1]
neg_doc_matrix = cvec.transform(neg_train)
pos_doc_matrix = cvec.transform(pos_train)
neg_tf = np.sum(neg_doc_matrix,axis=0)
pos_tf = np.sum(pos_doc_matrix,axis=0)

In [55]:
from scipy.stats import hmean
from scipy.stats import norm
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())

neg = np.squeeze(np.asarray(neg_tf))
pos = np.squeeze(np.asarray(pos_tf))
term_freq_df2 = pd.DataFrame([neg,pos],columns=cvec.get_feature_names()).transpose()
term_freq_df2.columns = ['negative', 'positive']
term_freq_df2['total'] = term_freq_df2['negative'] + term_freq_df2['positive']
term_freq_df2['pos_rate'] = term_freq_df2['positive'] * 1./term_freq_df2['total']
term_freq_df2['pos_freq_pct'] = term_freq_df2['positive'] * 1./term_freq_df2['positive'].sum()
term_freq_df2['pos_rate_normcdf'] = normcdf(term_freq_df2['pos_rate'])
term_freq_df2['pos_freq_pct_normcdf'] = normcdf(term_freq_df2['pos_freq_pct'])
term_freq_df2['pos_normcdf_hmean'] = hmean([term_freq_df2['pos_rate_normcdf'], term_freq_df2['pos_freq_pct_normcdf']])
term_freq_df2.sort_values(by='pos_normcdf_hmean', ascending=False).iloc[:10]

Unnamed: 0,negative,positive,total,pos_rate,pos_freq_pct,pos_rate_normcdf,pos_freq_pct_normcdf,pos_normcdf_hmean
welcome,610,6565,7175,0.914983,0.000752,0.912972,0.999474,0.954267
thank,2234,15428,17662,0.873514,0.001768,0.888181,1.0,0.940779
thanks,5646,33697,39343,0.856493,0.003862,0.876664,1.0,0.934279
congrats,451,3254,3705,0.878273,0.000373,0.891258,0.945374,0.917519
followfriday,167,2665,2832,0.941031,0.000305,0.926292,0.903828,0.914922
awesome,3735,14189,17924,0.79162,0.001626,0.825297,1.0,0.904288
hello,1104,4425,5529,0.800326,0.000507,0.832885,0.985875,0.902945
hehe,960,3966,4926,0.805116,0.000454,0.836969,0.975098,0.900769
glad,2225,8086,10311,0.784211,0.000927,0.818668,0.999974,0.900284
follow,2498,8977,11475,0.782309,0.001029,0.816942,0.999997,0.899248


In [56]:
pos_hmean = term_freq_df2.pos_normcdf_hmean

In [53]:
%%time
w2v_pos_hmean = {}
for w in model_ug_dbow.wv.vocab.keys():
    if w in pos_hmean.keys():
        w2v_pos_hmean[w] = np.append(model_ug_dbow[w],model_ug_dmm[w]) * pos_hmean[w]

CPU times: user 4.81 s, sys: 1.93 s, total: 6.75 s
Wall time: 9.51 s


In [58]:
with open('w2v_hmean.p', 'wb') as fp:
    pickle.dump(w2v_pos_hmean, fp, protocol=pickle.HIGHEST_PROTOCOL)

In [43]:
import cPickle as pickle
with open('w2v_hmean.p', 'rb') as fp:
    w2v_pos_hmean = pickle.load(fp)

In [44]:
train_vecs_w2v_poshmean_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'mean') for z in x_train]))
validation_vecs_w2v_poshmean_mean = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'mean') for z in x_validation]))

In [45]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_poshmean_mean, y_train)

CPU times: user 1min 40s, sys: 1min 15s, total: 2min 55s
Wall time: 4min 20s


In [46]:
clf.score(validation_vecs_w2v_poshmean_mean, y_validation)

0.7327067669172932

In [47]:
train_vecs_w2v_poshmean_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'sum') for z in x_train]))
validation_vecs_w2v_poshmean_sum = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean, 'sum') for z in x_validation]))

In [48]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_poshmean_sum, y_train)

CPU times: user 3min 23s, sys: 3min 51s, total: 7min 14s
Wall time: 11min 46s


In [49]:
clf.score(validation_vecs_w2v_poshmean_sum, y_validation)

0.7093984962406015

Unlike Tfidf weighting, this time with custom weighting it actually gave me some performance boost when used with averaging method. But with summing, this weighting has performed no better than the word vectors without weighting.

## Word vectors extracted from pre-trained GloVe (Average/Sum)

GloVe is another kind of word representaiton in vectors proposed by Pennington et al. (2014) from the Stanford NLP Group. https://nlp.stanford.edu/pubs/glove.pdf

The difference between Word2Vec and Glove is how the two models compute the word vectors. In Word2Vec, the word vectors you are getting is a kind of a by-product of a shallow neural network, when it tries to predict either centre word given surrounding words or vice versa. But with GloVe, the word vectors you are getting is the object matrix of GloVe model, and it calculates this using term co-occurrence matrix and dimensionality reduction.

The good news is you can now easily load and use the pre-trained GloVe vectors from Gensim thanks to its latest update (Gensim 3.2.0). In addition to some pre-trained word vectors, new datasets are also added and this also can be easily downloaded using their downloader API. If you want to know more about this, please check this blog post by RaRe Technologies. https://rare-technologies.com/new-download-api-for-pretrained-nlp-models-and-datasets-in-gensim/

The Stanford NLP Group has made their pre-trained GloVe vectors publicly available, and among them there are GloVe vectors trained specifically with Tweets. This sounds like something definitely worth trying. They have four different versions of Tweet vectors each with different dimensions (25, 50, 100, 200) trained on 2 billion Tweets. You can find more detail in their website. https://nlp.stanford.edu/projects/glove/

For this post, I will use 200 dimesion pre-trrained GloVe vectors.

In [16]:
import gensim.downloader as api
glove_twitter = api.load("glove-twitter-200")

In [17]:
train_vecs_glove_mean = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'mean') for z in x_train]))
validation_vecs_glove_mean = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'mean') for z in x_validation]))

In [18]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_glove_mean, y_train)

CPU times: user 1min 39s, sys: 34 s, total: 2min 13s
Wall time: 2min 45s


In [19]:
clf.score(validation_vecs_glove_mean, y_validation)

0.76265664160401

In [20]:
train_vecs_glove_sum = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'sum') for z in x_train]))
validation_vecs_glove_sum = scale(np.concatenate([get_w2v_general(z, 200, glove_twitter,'sum') for z in x_validation]))

In [21]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_glove_sum, y_train)

CPU times: user 2min 49s, sys: 43.3 s, total: 3min 32s
Wall time: 4min 14s


In [22]:
clf.score(validation_vecs_glove_sum, y_validation)

0.7659774436090225

By using pre-trained GloVe vectors, I can see that the validation accuracy significantly improved. So far the best validation accuracy was from the averaged word vectors with custom weighting, which gave me 73.27% accuracy, and compared to this, GloVe vectors yields 76.27%, 76.60% for average and sum respectively.

## Word vectors extracted from pre-trained Google News Word2Vec (Average/Sum)

With new updated Gensim, I can also load the famous pre-trained Google News word vectors. These word vectors are trained using Word2Vec model on Google News dataset (about 100 billion words) and published by Google. The model contains 300-dimensional vectors for 3 million words and phrases. You can find more detail in the Google project archive. https://code.google.com/archive/p/word2vec/

In [16]:
import gensim.downloader as api
googlenews = api.load("word2vec-google-news-300")

In [17]:
train_vecs_googlenews_mean = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'mean') for z in x_train]))
validation_vecs_googlenews_mean = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'mean') for z in x_validation]))

In [18]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_googlenews_mean, y_train)

CPU times: user 5min 55s, sys: 41min 5s, total: 47min
Wall time: 1h 23min 5s


In [19]:
clf.score(validation_vecs_googlenews_mean, y_validation)

0.749561403508772

In [20]:
train_vecs_googlenews_sum = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'sum') for z in x_train]))
validation_vecs_googlenews_sum = scale(np.concatenate([get_w2v_general(z, 300, googlenews,'sum') for z in x_validation]))

In [21]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_googlenews_sum, y_train)

CPU times: user 5min 45s, sys: 39min 17s, total: 45min 2s
Wall time: 1h 19min 51s


In [22]:
clf.score(validation_vecs_googlenews_sum, y_validation)

0.7491854636591478

Even though it gives me a better result than the word vectors extracted from custom trained Doc2Vec models, but it fails to outperform GloVe vectors. And the vector dimension is even larger in Google News word vectors.

But, this is trained with Google News, and GloVe vector I used was trained specifically with Tweets, thus it is hard to comapre each other directly. What if Word2Vec is specifically trained with Tweets?

## Separately trained Word2Vec (Average/Sum)

I know I have already tried word vectors I extracted from Doc2Vec models, but what if I train separate Word2Vec models? Even though Doc2Vec models gave good representational vectors of document level, would it be more efficently learning word vectors if I train pure Word2Vec?

In order to answer my own questions, I trained two Word2Vec models using CBOW (Continuous Bag Of Words) and Skip Gram models. In terms of parameter setting, I set the same parameters I used for Doc2Vec.

- size of vectors: 100 dimensions
- negative sampling: 5
- window: 2
- minimum word count: 2
- alpha: 0.065 (decrease alpha by 0.002 per epoch)
- number of epochs: 30

With above settings, I defined CBOW model by passing "sg=0", and Skip Gram model by passing "sg=1".

And once I get the results from two models, I concatenate vectors of two models for each word so that the concatenated vectors will have 200 dimensional representation of each word.

Please note that in the 6th part, where I trained Doc2Vec, I used "LabeledSentence" function imported from Gensim. This has now been deprecated, thus for this post I used "TaggedDocument" function instead. The usage is the same.

In [26]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
import gensim
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import TaggedDocument
import multiprocessing
from sklearn import utils

In [27]:
def labelize_tweets_ug(tweets,label):
    result = []
    prefix = label
    for i, t in zip(tweets.index, tweets):
        result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
    return result

In [28]:
all_x = pd.concat([x_train,x_validation,x_test])
all_x_w2v = labelize_tweets_ug(all_x, 'all')

In [32]:
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

100%|██████████| 1596019/1596019 [00:01<00:00, 974931.80it/s]


In [33]:
%%time
for epoch in range(30):
    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
    model_ug_cbow.alpha -= 0.002
    model_ug_cbow.min_alpha = model_ug_cbow.alpha

100%|██████████| 1596019/1596019 [00:01<00:00, 896751.73it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1132476.16it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1098657.06it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1087776.38it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1093001.25it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1157588.98it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1100648.79it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1119397.57it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1039300.87it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1139894.69it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1115533.23it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1056060.47it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1110581.51it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1120059.66it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1124520.95it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1012078.2

CPU times: user 22min 1s, sys: 1min 3s, total: 23min 5s
Wall time: 9min 37s


In [35]:
train_vecs_cbow_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_cbow,'mean') for z in x_train]))
validation_vecs_cbow_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_cbow,'mean') for z in x_validation]))

  


In [36]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_cbow_mean, y_train)

CPU times: user 40.8 s, sys: 5.19 s, total: 46 s
Wall time: 48.7 s


In [37]:
clf.score(validation_vecs_cbow_mean, y_validation)

0.7600250626566416

In [38]:
model_ug_sg = Word2Vec(sg=1, size=100, negative=5, window=2, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_sg.build_vocab([x.words for x in tqdm(all_x_w2v)])

100%|██████████| 1596019/1596019 [00:02<00:00, 533098.47it/s]


In [39]:
%%time
for epoch in range(30):
    model_ug_sg.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
    model_ug_sg.alpha -= 0.002
    model_ug_sg.min_alpha = model_ug_sg.alpha

100%|██████████| 1596019/1596019 [00:01<00:00, 923343.66it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1071407.58it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1084559.36it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1085515.92it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1098921.10it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1084263.01it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1137634.20it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1063158.62it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1095510.09it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1124627.88it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1097201.52it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 994637.94it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1164331.50it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1117291.03it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1121341.67it/s]
100%|██████████| 1596019/1596019 [00:01<00:00, 1069449.79

CPU times: user 41min 35s, sys: 31.7 s, total: 42min 7s
Wall time: 12min 35s


In [40]:
train_vecs_sg_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_sg,'mean') for z in x_train]))
validation_vecs_sg_mean = scale(np.concatenate([get_w2v_general(z, 100, model_ug_sg,'mean') for z in x_validation]))

  


In [41]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_sg_mean, y_train)

CPU times: user 23.3 s, sys: 4.34 s, total: 27.7 s
Wall time: 29.6 s


In [42]:
clf.score(validation_vecs_sg_mean, y_validation)

0.7604010025062656

In [43]:
def get_w2v_mean(tweet, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tweet.split():
        try:
            vec += np.append(model_ug_cbow[word],model_ug_sg[word]).reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

In [44]:
train_vecs_cbowsg_mean = scale(np.concatenate([get_w2v_mean(z, 200) for z in x_train]))
validation_vecs_cbowsg_mean = scale(np.concatenate([get_w2v_mean(z, 200) for z in x_validation]))

  


In [45]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_cbowsg_mean, y_train)

CPU times: user 6min 17s, sys: 27min 45s, total: 34min 2s
Wall time: 1h 13min 2s


In [46]:
clf.score(validation_vecs_cbowsg_mean, y_validation)

0.7650375939849624

In [47]:
def get_w2v_sum(tweet, size):
    vec = np.zeros(size).reshape((1, size))
    for word in tweet.split():
        try:
            vec += np.append(model_ug_cbow[word],model_ug_sg[word]).reshape((1, size))
        except KeyError:
            continue
    return vec

In [48]:
train_vecs_cbowsg_sum = scale(np.concatenate([get_w2v_sum(z, 200) for z in x_train]))
validation_vecs_cbowsg_sum = scale(np.concatenate([get_w2v_sum(z, 200) for z in x_validation]))

  """


In [49]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_cbowsg_sum, y_train)

CPU times: user 7min 7s, sys: 28min 32s, total: 35min 40s
Wall time: 1h 16min 21s


In [50]:
clf.score(validation_vecs_cbowsg_sum, y_validation)

0.7675438596491229

The concatenated vectors of unigram CBOW and unigram Skip Gram models has yielded 76.50%, 76.75% validation accuracy respectively with mean and sum method. These results are even higher than the results I got from GloVe vectors. 

But please do not confuse this as a general statement. This is an empirical finding in this particualr setting.

## Separately trained Word2Vec with custom weighting (Average/Sum)

As a final step, I will apply the custom weighting I have implemented above and see if this affects the performance.

In [58]:
%%time
w2v_pos_hmean_01 = {}
for w in model_ug_cbow.wv.vocab.keys():
    if w in pos_hmean.keys():
        w2v_pos_hmean_01[w] = np.append(model_ug_cbow[w],model_ug_sg[w]) * pos_hmean[w]

  after removing the cwd from sys.path.


CPU times: user 4.92 s, sys: 631 ms, total: 5.55 s
Wall time: 6.19 s


In [59]:
train_vecs_w2v_poshmean_mean_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'mean') for z in x_train]))
validation_vecs_w2v_poshmean_mean_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'mean') for z in x_validation]))

In [60]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_poshmean_mean_01, y_train)

CPU times: user 8min 3s, sys: 32min 13s, total: 40min 17s
Wall time: 1h 23min 58s


In [61]:
clf.score(validation_vecs_w2v_poshmean_mean_01, y_validation)

0.7797619047619048

In [62]:
train_vecs_w2v_poshmean_sum_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'sum') for z in x_train]))
validation_vecs_w2v_poshmean_sum_01 = scale(np.concatenate([get_w2v_general(z, 200, w2v_pos_hmean_01, 'sum') for z in x_validation]))

In [63]:
%%time
clf = LogisticRegression()
clf.fit(train_vecs_w2v_poshmean_sum_01, y_train)

CPU times: user 7min 48s, sys: 28min 59s, total: 36min 48s
Wall time: 1h 11min 4s


In [64]:
clf.score(validation_vecs_w2v_poshmean_sum_01, y_validation)

0.7451754385964913

Finally I get the best performing word vectors. Averaged word vectors (separately trained Word2Vec models) weighted with custom metric has yielded the best validation accuray of 77.97%! Below is the table of all the results I tried above.

| Word vectors extracted from | Vector dimensions | Weightings | Validation Accuracy with mean | Validation accuracy with sum |
|---|---|---|---|
| Doc2Vec (unigram DBOW + unigram DMM) | 200 | N/A | 72.42% | 72.51% |
| Doc2Vec (unigram DBOW + unigram DMM) | 200 | TF-IDF | 70.57% | 70.32% |
| Doc2Vec (unigram DBOW + unigram DMM) | 200 | custom | 73.27% | 70.94% |
| pre-trained GloVe (Tweets) | 200 | N/A | 76.27% | 76.60% |
| pre-trained Word2Vec (Google News) | 300 | N/A | 74.96% | 74.92% |
| Word2Vec (unigram CBOW + unigram SG) | 200 | N/A | 76.50% | 76.75% |
| Word2Vec (unigram CBOW + unigram SG) | 200 | custom | 77.98% | 74.52% |

# Neural Network with Word2Vec

The best performing word vectors with logistic regression was chosen to feed to a neural network model. This time I did not try various different architecture. Based on what I have observed during trials of different artchitectures with Doc2Vec document vectors, the best performing architecture was one with 3 hiddel layers with 256 hidden nodes at each hidden layer.

I will finally fit a neural network with early stopping and checkpoint so that I can save the best performing weights on validation accuracy.

In [65]:
train_w2v_final = train_vecs_w2v_poshmean_mean_01
validation_w2v_final = validation_vecs_w2v_poshmean_mean_01

In [66]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

filepath="w2v_01_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early_stop = EarlyStopping(monitor='val_acc', patience=5, mode='max') 
callbacks_list = [checkpoint, early_stop]
np.random.seed(seed)
model_w2v_01 = Sequential()
model_w2v_01.add(Dense(256, activation='relu', input_dim=200))
model_w2v_01.add(Dense(256, activation='relu'))
model_w2v_01.add(Dense(256, activation='relu'))
model_w2v_01.add(Dense(1, activation='sigmoid'))
model_w2v_01.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model_w2v_01.fit(train_w2v_final, y_train, validation_data=(validation_w2v_final, y_validation), 
                 epochs=100, batch_size=32, verbose=2, callbacks=callbacks_list)

Train on 1564098 samples, validate on 15960 samples
Epoch 1/100
Epoch 00001: val_acc improved from -inf to 0.79530, saving model to w2v_01_best_weights.01-0.7953.hdf5
 - 500s - loss: 0.4294 - acc: 0.8017 - val_loss: 0.4392 - val_acc: 0.7953
Epoch 2/100
Epoch 00002: val_acc improved from 0.79530 to 0.79762, saving model to w2v_01_best_weights.02-0.7976.hdf5
 - 575s - loss: 0.4144 - acc: 0.8090 - val_loss: 0.4353 - val_acc: 0.7976
Epoch 3/100
Epoch 00003: val_acc improved from 0.79762 to 0.80056, saving model to w2v_01_best_weights.03-0.8006.hdf5
 - 753s - loss: 0.4086 - acc: 0.8118 - val_loss: 0.4319 - val_acc: 0.8006
Epoch 4/100
Epoch 00004: val_acc improved from 0.80056 to 0.80182, saving model to w2v_01_best_weights.04-0.8018.hdf5
 - 768s - loss: 0.4046 - acc: 0.8138 - val_loss: 0.4331 - val_acc: 0.8018
Epoch 5/100
Epoch 00005: val_acc did not improve
 - 787s - loss: 0.4016 - acc: 0.8155 - val_loss: 0.4300 - val_acc: 0.8010
Epoch 6/100
Epoch 00006: val_acc improved from 0.80182 to 0.

<keras.callbacks.History at 0x60a841290>

In [67]:
from keras.models import load_model
loaded_w2v_model = load_model('w2v_01_best_weights.10-0.8048.hdf5')

In [68]:
loaded_w2v_model.evaluate(x=validation_w2v_final, y=y_validation)



[0.4244666022615026, 0.8047619047619048]

The best validation accuracy is 80.48%. Surprisingly this is even hihger than the best accuracy I got by feeding document vectors to neurla network models in the above.

It took quite some time for me to try different settings, different calculations, but I learned some valuable lessons through all the trial and errors. Specifically trained Word2Vec with carefully engineered weighting can even outperform Doc2Vec in classification task.

In the next post, I will try more sophisticated neural network model, Convolutional Neural Network. Again I hope this will give me some boost of the performance.