# LSTM + word2vec

#### Preprosessing

nltk preprocessing, keras default tokenizer (remove stopwords, numbers, punctuation)

#### Model

LSTM base - one layer (100 - hidden units, without dropouts) 
* 1) word2vec train own texts
* 2) pretrained word2vec

---

In [124]:
MODEL_NAME = "lstm_word2vec"
TRAIN_DATASETS = ["data/test_imdb.csv", "data/train_imdb.csv", "data/test_rt_en.csv", "data/train_rt_en.csv"]

TOKENIZER_NAME = "lstm_word2vec_tokenizer"

---

In [66]:
import nltk
import numpy
import pickle
import re
numpy.random.seed(42)

from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter

from sklearn.model_selection import train_test_split


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

Swith on full text mode)

In [10]:
pd.options.display.max_colwidth = -1

## Load data

In [11]:
datasets = []
for dataset in TRAIN_DATASETS:    
    datasets.append(pd.read_csv(dataset, sep="|"))

In [64]:
whole_data = pd.concat(datasets)

In [13]:
for data in datasets:
    print(data.shape)

(10000, 3)
(40000, 3)
(19798, 3)
(79190, 3)


In [70]:
test_data, train_data = train_test_split(whole_data, train_size=0.1)

In [110]:
test_data.shape

(14898, 5)

In [111]:
train_data.shape

(134090, 4)

---

## Preprocessing

In [73]:
MAX_NB_WORDS = 20000
MAX_SEQUENCE_LENGTH = 70

---

Remove stopwords, numbers, punctuation

In [74]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

In [75]:
negatives = {
    "didn't": "didn_`_t",
    "couldn't": "couldn_`_t",
    "don't": "don_`_t",
    "wouldn't": "wouldn_`_t",
    "doesn't": "doesn_`_t",
    "wasn't": "wasn_`_t",
    "weren't": "weren_`_t",
    "shouldn't":"shouldn_`_t",
    "isn't": "isn_`_t",
    "aren't": "aren_`_t",
}

In [76]:
def preprocess(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    text = ' '.join(tweet_tokenizer.tokenize(text))
    for k, v in negatives.items():
        text = text.replace(k, v)
    return text

In [77]:
preprocess("""A sentimental crowd-pleaser, well-directed by Le McCarey, this tale about a priest (Bing Crosby) assigned to a problematic parish was so popular that Paramount reteamed the same players for The Bells of St. Mary's.""")

"a sentimental crowd-pleaser , well-directed by le mccarey , this tale about a priest ( bing crosby ) assigned to a problematic parish was so popular that paramount reteamed the same players for the bells of st . mary's ."

In [78]:
train_data['prep_text'] = train_data['text'].map(preprocess)
test_data['prep_text'] = test_data['text'].map(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [80]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,label,text,prep_text
13680,58770,1,Let's face it -- nothing short of a gorilla suit can make Peter MacNicol not look like a weenie.,let's face it - - nothing short of a gorilla suit can make peter macnicol not look like a weenie .
71716,22099,1,"A modest and accessible Iranian film, softer than most but still intriguing.","a modest and accessible iranian film , softer than most but still intriguing ."
2273,46279,1,"I saw the movie and really could not stop my tears. Its tragedy that India has no such leaders after freedom, who dare to do justice with their own children, when they don't behave properly.. In current generation, politicians bring their children's into politics without measuring their caliber and skills.. I remember the dialogue from Gandhi 'What kind of society we want to create/make with such people (about Harilal)?' No wonder that it will be a dream that India will hardly have such leader in this or next generation.. Einstein was right when he said about Gandhi that 'After 50 years one would hardly believe that such person with body, soul and mind (Mahatma Gandhi) had ever lived on this earth.' I sincerely want to THANKS a LOT to Anil kapoor, Feroze khan and all film actors/actresses for this wonderful movie about great person and relationship with his son. All father and son should watch this movie once and take some lessons for both roles.","i saw the movie and really could not stop my tears . its tragedy that india has no such leaders after freedom , who dare to do justice with their own children , when they don_`_t behave properly .. in current generation , politicians bring their children's into politics without measuring their caliber and skills .. i remember the dialogue from gandhi ' what kind of society we want to create / make with such people ( about harilal ) ? ' no wonder that it will be a dream that india will hardly have such leader in this or next generation .. einstein was right when he said about gandhi that ' after 50 years one would hardly believe that such person with body , soul and mind ( mahatma gandhi ) had ever lived on this earth . ' i sincerely want to thanks a lot to anil kapoor , feroze khan and all film actors / actresses for this wonderful movie about great person and relationship with his son . all father and son should watch this movie once and take some lessons for both roles ."
25816,60370,0,"Despite a rich premise, Just Friends ultimately fails to live up to its early potential, settling into broad comedy that shortchanges the romantic possibilities.","despite a rich premise , just friends ultimately fails to live up to its early potential , settling into broad comedy that shortchanges the romantic possibilities ."
22921,9272,1,"Shot in astonishingly elaborate long takes, this is the kind of film that finds the most brilliant poetry in the slightest movement of the camera -- a paradigm of cinematic expression.","shot in astonishingly elaborate long takes , this is the kind of film that finds the most brilliant poetry in the slightest movement of the camera - - a paradigm of cinematic expression ."


### Padding data

Keras [Embedding layer](https://keras.io/layers/embeddings/) turn positive integers (indexes) into dense vectors of fixed size. 

* 1) Firstly convert words to indexes
* 2) Then we padding data

['юристы есть', 'мне нужны юристы'] -> [[0, 4, 10], [2, 3, 4]] -> [[0.25, 0.1], [0.6, -0.2]]

#### Loading google word2vec

In [49]:
%%time
from gensim.models import Word2Vec

word2vec_google = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
print("Loaded")

Loaded
CPU times: user 2min 33s, sys: 5.44 s, total: 2min 38s
Wall time: 2min 40s


In [50]:
%%time
# Normalizing word2vec vectors.
word2vec_google.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.

CPU times: user 1min 6s, sys: 14.8 s, total: 1min 21s
Wall time: 1min 51s


In [81]:
len(word2vec_google.index2word)

3000000

#### 1. String -> Int vector

In [82]:
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, filters='"#$%&()*+-/:;<=>@[\\]^{|}~\t\n,.')
tokenizer.fit_on_texts(train_data['prep_text'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 117811 unique tokens.


In [83]:
#save tokenizer
with open(TOKENIZER_NAME,'wb') as ofile:
    pickle.dump(tokenizer, ofile)
    ofile.close()

#### Add embeddings

In [84]:
def get_embedding(word2vec_model, word):
    try:
        return word2vec_model[word]
    except KeyError:
        return np.zeros(word2vec_model.vector_size)

In [85]:
nb_words = min(MAX_NB_WORDS, len(word_index))
    
embedding_weights_google = np.zeros((nb_words, word2vec_google.vector_size))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_weights_google[i] = get_embedding(word2vec_google, word)

#### 2. Padding

In [86]:
sequences_train = tokenizer.texts_to_sequences(train_data['prep_text'])
sequences_test = tokenizer.texts_to_sequences(test_data['prep_text'])

padded_sequences_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH)
padded_sequences_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

In [87]:
def padding(text):
    return pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=MAX_SEQUENCE_LENGTH)

In [88]:
x_train = padded_sequences_train
x_test = padded_sequences_test
y_train = train_data['label']
y_test= test_data['label']

---

## Model

In [93]:
LSTM_DIM = 128
EMBEDDING_DIM = 300
DROPOUT_U = 0.2
DROPOUT_W = 0.2
DROPOUT_AFTER_LSTM = 0.2

---

In [94]:
from keras.models import Sequential
from keras.layers import Dense
from keras.models import model_from_json
from keras.layers import Flatten, Dropout
from keras.layers import LSTM, Bidirectional
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping

In [95]:
labels = to_categorical(np.asarray(train_data['label']))
print('Shape of data tensor:', padded_sequences_train.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (134090, 70)
Shape of label tensor: (134090, 2)


In [96]:
def create_model(pretrained_embedding_weights = None):
    model = Sequential()
    if pretrained_embedding_weights is not None:
        model.add(Embedding(nb_words,
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False,
                            weights=[pretrained_embedding_weights]))
    else:
        model.add(Embedding(n_symbols, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
    #model.add(LSTM(LSTM_DIM, dropout_U=DROPOUT_U, dropout_W=DROPOUT_W))
    model.add(Bidirectional(LSTM(LSTM_DIM, dropout_U=DROPOUT_U, dropout_W=DROPOUT_W)))
    model.add(Dropout(DROPOUT_AFTER_LSTM))
    model.add(Dense(1, activation='sigmoid'))
    metrics=['accuracy', 'fmeasure', 'precision', 'recall']
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=metrics)
    return model

In [98]:
lstm_word2vec_google = create_model(embedding_weights_google)

In [99]:
lstm_word2vec_google.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 70, 300)       6000000     embedding_input_1[0][0]          
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 256)           439296      embedding_1[0][0]                
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 256)           0           bidirectional_1[0][0]            
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             257         dropout_1[0][0]                  
Total params: 6,439,553
Trainable params: 439,553
Non-trainable params: 6,000,000
_________

In [100]:
tensor_board = TensorBoard(log_dir='./logs', histogram_freq=0, write_graph=False, write_images=False)
early_stopping = EarlyStopping(monitor='val_acc', min_delta=0, patience=3, verbose=0, mode='auto')
model_checkpoint = ModelCheckpoint("models/%s.hdf5" % MODEL_NAME, monitor='val_acc', save_best_only=True, verbose=1)

In [101]:
%%time
lstm_word2vec_google.fit(x_train, y_train, 
                         nb_epoch=50,
                         batch_size=128,
                         verbose=2,
                         validation_data=(x_test, y_test),
                         callbacks=[tensor_board, early_stopping, model_checkpoint])

Train on 134090 samples, validate on 14898 samples
Instructions for updating:
Please switch to tf.summary.merge_all.
Epoch 1/50
Epoch 00000: val_acc improved from -inf to 0.77077, saving model to models/rt_lstm_word2vec.hdf5
1377s - loss: 0.5319 - acc: 0.7284 - fmeasure: 0.7764 - precision: 0.7535 - recall: 0.8088 - val_loss: 0.4735 - val_acc: 0.7708 - val_fmeasure: 0.7985 - val_precision: 0.8322 - val_recall: 0.7693
Epoch 2/50
Epoch 00001: val_acc improved from 0.77077 to 0.78870, saving model to models/rt_lstm_word2vec.hdf5
1063s - loss: 0.4761 - acc: 0.7658 - fmeasure: 0.8040 - precision: 0.7869 - recall: 0.8260 - val_loss: 0.4393 - val_acc: 0.7887 - val_fmeasure: 0.8217 - val_precision: 0.8200 - val_recall: 0.8253
Epoch 3/50
Epoch 00002: val_acc improved from 0.78870 to 0.79628, saving model to models/rt_lstm_word2vec.hdf5
982s - loss: 0.4461 - acc: 0.7854 - fmeasure: 0.8195 - precision: 0.8054 - recall: 0.8372 - val_loss: 0.4258 - val_acc: 0.7963 - val_fmeasure: 0.8273 - val_preci

<keras.callbacks.History at 0x291c33dd8>

## Analyze results

In [102]:
predictions = lstm_word2vec_google.predict(padded_sequences_test)

In [126]:
test_data['p'] = [p[0] for p in predictions]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [127]:
test_data

Unnamed: 0.1,Unnamed: 0,label,text,prep_text,p
13680,58770,1,Let's face it -- nothing short of a gorilla suit can make Peter MacNicol not look like a weenie.,let's face it - - nothing short of a gorilla suit can make peter macnicol not look like a weenie .,0.269437
71716,22099,1,"A modest and accessible Iranian film, softer than most but still intriguing.","a modest and accessible iranian film , softer than most but still intriguing .",0.996948
2273,46279,1,"I saw the movie and really could not stop my tears. Its tragedy that India has no such leaders after freedom, who dare to do justice with their own children, when they don't behave properly.. In current generation, politicians bring their children's into politics without measuring their caliber and skills.. I remember the dialogue from Gandhi 'What kind of society we want to create/make with such people (about Harilal)?' No wonder that it will be a dream that India will hardly have such leader in this or next generation.. Einstein was right when he said about Gandhi that 'After 50 years one would hardly believe that such person with body, soul and mind (Mahatma Gandhi) had ever lived on this earth.' I sincerely want to THANKS a LOT to Anil kapoor, Feroze khan and all film actors/actresses for this wonderful movie about great person and relationship with his son. All father and son should watch this movie once and take some lessons for both roles.","i saw the movie and really could not stop my tears . its tragedy that india has no such leaders after freedom , who dare to do justice with their own children , when they don_`_t behave properly .. in current generation , politicians bring their children's into politics without measuring their caliber and skills .. i remember the dialogue from gandhi ' what kind of society we want to create / make with such people ( about harilal ) ? ' no wonder that it will be a dream that india will hardly have such leader in this or next generation .. einstein was right when he said about gandhi that ' after 50 years one would hardly believe that such person with body , soul and mind ( mahatma gandhi ) had ever lived on this earth . ' i sincerely want to thanks a lot to anil kapoor , feroze khan and all film actors / actresses for this wonderful movie about great person and relationship with his son . all father and son should watch this movie once and take some lessons for both roles .",0.999608
25816,60370,0,"Despite a rich premise, Just Friends ultimately fails to live up to its early potential, settling into broad comedy that shortchanges the romantic possibilities.","despite a rich premise , just friends ultimately fails to live up to its early potential , settling into broad comedy that shortchanges the romantic possibilities .",0.007091
22921,9272,1,"Shot in astonishingly elaborate long takes, this is the kind of film that finds the most brilliant poetry in the slightest movement of the camera -- a paradigm of cinematic expression.","shot in astonishingly elaborate long takes , this is the kind of film that finds the most brilliant poetry in the slightest movement of the camera - - a paradigm of cinematic expression .",0.998813
13606,17958,0,"Illustrate[s], once and for all, that it's possible for a film to be soooo slow and ponderous it can appear, at times, to come to a dead stop.","illustrate [ s ] , once and for all , that it's possible for a film to be soooo slow and ponderous it can appear , at times , to come to a dead stop .",0.091740
27908,32765,1,One of the most accessible films to come from Bertolucci and shows the filmmakers deep understanding of his craft.,one of the most accessible films to come from bertolucci and shows the filmmakers deep understanding of his craft .,0.995561
23505,12179,0,"As was mentioned by others, could there be any other reason to see this film other than to see former ""Wayne's World"" star Mike Myers play a serious role? The story line is interesting but lacks development and is sabotaged by loose ends and bad characters. If there was any good scenic shots of Ireland then it would give it another reason to see it. But instead it focuses on a little normal village that is obviously surrounded by the 'green pastures' of the Emerald-Isle that are often shown in Irish films. If there was any cultural 'spice' to admire the ""Irish personality"" it would be worth seeing, however this could have almost been shot in England. Too bad for Myers, but this one fails to please or satisfy the heart of anybody who ever wanted to visit the land of Guiness.","as was mentioned by others , could there be any other reason to see this film other than to see former "" wayne's world "" star mike myers play a serious role ? the story line is interesting but lacks development and is sabotaged by loose ends and bad characters . if there was any good scenic shots of ireland then it would give it another reason to see it . but instead it focuses on a little normal village that is obviously surrounded by the ' green pastures ' of the emerald-isle that are often shown in irish films . if there was any cultural ' spice ' to admire the "" irish personality "" it would be worth seeing , however this could have almost been shot in england . too bad for myers , but this one fails to please or satisfy the heart of anybody who ever wanted to visit the land of guiness .",0.026754
62997,58429,1,Great fun if not quite the great teen pic some believe it to be.,great fun if not quite the great teen pic some believe it to be .,0.971040
74478,14139,1,An astonishing achievement for writer-actor François Bégaudeau who adapted his own book for the screenplay and also stars in this absorbing film about the challenges of teaching in a public school.,an astonishing achievement for writer-actor françois bégaudeau who adapted his own book for the screenplay and also stars in this absorbing film about the challenges of teaching in a public school .,0.998460


In [None]:
test_data[test_data['p'] < 0.1][test_data['label'] == 1][['label', 'p', 'text']].sort_values('p')[:10]

In [None]:
test_data[test_data['p'] > 0.9][test_data['label'] == 0][['label', 'p', 'text']].sort_values('p', ascending=False)[:10]

#### Model sensitivity

In [112]:
def lstm_predict(text):
    return lstm_word2vec_google.predict(padding(preprocess(text)))[0][0]

In [113]:
lstm_predict('good actors, but bad movie ')

0.042715129

In [114]:
lstm_predict('really good actors, but bad movie')

0.092992492

In [115]:
lstm_predict('good actors, but really bad movie')

0.042997032

### Dependency

In [116]:
lstm_predict("i think the movie is good")

0.86298966

In [122]:
lstm_predict("i don't think the movie good")

0.20842142

In [123]:
lstm_predict("i doubt the movie is good") ## bad

0.82579565

## Save model

In [125]:
# serialize model to JSON
lstm_word2vec_google_json = lstm_word2vec_google.to_json()
with open(MODEL_NAME + ".json", "w") as json_file:
    json_file.write(lstm_word2vec_google_json)