### Using Word2Vec Embeddings with a CNN

10 Jan 2018

***
### Summary

This model scored 0.056 on the leaderboard. This is far from fantastic, however the main purpose of this kernel is to describe a process for creating word2vec embeddings and then using those embeddings to train a Keras cnn.

Lessons like this are easy to forget so I'm publishing this here mainly for my own future reference. Other beginners may also find this kernel helpful.

The training time for this model was approx. 90 minutes using a GPU.

***

#### References:

- Kaggle Word2Vec tutorial by Angela Chapman:<br>
This tutorial is excellent.<br>
https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors <br>

- Blog post by Dr Jason Brownlee:<br>
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ <br>

- My previous kernel that used pre trained GloVe embeddings:<br>
https://www.kaggle.com/vbookshelf/keras-cnn-glove-early-stopping-0-048-lb <br>

- Other helpful info:<br>
https://radimrehurek.com/gensim/models/word2vec.html<br>
https://radimrehurek.com/gensim/models/keyedvectors.html<br>


***


In [1]:
import pandas as pd
import numpy as np

import math
from sklearn.model_selection import train_test_split

import nltk

from gensim.models import word2vec

from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.layers import BatchNormalization, Flatten, Conv1D, MaxPooling1D
from keras.layers import Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

df_train.fillna(value='none',inplace=True)
df_test.fillna(value='none',inplace=True)

print(df_train.shape)
print(df_test.shape)

In [3]:
# combine the train and test sets for encoding and padding
train_len = len(df_train)
df_combined =  pd.concat(objs=[df_train, df_test], axis=0).reset_index(drop=True)

print(df_combined.shape)

# make a copy of df_combined
df_combined_copy = df_combined

### Format word2vec input

This is the input format that Word2vec wants:

[ [Hello, how, are, you.], [I, am, fine, thanks.] ]

Word2Vec expects single sentences. Each sentence is a list of words. In other words, the input format is a list of lists.

### 1. Extract the sentences from each comment

In [4]:

# initialize the tokenizer for extracting sentences
tok = nltk.data.load('tokenizers/punkt/english.pickle')

output_list = []

def sentence_to_list(x):
    """
    1. Input: All text in the corpus - i.e. every comment
    2. Output: List of sentences where each sentence is a list of words e.g.
    output = [[hello,how,are,you],[i,am,fine,thanks]]
    3. The output python list contains all sentences from every train and test comment.
    
    """
    sentence_list= tok.tokenize(x)
    
    for sentence in sentence_list:
        # convert the sentence into a list of words
        word_list = sentence.split()
        # add the sentence to the list of sentences
        output_list.append(word_list)
        
    return output_list


# Run the function
# note that df_combined_copy['comment_text'] is not usable after this step.
# After running this line, a variable called output_list is created in memory...
# Okay, this is not the most pythonic way of doing things but apply() runs fast.
df_combined_copy['comment_text'].apply(sentence_to_list)

print(len(output_list))

### 2. Create the word2vec embedding

In [None]:

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 4    # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model

w2v_model = word2vec.Word2Vec(output_list, workers=num_workers, 
            size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

w2v_model.init_sims(replace=True)

# save the model
# model_name = "300features_4minwords_10context"
# w2v_model.save(model_name)

print('Training completed.')

In [None]:
# save the word vectors

#word_vectors = w2v_model.wv
#word_vectors.save('word2vec_toxic_vectors.csv')

# load the saved word vectors
#word_vectors = KeyedVectors.load('word2vec_toxic_vectors.csv')


In [None]:
# get the shape of the word2vec embedding matrix
w2v_model.syn1neg.shape

In [None]:
# Tell me what words are most similar to the word 'man'?
w2v_model.most_similar("man")

In [None]:
# This is how to access the embedding vector for a given word
w2v_model.wv['hello']

### Now that we have converted our corpus into a word2vec embedding,  how do we actually use it in a machine learning model?

Good question. One way to do this is to add an embedding matrix to the embedding layer of a neural network. Here's how I did it.

### 1. Process the train and test comments again:
1. Each word is assigned a unique integer.
2. Each training example (comment) is transformed into a sequence of these unique integers. This is a vector. (A vector is simply a list of numbers.)
3. Make all vectors the same length by padding the vector with zeros (if too short). Here we set the vector length (max_length) as 500.
4. These padded vectors will be our model inputs: X and X_test

In [None]:
# create the padded vectors

docs_combined = df_combined['comment_text'].astype(str)

# This tokenizer creates a python list of words
t = Tokenizer()
t.fit_on_texts(docs_combined)
vocab_size = len(t.word_index) + 1

# integer encode the documents
# assign each word a unique integer
encoded_docs = t.texts_to_sequences(docs_combined)

# pad documents to a max length of 500 words
max_length = 500 ###
padded_docs_combined = pad_sequences(encoded_docs, maxlen=max_length, padding='post')


In [None]:
# seperate the train and test sets

df_train_padded = padded_docs_combined[:train_len]
df_test_padded = padded_docs_combined[train_len:]

print(df_train_padded.shape)
print(df_test_padded.shape)

In [None]:
# create a embedding matrix for words that are in our combined train and test dataframes

embedding_matrix = zeros((vocab_size, 300))

for word, i in t.word_index.items():
    # check if the word is in the word2vec vocab
    if word in w2v_model.wv:
        embedding_vector = w2v_model.wv[word]
        
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

#### What is the above code doing?

First, recall that we access a word's embedding like this:<br>

w2v_model.wv['some_word']

If a word is in the train or test comments and that word does have a word2vec embedding, then we insert that embedding into our new embedding_matrix. Later, this embedding_matrix will be input into the cnn embedding layer.

In [None]:
# check the shape of the new embedding matrix
embedding_matrix.shape

### 2. CNN Model
Finally, we arrive at the cnn model...

We run the model seperately for each of the 6 targets - toxic, severe_toxic, obscene, threat, insult, identity_hate.

In [None]:
X = df_train_padded
X_test = df_test_padded

# target columns
y_toxic = df_train['toxic']
y_severe_toxic = df_train['severe_toxic']
y_obscene = df_train['obscene']
y_threat = df_train['threat']
y_insult = df_train['insult']
y_identity_hate = df_train['identity_hate']

In [None]:
# target columns for each of the 6 models
target_cols = [y_toxic,y_severe_toxic,y_obscene,y_threat,y_insult,y_identity_hate]

preds = []

for col in target_cols:
    
    # set the value of y_train
    y = col
    
    X_train, X_eval, y_train ,y_eval = train_test_split(X, y,test_size=0.25,shuffle=True,
                                                    random_state=5,stratify=y)

    # define model
    model = Sequential()
    e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=500, trainable=False)
    model.add(e)
    model.add(Conv1D(128, 3, activation='relu'))
    model.add(MaxPooling1D(pool_size=3, strides=2))
    model.add(Dropout(0.2))
    model.add(Conv1D(64, 3, activation='relu'))
    model.add(MaxPooling1D(pool_size=3, strides=2))
    model.add(Dropout(0.2))
    model.add(Conv1D(64, 3, activation='relu'))
    model.add(MaxPooling1D(pool_size=3, strides=2))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))


    # compile the model
    Adam_new = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    model.compile(optimizer=Adam_new, loss='binary_crossentropy', metrics=['acc'])

    early_stopping = EarlyStopping(monitor='val_loss', patience=5, mode='min')
   
    save_best = ModelCheckpoint('toxic.hdf', save_best_only=True, monitor='val_loss', 
                               mode='min')

    history = model.fit(X_train, y_train, validation_data=(X_eval, y_eval),epochs=100, verbose=1,
                   callbacks=[early_stopping,save_best])


    model.load_weights(filepath = 'toxic.hdf')
    
    # make a prediction
    predictions = model.predict(X_test)

    y_preds = predictions[:,0]
    
    preds.append(y_preds)


In [None]:
# put the results into a dataframe

df_results = pd.DataFrame({'id':df_test.id,
                            'toxic':preds[0],
                           'severe_toxic':preds[1],
                           'obscene':preds[2],
                           'threat':preds[3],
                           'insult':preds[4],
                           'identity_hate':preds[5]}).set_index('id')

# Pandas automatically sorts the columns alphabetically by column name.
# Therefore we need to re-order the columns to match the sample submission file.
df_results = df_results[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

# create a submission csv file
#df_results.to_csv('word2vec_with_cnn.csv', columns=['toxic','severe_toxic','obscene','threat','insult','identity_hate']) 


***

Thank you for reading.