<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/NLPModel_MultiClass_Keras_Word2VecVectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Model with Word2Vec and Keras

In this notebook, we are going to use a Keras Model to predict categories of text. To vectorize the text we are going to use the Word2Vec model.

We train Word2Vec embedding with our own dataset.

Notebook adapted from:

https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456

In [0]:
import warnings
warnings.filterwarnings('ignore')

## DataSet

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/SrinidhiRaghavan/AI-Sentiment-Analysis-on-IMDB-Dataset/master/imdb_tr.csv',  encoding = "ISO-8859-1")

print(f'Number of examples : {len(df)}')
df.head()

Number of examples : 25000


Unnamed: 0,row_Number,text,polarity
0,2148,"first think another Disney movie, might good, ...",1
1,23577,"Put aside Dr. House repeat missed, Desperate H...",0
2,1319,"big fan Stephen King's work, film made even gr...",1
3,13358,watched horrid thing TV. Needless say one movi...,0
4,9495,truly enjoyed film. acting terrific plot. Jeff...,1


## Word2Vec Embedding

In particular, in this implementation we apply the Word2Vec techinques to calculate our own embedding in the current dataset.

Creating word tokens, removing punctuation, removing stop words etc. 

In [4]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

review_lines = list()
lines = df['text'].values.tolist()

for line in lines:
  tokens = word_tokenize(line)

  #convert to lower case
  tokens = [w.lower() for w in tokens]

  #remove punctuation from each word
  table = str.maketrans('', '', string.punctuation)
  stripped = [w.translate(table) for w in tokens]

  #remove remaining tokens that are not alphabetic
  words = [word for word in stripped if word.isalpha()]

  #filter out stop words
  stop_words = set(stopwords.words('english'))
  words = [w for w in words if not w in stop_words]
  
  review_lines.append(words)

print(f'Example of tokenize text : {review_lines[0]}')

Example of tokenize text : ['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'kids', 'movie', 'watch', 'ca', 'nt', 'help', 'enjoy', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'danny', 'glover', 'superb', 'could', 'play', 'part', 'better', 'christopher', 'lloyd', 'hilarious', 'perfect', 'part', 'tony', 'danza', 'believable', 'mel', 'clark', 'ca', 'nt', 'help', 'enjoy', 'movie', 'give']


We will use the Gensim implementation of Word2Vec. The word2vec algorithm processes documents sentence by sentence.

Gensim’s Word2Vec API requires some parameters for initialization.
 - `sentences` – List of sentences; here we pass the list of review sentences.
 - `size` – The number of dimensions in which we wish to represent our word. This is the size of the word vector.
 - `min_count` – Word with frequency greater than min_count only are going to be included into the model. Usually, the bigger and more extensive your text, the higher this number can be.
 - `window` – Only terms that occur within a window-neighborhood of a term, in a sentence, are associated with it during training. The usual value is 4 or 5.
 - `workers` – Number of threads used in training parallelization, to speed up training

In [6]:
import gensim

EMBEDDING_DIM = 100

#train word2vec model
model = gensim.models.Word2Vec(sentences=review_lines, size=EMBEDDING_DIM, window=5, workers=4, min_count=1)

#vocab size
words = list(model.wv.vocab)
print(f'Vocabulary size: {len(words)}')

Vocabulary size: 93058


In [7]:
print('Word2Vec similar words for "horrible"')
model.wv.most_similar('horrible')

Word2Vec similar words for "horrible"


[('terrible', 0.9547101259231567),
 ('awful', 0.9289554357528687),
 ('sucks', 0.8766037225723267),
 ('atrocious', 0.843876838684082),
 ('sucked', 0.8312643766403198),
 ('crappy', 0.8228499889373779),
 ('horrendous', 0.8225404024124146),
 ('dreadful', 0.8206788897514343),
 ('laughable', 0.8203979730606079),
 ('stupid', 0.8178586959838867)]

In [8]:
#Let's see the result of semantically reasonable word vectors (brother - man + woman)

print(f'Word2Vec result of semantically reasonable word vectors (brother - man + woman) :')
model.wv.most_similar_cosmul(positive=['brother', 'woman'], negative=['man'])

Word2Vec result of semantically reasonable word vectors (brother - man + woman) :


[('sister', 1.1465065479278564),
 ('daughter', 1.1121941804885864),
 ('boyfriend', 1.0967283248901367),
 ('son', 1.0912766456604004),
 ('wife', 1.0800729990005493),
 ('father', 1.076292634010315),
 ('marie', 1.0708884000778198),
 ('girlfriend', 1.0669641494750977),
 ('mother', 1.0638612508773804),
 ('husband', 1.0607103109359741)]

In [9]:

print(f'Word2Vec result of find the odd word in "actor", "director", "actress" and "foot" : ')
model.wv.doesnt_match("actor director actress foot".split())

Word2Vec result of find the odd word in "actor", "director", "actress" and "foot" : 


'foot'

You can save your Word2Vec model

In [0]:
# save model
filename = "imdb_embedding_word2vec.txt"
model.wv.save_word2vec_format(filename, binary=False)

## Train the model

Since we have already trained word2vec model with IMDb dataset, we have the word embeddings ready to use.

In [0]:
# Load the saved W2V model

import os

embeddings_index = {}
f = open(os.path.join('', 'imdb_embedding_word2vec.txt'), encoding='utf-8')
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:])
  embeddings_index[word] = coefs
f.close()

The next step is to convert the word embedding into tokenized vector.

In [12]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

# vectorize the text samples into a 2D integer tensor

# Train the tokenizer (vectorizer) with the tokenized word text
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(review_lines)

# Convert to sequence the text
sequences = tokenizer_obj.texts_to_sequences(review_lines)

print(f'Vectorized {len(sequences)} sequences of text')

print('Example of vectorized text :')
print(f'Tokenized text ({len(review_lines[0])} entries) : {review_lines[0]}')
print(f'Vectorized text ({len(sequences[0])} entries) : {sequences[0]}')

Vectorized 25000 sequences of text
Example of vectorized text :
Tokenized text (47 entries) : ['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'kids', 'movie', 'watch', 'ca', 'nt', 'help', 'enjoy', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'danny', 'glover', 'superb', 'could', 'play', 'part', 'better', 'christopher', 'lloyd', 'hilarious', 'perfect', 'part', 'tony', 'danza', 'believable', 'mel', 'clark', 'ca', 'nt', 'help', 'enjoy', 'movie', 'give']
Vectorized text (47 entries) : [22, 27, 61, 673, 2, 124, 7, 227, 2, 30, 87, 4, 214, 232, 1868, 40, 2, 22, 106, 2, 59, 183, 47, 40, 1432, 2989, 741, 16, 187, 76, 45, 1151, 3042, 491, 278, 76, 932, 10915, 701, 3466, 2260, 87, 4, 214, 232, 2, 93]


In [13]:
# pad sequences
word_index = tokenizer_obj.word_index
print(f'Found {len(word_index)} unique tokens.')

Found 93058 unique tokens.


In [14]:
max_length = max([len(s) for s in review_lines]) # The max length is the length of the most text with more words

print(f'Text with more words : {max_length}')

# Pad all the sequences in the same length (fill in with 0 the short sequences or cut the longer sequences)
review_pad = pad_sequences(sequences, maxlen=max_length)

print(f'The padding sequences are expresed in a list of {review_pad.shape[0]} entries (examples) where each entry is a list of {review_pad.shape[1]} tokens (pad with 0 in the shorter sequences)')

Text with more words : 1440
The padding sequences are expresed in a list of 25000 entries (examples) where each entry is a list of 1440 tokens (pad with 0 in the shorter sequences)


In [0]:
labels = df['polarity'].values

In [16]:
print(f'Shape of review tensor : { review_pad.shape }')
print(f'Shape of sentiment tensor: {labels.shape}')

Shape of review tensor : (25000, 1440)
Shape of sentiment tensor: (25000,)


Now we will map embeddings from the loaded word2vec model for each word to the `tokenizer_obj.word_index` vocabulary and create a matrix with of word vectors.

In [17]:
num_words = len(word_index) + 1
embedding_matrix = np.zeros( (num_words, EMBEDDING_DIM) )

print(f"We create a embedding matrix of shape ({embedding_matrix.shape[0]} words) x ({embedding_matrix.shape[1]} features)")

We create a embedding matrix of shape (93059 words) x (100 features)


In [0]:
for word, i in word_index.items():
  if i > num_words:
    continue
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    # words not found in embedding index will be all-zeros
    embedding_matrix[i] = embedding_vector

We are now ready with the trained embedding vector to be used directly in the embedding layer.

In [19]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.initializers import Constant

#define model
model = Sequential()
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=max_length,
                            trainable=False)
model.add(embedding_layer)
model.add(GRU(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])






Using TensorFlow backend.



Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


To train the sentiment classification model, we use VALIDATION_SPLIT= 0.2, you can vary this to see effect on the accuracy of the model.

In [0]:
# Randomize the examples
indices = np.arange(review_pad.shape[0])
np.random.shuffle(indices)

review_pad = review_pad[indices]
labels = labels[indices]

In [27]:
# split the data into a training set and validation set
VALIDATION_SPLIT = 0.2
num_validation_samples = int(VALIDATION_SPLIT * review_pad.shape[0])

print(num_validation_samples)

5000


In [30]:
X_train_pad = review_pad[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
X_test_pad = review_pad[-num_validation_samples:]
y_test = labels[-num_validation_samples:]

print(f'Number of training examples : {len(y_train)} ')
print(f'Number of test examples : {len(y_test)} ')
print(f'Shape of training examples : {X_train_pad.shape}')
print(f'Shape of test examples : {X_test_pad.shape}')

Number of training examples : 20000 
Number of test examples : 5000 
Shape of training examples : (20000, 1440)
Shape of test examples : (5000, 1440)


In [0]:
print('Train ...')
model.fit(X_train_pad, y_train, batch_size=128, epochs=25, validation_data=(X_test_pad, y_test), verbose=2)

Train ...



Train on 20000 samples, validate on 5000 samples
Epoch 1/25





 - 255s - loss: 0.6010 - acc: 0.6642 - val_loss: 0.4375 - val_acc: 0.8018
Epoch 2/25
 - 253s - loss: 0.4476 - acc: 0.7963 - val_loss: 0.3513 - val_acc: 0.8502
Epoch 3/25
 - 253s - loss: 0.3951 - acc: 0.8266 - val_loss: 0.3339 - val_acc: 0.8566
Epoch 4/25
 - 253s - loss: 0.3730 - acc: 0.8418 - val_loss: 0.3238 - val_acc: 0.8636
Epoch 5/25
 - 254s - loss: 0.3570 - acc: 0.8470 - val_loss: 0.3136 - val_acc: 0.8664
Epoch 6/25
 - 254s - loss: 0.3455 - acc: 0.8524 - val_loss: 0.3073 - val_acc: 0.8706
Epoch 7/25
 - 252s - loss: 0.3415 - acc: 0.8548 - val_loss: 0.3012 - val_acc: 0.8714
Epoch 8/25
 - 252s - loss: 0.3310 - acc: 0.8599 - val_loss: 0.3002 - val_acc: 0.8744
Epoch 9/25
 - 255s - loss: 0.3253 - acc: 0.8609 - val_loss: 0.2974 - val_acc: 0.8738
Epoch 10/25
 - 252s - loss: 0.3232 - acc: 0.8648 - val_loss: 0.2939 - val_acc: 0.8754
Epoch 11/25


## Model evaluation

In [0]:
%%time
from sklearn.metrics import classification_report
y_pred = logreg.predict(test_vectors_dbow)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=categories))

In [0]:
import itertools

# This utility function is from the sklearn docs: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """

    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=30)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, fontsize=22)
    plt.yticks(tick_marks, classes, fontsize=22)

    fmt = '.2f'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label', fontsize=25)
    plt.xlabel('Predicted label', fontsize=25)

In [0]:
cnf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(24,13))
plot_confusion_matrix(cnf_matrix, classes=categories, title="Confusion matrix")
plt.show()