# Conv. neuronal network sentence classifier notebook
In this notebook we will attemp to build a sentence classifier model based on https://www.tensorflow.org/tutorials/text/text_classification_rnn

## 00. Packages setup

We start by installing a set of requerired packages, we need to override some exiting package installations (e.g. fairing 0.5)

In [1]:
import os
import logging
import site
from pathlib import Path
import sys

In [2]:
home = str(Path.home())
local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")
if local_py_path not in sys.path:
    logging.info("Adding %s to python path", local_py_path)
    sys.path.insert(0, local_py_path)
site.getsitepackages()    

['/usr/local/lib/python3.6/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/lib/python3.6/dist-packages']

In [3]:
if not os.getenv("GOOGLE_APPLICATION_CREDENTIALS"):
    raise ValueError("Notebook is missing google application credentials")
else:
    print('GCP Credentials OK')

GCP Credentials OK


In [4]:
!pip install --user --upgrade 
!pip install --user pandas
!pip install --user tensorflow
!pip install --user keras
!pip install --user numpy
!pip install --user gcsfs
!pip install --user google-cloud-storage
!pip install --user gensim
!pip install --user kubeflow

[31mERROR: You must give at least one requirement to install (see "pip help install")[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


We install a recent fairing commit, we hit a couple of bugs with the released one: https://github.com/kubeflow/kubeflow/issues/3643 

In [None]:
!pip install --user git+git://github.com/kubeflow/fairing.git@dc61c4c88f233edaf22b13bbfb184ded0ed877a4

## 01.Data preparation

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import gensim.models.keyedvectors as word2vec
from gensim.models import Word2Vec

We will start by loading train and test data files from Google Cloud Storage. We will using the Wikipedia Movies Plot Dataset (https://www.kaggle.com/jrobischon/wikipedia-movie-plots). This dataset features plot summary descriptions scraped from Wikipedia.

In [5]:
train_data_path = 'data/wiki_movie_plots_deduped.csv'
test_data_path = 'data/wiki_movie_plots_deduped_test.csv'
gcp_bucket = 'velascoluis-test'
column_target_value = 'Genre'
column_text_value = 'Plot'

We drop rows with missing data and deduplicate rows

In [6]:
train_data_load = pd.read_csv("gs://" + gcp_bucket + "/" + train_data_path, sep=',')
test_data_load = pd.read_csv("gs://" + gcp_bucket + "/" + test_data_path, sep=',')
train_data = train_data_load.dropna().drop_duplicates(subset=column_text_value, keep='first', inplace=False)
test_data = test_data_load.dropna().drop_duplicates(subset=column_text_value, keep='first', inplace=False)

Lets have a glimpse of the data

In [None]:
train_data.head()

We will focus on two columns: Plot and Genre. The algorithm goal will be to infer the movie genre based on the plot.
Next step is to drop rows with unknown genre

In [None]:
train_data = train_data[train_data.Genre != 'unknown']

In [None]:
train_data.head()

We will exploring the histogram of genres distribution

In [None]:
plt.hist(train_data[column_target_value], color = 'blue', edgecolor = 'black')
plt.title('Histogram of movies by genre')
plt.xlabel('Genre')
plt.ylabel('Movies')

The data is severely swekedm and we have a long tail of genres.

In [26]:
train_data[column_target_value].value_counts()

drama                                                               5750
unknown                                                             5105
comedy                                                              4304
horror                                                              1103
action                                                              1051
                                                                    ... 
drama / war / comedy / action                                          1
raghava lawrence, prabhu deva, raja, charmme, kamalini mukherjee       1
mythology (fiction)                                                    1
war, drama, historical                                                 1
animation, musical, comedy                                             1
Name: Genre, Length: 2141, dtype: int64

We will focus only on the genres featuring at least 900 observations

In [7]:
train_data = train_data.groupby(column_target_value).filter(lambda x : len(x)>900)

In [9]:
train_data[column_target_value].value_counts()

drama       5750
unknown     5105
comedy      4304
horror      1103
action      1051
thriller     936
Name: Genre, dtype: int64

In order to balance the data, we will randomly trim data from the drama and comeny genres

In [8]:
train_data = train_data.drop(((train_data[train_data[column_target_value] == 'drama' ]).sample(frac=.8,random_state=100).index))
train_data = train_data.drop(((train_data[train_data[column_target_value] == 'comedy' ]).sample(frac=.75,random_state=100).index))

In [29]:
train_data[column_target_value].value_counts()

unknown     5105
drama       1150
horror      1103
comedy      1076
action      1051
thriller     936
Name: Genre, dtype: int64

In [None]:
plt.hist(train_data[column_target_value], color = 'blue', edgecolor = 'black')
plt.title('Histogram of movies by genre')
plt.xlabel('Genre')
plt.ylabel('Movies')

In [9]:
classifier_values = train_data[column_target_value].unique()
print(classifier_values)

['unknown' 'comedy' 'drama' 'horror' 'thriller' 'action']


As a next step, we will generate numerical labels for the genres

In [10]:
dic = {}
for i, class_value in enumerate(classifier_values):
    dic[class_value] = i
labels = train_data[column_target_value].apply(lambda x: dic[x])
num_classes = i + 1

We also split the data between training and validation

In [11]:
val_data_pct = 0.2
val_data = train_data.sample(frac=val_data_pct, random_state=200)
train_data = train_data.drop(val_data.index)

Next, will be generating representations of the sentences to classify, we create a vocabulary index based on word frequency and then transform the text to numerical vectors

In [12]:
num_words = 10000
texts = train_data[column_text_value]
tokenizer = Tokenizer(num_words=num_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n\'', lower=True)
tokenizer.fit_on_texts(texts)
sequences_train = tokenizer.texts_to_sequences(texts)
sequences_valid = tokenizer.texts_to_sequences(val_data[column_text_value])

In [None]:
print(sequences_train[0])

Now , we pad the sequences so all of them will have the same lenght, for the labels text we create categorical vectors

In [13]:
x_train = pad_sequences(sequences_train)
sequence_length = x_train.shape[1]
x_val = pad_sequences(sequences_valid, maxlen=sequence_length)
y_train = to_categorical(np.asarray(labels[train_data.index]))
y_val = to_categorical(np.asarray(labels[val_data.index]))

In [None]:
print(x_val[2])

In [None]:
print(y_train[1])

Now, we will generate the word embeddings, we will use transfer learning and re-use a pretrained word2vec model. In this case we will use GloVe (https://nlp.stanford.edu/projects/glove/) 100 dimensions. We had to tranform the Glove representation to word2vec using the glove2word2vec util

In [None]:
embedding_dim = 100
w2v_model_path = 'model/word2vec100d.txt'
w2v_model = word2vec.KeyedVectors.load_word2vec_format("gs://" + gcp_bucket + "/" + w2v_model_path)
word_vectors = w2v_model.wv
word_index = tokenizer.word_index
vocabulary_size = min(len(tokenizer.word_index) + 1, num_words)
embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word, i in word_index.items():
    if i >= num_words:
        continue
    try:
        embedding_vector = word_vectors[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        embedding_matrix[i] = np.random.normal(0, np.sqrt(0.25), embedding_dim)

In [38]:
w2v_model.most_similar('summer')

[('winter', 0.8896949291229248),
 ('spring', 0.8580389022827148),
 ('autumn', 0.7742397785186768),
 ('weekend', 0.7385302782058716),
 ('year', 0.7348464131355286),
 ('days', 0.725011944770813),
 ('beginning', 0.7218300104141235),
 ('during', 0.7205086946487427),
 ('season', 0.7031365633010864),
 ('day', 0.7015056610107422)]

## 02.Model generation and training - RNN version

In [50]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.layers import Input, Dense, Embedding, Conv2D, MaxPooling2D, Dropout, concatenate, Reshape, \
    Flatten, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
from tensorflow.keras.losses import BinaryCrossentropy
import datetime
import os

In [40]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Bidirectional

In [None]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, weights=[embedding_matrix], trainable=True))
model.add(Bidirectional(LSTM(16)))
model.add(Dense(16, activation='relu'))
model.add(Dense(units=num_classes, activation='softmax', kernel_regularizer=regularizers.l2(0.01)))

In [52]:
model.compile(loss=BinaryCrossentropy(from_logits=True),
              optimizer=Adam(1e-3),
              metrics=['accuracy'])

In [45]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 390       
Total params: 1,093,126
Trainable params: 1,093,126
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
epochs = 20
batch_size = 500
now = datetime.datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "model/tf_logs"
if not os.path.exists(root_logdir):
    os.mkdir(root_logdir)
log_dir = "{}/run-{}/".format(root_logdir, now)
callback_tensorboard = TensorBoard(log_dir=log_dir, histogram_freq=1)
callback_earlystopping = EarlyStopping(monitor='val_loss')

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_val, y_val),
                  callbacks=[callback_earlystopping, callback_tensorboard])
loss, acc = model.evaluate(x_train, y_train, verbose=2)
print("Accuracy = {:5.2f}%".format(100 * acc))
print("Loss = {:5.2f}%".format(100 * loss))

Train on 8337 samples, validate on 2084 samples
Epoch 1/20
