## Assignment 3:

The goal is to compare several NLP algorithms for Reuters data – multi-class classification problem
 
First, do EDA to understand how many topics (classes) are there in the data. Also understand how many documents
are there in each class. You may want to reduce the number of topics to top 10 or something like that, based
on frequencies.  

In all the experiments, we would hold some parameters constants – truncation of the documents to 128 tokens,
 the batch size to 100, the number of epochs to 10, same optimizer, same loss function of cross entropy, so that
 the comparisons are fair.
 
* EXPERIMENT 1: Fully connected dense neural network
* EXPERIMENT 2: Simple RNN
* EXPERIMENT 3: LSTM RNN
* **EXPERIMENT 4: 1D CNN**

`Result`:  Create a table with the accuracy and loss for train/test/validation & process time for all the 4 models.

`Note`: You can tweak several parameters such as dropout, embedding etc. to get more insights.

In [1]:
import tensorflow as tf
from tensorflow import keras
keras.__version__

'2.2.4-tf'

## The Reuters dataset


We will be working with the `Reuters dataset`, a set of short newswires and their topics, published by Reuters in 1986. It's a very simple, 
widely used toy dataset for text classification. There are 46 different topics; some topics are more represented than others, but each 
topic has at least 10 examples in the training set.

Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras. Let's take a look right away:

In [2]:
# https://keras.io/datasets/#reuters-newswire-topics-classification
from tensorflow.keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=128)

The argument `num_words=128` restricts the data to the 128 most frequently occurring words found in the data.

We have 8,982 training examples and 2,246 test examples:

In [3]:
len(train_data), len(train_labels), len(test_data), len(test_labels)

(8982, 8982, 2246, 2246)

## Reducing the number of topics

In [4]:
from collections import Counter
topics_train_tpl, _ = zip(*Counter(list(train_labels)).most_common(9))
topics_train_tpl

(3, 4, 19, 16, 1, 11, 20, 13, 8)

In [5]:
topics_test_tpl, _ = zip(*Counter(list(test_labels)).most_common(9))
topics_test_tpl

(3, 4, 19, 1, 16, 11, 20, 8, 13)

In [6]:
train_data_sm, train_labels_sm = zip(*((x,y) for x,y in zip(train_data,train_labels) if y in topics_train_tpl))

In [7]:
import numpy as np
train_data_sm, train_labels_sm = np.array(train_data_sm), np.array(train_labels_sm)

In [8]:
len(train_data_sm), len(train_labels_sm)  # matches number of training values in top 10

(7503, 7503)

In [9]:
test_data_sm, test_labels_sm = zip(*((x,y) for x,y in zip(test_data,test_labels) if y in topics_test_tpl))

In [10]:
test_data_sm, test_labels_sm = np.array(test_data_sm), np.array(test_labels_sm)

In [11]:
len(test_data_sm), len(test_labels_sm) # matches number of test values in top 10

(1852, 1852)

In [12]:
Counter(train_labels_sm) # another sanity check on the the new smaller set of training labels. See In [13].

Counter({3: 3159,
         4: 1949,
         16: 444,
         19: 549,
         8: 139,
         11: 390,
         1: 432,
         13: 172,
         20: 269})

In [13]:
Counter(test_labels_sm) # another sanity check on the the new smaller set of test labels. See In [14].

Counter({3: 813,
         1: 105,
         4: 474,
         11: 83,
         19: 133,
         8: 38,
         20: 70,
         16: 99,
         13: 37})

##### EXPERIMENT 4: 1D CNN.

In [14]:
import numpy as np
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

## Preparing the data

We need to vectorize the sequence into numeric tensors that the neural networks can work with.

In [15]:
train_data_sm.shape, test_data_sm.shape

((7503,), (1852,))

In [16]:
# import numpy as np

# def vectorize_sequences(sequences, dimension=10000):
#     results = np.zeros((len(sequences), dimension))
#     for i, sequence in enumerate(sequences):
#         results[i, sequence] = 1.
#     return results

# # Our vectorized training data
# train_data_smv = vectorize_sequences(train_data_sm)
# # Our vectorized test data
# test_data_smv = vectorize_sequences(test_data_sm)

In [17]:
# Alternate processing for RNN...
# https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb

from tensorflow.keras.datasets import imdb
from tensorflow.keras import preprocessing

train_data_sm_rnn = preprocessing.sequence.pad_sequences(train_data_sm, maxlen=30)
test_data_sm_rnn = preprocessing.sequence.pad_sequences(test_data_sm, maxlen=30)

In [18]:
from tensorflow.keras.utils import to_categorical

one_hot_train_labels_sm = to_categorical(train_labels_sm)
one_hot_test_labels_sm = to_categorical(test_labels_sm)

In [19]:
train_labels_sm.shape, one_hot_train_labels_sm.shape

((7503,), (7503, 21))

In [20]:
test_data_sm_rnn.shape, train_data_sm_rnn.shape, one_hot_train_labels_sm.shape, one_hot_test_labels_sm.shape

((1852, 30), (7503, 30), (7503, 21), (1852, 21))

## Building our network

In [21]:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Embedding, LSTM
# from tensorflow.keras.layers import Dense

# model = Sequential()
# model.add(Embedding(10000, 64))
# model.add(LSTM(32))
# model.add(Dense(21, activation='sigmoid'))

In [22]:
# https://github.com/jsrpy/NLP_Sentiment_Analysis/blob/master/Reuters_news_topic_classify_A.ipynb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding, Dropout
from tensorflow.keras.layers import Conv1D, MaxPooling1D
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(Embedding(10000, 64, input_length=30))
model.add(Conv1D(32,3, activation='relu')) #
model.add(Flatten())
model.add(Dense(512, activation='relu')) #
model.add(Dropout(0.5)) #
model.add(Dense(21, activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 30, 64)            640000    
_________________________________________________________________
conv1d (Conv1D)              (None, 28, 32)            6176      
_________________________________________________________________
flatten (Flatten)            (None, 896)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               459264    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 21)                10773     
Total params: 1,116,213
Trainable params: 1,116,213
Non-trainable params: 0
______________________________________________

In [23]:
# keras.backend.clear_session()

# from tensorflow.keras.models import Sequential
# from tensorflow.keras import layers
# from tensorflow.keras.optimizers import RMSprop

# model = Sequential()
# model.add(layers.Embedding(10000, 128, input_length=100))
# model.add(layers.Conv1D(32, 7, activation='relu'))
# model.add(layers.MaxPooling1D(5))
# model.add(layers.Conv1D(32, 7, activation='relu'))
# model.add(layers.GlobalMaxPooling1D())
# model.add(layers.Dense(21))

# model.summary()




In [24]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [25]:
# model.compile(optimizer=RMSprop(lr=1e-4),
#               loss='categorical_crossentropy',
#               metrics=['accuracy'])

## Validating our approach

In the commented code below we set apart 1,000 samples in our training data to use as a validation set. Instead, we set the validation_split (=0.15) when training the model.As an alternative, you can uncomment this code and uncomment: validation_data=(val_data_smv, one_hot_val_labels_sm)).

In [26]:
# val_data_sm_rnn = train_data_sm_rnn[:1000]
# train_data_sm_rnn = train_data_sm_rnn[1000:]

# one_hot_val_labels_sm = one_hot_train_labels_sm[:1000]
# one_hot_train_labels_sm = one_hot_train_labels_sm[1000:]

In [27]:
# test_data_sm_rnn.shape, train_data_sm_rnn.shape, one_hot_train_labels_sm.shape, one_hot_test_labels_sm.shape

## Training the model

To get the total training time I used the callback. 

In [28]:
# Define callback to get total training time
import datetime

class TrainRuntimeCallback(keras.callbacks.Callback):

  def on_train_begin(self,logs={}):
    self.start = datetime.datetime.now()

  def on_train_end(self,logs={}):
    self.process_time = (datetime.datetime.now() - self.start).total_seconds()

Now let's train our network for 10 epochs:

In [29]:
train_rt = TrainRuntimeCallback()
history = model.fit(train_data_sm_rnn, one_hot_train_labels_sm,
                    callbacks = [train_rt],
                    epochs=10,
                    batch_size=100,
                    validation_split=0.15)

Train on 6377 samples, validate on 1126 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
# train_rt = TrainRuntimeCallback()
# history = model.fit(train_data_sm_rnn,
#                     one_hot_train_labels_sm,
#                     callbacks = [train_rt],
#                     epochs=10,
#                     batch_size=100,
# #                   validation_data=(val_data_sm_rnn, one_hot_val_labels_sm))
#                     validation_split = 0.15)   # comment out if setting validation_data value.

In [31]:
# Get the training time
train_time = train_rt.process_time
train_time # in seconds

17.91159

## Testing the model

Test the model and get its runtime using callbacks.

In [32]:
# Define callback to get total test time
import datetime

class TestRuntimeCallback(keras.callbacks.Callback):

  def on_test_begin(self,logs={}):
    self.start = datetime.datetime.now()

  def on_test_end(self,logs={}):
    self.process_time = (datetime.datetime.now() - self.start).total_seconds()

In [33]:
test_rt = TestRuntimeCallback()
# test_loss, test_acc = model.evaluate(test_data_sm_rnn, one_hot_test_labels_sm, callbacks=[test_rt])

test_loss, test_accuracy = model.evaluate(test_data_sm_rnn, one_hot_test_labels_sm, callbacks=[test_rt])



In [34]:
# Get the test time
test_time = test_rt.process_time
test_time # in seconds

0.23003

In [35]:
history_dict = history.history
history_dict['train_accuracy'] = history_dict.pop('accuracy') # rename the the key to 'test_accuracy'
history_dict.keys()

dict_keys(['loss', 'val_loss', 'val_accuracy', 'train_accuracy'])

In [36]:
import pandas as pd
history_df=pd.DataFrame(history_dict)
history_df.tail()

Unnamed: 0,loss,val_loss,val_accuracy,train_accuracy
5,0.981142,1.067737,0.639432,0.671632
6,0.916715,1.066873,0.647425,0.692175
7,0.853888,1.066898,0.645648,0.720558
8,0.791087,1.106907,0.64032,0.741258
9,0.736178,1.096277,0.64476,0.756312


## Saving the performance to a DataFrame

Let us now create the DataFrame with statistics which we append to the DataFrame from part 2. Note that we only need the last row of `history_df`.

In [37]:
results_df = history_df.iloc[-1:].copy()
results_df.insert(0,'model','1D CNN') # went the model name to appear first
results_df['test_accuracy'] = test_accuracy
results_df['training time (sec)'] = train_time      # we are okay with training time appearing last
results_df['testing time (sec)'] = test_time      # we are okay with training time appearing last
results_df

Unnamed: 0,model,loss,val_loss,val_accuracy,train_accuracy,test_accuracy,training time (sec),testing time (sec)
9,1D CNN,0.736178,1.096277,0.64476,0.756312,0.644168,17.91159,0.23003


In [38]:
prev_results_df = pd.read_pickle('results3.pkl')
results_df = prev_results_df.append(results_df,ignore_index=True)
results_df

FileNotFoundError: [Errno 2] No such file or directory: 'results3.pkl'

In [None]:
results_df.to_pickle("results4.pkl") # save the DataFrame to use in Part 3

## Plotting the performance 

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure

acc = history.history['train_accuracy']
val_acc = history.history['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Our approach reaches an accuracy of ~85%. With a balanced binary classification problem, the accuracy reached by a purely random classifier 
would be 50%, but in our case it is closer to 26%, so our results seem pretty good, at least when compared to a random baseline:

In [None]:
import copy

test_labels_sm_copy = copy.copy(test_labels_sm)
np.random.shuffle(test_labels_sm_copy)
float(np.sum(np.array(test_labels_sm) == np.array(test_labels_sm_copy))) / len(test_labels_sm)

## Saving to a DataFrame to disk

Save the DataFrame.

In [None]:
results_df.to_pickle("results4.pkl") 