<a href="https://colab.research.google.com/github/vignesh-pala/NLP/blob/master/NL_Keras_Ch8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Challenge 8 (NLP with Keras-TF2)**

 

Implement the following on the built in reuters dataset in keras

 

1. Vectorization and Normalization

2. One Hot encoding for labels using the built in to_categorical util

3. Use these layers to build the model - Sequential, Dense, Dropout

4. Achieve a model accuracy of around 82%

In [36]:
#import Tensorflow.Keras libraries
import tensorflow 
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from tensorflow.keras.datasets import reuters
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models,layers,regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1, l2, l1_l2

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import preprocessing
from tensorflow.keras import models, layers, backend
from keras.utils import np_utils

from matplotlib import pyplot
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [6]:
max_words = 5000

In [25]:
# split into Train, Test and Validation set
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.1)

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1)

#word_index = reuters.get_word_index(path="reuters_word_index.json")
print(x_train[0])

[1, 53, 996, 26, 14, 924, 26, 39, 19, 2, 18, 14, 19, 3302, 18, 86, 187, 63, 11, 14, 160, 59, 11, 17, 12]


In [27]:
tokenizer = Tokenizer(num_words=max_words)
x_train_tkn = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test_tkn = tokenizer.sequences_to_matrix(x_test, mode='binary')
x_val_tkn = tokenizer.sequences_to_matrix(x_val, mode='binary')
print(x_train_tkn[0])

[0. 1. 1. ... 0. 0. 0.]


Convert labels to one-hot encoded fields, so each field represents a label

In [28]:
y_train_cat = np_utils.to_categorical(y_train)
y_test_cat = np_utils.to_categorical(y_test)
y_val_cat = np_utils.to_categorical(y_val)

categories = len(y_train_cat[0])

print('Total categories = {}'.format(categories))

Total categories = 46


**Pad Sequences**

* Keras prefers inputs to be vectorized and all inputs to have the same length.
So we need to fill in the remaining length with '0' to make the vector length uniform across the dataset.
* To acheive this, we use the pad_sequences()
Note: padding='post' ensures the 0 padding is done at the end of each record

In [29]:
x_train_pad = preprocessing.sequence.pad_sequences(x_train_tkn, maxlen=max_words,  padding='post')
x_test_pad = preprocessing.sequence.pad_sequences(x_test_tkn, maxlen=max_words,  padding='post')
x_val_pad = preprocessing.sequence.pad_sequences(x_val_tkn, maxlen=max_words,  padding='post')

**Model Building:**
* Train the model and validate on the Validation set
* Dropout Layer: Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly.


In [30]:
backend.clear_session()

model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(max_words,)))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(categories, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=2, verbose=1, mode='auto', baseline=None, restore_best_weights=False)
history = model.fit(x_train_pad, y_train_cat,
                    epochs=10,
                    batch_size=64,
                    validation_data=(x_val_pad, y_val_cat), callbacks=[early_stop])

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               1280256   
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                11822     
Total params: 1,292,078
Trainable params: 1,292,078
Non-trainable params: 0
_________________________________________________________________
Train on 9094 samples, validate on 1011 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 00005: early stopping


Evaluate the model on Test data

In [34]:
results = model.evaluate(x_test_pad, y_test_cat, batch_size=64)
print("test loss = {}\ntest accuracy = {}".format(results[0], results[1]))

test loss = 0.9978815156863613
test accuracy = 0.8076580762863159
