<a href="https://colab.research.google.com/github/vignesh-pala/NLP/blob/master/NL_Keras_Ch8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Challenge 8 (NLP with Keras-TF2)**

 

Implement the following on the built in reuters dataset in keras

 

1. Vectorization and Normalization

2. One Hot encoding for labels using the built in to_categorical util

3. Use these layers to build the model - Sequential, Dense, Dropout

4. Achieve a model accuracy of around 82%

In [111]:
import keras
from keras.datasets import reuters
from keras.preprocessing.text import Tokenizer
from keras import preprocessing

from keras.models import Sequential
from keras.layers import Flatten, Dense, BatchNormalization
from keras.layers.embeddings import Embedding
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping

from keras.regularizers import l1, l2, l1_l2
from matplotlib import pyplot
from sklearn.preprocessing import MinMaxScaler
from keras import models, layers, backend

In [112]:
max_words = 5000

In [113]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.1)
#word_index = reuters.get_word_index(path="reuters_word_index.json")
print(x_train[0])

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]


In [114]:
tokenizer = Tokenizer(num_words=max_words)
x_train_tkn = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test_tkn = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train_tkn[0])

[0. 1. 1. ... 0. 0. 0.]


Convert labels to one-hot encoded fields, so each field represents a label

In [115]:
y_train_cat = np_utils.to_categorical(y_train)
y_test_cat = np_utils.to_categorical(y_test)

categories = len(y_train_cat[0])

print('Total categories = {}'.format(categories))

Total categories = 46


**Pad Sequences**

* Keras prefers inputs to be vectorized and all inputs to have the same length.
So we need to fill in the remaining length with '0' to make the vector length uniform across the dataset.
* To acheive this, we use the pad_sequences()
Note: padding='post' ensures the 0 padding is done at the end of each record

In [116]:
x_train_pad = preprocessing.sequence.pad_sequences(x_train_tkn, maxlen=max_words,  padding='post')
x_test_pad = preprocessing.sequence.pad_sequences(x_test_tkn, maxlen=max_words,  padding='post')

**Model Building:**
* Dropout Layer: Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly.


In [118]:
backend.clear_session()

model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(max_words,)))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(categories, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=1, mode='auto', baseline=None, restore_best_weights=False)
history = model.fit(x_train_pad, y_train_cat,
                    epochs=3,
                    batch_size=64,
                    validation_data=(x_test_pad, y_test_cat), callbacks=[early_stop])

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               1280256   
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                11822     
Total params: 1,292,078
Trainable params: 1,292,078
Non-trainable params: 0
_________________________________________________________________
Train on 10105 samples, validate on 1123 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
