# Assignment 10 - New Topic Identification
## Author - Salinee Kingbaisomboon
### UW NetID: 1950831

For this project you will use the Keras Reuters newswire topics classification dataset, which consists of:
1. This dataset contains 11,228 newswires from Reuters, labeled with over 46 topics.
2. Each wire is encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
3. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

## Instructions
1. Read Reuters dataset into training and testing
2. Prepare dataset
3. Build and compile 3 different models using Keras LSTM ideally improving model at each iteration
4. Describe and explain your findings

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np

import tensorflow
from tensorflow.keras.datasets import reuters
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Activation
from tensorflow.keras.preprocessing.text import Tokenizer

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore") # To suppress warning

%matplotlib inline

pd.options.display.max_rows = None
pd.options.display.max_columns = None

# Declare function used in this assignment

In [2]:
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# Read data

In [3]:
# Read Reuters dataset from Keras into training and testing
print('Loading data...')

# Hyperparameters
embedding_vecor_length = 32
batch_size = 32
epochs = 3
max_words = 1000
num_of_words=10000
(data_train, y_train), (data_test, y_test) = reuters.load_data(num_words=num_of_words,
                                                         test_split=0.2)
print('Train: X=%s, Y=%s' % (data_train.shape, y_train.shape))
print('Test: X=%s, Y=%s' % (data_test.shape, y_test.shape))

Loading data...
Train: X=(8982,), Y=(8982,)
Test: X=(2246,), Y=(2246,)


In [4]:
# Check how many topics this data set has
num_classes = np.max(y_train) + 1
print('This data set has' , num_classes, 'classes')

This data set has 46 classes


In [5]:
# A dictionary mapping words to an integer index
word_index = reuters.get_word_index(path="reuters_word_index.json")

In [6]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [8]:
# Decode the training data 
# Checking the first corpus
decode_review(data_train[0])

'the of of mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs'

# Build 3 different TensorFlow models

## 1. Simple LSTM for Sequence Classification using Binary Crossentropy as a loss function

In [9]:
# Alter the input sequences so that they all have the same length for modeling
max_review_length = 400
x_train = tensorflow.keras.preprocessing.sequence.pad_sequences(data_train, maxlen=max_review_length)
x_test = tensorflow.keras.preprocessing.sequence.pad_sequences(data_test, maxlen=max_review_length)

In [10]:
# Construct the model
model1 = Sequential()
model1.add(Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model1.add(LSTM(100))
model1.add(Dense(1, activation='sigmoid'))

# View inside the network
print(model1.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit model
model1.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x18298b11088>

In [13]:
# Evaluate model
scores = model1.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 4.67%


## 2. LSTM and Convolutional Neural Network For Sequence Classification using Binary Crossentropy as a loss function

In [14]:
# Constructthe model
model2 = Sequential()
model2.add(Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model2.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model2.add(MaxPooling1D(pool_size=2))
model2.add(LSTM(100))
model2.add(Dense(1, activation='sigmoid'))

# View inside the network
print(model2.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
conv1d (Conv1D)              (None, 400, 32)           3104      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 200, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 376,405
Trainable params: 376,405
Non-trainable params: 0
_________________________________________________________________
None


In [15]:
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit model
model2.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x18298c3c7c8>

In [16]:
# Evaluate model
scores = model2.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 4.67%


## 3. Simple LSTM for Sequence Classification using Categorical Crossentropy as a loss function

In [17]:
print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(data_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(data_test, mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Vectorizing sequence data...
x_train shape: (8982, 1000)
x_test shape: (2246, 1000)


In [18]:
print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
y_train = tensorflow.keras.utils.to_categorical(y_train, num_classes)
y_test = tensorflow.keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

Convert class vector to binary class matrix (for use with categorical_crossentropy)
y_train shape: (8982, 46)
y_test shape: (2246, 46)


In [19]:
# Construct the model
model3 = Sequential()
model3.add(Embedding(num_of_words, embedding_vecor_length, input_length=max_words))
model3.add(Dropout(0.5))
model3.add(LSTM(100))
model3.add(Dense(num_classes))
model3.add(Activation('softmax'))

# View inside the network
print(model3.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1000, 32)          320000    
_________________________________________________________________
dropout (Dropout)            (None, 1000, 32)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 46)                4646      
_________________________________________________________________
activation (Activation)      (None, 46)                0         
Total params: 377,846
Trainable params: 377,846
Non-trainable params: 0
_________________________________________________________________
None


In [20]:
model3.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [22]:
model3.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)

Train on 8083 samples, validate on 899 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x182a12654c8>

In [23]:
# Evaluate model
scores = model3.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 38.65%


***
**Summary:**
1. **Simple LSTM Model using binary_crossentropy as a loss function** yeild accuracy at 4.67% on test data.
2. **LSTM and Convolutional Neural Network model using binary_crossentropy as a loss function** yeild accuracy at 4.67% on test data.
3. **Simple LSTM model using Categorical Crossentropy as a loss function** yeild accuracy at 38.65% on test data.

Therefore, we can see from model **1** and **2** that eventhough we try to improve the model by implementing **CNN**, the model won't improve at all if the **loss function** isn't appropriate. 

We clearly see from model **3** that the accuracy rate is much better eventhough the hidden layer is less than model **3**.

This assignment show the important of how to pick the right **loss function** which then will effect the model significantly.
***