## Introduction

This notebook covers the creation, execution and evaluation of an Recurrent Neural Long Short Term Memory Network model.
The steps followed in this notebook are as follows
1. Load the proprocessed data to a dataframe from the google drive(as this notebook used google Colab).
2. Tokenize the news articles using NLTK tokenizer and thereafter apply padding on the encoding of the news article.
3. Create and fit the model on train set.
4. Predict the validation and test set. Generate accuracy score for validation and test predictions.
5. Save the model for future use in model Evaluation notebook.

In [1]:
# Importing required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('punkt')
import re

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import word_tokenize

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
from sklearn.decomposition import PCA
import pickle

# import keras
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model
#from jupyterthemes import jtplot
#jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# Mouting the google drive to google collaboratory
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Before the modelling lets load the preprocessed news data into train, test and validation dataframes.

In [3]:
# Read the CSV file from the preprocessing step
news_df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/train_processed.csv', encoding='UTF-8')

# Split the data into train, validation, and test sets
#train_df, temp_df = train_test_split(news_df, test_size=0.2, random_state=42)
#val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

In [4]:
print("News df Shape:", news_df.shape)

News df Shape: (20758, 6)


In [5]:
# Split the data into train, validation, and test sets
X_trainval, X_test, y_trainval, y_test = train_test_split(news_df['clean_joined'], news_df['label'], test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)

In [6]:
# Print the shape of the dataframes
print("Train Data Shape:", X_train.shape)
print("Validation Data Shape:", X_val.shape)
print("Test Data Shape:", X_test.shape)

Train Data Shape: (14945,)
Validation Data Shape: (3737,)
Test Data Shape: (2076,)


Next step is to find the total words in the entire news_df dataframe and maximum words in a news article. This is useful for word vectorizing.

In [7]:
# Obtain the total words present in the dataset
list_of_words = []
for i in news_df['clean_joined']:
    for j in i.split():
        list_of_words.append(j)

In [8]:
# Totalnumber of words in the news dataframe.
len(list_of_words)

7466427

In [9]:
# length of maximum document will be needed to create word embeddings
maxlen = -1
for doc in news_df.clean_joined:
    tokens = nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen = len(tokens)
print("The maximum number of words in any document is =", maxlen)

The maximum number of words in any document is = 13775


In [10]:
# Obtain the total number of unique words
total_words = len(list(set(list_of_words)))
total_words

170784

NLTK Tokenizer is used to embed words in the news article and train, test and validation sequences are created.

In [11]:
# Create a tokenizer to tokenize the words and create sequences of tokenized words
tokenizer = Tokenizer(num_words = total_words)
tokenizer.fit_on_texts(X_train)
train_sequences = tokenizer.texts_to_sequences(X_train)
val_sequences = tokenizer.texts_to_sequences(X_val)
test_sequences = tokenizer.texts_to_sequences(X_test)

In [12]:
print("The encoding for document\n",news_df.clean_joined[0],"\n is : ",train_sequences[0])

The encoding for document
 house aide comey letter jason chaffetz tweeted house aide comey letter jason chaffetz tweeted darrell lucus october subscribe jason chaffetz stump american fork utah image courtesy michael jolley available creative commons license apologies keith olbermann doubt worst person world week director james comey according house democratic aide looks like know second worst person turns comey sent infamous letter announcing looking emails related hillary clinton email server ranking democrats relevant committees hear comey tweet republican committee chairmen know comey notified republican chairmen democratic ranking members house intelligence judiciary oversight committees agency reviewing emails recently discovered order contained classified information long letter went oversight committee chairman jason chaffetz political world ablaze tweet informed learned existence emails appear pertinent investigation case reopened jason chaffetz jasoninthehouse october course k

In [13]:
# Add padding can be up to 13775 i.e the max number of words in a news article. We selected maxlen = 4000 as average
# word length in the news article is found to be 4000.
padded_train = pad_sequences(train_sequences,maxlen = 4000, padding = 'post', truncating = 'post')
padded_val = pad_sequences(val_sequences,maxlen = 4000,  padding = 'post', truncating = 'post')
padded_test = pad_sequences(test_sequences,maxlen = 4000,  padding = 'post', truncating = 'post')

In the next step, a LSTM RNN model is trained to fit the train dataframe.
Below are the steps
1. Importing Sequential model from Keras.
2. Adding an Embedding layer to the model. This layer is often used to process
words or other values that have a huge number of categories and can be
represented as dense vectors. 'output_dim = 128' specifies the size of
the vector space in which words will be embedded. It defines the size of the
output vectors from this layer for each word.
3. Adding a Bidirectional wrapper for LSTM (Long Short Term Memory) which is a
type of Recurrent Neural Network (RNN). The LSTM will learn how to predict the
next word based on the previous one it has seen. The use of Bidirectional is
to make the LSTM "look" backwards in the input sequence and in theory provide
additional context to the model. It will have a total of 128 units or "cells".
4. Adding a Dense layer (Fully connected layer) where every node in the layer is
connected to every node in the preceding layer. The number 128 indicates how
many neurons are in this layer. The 'relu' activation function is used.
5. Adding another Dense layer with 1 neuron as it is usually the case for binary
classification problems (as suggested by 'sigmoid' activation). Sigmoid activation
function outputs a value between 0 and 1 which can be treated as a probability for
the binary classes.
6. The compile method is used to configure the learning process before training the
model. It receives three arguments. An optimizer ('adam' in this case), a loss
function ('binary_crossentropy' which is suitable for binary classification),
and a list of metrics ('acc' stands for accuracy).


In [14]:
# Sequential Model
model = Sequential()

# embeddidng layer
model.add(Embedding(total_words, output_dim = 128))


# Bi-Directional RNN and LSTM
model.add(Bidirectional(LSTM(128)))

# Dense layers
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1,activation= 'sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         21860352  
                                                                 
 bidirectional (Bidirectiona  (None, 256)              263168    
 l)                                                              
                                                                 
 dense (Dense)               (None, 128)               32896     
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 22,156,545
Trainable params: 22,156,545
Non-trainable params: 0
_________________________________________________________________


In [15]:
# Converting a y_train series to an array.
y_train = np.asarray(y_train)

In [16]:
# train the model
model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7883f7743520>

In [17]:
# make prediction
pred_test = model.predict(padded_test)
pred_val = model.predict(padded_val)



In [19]:
# if the predicted value is >0.5 it is real else it is fake
prediction_test = []
for i in range(len(pred_test)):
    if pred_test[i].item() > 0.5:
        prediction_test.append(1)
    else:
        prediction_test.append(0)

In [20]:
# getting the accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(list(y_test), prediction_test)

print("Model Accuracy : ", accuracy)

Model Accuracy :  0.9590558766859345


In [21]:
# if the predicted value is >0.5 it is real else it is fake
prediction_val = []
for i in range(len(pred_val)):
    if pred_val[i].item() > 0.5:
        prediction_val.append(1)
    else:
        prediction_val.append(0)

In [22]:
accuracy = accuracy_score(list(y_val), prediction_val)

print("Model Accuracy : ", accuracy)

Model Accuracy :  0.9644099545089644


In [23]:
# Save the RNN model

with open('rnn_news_classification.pkl', 'wb') as file:
    pickle.dump(model, file)