# Part 2: Natural Language Processing
Codes executed in `submit` folder

## Setup

Import from required libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import (
    Input, Embedding, LSTM, BatchNormalization, Dropout, Dense
)
from tensorflow.keras.callbacks import EarlyStopping
from pathlib import Path
from zipfile import ZipFile
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile
import pickle
from typing import List
import joblib
import time



Download stopwords from `nltk` library (will result in error later on if this step is not done right after installing `nltk`)

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\thefo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Define directories and create temporary directory `tmp` for storing intermediate objects (e.g. trained models)

In [3]:
CURR_DIR = Path.cwd()
DATA_PATH = CURR_DIR.parent /'fake-and-real-news-dataset.zip'
TMP_DIR = CURR_DIR.parent / 'tmp'

if not TMP_DIR.exists():
    TMP_DIR.mkdir()

## Data Exploration and Preparation

Read in data files and merge datasets together

In [4]:
# Read in data from zip file
zip_file = ZipFile(DATA_PATH)
fake_df, real_df = [
    pd.read_csv(zip_file.open(text_file.filename))
    for text_file in zip_file.infolist()
    if text_file.filename.endswith('.csv')
]

# Give label of 1 for fake news and 0 for real news
fake_df['label'] = 1
real_df['label'] = 0

# Combine DataFrames in a single one
input_df = pd.concat([fake_df, real_df], axis=0, ignore_index=True)

Examine top 5 rows of input `DataFrame`

In [5]:
input_df.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


Check data type for every column and also for missing values (no missing values found)

In [6]:
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


Randomly split dataset into training and test sets, keeping proportion of positive and negative labels constant

In [7]:
train_df, test_df = train_test_split(input_df, test_size=0.25, stratify=input_df['label'], random_state=0)

Functions to perform text processing and obtain corpuses from news titles. Corpus is saved in serialized file to reduce runtime during repeated execution

In [8]:
porter_stemmer = PorterStemmer()


def process_sentence(title: str) -> str:
    # Procedure from https://www.kaggle.com/dheerajchaudhary/simple-nlp-model-using-word2vec
    review = re.sub('[^a-zA-Z]', ' ', title)
    review = review.lower()

    review = [
        porter_stemmer.stem(word)
        for word in review.split()
        if not word in stopwords.words('english')
    ]
    return ' '.join(review)


def get_corpus(corpus_path: Path, titles: List[str]) -> List[str]:
    if corpus_path.exists():
        with open(corpus_path, 'rb') as pkl_file:
            corpus = pickle.load(pkl_file)
    else:
        corpus = [process_sentence(title) for title in titles]
        with open(corpus_path, 'wb') as pkl_file:
            pickle.dump(corpus, pkl_file)
    return corpus

Get corresponding corpuses from training and test sets

In [9]:
train_corpus: List[str] = []
test_corpus: List[str] = []
TRAIN_CORPUS_PATH = TMP_DIR / 'train_corpus.pkl'
TEST_CORPUS_PATH = TMP_DIR / 'test_corpus.pkl'

train_corpus = get_corpus(TRAIN_CORPUS_PATH, train_df['title'].tolist())
test_corpus = get_corpus(TEST_CORPUS_PATH, test_df['title'].tolist())

TF-IDF transform on training and test corpuses (for model 1). TF-IDF Vectorizer is saved in serialized file to reduce runtime during repeated execution

In [10]:
VECTORIZER_PATH = TMP_DIR / 'tfidf_vectorizer.pkl'

vectorizer: TfidfVectorizer

if VECTORIZER_PATH.exists():
    vectorizer = joblib.load(VECTORIZER_PATH)
else:
    vectorizer = TfidfVectorizer()
    vectorizer.fit(train_corpus)
    joblib.dump(vectorizer, VECTORIZER_PATH)


X_train_tfidf = vectorizer.transform(train_corpus).toarray()
X_test_tfidf = vectorizer.transform(test_corpus).toarray()

print(f'Dimensions of X_train_tfidf: {X_train_tfidf.shape}')
print(f'Dimensions of X_test_tfidf: {X_test_tfidf.shape}')

Dimensions of X_train_tfidf: (33673, 12211)
Dimensions of X_test_tfidf: (11225, 12211)


Doc2Vec transform on training and test corpuses (for model 2). Doc2Vec embeddings are saved to file after training and loaded when needed, to reduce runtime during repeated executions

In [11]:
# Processing steps from
# https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
DOC2VEC_PATH = TMP_DIR / 'doc2vec.model'
VEC_SIZE = 32
fname = get_tmpfile(DOC2VEC_PATH)

if DOC2VEC_PATH.exists():
    doc2vec = Doc2Vec.load(fname)
else:
    train_documents = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(train_corpus)]
    doc2vec = Doc2Vec(vector_size=VEC_SIZE, min_count=0)
    doc2vec.build_vocab(train_documents)
    doc2vec.train(train_documents, total_examples=doc2vec.corpus_count, epochs=doc2vec.epochs)
    doc2vec.save(fname)

X_train_doc2vec = np.array([
    doc2vec.infer_vector(doc.split())
    for doc in train_corpus
])
X_test_doc2vec = np.array([
    doc2vec.infer_vector(doc.split())
    for doc in test_corpus
])

print(f'Dimensions of X_train_doc2vec: {X_train_doc2vec.shape}')
print(f'Dimensions of X_test_doc2vec: {X_test_doc2vec.shape}')

Dimensions of X_train_doc2vec: (33673, 32)
Dimensions of X_test_doc2vec: (11225, 32)


One hot encoding for model 3. To be used for embedding later. One-hot encoded vectors are padded to ensure equal length for all documents in corpuses

In [12]:
# Processing steps from
# https://www.kaggle.com/dheerajchaudhary/simple-nlp-model-using-word2vec
VOCAB_SIZE = 10000
SENTENCE_LEN = 20

X_train_ohe = pad_sequences(
    [one_hot(doc, VOCAB_SIZE) for doc in train_corpus],
    padding='pre',
    maxlen=SENTENCE_LEN
)
X_test_ohe = pad_sequences(
    [one_hot(doc, VOCAB_SIZE) for doc in test_corpus],
    padding='pre',
    maxlen=SENTENCE_LEN
)

print(f'Dimensions of X_train_ohe: {X_train_ohe.shape}')
print(f'Dimensions of X_test_ohe: {X_test_ohe.shape}')

Dimensions of X_train_ohe: (33673, 20)
Dimensions of X_test_ohe: (11225, 20)


Get target labels from training and test sets

In [13]:
y_train = train_df['label']
y_test = test_df['label']

print(f'Length of y_train: {len(y_train)}')
print(f'Lenght of y_test: {len(y_test)}')

Length of y_train: 33673
Lenght of y_test: 11225


## Model Training and Evaluation

### Model 1: Naive Bayes
Gaussian distribution assumed for underlying model

Train Naive Bayes model. Model is saved after training to reduce runtime during repeated executions

In [14]:
MODEL_NB_PATH = TMP_DIR / 'model_nb.pkl'

if MODEL_NB_PATH.exists():
    model_nb = joblib.load(MODEL_NB_PATH)
else:
    model_nb = GaussianNB()
    model_nb.fit(X_train_tfidf, y_train)
    joblib.dump(model_nb, MODEL_NB_PATH)

Evaluate inference time, accuracy and F1 score on test set

In [15]:
y_pred = model_nb.predict(X_test_tfidf)
print(f'Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
print(f'F1 score on test set: {f1_score(y_test, y_pred):.3f}')

Accuracy on test set: 0.791
F1 score on test set: 0.766


### Model 2: Gradient Boosted Trees

Train Gradient Boosted Trees model. Model is saved after training to reduce runtime during repeated executions

In [16]:
MODEL_GBM_PATH = TMP_DIR / 'model_gbm.pkl'

if MODEL_GBM_PATH.exists():
    model_gbm = joblib.load(MODEL_GBM_PATH)
else:
    model_gbm = GradientBoostingClassifier(n_estimators=100, max_depth=3)
    model_gbm.fit(X_train_doc2vec, y_train)
    joblib.dump(model_gbm, MODEL_GBM_PATH)

Evaluate inference time, accuracy and F1 score on test set

In [17]:
y_pred = model_gbm.predict(X_test_doc2vec)
print(f'Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
print(f'F1 score on test set: {f1_score(y_test, y_pred):.3f}')

Accuracy on test set: 0.830
F1 score on test set: 0.836


### Model 3: Deep Learning
Contains Embedding, LSTM, Batch Normalization, Dropout and Dense layers

Define deep learning model architecture and compile

In [18]:
# Deep learning architecture from
# https://www.kaggle.com/dheerajchaudhary/simple-nlp-model-using-word2vec
NUM_EMBEDDING = 60
inp = Input(shape=(SENTENCE_LEN,), name='input')
out = Embedding(VOCAB_SIZE, NUM_EMBEDDING, name='embedding')(inp)
out = LSTM(64, return_sequences=True, name='lstm_0')(out)
out = BatchNormalization(name='batchnorm_0')(out)
out = Dropout(0.5, name='dropout_0')(out)
out = LSTM(32, name='lstm_1')(out)
out = BatchNormalization(name='batchnorm_1')(out)
out = Dropout(0.5, name='dropout_1')(out)
out = Dense(1, activation='relu', name='dense_0')(out)
out = Dense(1, activation='sigmoid', name='dense_1')(out)
model_dl = Model(inputs=inp, outputs=out, name='functional_model')
model_dl.compile(
    loss='binary_crossentropy',
    optimizer='adam'
)

Train deep learning model using LSTM layers. Model is saved after training to reduce runtime during repeated executions

In [19]:
MODEL_DL_PATH = TMP_DIR / 'model_dl.h5'

if MODEL_DL_PATH.exists():
    model_dl = load_model(MODEL_DL_PATH)
else:
    # Define early stopping callback to stop training process when
    # there is no more significant improvement in training score
    early_stopping = EarlyStopping(
        monitor='loss',
        patience=5,
        min_delta=0.001,
        restore_best_weights=True,
    )

    model_dl.fit(X_train_ohe, y_train, epochs=20, batch_size=128, callbacks=[early_stopping])
    model_dl.save(MODEL_DL_PATH)

Evaluate inference time, accuracy and F1 score on test set

In [20]:
THRESHOLD = 0.5
# Apply threshold to outputted prediction, to transform probability values
# to binary labels
y_pred = np.where(model_dl.predict(X_test_ohe).flatten() > THRESHOLD, 1, 0)
print(f'Accuracy on test set: {accuracy_score(y_test, y_pred):.3f}')
print(f'F1 score on test set: {f1_score(y_test, y_pred):.3f}')

Accuracy on test set: 0.635
F1 score on test set: 0.687


## Summary

Reasons for model choice:
- Naive Bayes: The idea was to choose a relatively simple model to start off as a baseline, and try to improve from there.
- Gradient Boosted Trees: Gradient Boosted Trees model is one of the non deep learning models that is known for high performance in terms of metrics, so it is a model worth trying.
- Deep Learning: Deep learning is known to perform well for unstructured datasets (e.g. images, text). LSTM layers are suited for sequences, such as words and sentences, and it has the ability to learn and understand from long sequences of values.

Comparison between models:

Model|Accuracy|F1 Score
---|---|---
Naive Bayes|0.791|0.766
Gradient Boosted Trees|0.830|0.836
Deep Leaarning|0.635|0.687

Based on the accuracy and F1 score, the Gradient Boosted Trees model is the most preferred. However, it is to be noted that other processes, such as feature engineering and hyperparamter tuning, could change these metrics, so these current metrics probably should not be treated as the endpoint of model training.