In [1]:
## if working in jupyter notebook
# %load_ext nb_black
## if working in jupyter lab
#%load_ext lab_black
#%load_ext tensorboard

# 1. Download Kaggle dataset
Dataset source: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset/
- The following cells use the Kaggle API to download the dataset, please follow these [instructions](https://github.com/Kaggle/kaggle-api#api-credentials) in creating your Kaggle API Token if you have not already done so.
- After downloading your `kaggle.json` make sure to edit the KAGGLE_JSON variable so it can find the JSON file.

In [2]:
# check that our .gzip files are present
data_path = "../data"
!ls -la $data_path

total 42336
drwxrwxr-x 3 evan evan     4096 Mar 25 17:25 .
drwxrwxr-x 8 evan evan     4096 Mar 27 21:58 ..
-rw-rw-r-- 1 evan evan   338769 Mar 20 12:40 df_cases_200906.gzip
-rw-rw-r-- 1 evan evan    19689 Mar 20 12:40 df_label_200906.gzip
-rw-rw-r-- 1 evan evan 42975911 Mar 25 17:17 fake-and-real-news-dataset.zip
-rw-rw-r-- 1 evan evan        0 Mar 20 12:40 .gitkeep
drwxrwxr-x 2 evan evan     4096 Mar 27 21:29 kaggle


In [3]:
import os
import json

HOME = os.path.expanduser("~")
KAGGLE_JSON = HOME + "/.kaggle/kaggle.json"

f = open(KAGGLE_JSON)
data = json.load(f)

USER = data.get("username")
KEY = data.get("key")

In [4]:
%env KAGGLE_CONFIG_DIR=$KAGGLE_JSON
!chmod 600 $KAGGLE_CONFIG_DIR
%env KAGGLE_USERNAME=$USER
%env KAGGLE_KEY=$KEY
!ls $KAGGLE_CONFIG_DIR

env: KAGGLE_CONFIG_DIR=/home/evan/.kaggle/kaggle.json
env: KAGGLE_USERNAME=evantancy
env: KAGGLE_KEY=8d05a9f599eddbe903759ebe9b8bb56f
/home/evan/.kaggle/kaggle.json


In [5]:
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset -p $data_path

fake-and-real-news-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
!rm $data_path/kaggle/True.csv
!rm $data_path/kaggle/Fake.csv
!unzip $data_path/fake-and-real-news-dataset.zip -d $data_path/kaggle

Archive:  ../data/fake-and-real-news-dataset.zip
  inflating: ../data/kaggle/Fake.csv  
  inflating: ../data/kaggle/True.csv  


# 2. Load Kaggle dataset into DataFrames

In [7]:
# all of our imported libraries
import datetime
import string
import re
import pandas as pd
import numpy as np
import seaborn as sns
import tensorflow as tf
import tensorflow.keras.preprocessing.text as text
import tensorflow.keras.preprocessing.sequence as sequence
from keras.models import Sequential
from keras.layers import (
    Dense,
    Embedding,
    Bidirectional,
    LSTM,
    Dropout,
    Conv1D,
    Flatten,
    GlobalMaxPooling1D,
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import nltk

nltk.download("stopwords")

Using TensorFlow backend.
[nltk_data] Downloading package stopwords to /home/evan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# real and fake news DataFrames
df_real = pd.read_csv("../data/kaggle/True.csv")
df_fake = pd.read_csv("../data/kaggle/Fake.csv")

# 3. Exploring our data

## 3.1 Explore real news data

In [9]:
# check for any NaN values
df_real.isna().sum()

title      0
text       0
subject    0
date       0
dtype: int64

In [10]:
df_real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [11]:
df_real["subject"].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [12]:
# let's add an is_fake column as a label
df_real["is_fake"] = 0

## 3.2 Explore fake news data

In [13]:
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [14]:
df_fake["subject"].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

In [15]:
# let's add an is_fake column as a label
df_fake["is_fake"] = 1

In [16]:
# merging both DataFrames (vertical concatenation)
df_all_news = pd.concat([df_real, df_fake], ignore_index=True)
# shuffle and reindex the new DataFrame
df_all_news = df_all_news.sample(frac=1).reset_index(drop=True)
# drop subject, date, and title column as they're different
df_all_news = df_all_news.drop(["subject", "date", "title"], axis=1)

In [17]:
df_all_news.head(10)

Unnamed: 0,text,is_fake
0,UNITED NATIONS (Reuters) - The United States a...,0
1,SYDNEY (Reuters) - Thousands of people rallied...,0
2,TOKYO (Reuters) - Japan s Coast Guard found th...,0
3,Our good friend Brian Pannebecker is a tireles...,1
4,"BERLIN (Reuters) - The German military, buoyed...",0
5,Now that Donald Trump is president and Jeff Se...,1
6,Perhaps the Green Party isn t so useless after...,1
7,The left is truly becoming unhinged! The tensi...,1
8,If you ve been watching the Republican Nationa...,1
9,WASHINGTON (Reuters) - The U.S. House of Repre...,0


# 4. Filtering data
Removing stopwords and punctuation from text

In [18]:
# sets are faster than lists when checking for "not in"
stopword_set = set(nltk.corpus.stopwords.words("english"))
punctuation_set = set(string.punctuation)
# exclude_set = stopword_set.union(punctuation_set)

In [19]:
# removing the words from text
def remove_stopwords(text: str, stop_set):
    word_list = text.split()
    filtered = ""
    for word in word_list:
        if word not in stopword_set:
            filtered += word
        filtered += " "
    return filtered


# removing punctuation from text
def remove_punctuation(text: str, punc_set):
    #     filtered = ""
    #     for char in text:
    #         if char not in punc_set:
    #             filtered += char
    return "".join(char if char not in punc_set else "" for char in text)


def filter_text(text: str, stopwords, punctuation):
    text = text.lower()
    text = remove_punctuation(text, punctuation)
    text = remove_stopwords(text, stopwords)
    return text

In [20]:
# apply text filtering these to the text fields
df_all_news["text"] = df_all_news["text"].apply(
    filter_text, args=(stopword_set, punctuation_set)
)

# 5. Split dataset into train, validation, and test dataset
Here we use a 60/20/20 train/validation/test split
(This solution is a bit hacky)

In [21]:
# let's do a 75:25 train/test split
random_num = "96"
random_num = int(random_num[::-1])

# train = 0.8, test = 0.2
x_train, x_test, y_train, y_test = train_test_split(
    df_all_news["text"],
    df_all_news["is_fake"],
    test_size=0.2,
    random_state=random_num,
)

# val = 0.25 * 0.8 = 0.2,
# train = 0.75 * 0.8 = 0.6
# test = 0.2 (unchanged)
x_train, x_val, y_train, y_val = train_test_split(
    x_train,
    y_train,
    test_size=0.25,
    random_state=random_num,
)

# aliasing for our variables
train_labels = y_train
test_labels = y_test
val_labels = y_val

In [22]:
y_train.value_counts()

1    14134
0    12804
Name: is_fake, dtype: int64

In [23]:
y_test.value_counts()

1    4706
0    4274
Name: is_fake, dtype: int64

From the above 2 value counts, we can see that the data is pretty well balanced.

In [24]:
# tokenizing parameters
vocab_size = 10000
max_len = 300
padding_type = "post"
trunc_type = "post"

# create tokenizer
tokenizer = text.Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(x_train)

# tokenize train and test data
train_seq = tokenizer.texts_to_sequences(x_train)
test_seq = tokenizer.texts_to_sequences(x_test)
val_seq = tokenizer.texts_to_sequences(x_val)

# pad our sequences
train_seq_padded = sequence.pad_sequences(
    train_seq, maxlen=max_len, padding=padding_type, truncating=trunc_type
)
test_seq_padded = sequence.pad_sequences(
    test_seq, maxlen=max_len, padding=padding_type, truncating=trunc_type
)
val_seq_padded = sequence.pad_sequences(
    val_seq, maxlen=max_len, padding=padding_type, truncating=trunc_type
)

# 6. Use GloVe embedding
Download the embedding file [here](http://nlp.stanford.edu/data/glove.twitter.27B.zip) and place it into `../data/kaggle`
Alternatively, run the cell below.

In [25]:
!wget --no-clobber http://nlp.stanford.edu/data/glove.twitter.27B.zip -P ../data/kaggle/

File ‘../data/kaggle/glove.twitter.27B.zip’ already there; not retrieving.



In [26]:
# relative location of our embedding file
EMBEDDING_FILE = "../data/kaggle/glove.twitter.27B.100d.txt"

In [27]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype="float32")


embeddings_index = dict(
    get_coefs(*o.rstrip().rsplit(" ")) for o in open(EMBEDDING_FILE)
)

In [28]:
all_embs = np.stack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(vocab_size, len(word_index))

# create embedding matrix
# change below line if computing normal stats is too slow
embedding_matrix = embedding_matrix = np.random.normal(
    emb_mean, emb_std, (nb_words, embed_size)
)

for word, i in word_index.items():
    if i >= vocab_size:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

  if (await self.run_code(code, result,  async_=asy)):


# 7. Define some training parameters and train models

In [29]:
log_dir = "./logs/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

In [30]:
# override previously set embed_size when reading GloVe embedding
embed_size = 100
num_epochs = 5  # even this low number of epochs is overkill, see val_accuracy below
batch = 128

# 7.1 Model 1: Bidirectional LSTM

In [31]:
# define our LSTM model
model_LSTM = tf.keras.Sequential(
    [
        tf.keras.layers.Embedding(vocab_size, embed_size, input_length=max_len),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model_LSTM.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model_LSTM.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 300, 128)          84480     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,129,921
Trainable params: 1,129,921
Non-trainable params: 0
_________________________________________________________________


In [32]:
model_LSTM.fit(
    train_seq_padded,
    train_labels,
    epochs=num_epochs,
    batch_size=batch,
    validation_data=(val_seq_padded, val_labels),
    callbacks=[tensorboard_callback],
)

Train on 26938 samples, validate on 8980 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5b7eb7e0b8>

# 7.2 Model 2: CNN

In [33]:
model_CNN = tf.keras.Sequential(
    [
        tf.keras.layers.Embedding(vocab_size, embed_size, input_length=max_len),
        tf.keras.layers.Conv1D(filters=128, kernel_size=5, activation="relu"),
        tf.keras.layers.GlobalMaxPooling1D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model_CNN.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model_CNN.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
conv1d (Conv1D)              (None, 296, 128)          64128     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
flatten (Flatten)            (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 1,072,449
Trainable params: 1,072,449
Non-trainable params: 0
____________________________________________

In [34]:
model_CNN.fit(
    train_seq_padded,
    train_labels,
    epochs=num_epochs,
    batch_size=batch,
    validation_data=(val_seq_padded, val_labels),
    callbacks=[tensorboard_callback],
)

Train on 26938 samples, validate on 8980 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5abc5cd278>

# 7.3 Model 3: Simple DNN

In [35]:
# define our LSTM model
model_DNN = tf.keras.Sequential(
    [
        tf.keras.layers.Embedding(vocab_size, embed_size, input_length=max_len),
        tf.keras.layers.Dense(256, activation="relu"),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="sigmoid"),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model_DNN.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model_DNN.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
dense_4 (Dense)              (None, 300, 256)          25856     
_________________________________________________________________
flatten_1 (Flatten)          (None, 76800)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 128)               9830528   
_________________________________________________________________
dense_6 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 65        
Total params: 10,864,705
Trainable params: 10,864,705
Non-trainable params: 0
__________________________________________

In [36]:
model_DNN.fit(
    train_seq_padded,
    train_labels,
    epochs=num_epochs,
    batch_size=batch,
    validation_data=(val_seq_padded, val_labels),
    callbacks=[tensorboard_callback],
)

Train on 26938 samples, validate on 8980 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5abc41cc50>

# 8. Calculating F1 scores and accuracy

## For Model 1, Bidirectional LSTM:

In [37]:
# evaluate model accuracy
_, train_acc_lstm = model_LSTM.evaluate(train_seq_padded, train_labels)
_, test_acc_lstm = model_LSTM.evaluate(test_seq_padded, test_labels)
print(f"Bidirectional LSTM Accuracy: Train = {train_acc_lstm} Test = {test_acc_lstm}")
# f1 score
label_pred_lstm = model_LSTM.predict_classes(test_seq_padded)
label_pred_lstm = label_pred_lstm[:]
f1_lstm = f1_score(test_labels, label_pred_lstm)
print(f"Bidirectional LSTM F1-Score = {f1_lstm}")

Bidirectional LSTM Accuracy: Train = 0.9999257326126099 Test = 0.9979955554008484
Bidirectional LSTM F1-Score = 0.9980879541108987


## For Model 2, CNN:

In [43]:
# evaluate model accuracy
_, train_acc_cnn = model_CNN.evaluate(train_seq_padded, train_labels)
_, test_acc_cnn = model_CNN.evaluate(test_seq_padded, test_labels)
print(f"CNN Accuracy: Train = {train_acc_cnn} Test = {test_acc_cnn}")
# f1 score
label_pred_cnn = model_CNN.predict_classes(test_seq_padded)
label_pred_cnn = label_pred_cnn[:]
f1_cnn = f1_score(test_labels, label_pred_cnn)
print(f"CNN F1-Score = {f1_cnn}")

CNN Accuracy: Train = 0.9999628663063049 Test = 0.9982182383537292
CNN F1-Score = 0.9982996811902233


## For Model 3: Simple DNN:

In [44]:
# evaluate model accuracy
_, train_acc_dnn = model_DNN.evaluate(train_seq_padded, train_labels)
_, test_acc_dnn = model_DNN.evaluate(test_seq_padded, test_labels)
print(f"DNN Accuracy: Train = {train_acc_dnn} Test = {test_acc_dnn}")
# f1 score
label_pred_dnn = model_DNN.predict_classes(test_seq_padded)
label_pred_dnn = label_pred_dnn[:]
f1_dnn = f1_score(test_labels, label_pred_dnn)
print(f"DNN F1-Score = {f1_dnn}")

DNN Accuracy: Train = 0.9999628663063049 Test = 0.999109148979187
DNN F1-Score = 0.9991500212494687


In [42]:
# let's view our training results
# this launches TensorBoard in this cell
# %tensorboard --logdir $log_dir # oops

# 9. Summary
## Model Choice
Model 1 -> LSTM: [Source](https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S)
Derived from Moroney's NLP Zero to Hero series on YouTube, I chose to train this because I wanted to see how a simple model that came from a mere ~4 minute tutorial, also because LSTMs are also good for sequences of data which makes it a good fit for this ML task. Furthermore, the first LSTM layer is set to be bi-directional.

Model 2 -> CNN: I chose to train this because coming from a computer vision background, where CNNs are mostly used, I wanted to see how this model architecture would do against the LSTM model. I also chose this model as CNNs are able to learn the features, stitching lower level features to higher level features.

Model 3 -> Simple DNN: I chose to train this to see how a simple DNN with few layers and a low number of neurons would go against the more powerful Bidirectional LSTM and CNN models.

## Model Performance
| Model | Accuracy (Test) | F1-Score |
| --- | --- | --- |
| Bidirectional LSTM | 0.9979955554008484 | 0.9980879541108987 |
| CNN | 0.9982182383537292 | 0.9982996811902233 |
| DNN | 0.999109148979187 | 0.9991500212494687 |

Using a train/validation/test split of 60/20/20, the following accuracy on the test dataset and F1-Scores were achieved.

In terms of F1-Score, DNN > CNN > Bidirectional LSTM. However, the difference in accuracy is negligible, (there is a maximum difference of 0.11%).

In terms of accuracy of the 3 models, DNN > CNN > Bidirectional LSTM. However, the difference in accuracy is negligible, (there is a maximum difference of 0.12%).

In terms of training speed, CNN > DNN > Bidirectional LSTM. 

The most preferred model is the CNN, which has a quick training speed, as well as decent performance. However, we have to consider that the dataset is very small, where the combined file size of`True.csv` and `Fake.csv` is only 111MB. If the dataset were larger, I would choose the Bidirectional LSTM model.