<a href="https://colab.research.google.com/github/utr100/fake-news-detection/blob/main/fake_news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fake News Detection Using Gated Recurrent Units

The objective of this notebook is to demonstrate how to use a machine learning program to identify when an article might be fake news. The dataset for this challenge can be found on [Kaggle](https://www.kaggle.com/c/fake-news/data). 

To achieve this, we will use a type of neural network called Gated Recurrent Units (GRU's) which are suitable for sequence data such as text and time series. GRU's can be considered as an improvement to simple Recurrent Neural Networks (RNN's), and they alleviate some of the issues with RNN's to some extent (for example the extremely short-term memory of RNN's).

## Importing Libraries

We will be using [Tensorflow](https://www.tensorflow.org/) to train the GRU model for fake news detection. Along with Tensorflow, we will use useful functions from some other libraries such as [Pandas](https://pandas.pydata.org/) and [Scikit-Learn](https://scikit-learn.org/) to help prepare our data.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

import pandas as pd
import shutil

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np

# Scikit-Learn ≥0.20 is required
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
assert sklearn.__version__ >= "0.20"

## Getting the Data

Note: Since the training data is larger than 25MB, we cannot upload it to a Github repository. So, we will use the Kaggle API to download it directly from the source.

Instructions to set up and use Kaggle API can be found [here](https://github.com/Kaggle/kaggle-api).

If you would not like to set up the Kaggle API you can also download the data to your system, unzip it, and upload it to colab.

**Note: Skip this section if you are uploading the data yourself. If you are uploading your own data, make sure that you upload both the train.csv and test.csv files to the /content directory.**

In [2]:
# Upload the kaggle.json file into the /content directory and then run the following code
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [3]:
# Download the data files
! kaggle competitions download -c fake-news

Downloading submit.csv to /content
  0% 0.00/40.6k [00:00<?, ?B/s]
100% 40.6k/40.6k [00:00<00:00, 16.1MB/s]
Downloading test.csv.zip to /content
 53% 5.00M/9.42M [00:00<00:00, 49.5MB/s]
100% 9.42M/9.42M [00:00<00:00, 60.1MB/s]
Downloading train.csv.zip to /content
 70% 26.0M/37.0M [00:00<00:00, 68.2MB/s]
100% 37.0M/37.0M [00:00<00:00, 106MB/s] 


In [4]:
# Unzip the train and test zip files
shutil.unpack_archive('train.csv.zip')
shutil.unpack_archive('test.csv.zip')

## Reading the Data and Getting Info

In [5]:
full_train_df = pd.read_csv('train.csv')

There are some null values in the dataset. We will drop the null values in our analysis.

In [6]:
full_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


There are three columns which may be useful in determining whether a news article is fake or not - title, author, and text. In our analysis, we will use only the text column to classify the articles as this column contains the most important information.

In [7]:
full_train_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


This is a very well-balanced dataset with roughly equal instances of positive and negative classes. This balance is generally favorable for Machine Learning algorithms.

In [8]:
full_train_df['label'].value_counts()

1    10413
0    10387
Name: label, dtype: int64

## Preparing the Data

We will use the following steps to prepare the data:
1. **Removing null values**: Since we are using the text column in determining whether a news article may be fake, we will have to remove the rows which have null values in the text column.
2. **Separating out the target column**
3. **Splitting the dataset into train, validation, and test**

There are some other preprocessing steps such as **removing stop words** and **stemming/lemmatization** that we could have applied, but in this demonstration, we are going to proceed without these steps.


In [9]:
# Removing articles with null values in text column
full_train_df = full_train_df[full_train_df['text'].notnull()].reset_index(drop=True).copy()

In [10]:
# Seperating out the target column
full_X_df = full_train_df.drop(["label"], axis=1)
full_y_df = full_train_df["label"]

In [11]:
# Splitting the dataset into train, validation and test sets

# 10% data for validation 
X_train_df, X_validation_df, y_train, y_validation = train_test_split(
    full_X_df, full_y_df, test_size=0.10, random_state=42)

# 10% data for test 
X_train_df, X_test_df, y_train, y_test = train_test_split(
    X_train_df, y_train, test_size=0.10, random_state=42)

## Tokenizing the Text

Machine Learning algorithms cannot handle text data directly. So, text data has to be encoded into a numerical format before it can be fed into Machine Learning models. To do this for our neural network, we will use the `keras.preprocessing.text.Tokenizer` class, which can be fitted on text data and then it can map each token (a token is typically a word, but it can also be set as a character or a subword) to an integer. For example, if the word **news** has been mapped to the integer **11**, then all instances of the word **news** in the data will get replaced by **11**.

After fitting the tokenizer, it can be noticed that the vocabulary size is quite large (208,474 words). We would not need all the words in order to make accurate predictions, so we will only keep the 15,000 most frequent words. All the other words will be considered as unknown and will be mapped to the value of the \<unk\> token. (see code below)

In [12]:
# converting text column to numpy array
train_text = X_train_df['text'].apply(lambda x: str(x)).to_numpy()

# Initializing and fitting the tokenizer
tokenizer = keras.preprocessing.text.Tokenizer(char_level=False, oov_token='<unk>')
tokenizer.fit_on_texts(train_text)

# creating a token for padding - padding is used to make all training sentences of 
# the same size by adding the <pad> token to the beginning or end of shorter sentencs
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

In [13]:
# The vocabulary size is quite large
len(tokenizer.word_counts)

208355

In [14]:
# Restricting the vocabulary size
tokenizer.num_words = 15000

## Encoding and Padding the Train, Validation and Test Texts

Now we will use the tokenizer that we have fitted to encode the train, validation, and test datasets. Then we will pad the news articles to make them all of the same length.

In [15]:
# converting text column to numpy array
validation_text = X_validation_df['text'].apply(lambda x: str(x)).to_numpy()
test_text = X_test_df['text'].apply(lambda x: str(x)).to_numpy()

In [16]:
# tokenizing the texts
X_train = tokenizer.texts_to_sequences(train_text)
X_validation = tokenizer.texts_to_sequences(validation_text)
X_test = tokenizer.texts_to_sequences(test_text)

# padding the texts
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, padding='post')
X_validation = tf.keras.preprocessing.sequence.pad_sequences(X_validation, padding='post')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, padding='post')

## Final Preparation for Training

As part of the final data preparation, we will perform the following operations on the data:

1. We will truncate each article to 200 words, since the first 200 words should be enough to determine if the article constitutes fake news.
2. We will convert the labels to a NumPy array.
3. We will convert the articles and the labels from NumPy arrays to Tensorflow datasets. This format can then be fed into the neural network.
4. We will batch and prefetch the datasets.

In [17]:
# Keep the first 200 words of each article
X_train = [text[:200] for text in X_train]
X_validation = [text[:200] for text in X_validation]
X_test = [text[:200] for text in X_test]

In [18]:
# converting the labels to numpy array
y_train = y_train.to_numpy().flatten()
y_validation = y_validation.to_numpy().flatten()
y_test = y_test.to_numpy().flatten()

In [25]:
# converting the data from numpy arrays to tensorflow datasets
tfds_train = tf.data.Dataset.from_tensor_slices((X_train, y_train))
tfds_validation = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
tfds_test = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [26]:
# batching and prefetching
tfds_train = tfds_train.batch(16).prefetch(1)
tfds_validation = tfds_validation.batch(16).prefetch(1)
tfds_test = tfds_test.batch(16).prefetch(1)

## Creating and Training the Model

Finally, it's time to create and train the model. Below are steps to achieve this:

1. We create a sequential model which consists of an embedding layer as the input layer, followed by 2 GRU layers and finally a dense layer with 1 neuron and sigmoid activation function. (The last layer is the conventional architecture used for binary classification).
2. We then compile our model with the binary cross entropy as the loss (typical for binary classification tasks), adam optimizer (a very fast optimizer) and accuracy as the metric.
3. Finally we fit our model to the data and train it for 3 epochs, using the validation dataset created earlier for validation.

In [21]:
vocab_size = tokenizer.num_words
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, embed_size,
                           mask_zero=True,
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(tfds_train, epochs=3, validation_data=tfds_validation)

Epoch 1/3
Epoch 2/3
Epoch 3/3


At the end of 3 epochs, the accuracy achieved on the training data is 99.33%, whereas the accuracy achieved on the validation data is 96.58%. Since the training accuracy is greater than the validation accuracy, there is some measure of overfitting involved during training, which can be addressed by applying regularization measures such as adding dropout layers in the network and training the model again.

## Evaluation on Test Data

We now evaluate our model on the 10% of test data that we had set aside in the beginning. It can be observed in the below report that the precision, recall and f1-scores for both the classes are reasonable high (97%), which indicates that the classifier is performing well on the test data.

In [27]:
# make predictions using model
predictions = model.predict(tfds_test)

# function to convert the scores to binary classes (0 or 1)
def get_class(score):
  return 1 if score > 0.5 else 0

get_class_v = np.vectorize(get_class)

# get the binary class values
y_pred = get_class_v(predictions)

# get the classification report which contains precison and recall values for each class
target_names = ['reliable', 'unreliable']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

    reliable       0.96      0.98      0.97       927
  unreliable       0.98      0.96      0.97       942

    accuracy                           0.97      1869
   macro avg       0.97      0.97      0.97      1869
weighted avg       0.97      0.97      0.97      1869



## Creating Submission File

We now create the submission file using the data in the unlabeled "test.csv" file.

In [28]:
test_df_final = pd.read_csv('test.csv')

In [29]:
test_df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      5200 non-null   int64 
 1   title   5078 non-null   object
 2   author  4697 non-null   object
 3   text    5193 non-null   object
dtypes: int64(1), object(3)
memory usage: 162.6+ KB


In [30]:
# Fill null values with empty string
test_df_final.replace(np.NaN, '', inplace=True)

# converting text column to numpy array
test_text_final = test_df_final['text'].apply(lambda x: str(x)).to_numpy()

# tokenizing and padding the texts
X_test_final = tokenizer.texts_to_sequences(test_text_final)
X_test_final = tf.keras.preprocessing.sequence.pad_sequences(X_test_final, padding='post')

# Keep the first 200 words of each article
X_test_final = [text[:200] for text in X_test_final]

# converting the data from numpy arrays to tensorflow datasets, batching and prefetching
tfds_test_final = tf.data.Dataset.from_tensor_slices((X_test_final))
tfds_test_final = tfds_test_final.batch(16).prefetch(1)

# get the predictions
predictions_final = model.predict(tfds_test_final)

# get the binary class values
y_pred_final = get_class_v(predictions_final)

In [33]:
# create and save final submission df
test_df_final['label'] = y_pred_final
test_df_final = test_df_final[['id', 'label']]
test_df_final.to_csv('submit.csv', index=False)

## Testing on custom text

In this section we can try out the model using our custom text.

In [47]:
#@title Fake News Detector

new_text = "Our house is burning. Literally. The Amazon rain forest - the lungs which produces 20% of our planet\u2019s oxygen - is on fire. It is an international crisis. Members of the G7 Summit, let's discuss this emergency first order in two days!" #@param {type:"string"}

new_text_processed = [new_text]

X_new = tokenizer.texts_to_sequences(new_text_processed)
X_new = tf.keras.preprocessing.sequence.pad_sequences(X_new, padding='post')

# Keep the first 200 words of each article
X_new = [text[:200] for text in X_new]

tfds_new = tf.data.Dataset.from_tensor_slices((X_new))
tfds_new = tfds_new.batch(16).prefetch(1)

score = model.predict(tfds_new)
classification = "unreliable" if score > 0.5 else "reliable"

print(f"Score: {round(float(score), 4)}")
print(f"The news is likely {classification}")

Score: 0.9995
The news is likely unreliable


## Conclusion and Improvement Ideas

It was possible to train a fairly performant classifier with a minimum of data preprocessing and a relatively simple model, with 1 embedding layer, 2 GRU layers and a single neuron for producing the final classification score. There are a number of improvements that can be applied to this simple workflow in order to improve its performance. Listed below are some ideas for improvement:
1.	Text preprocessing steps such as removing stopwords and stemming/lemmatization can be applied to the data before encoding it.
2.	A pre-trained embedding can be used in the embedding layer. These embeddings are usually trained on a much larger corpus of data and perform better that an embedding trained from scratch.
3.	The embedding and GRU architecture can be replaced with superior Transformer based architectures which are the current state-of-the-art in Natural Language Processing tasks. One such example is BERT (Bidirectional Encoder Representation from Transformers) which is available in different sizes as a pre-trained model and can be fine tuned to any task, including classification. 
