# News: Fake vs Real

The aim of this notebook is to provide an example of how to train an NLP model to detect fake news. Fake news is becoming a bigger and bigger issue in society, with disinformation becoming more widely shared on social media. So, a model to detect and filter out such stories would help to regain a lot of trust in the news we consume. The dataset used is approx. 20,000 examples of both real and fake news stories respectively, consisting of a title, text, subject and date. This notebook will preprocess the text, use GloVe word embeddings to map each word to a numeric vector, then pass the text through an LSTM network. The network will then classify each example as real or fake.

# Imports

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import string, re
from bs4 import BeautifulSoup
from wordcloud import WordCloud
from sklearn.metrics import classification_report, confusion_matrix
from keras.preprocessing import text, sequence
from nltk.tokenize.toktok import ToktokTokenizer
import keras
from keras.models import Sequential
from keras.layers import Dense,Embedding,LSTM,Dropout
from keras.callbacks import ReduceLROnPlateau
import tensorflow as tf

# **Loading Datasets**

First, let's load the datasets.

In [None]:
# Set the filepaths for each dataset
fake_filepath = '../input/fake-and-real-news-dataset/Fake.csv'
true_filepath = '../input/fake-and-real-news-dataset/True.csv'

# Read the csv files from the filepaths
fake_data = pd.read_csv(fake_filepath)
true_data = pd.read_csv(true_filepath)

The first few rows of the true stories are as follows:

In [None]:
true_data.head()

While the first few rows of the false stories look like this:

In [None]:
fake_data.head()

Let's create one big dataframe containing both real and fake stories, with a column named 'fake' that labels each type respectively.

In [None]:
# Add a 'fake' column to each dataframe
true_data['fake'] = 0
fake_data['fake'] = 1

# concatenate the two together
data = pd.concat([true_data, fake_data], ignore_index=True)

data.info()

# Initial Data Exploration and Data Cleaning

The easiest column to explore initially is the subject column. Let's first plot the distribution of the different subject titles, and see how this relates to the truth value of the story.

In [None]:
plt.figure(figsize=(12, 6))

sns.countplot(x='subject', hue='fake', data=data)

Unfortunately, the truth value of a story can be essentially predicted by the subject title, so we must remove the subject column to avoid overfitting to this particular dataset.

In [None]:
del data['subject']

data.head()

Let's also remove the date column.

In [None]:
del data['date']

data.head()

It would be useful to have the title and text of the article concatenated into one 'text' column, for easier modelling.

In [None]:
data['text'] = data['title'] + " " + data['text']

del data['title']

data.head()

# Text Cleaning

In order to pass text into a neural network model, we must first perform a number of preprocessing steps. First, we remove stopwords from the text. Stopwords are essentially words such as 'a', 'the' and 'and', which occur often in written text but essentially add no substance. We use an imported set of stopwords from the stopwords library.

In [None]:
stop = set(stopwords.words('english'))
# add punctuation to the list of stopwords
punctuation = list(string.punctuation)
stop.update(punctuation)
# these were common words at the start of almost every real article, so I chose to remove them so the model is not biased
stop.update(['washington', 'reuters', '(reuters)'])

We now remove stopwords, and strip any unnecessary punctuation.

In [None]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

# Removing URL's
def remove_between_square_brackets(text):
    return re.sub(r'http\S+', '', text)

#Removing the stopwords from text
def remove_stopwords(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    text = remove_stopwords(text)
    return text

#Apply function on review column
data['text']= data['text'].apply(denoise_text)

data.head()

Let's now visualise how each set of articles looks with a wordcloud.

In [None]:
# Text that is Fake
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop).generate(" ".join(data[data.fake == 1].text))
plt.imshow(wc , interpolation = 'bilinear')

In [None]:
# Text that is not Fake
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop).generate(" ".join(data[data.fake == 0].text))
plt.imshow(wc , interpolation = 'bilinear')

There is no obvious pattern in the word clouds, but we do see a slight bias towards 'Hillary Clinton' and 'Barack Obama' in the fake articles.

Now, let's split the data into training, development and testing sets. It is important that there is roughly an equal distribution of fake and true stories in each set, and the split will be 60-20-20.

In [None]:
# set random seed for reproducibility
np.random.seed(1)

def train_dev_test_split(df, target_col, proportions):
    """
    Helper function to randomly split a binary classification dataset into training, dev and test sets, 
    with proportional amounts of positive and negative examples
    """
    assert(sum(proportions) == 1.0)
    train_size, dev_size, test_size = proportions
    
    # separate the lists of indices for each value of the target variable
    false_index = df[df[target_col] == 1].index
    true_index = df[df[target_col] == 0].index
    
    # randomly choose train_size% of the positive and negative examples for the training set
    false_train_indices = np.random.choice(false_index, int(train_size * len(false_index)), replace=False)
    true_train_indices = np.random.choice(true_index, int(train_size * len(true_index)), replace=False)
    train_indices = list(false_train_indices) + list(true_train_indices)
    train = df.iloc[train_indices]
    
    # select the remaining data to split between dev and test sets
    rem_df = df.iloc[list(set(df.index)-set(train_indices))]
    rem_false_index = rem_df[rem_df[target_col] == 1].index
    rem_true_index = rem_df[rem_df[target_col] == 0].index
    false_dev_indices = np.random.choice(rem_false_index, int((dev_size / (dev_size + test_size)) * len(rem_false_index)), replace=False)
    true_dev_indices = np.random.choice(rem_true_index, int((dev_size / (dev_size + test_size)) * len(rem_true_index)), replace=False)
    dev_indices = list(false_dev_indices) + list(true_dev_indices)
    dev = df.iloc[dev_indices]

    # finally create the test set
    test_indices = list(set(df.index) - set(train_indices + dev_indices))
    test = df.iloc[test_indices]
    return train, dev, test

train, dev, test = train_dev_test_split(data, 'fake', (0.6, 0.2, 0.2))
X_train, X_dev, X_test = train.text, dev.text, test.text
y_train, y_dev, y_test = train.fake, dev.fake, test.fake

So now the training set, dev set and test set are stored in train, dev and test respectively. As we can see from the below plots, they each have an equal proportion of real and fake stories.

In [None]:
sns.countplot(x='fake', data=data)

In [None]:
sns.countplot(x='fake', data=train)

In [None]:
sns.countplot(x='fake', data=dev)

In [None]:
sns.countplot(x='fake', data=test)

We now tokenise (assign it to a unique number) each word so that it can be passed into the embedding matrix later. We set a maximum number of words to be considered (10000). We also pad each sentence with zeros so that it is exactly 300 words in length.

In [None]:
# Set max words and max length hyperparameters
max_features = 10000
max_len = 300

In [None]:
# Fit the tokenizer on the training data
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)

In [None]:
# Tokenize and pad each set of texts
tokenized_train = tokenizer.texts_to_sequences(X_train)
X_train = sequence.pad_sequences(tokenized_train, maxlen=max_len)

tokenized_dev = tokenizer.texts_to_sequences(X_dev)
X_dev = sequence.pad_sequences(tokenized_dev, maxlen=max_len)

tokenized_test = tokenizer.texts_to_sequences(X_test)
X_test = sequence.pad_sequences(tokenized_test, maxlen=max_len)

# Designing and Training the Model

Now all the text has been padded and tokenised, we are ready to create the model. The first step in the model is to use the GloVe embedding to assign each word to a feature vector which has some meaning in the context of the model, instead of an arbitrary token. We load the GloVe embeddings from an external source.

In [None]:
EMBEDDING_FILE = '../input/glove-twitter/glove.twitter.27B.100d.txt'

In [None]:
# Create a dictionary of words and their feature vectors from the embedding file
def get_coefs(word, *arr): 
    return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE))

In [None]:
all_embs = np.stack(list(embeddings_index.values()))
emb_mean,emb_std = all_embs.mean(), all_embs.std()

# Find dims of embedding matrix
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

# Randomly initialize the embedding matrix
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

# Add each vector to the embedding matrix, corresponding to each token that we set earlier
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Now the embedding matrix is created, we need to set some other hyperparameters, before finally creating the network architecture. We intend to use batch gradient descent, so we need to set a batch size and a number of epochs. These are parameters that can be tuned!

In [None]:
batch_size = 256
epochs = 10
embed_size = 100

As a further regularisation technique, we add in learning rate reduction.

In [None]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', patience = 2, verbose=1,factor=0.5, min_lr=0.00001)

The model is designed as follows: we first have the embedding layer, which takes each tokenized word to its unique feature vector. Then each sequence of feature vectors is passed through two LSTM units, of sizes 128 and 64 respectively, followed by a 32 unit fully connected layer, and a logistic unit with sigmoid activation function. 

In [None]:
#Defining Neural Network
model = Sequential()
#Non-trainable embeddidng layer
model.add(Embedding(max_features, output_dim=embed_size, weights=[embedding_matrix], input_length=max_len, trainable=False))
#LSTM 
model.add(LSTM(units=128 , return_sequences = True , recurrent_dropout = 0.25 , dropout = 0.25))
model.add(LSTM(units=64 , recurrent_dropout = 0.1 , dropout = 0.1))
model.add(Dense(units = 32 , activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=keras.optimizers.Adam(lr = 0.01), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.summary()

Now we sit back and relax while backpropagation works its magic!

In [None]:
history = model.fit(X_train, y_train, batch_size = batch_size , validation_data = (X_dev,y_dev) , epochs = epochs , callbacks = [learning_rate_reduction])

In [None]:
# Save the model for future use
model.save("model")

# Model Performance

Our first step is simply to check the accuracy of the model on each of the sets. 

In [None]:
print("Accuracy of the model on Training Data is - " , model.evaluate(X_train,y_train)[1]*100)
print("Accuracy of the model on Dev Data is - " , model.evaluate(X_dev,y_dev)[1]*100)
print("Accuracy of the model on Test Data is - " , model.evaluate(X_test,y_test)[1]*100)

We see that the model performs remarkably well on each of our three datasets! Incredibly, it achieves over 99.5% accuracy on each of the three sets. Let's now see how the cost and the accuracy evolved over training.

In [None]:
epochs = [i for i in range(10)]
fig , ax = plt.subplots(1,2)
train_acc = history.history['accuracy']
train_loss = history.history['loss']
val_acc = history.history['val_accuracy']
val_loss = history.history['val_loss']
fig.set_size_inches(20,10)

ax[0].plot(epochs , train_acc , 'go-' , label = 'Training Accuracy')
ax[0].plot(epochs , val_acc , 'ro-' , label = 'Dev Accuracy')
ax[0].set_title('Training & Dev Accuracy')
ax[0].legend()
ax[0].set_xlabel("Epochs")
ax[0].set_ylabel("Accuracy")

ax[1].plot(epochs , train_loss , 'go-' , label = 'Training Loss')
ax[1].plot(epochs , val_loss , 'ro-' , label = 'Dev Loss')
ax[1].set_title('Training & Dev Loss')
ax[1].legend()
ax[1].set_xlabel("Epochs")
ax[1].set_ylabel("Loss")
plt.show()

We can see that the loss essentially decreases on each epoch, which could suggest further training may slightly improve accuracy even further! Let's now examine the classification report and confusion matrix of the model.

In [None]:
pred = model.predict_classes(X_test)

print(classification_report(y_test, pred, target_names = ['Fake','Not Fake']))

In [None]:
confusion_matrix(y_test,pred)

Our confusion matrix shows no particular bias to false negatives or false positives, and in general performs very well! Now, let's try out our model on some articles taken from the web. The first is a BBC news article about the Portland protests (not fake), while the second is a well-known fake news article about Hillary Clinton. We apply the same preprocessing steps as before to the text, then feed each one into the algorithm.

In [None]:
real_title = "Portland protests: Trump threatens to send officers to more US cities"
real_text = "President Donald Trump has threatened to send more federal law enforcement officers to major US cities to control ongoing protests.\
Mr Trump on Monday criticised a number of cities run by 'liberal Democrats', including Chicago and New York, saying their leaders were afraid to act.\
He said officers sent to Oregon had done a 'fantastic job' restoring order amid days of protests in Portland.\
Democrats accuse Mr Trump of trying to rally his Conservative base.\
President Trump, a Republican, has been trailing in opinion polls behind his Democratic rival, Joe Biden, ahead of November's election.\
Last month, Mr Trump declared himself the 'president of law and order' in the wake of widespread protests over the death in police custody of African-American man George Floyd.\
Speaking at the White House on Monday, Mr Trump reiterated his call for law and order.\
'We're sending law enforcement,'' he told reporters. 'We can't let this happen to the cities.'\
He specifically named New York City, Chicago, Philadelphia, Detroit, Baltimore and Oakland in discussing problems with violence."

fake_title = "FBI Agent, Who Exposed Hillary Clinton Cover-up, Found Dead"
fake_text = "An FBI Special Agent, who was anticipated to expose the extent of Clinton and Obama malpractice and corruption in the “Operation Fast and Furious” \
cover-up before a US Federal Grand Jury, has been found dead at his home. The FBI official’s wife was also found dead at the scene with the couple both being murdered \
using the 52-year-old agent’s own gun.Special Agent David Raynor was “stabbed multiple times” and “shot twice with his own weapon,” according to local media reports. \
Raynor’s tragic death comes just one day before he was due to testify before a US Federal Grand Jury. \
He was widely expected to testify that Hillary Clinton acted illegally to protect Obama administration crimes while covering up the Fast and Furious scandal. \
Raynor’s wife, Donna Fisher, was also found dead at the scene. An autopsy will be completed to determine the exact cause of death, according to police. According to the \
Baltimore Sun:Authorities, who are offering a $215,000 reward for tips in Suiter’s killing, have struggled to understand what happened. \
The detective was shot with his own gun, which was found at the scene. Two other shots were fired from the gun, and Davis said there were signs of a brief \
struggle.Special Agent Raynor’s suspicious death is the latest in a sequence of disturbing deaths in Baltimore connected to the Clinton/Obama cover-up of \
Operation Fast and Furious.When President Trump took power, the US Justice Department opened another investigation into Operation Fast and Furious as it pertained \
to the Baltimore Police Department and impaneled a US Federal Grand Jury. One of the main witnesses was Detective Sean Suiter, an 18-year veteran of the FBI.However, \
Detective Suiter was gunned down in November, in eerily similar circumstances to Special Agent Raynor, also one day before he could testify. \
Special Agent Raynor was leading US Deputy Attorney General Rod Rosenstein’s and FBI Director Christopher Wray’s investigation into the murder of Detective Sean Suiter. \
Raynor believed Suiter was silenced before he could testify that the Obama administration was criminally complicit in allowing guns to flow into the hands of criminals on the \
Mexican border. \
These guns were involved in the murder of a US Federal Officer, among others, and is seen by investigators as the “Achilles heel of the Obama regime.”The murder of Border \
Patrol Agent Brian Terry is one of but a very few Obama administration crimes that have no statute of limitations as it involved the killing of a US Federal Officer. \
Leaked Wikileaks emails also prove Hillary Clinton was fully knowledgeable about the crime—thus making her liable to criminal charges. \
Last’s week’s bombshell Inspector General’s reports have exposed yet more Hillary Clinton and Obama Administration crimes.The report, that was released last Thursday, \
revealed that the FBI had discovered evidence that Hillary Clinton and the Clinton Foundation had committed “sexual crimes against children.” The report also shows that Obama \
lied to cover-up parts of these investigations that exposed child trafficking.However, the IG report proves that the evidence of these crimes has been covered-up and \
swept under the carpet by those acting at the highest levels."

# Denoise and tokenise the title and article contents and add them to an array
text = [denoise_text(real_title + " " + real_text), denoise_text(fake_title + " " + fake_text)]

tokenized_new = tokenizer.texts_to_sequences(text)
X_new = sequence.pad_sequences(tokenized_new, maxlen=max_len)

# Make predictions on the new examples
predictions = model.predict_classes(X_new)

mapping = {0: 'Not fake',
           1: 'Fake'}

predictions = np.vectorize(mapping.get)(predictions)

print("BBC News article is predicted to be: {}".format(predictions[0, 0]))
print("Fake news article is predicted to be: {}".format(predictions[1, 0]))

Unsurprisingly, the model gets it spot on! A really satisfactory result on a really fun dataset to work with.

# **I hope you enjoyed reading this and found it informative. If you have any questions let me know in the comments!**