# Comparing fake and real news classifiers

This Notebook explores if and to what extent a Neural Network could perform better than a Logistic Regression classifier. It is build upon the "Fake and real news dataset" available in Kaggle [here](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset).

The dataset consists of 2 files - one holding real news, and the other - fake ones. The general purpose of the classifier is to determine if an article is fake news or not. The work below compares success metrics achived with a Logistic Regression Classifier and with a Neural Network. The former follows the steps and reproduces the results of "News_Classifier_98%", published at this [link](https://www.kaggle.com/shawnbalu/news-classifier-98). 

All main stages are organised in separate chapters.

### Imports

In [None]:
# Import main libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
import csv
import io
import re
from IPython.display import Image

In [None]:
# Import text processing libraries 
import nltk
from nltk.corpus import stopwords
import string
from string import punctuation

In [None]:
# download "stopwords"
nltk.download('stopwords')

In [None]:
# Import scikit learn modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

In [None]:
# Import TensorFlow modules
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Bidirectional, LSTM

from tensorflow.keras import backend as K

## 1. Load datasets and insert "label" column

True news are loaded and stored in a variable. A brief check shows that there is not a feature suggesting that the text concerns a real story. Therefore, a new column "label" is added with values of "1", indicating true news.

In [None]:
true_news = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")

In [None]:
true_news

In [None]:
true_news.insert(0,"label", 1)

In [None]:
true_news.head()

Similarly, fake news are stored in another variable. The same operations were performed over this dataset, too.

In [None]:
fake_news = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")

In [None]:
fake_news

In [None]:
fake_news.insert(0, "label", 0)

In [None]:
fake_news.head()

There are 21417 entries in the "true" dataset, and 23481 samples in the false stories. Both dataframes are concatenated to form a single table.

In [None]:
true_and_fake = pd.concat([true_news, fake_news])

In [None]:
true_and_fake.shape

## 2. Prepare and preprocess data

Now, the new table has 44898 rows and 5 columns. To avoid possible distortions, duplicated news are removed.

In [None]:
true_and_fake.drop_duplicates(inplace = True)

A brief check shows that around 200 entries were duplicated.

In [None]:
true_and_fake.shape

Modelling both with "Scikit Learn" classifier and with Neural Network requires splitting data into training, validation and testing sets. Now, true and fake news  are ordered one afther the other. If this dataset is being split, training part will not get equal or similar false stories since most will fall into the validation and testing sets. Therefore, the code line below shuffles all samples and stores the new values in a new variable.

In [None]:
true_and_fake_dataset = true_and_fake.sample(frac = 1).reset_index(drop = True)

Let's check if shuffled worked.

In [None]:
true_and_fake_dataset.head()

"Title", "subject", and "date" columns won't be used for classification. It would be entirely based on the words in the "text" field. Therefore, the three features are removed.

In [None]:
true_and_fake_dataset = true_and_fake_dataset.drop(["title", "subject", "date"], axis = 1)

In [None]:
true_and_fake_dataset.head()

It is more convenient to have labels on the rightmost of the table. To that end, "label" and "text" switch places.

In [None]:
true_and_fake_dataset = true_and_fake_dataset[["text", "label"]]

In [None]:
true_and_fake_dataset.head()

Also, to avoid confusing models, all words are turned into lowercase.

In [None]:
true_and_fake_dataset["text"] = true_and_fake_dataset["text"].str.lower()

In [None]:
true_and_fake_dataset

Usually, texts contain a lot of stopwords. The latter are words which do not add much meaning to a sentence. For example, "the", "he", "have", etc. Thus, they can safely be ignored without sacrificing the meaning of the sentence. Such words are captured in `nltk`'s "corpus" module. The function below, when applied, will remove all stopwords from true and fake news.

In [None]:
def remove_stopwords(input_text):
    """
    Function: Removes stopwords from text
    
    Arguments: text
    
    Returns: text without stopwords
    """
    words = input_text.split()
    clean_words = [word for word in words if word not in stopwords.words("english")]
    clean_words = " ".join(clean_words)
    return clean_words

In addition, texts should be clean from digits, special characters, links, and other symbols which do not hold meaningful information if detached from words. Therefore, these will be removed after applying the function below.

In [None]:
def custom_preprocessor(text):
    """
    Function: Make text lowercase, remove text in square brackets, remove links, remove special
              characters and remove words containing numbers
    
    Arguments: text
    
    Returns: text without special characters, links, and numbers.
    """
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) # removes special characters
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)

    return text

First, text is being cleaned from special characters, links and numbers; stopwords are removed thereafter.

In [None]:
true_and_fake_dataset["text"] = true_and_fake_dataset["text"].apply(custom_preprocessor)

In [None]:
true_and_fake_dataset["text"] = true_and_fake_dataset["text"].apply(remove_stopwords)

The code line below checks if text have been cleaned.

In [None]:
true_and_fake_dataset.head()

Removing stopwords takes **more than half an hour**. In order to avoid repeating this operation, the final dataset is **exported as a "csv"** file **and is loaded again** (see below).

In [None]:
true_and_fake_dataset.to_csv("true_fake_news.csv", index = False) 

## 3. Classification of fake and true news with Scikit Learn

The exported file is loaded now and its values are used for classification.

In [None]:
true_and_fake_dataset = pd.read_csv("../input/true-fake-news/true_fake_news.csv")

In [None]:
true_and_fake_dataset.head()

The dataset was successfully loaded. It is important to check of there are any missing values in it.

In [None]:
true_and_fake_dataset.isna().any()

It seems some cells became empty after removing stopwords and other specific characters, links and numbers. Those are dropped off from the DataFrame.

In [None]:
true_and_fake_dataset.dropna(inplace = True)

Now, the dataset could be split into training and testing sets. In the original Notebook, the author didn't used "stratify" split. Here, this option is applied. Test size and random state are left as they are.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(true_and_fake_dataset["text"],
                              true_and_fake_dataset["label"], test_size = 0.25,
                              stratify = true_and_fake_dataset["label"],
                              random_state = 100)

It is important to check if resulting datasets are in the proper shape. The code line below confirms that.

In [None]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

The next step is to vectorize all words. This is performed with Scikit Learn's `TfidfVectorizer()`, which is a common algorithm for transforming text into a meaningful representation of numbers, used to fit machine algorithm for prediction.

In [None]:
vector = TfidfVectorizer()

In [None]:
x_train_vect = vector.fit_transform(x_train)

Vectorized data are stored as 33042 entries (rows) and 94854 columns (unique words).

In [None]:
x_train_vect.shape

These are fed to the `LogisticRegression()` by applying `fit()`, along with respective labels.

In [None]:
lg = LogisticRegression().fit(x_train_vect, y_train)

Predicted classes are computed by calling `predict` with testing texts.

In [None]:
y_pred = lg.predict(vector.transform(x_test))

In [None]:
print(y_pred)

The most common classification metrics for evaluating a model's performance, is "accuracy" and "f1_score". The former shows the proportion of true results among the total number of cases examined. "f1_score", on the other hand, is the weighted average of "precision" and "recall" (other popular classification metrics). Thus, "f1_score" takes both false positives and false negatives into account.

In [None]:
print(f"Accuracy score is: {accuracy_score(y_test, y_pred)*100}%")

In [None]:
print(f"f1 score is {f1_score(y_test, y_pred)*100}%")

Both "accuracy" and "f1_score" are pretty high: over 98%. Classification report shows how accurate and precise the algorithm is for each class. It is displayed below.

In [None]:
print(classification_report(y_test, y_pred))

The figures above confirm that both fake and real news are properly classified. Less than 1% of each type was misclassified, as shown on the confusion matrix below. Only 74 articles were wrongly declared "fake" instead of "true" (False Positive, type I error). On the other hand, barely 66 publications were not properly classified as "fake" (False Negative, type II error).

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred),
            annot = True,
            fmt = ".0f",
            cmap = "coolwarm",
            linewidths = 2, 
            linecolor = "white",
            xticklabels = lg.classes_,
            yticklabels = lg.classes_)
plt.show()

In conclusion, the Logistic Regression Classifier was trained to (almost) perfectly distinguish fake from real news. It is interesting to see if a Neural Network could perform better.

## 4. Classification of fake and true news with a Neural Network

### 4.1. Preprocessing data

Although Neural Networks are much more powerful, it might be hard to beat a Scikit Learn model having 98%-99% accuracy and f1_score. Nonetheless, it's worth to try it.

The first thing to do is to define values of relevant hyper-parameters. In this case, these are vocabulary size, i.e. the number of words used in training, embedding dimension (the dense representation of words and their relative meanings), maximum length of sentences, truncation and padding type, and how out of vocabulary words will be marked.

It is a common practice to begin with 10000 words. `TfidfVectorizer()` above however, found 94854 words. Therefore, the example below uses not 10000 but 20000 words. Embedding dimension for such not so complex tasks is usually set to 16 or 32. 300 words in a sentence is a kind of compromise - using less words might lead to loss of information, whereas more words (e.g. 400) - lots of "white space" at the end of shorter articles. Truncation type shows where to remove values from sequences larger than the maximum length, either at the beginning or at the end of the sequence. In this case, longer sequences will be truncated after the 300-th element. Padding indicates where to add "white space" (or 0s) when the text is shorter than the maximum lenght. Out of vocabulary words will be denoted as "OOV".

In [None]:
# Set values for hyper-parameters
vocabulary_size = 20000
embedding_dim = 32
max_length = 300
trunc_type = "post"
padding_type = "post"
oov_tokens = "<OOV>"

In the previous example, the dataset was split only into training and testing set. The Neural Network, however, is trained and tested with training, validation, and testing sets. These are created with the function below. Logistic Regression preserved more data for testing and used less for training. It is a better idea, however, to have more (and diverse) training samples (thus the model will be able to learn more and to adjust weights accordingly) and to evaluate performance on smaller sets. Therefore, the formula below takes 1000 samples from each label for validation and testing sets; the remaining are left for training.

In [None]:
#Stratified split
train_data = []
val_data = []
test_data = []

for label, data in true_and_fake_dataset.groupby("label"):
    shuffled_data = data.sample(len(data))
    val_in_group = shuffled_data.iloc[:1000]
    test_in_group = shuffled_data.iloc[1000:2000]
    train_in_group = shuffled_data.iloc[2000:]
    
    train_data.append(train_in_group)
    val_data.append(val_in_group)
    test_data.append(test_in_group)

All three sets are merged and shuffled (once again) by applying the function below.

In [None]:
def merge_and_shuffle(datasets):
    result = pd.concat(datasets)
    return result.sample(len(result))

In [None]:
train_data = merge_and_shuffle(train_data)
val_data = merge_and_shuffle(val_data)
test_data = merge_and_shuffle(test_data)

Datasets' shape is checked below. 40056 training samples, 2000 for validation, and 2000 for testing. All have two features - text and labels.

In [None]:
train_data.shape, val_data.shape, test_data.shape

Neural Networks work with NumPy arrays and tensors. For this reason, the three datasets (which are lists now) are converted into NumPy arrays.

In [None]:
train_text = train_data["text"].to_numpy()
validation_text = val_data["text"].to_numpy()
testing_text = test_data["text"].to_numpy()

Now, sentences (values in "text" column) can be tokenized, i.e. replacing each word with a number. This is performed by TensorFlow's `Tokenizer()` function, which expects (at least) the number of words to return (in this case 20000, as defined earlier), and how to denote out of vocabulary words. After initializing, the tokenizer is applied only on the training set. It is assumed the training data are sufficient for predicting fake or real news in validation and testing sets.

In [None]:
tokenizer = Tokenizer(num_words=vocabulary_size, oov_token=oov_tokens)

In [None]:
tokenizer.fit_on_texts(train_text)

Applying `word_index` over the tokenizer returns the numbers against each word in vocabulary. Out of vocabulary words are denoted as "1", "trump" as "2", "president" as 4, etc. Only the first 10 tokenized words (out of 20000) are displayed below.

In [None]:
word_index = tokenizer.word_index

In [None]:
# Print first 10 tokenized words (key, value pairs)
iterator = iter(word_index.items())
for i in range(10):
    print(next(iterator))

The next step is to turn text into sequence of numbers. This is performed with `texts_to_sequences()` method, which accepts training data.

In [None]:
train_sequences = tokenizer.texts_to_sequences(train_text)

The first text line is displayed below. It has around 200 words but the Neural Network will expect 300. Therefore, its length is expanded by applying `pad_sequences()`. This function expects the sequences, their maximum lenght, padding and truncating type.

In [None]:
np.array(train_sequences[0])

In [None]:
train_padded = pad_sequences(train_sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

The cell below shows how the same text looks like after padding. The first numbers are the same as above but 0s are added at the end, until the 300th element.

In [None]:
train_padded[0]

The same operations are applied to all text lines in the list. Length of second and sixth texts before and after padding is printed below.

In [None]:
print(len(train_sequences[1]))
print(len(train_padded[1]))

In [None]:
print(len(train_sequences[5]))
print(len(train_padded[5]))

To illustrate how padding and "texts_to_sequences" work, the code lines below convert sequences to text by reversing word_index.

In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [None]:
def decode_sentence(text):
    return " ".join([reverse_word_index.get(i, "?") for i in text])

Let's take the last example - the sixth line in the training set. It has 551 numbers and the remaining are truncated. Decoded sentence is shown first, followed by the original one. 

In [None]:
print(decode_sentence(train_padded[5]))
print(train_text[5])

Validation and testing texts undergo the same preprocessing.

In [None]:
validation_sequences = tokenizer.texts_to_sequences(validation_text)

In [None]:
validation_padded = pad_sequences(validation_sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

In [None]:
print(len(validation_sequences))
print(validation_padded.shape)

In [None]:
testing_sequences = tokenizer.texts_to_sequences(testing_text)

In [None]:
testing_padded = pad_sequences(testing_sequences, maxlen = max_length, padding = padding_type, truncating = trunc_type)

In [None]:
print(len(testing_sequences))
print(testing_padded.shape)

Labels should also be preprocessed. They are extracted and turned into NumPy arrays.

In [None]:
train_labels = train_data["label"].to_numpy()
validation_labels = val_data["label"].to_numpy()
testing_labels = test_data["label"].to_numpy()

### 4.2. Building and training the Neural Network

To avoid clutter from existing models and layers (when model is being fune-tuned several times), especially when memory is limited, `clear_session()` resets all prior state generated by Keras.

In [None]:
tf.keras.backend.clear_session()

The model (classifier) is a simple Neural Network with an Embedding layer, two LSTM layers (one of which Bidirectional to carry information from previous state), and two Dense layers. The last layer returns the output. Its activation is "sigmoid" since the task is a binary classification, i.e. there are only two possible outcomes - either an article is fake, or not. Therefore, the layer needs only one neuron. Weights of the previous Dense layer are computed by applying "relu" activation, which means only positive activities are returned. Several tests and trials showed that two LSTM layers with 24 and 16 neurons, respectively, return very good results. The Embedding layer expects dataset's shape, namely vocabulary size, embedding dimension, and maximum lenght of sequences.

In [None]:
model = Sequential([
        Embedding(vocabulary_size, embedding_dim, input_length = max_length),
        Bidirectional(LSTM(24, return_sequences = True)),
        LSTM(16),
        Dense(16, activation = "relu"),
        Dense(1, activation = "sigmoid")
])

The model has 655 393 trainable parameters, most of which in the Embedding layer.

In [None]:
model.summary()

TensorFlow, in contrast to Scikit Learn, does not maintain as a "ready to use" `f1_score` or other classification metrics save "accuracy". To compare both models, however, the Neural Network should be able to compute them. Thus, the functions below (which compute "precision", "recall", and "f1_score") are passed to the "metrics" element of `compile` method. The code is taken from [StackExchange](https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model).

In [None]:
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

In [None]:
def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

In [None]:
def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

The model is compiled by passing the appropriate loss function ("binary crossentropy" in this case), an optimizer (i.e. the formula for computing gradient descent and for updating layer weights), and metrics for evaluating model's performance.

In [None]:
model.compile(loss = "binary_crossentropy",
              optimizer = "adam",
              metrics = ["accuracy", f1_m, precision_m, recall_m])

Tests and trials showed that the model converges for around 10 epohcs (i.e. 10 forward and backward propagation of gradient descent).

In [None]:
num_epochs = 10

Training a Neural Network means applying `fit()` method over the model, passing training and validation data, and stating (at least) the number of epochs. The example below reached 99-100% both on "accuracy" and "f1_score", as well as on "precision" and "recall" on both datasets.

In [None]:
history = model.fit(train_padded, train_labels,
                    epochs = num_epochs,
                    validation_data = (validation_padded, validation_labels))

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

The plots below show how the model converged for 10 epochs.

In [None]:
plot_graphs(history, "loss")
plot_graphs(history, "accuracy")
plot_graphs(history, "f1_m")
plot_graphs(history, "precision_m")
plot_graphs(history, "recall_m")

### 4.3 Model exploring and evaluation

It would be interesting to see how weights were updated after training (weights' values have normal distribution when model is initialized). To demonstrate this, weights of the Embedding layer are extracted and stored in a variable.

In [None]:
e = model.layers[0]
weights = e.get_weights()[0]

In [None]:
weights

Weights matrix has 20000 rows (as the number of words in vocabulary) and 32 "features" (as the number of embedding dimensions).

In [None]:
print(weights.shape)

The plot below shows how all these 640 000 weights are distributed. Most have (as expected) values very close to 0. Those having higher value are more important, i.e. for deciding if an article is fake or not.

In [None]:
# Display distribution of weights in Embedding layer
plt.hist(weights.ravel())
plt.xlabel("weights of Embedding layer")
plt.ylabel("count")
plt.title("Distribution of weights in Embedding after training")
plt.show()

To compare a Neural Network's performance with that of a Logistic Regression Classifier, the former is evaluated on the testing data. Returned values for "accuracy", "f1_score", and "recall" are 99.9+%, and recall is 100% - higher than those achived by the "Scikit Learn" model. This means that the Neural Network is expected to be impeccable in distinguishing true from fake news.

In [None]:
loss, accuracy, f1_score, precision, recall = model.evaluate(testing_padded, testing_labels)

In [None]:
print(f"Model accuracy is {accuracy * 100}%")
print(f"Model f1 score is {f1_score * 100}%")
print(f"Model precision is {precision * 100}%")
print(f"Model recall is {recall * 100}%")

Confusion matrix is not computed since it will show only lesser number of misclassified articles. Instead, a screenshot of how words are clustered are shown below. The code lines below (taken from DeepLearning.AI training on NLP) extract vectors and metadata, which are fed into TensorFlow's projector. 

In [None]:
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocabulary_size):
    word = reverse_word_index[word_num]
    embeddings = weights[word_num]
    out_m.write(word + "\n")
    out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
try:
    from google.colab import files
except ImportError:
    pass
else:
    files.download('vecs.tsv')
    files.download('meta.tsv')

The screenshots show how the words are clustered within an imaginary shpere. "trump" and "trump"-linked words (second image) tend to be grouped around one of the poles, whereas "really" and similar words are more dispersed, without being explicitly assigned to one of two article types.

In [None]:
Image("../input/projector-images/01.jpg")

In [None]:
Image("../input/projector-images/02.jpg")

In [None]:
Image("../input/projector-images/03.jpg")

In conclusion, it could be said that a Neural Network can better (or perfectly) recognise fake from true news, compared to a top-performing Logistic Regressor.