In today's world abound with information that comes in all shapes and forms, it's crucial to stay vigilant when it comes to deciding which information we will consume and take to be real. Of various types of misinformation out there, fake news articles have been becoming increasingly easier to encounter in the recent years due to the relative ease in forging them and the profits they can generate (read [this fascinating article](https://www.nytimes.com/2016/11/20/business/media/how-fake-news-spreads.html) on how and why fake news can spread so  quickly and easily).

Thankfully, there are simple natural language processing algorithms that can be trained to help discriminate between real and fake news. In this notebook, I've described one of such text classification models and used it on a set of news articles.

# Table of contents
1. [Importing data & exploration](#1)
2. [Data cleaning / Prepping](#2)
3. [Feature extraction](#3)
4. [Model training](#4)
5. [Further exploration](#5)
6. [Conclusion](#6)

<div id='1'></div>

# 1. Importing data & exploration

Below, I start by taking a look at the data we're dealing with.

In [None]:
# Basic set-up
import os
import numpy as np
import pandas as pd

# ML toolkits
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.utils.extmath import density
from sklearn.pipeline import make_pipeline

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import plot_confusion_matrix

In [None]:
input_path = '/kaggle/input/fake-and-real-news-dataset'
fake = pd.read_csv(os.path.join(input_path,'Fake.csv'))
real = pd.read_csv(os.path.join(input_path,'True.csv'))

In [None]:
display(fake.head())

In [None]:
display(real.head())

Looks like a simple dataset that contains four columns, namely the article's title, the actual body of text, the subject, and date. There may not even be that much data cleaning to do, given how simple the dataset is!

In [None]:
display(fake.info())
print('\n')
display(real.info())

In [None]:
display(fake.subject.value_counts())
print('\n')
display(real.subject.value_counts())

<div id='2'></div>

# 2. Data Cleaning / Prepping
Hmmm... it looks like the `subject` column is perhaps too informative -- there are clearly no overlapping "subjects" between fake and true news articles. Since I want to build a model that can differentiate fake vs. true news based on its content, I will drop this column.

Moreover, the best way to train the model on both fake and true news data will be to use concatenate two kinds, and shuffle them. I should first add labels to make sure we know which ones are which.

In [None]:
fake['label'] = 'fake'
real['label'] = 'real'

In [None]:
data = pd.concat([fake, real], axis=0)
data = data.sample(frac=1).reset_index(drop=True)
data.drop('subject', axis=1)

Now I can split up our dataset into training vs. test dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.25)
display(X_train.head())
print('\n')
display(y_train.head())

print("\nThere are {} documents in the training data.".format(len(X_train)))

<div id='3'></div>

# 3. Feature extraction

Before getting into the actual feature extraction, I want to add some explanations to the method that was used here and why. This may be the most text-heavy section, but I believe it's also crucial to be able to reason the use of your choice of methodology, so please bear with me! But if you so wish, you could also just skip to the "TL;DR" below.

## Some terms to know when dealing with... terms

Among a few common ways of extracting numerical features from text are tokenizing, counting occurrence, and tf-idf term weighting; I've chosen **tf-idf term weighting** here as the feature to extract from these text.
<br>

### (1) Term frequency
The first portion of this method, "tf", refers to the **term frequency**, which simply indicates how often terms can be found in documents. Tf's alone are often insufficient as features, however; since there are many commonly-used words such as "is", "are", "the", etc. that do not carry much information about the document, we do not want to weigh these terms as heavily as other more rare but more informative terms. These uninformative terms are actually referred to as **stop words**, and are often cleaned out during data cleansing/feature extraction steps as they do not hold much value in enhancing the model's ability to predict information.

### (2) Inverse document frequency
This is where the "idf", short for **inverse document frequency**, comes into play. Idf is used to penalize such terms that occur commonly across different contexts without adding interesting information. The exact equation for computing inverse document frequency is:

$$ idf(t) = log \frac{1 + n}{1 + df(t)} + 1. $$

Here, $n$ represents the total number of documents, $t$ represents the term in question, $df(t)$ represents the document frequency of that term; i.e., the number of documents within the set of documents that contain that term. As one can imagine, for common terms such as "is", "are", etc., $idf(t)$ will most likely be 1, since all documents are highly likely to contain them (thus, $df(t) = n$). On the other hand, the less often a term occurs across different documents, the smaller the denominator will be, making the fraction bigger and in turn, $idf(t)$ bigger.

### (3) Tf-idf
Finally, **tf-idf** is the product of term-frequency and inverse document frequency, mathematically computed as: 

$$tf-idf(t,d) = tf(t,d) * idf(t). $$

Where in addition to notations used above, $d$ represents a document. The more commonly the word appears, the greater the value of tf will be, but if this is the case across different documents, it will be penalized with a small idf. On the other hand, a rarely-occurring word might have a smaller value of tf, but be highlighted by bigger idf values for not occurring often in different documents.

### TL;DR
>`Tf-idf` term weighting lets you assign importance to tokens that actually carry some information by balancing overall token frequency with its frequency across documents.

Below, I first initialize a `TfidfVectorizer` object. It takes as input the set of document strings and outputs the normalized tf-idf vectors; then, using `fit_transform` like any other transformers and predictors in scikit-learn, we can *fit* the vectorizer to data and *tranform* them. It has an option to use the `max_df` to indicate the cut-off document-frequency for stop words, if being used. Here, I will set the cut-off document-frequency to be 0.7, which is the lowest possible value that this parameter can take. The final output of fitting & transforming data will give a sparse matrix with the size of `n_samples` by `n_features`, i.e., `number of documents` by `number of unique words`.

In [None]:
my_tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)

# fit the vectorizer and transform X_train into a tf-idf matrix,
# then use the same vectorizer to transform X_test
tfidf_train = my_tfidf.fit_transform(X_train)
tfidf_test = my_tfidf.transform(X_test)

tfidf_train

As expected, we see that there are as many rows as the number of documents, and we have extracted over a hundred thousand features, or tokens.

As a side note, the output is in a `Compressed Sparse Row` format, which refers to how the resulting tf-idf matrix is stored in memory. Sparse matrices are matrices that contain few non-zero elements -- for example, as many non-zeros as number of rows or columns (i.e., the sparse matrix might contain one non-zero element in each row or each column). Because so many elements are zero, it's wasteful to store all the zero elements in memory. The compressed sparse row format presents a solution to this problem by storing the non-zero values and their locations instead. 


<div id='4'></div>

# 4. Model training

The model I've chosen to use is the **Passive-Aggressive (PA) Classifier** (see original paper [here](https://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf)). In essence, the PA classifier is an algorithm that only updates its weights (**"aggressive"** action) when it encounters examples for which its predictions are wrong, but otherwise remains unchanged (**"passive"** action).

The PA classifier is an *online* algorithm, meaning it uses one example at a time to update its weights and moves on, never seeing the same example again. This is in contrast to a *batch* algorithm, which would use the same set of multiple examples and updates weights in each iteration of training. Because of this, the PA classifier is particularly useful when dealing with a dataset containing a large or rapidly increasing number of examples, like news articles or Tweets! Of course, the data I'm using in this notebook are toy static data, but you can imagine its advantages in real-life applications. Other Kagglers, like [Ayushi Mishra](https://www.kaggle.com/ayushimishra2809/fake-news-prediction) have shown that the PA classifier outperforms several other types of models as well, so I can be confident that it is a good choice.

If you'd like to learn more about the mathematics behind the PA classifier algorithm, check out [this video](https://www.youtube.com/watch?v=uxGDwyPWNkU) by Dr. Victor Lavrenko that explains the steps in very clear steps!

Now, let's instantiate the `PassiveAggressiveClassifier` and train it with our features.

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

pa_clf = PassiveAggressiveClassifier(max_iter=50)
pa_clf.fit(tfidf_train, y_train)

Finally, we can implement the same algorithm to the test dataset and see how well it performs!

In [None]:
y_pred = pa_clf.predict(tfidf_test)

conf_mat = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(conf_mat,
                      show_normed=True, colorbar=True,
                      class_names=['Fake', 'Real'])

accscore = accuracy_score(y_test, y_pred)
f1score = f1_score(y_test,y_pred,pos_label='real')

print('The accuracy of prediction is {:.2f}%.\n'.format(accscore*100))
print('The F1 score is {:.3f}.\n'.format(f1score))

Amazing! The model does a very good job predicting whether the news is real or fake, with 99% accuracy; it has a phenomenal F1 score as well, scoring 0.993. The next few things that I'm curious to find out are the following:
- What criteria did the model learn to be able to make such accurate predictions?
- Will this model generalize well to other articles that were not included in this dataset (including the test data), or was there something characteristic about this toy dataset in particular?

I'll explore these questions in the next section.

<div id='5'></div>

# 5. Further exploration

First, to see what the model's criteria ended up being, I want to investigate more into the resulting model's attributes.

In [None]:
# Dimensionality and density of features

print("Dimensionality (i.e., number of features): {:d}".format(pa_clf.coef_.shape[1]))
print("Density (i.e., fraction of non-zero elements): {:.3f}".format(density(pa_clf.coef_)))

Out of the features identified, the algorithm found that a little less than half of them were not useful in determining the realness of the article. What does the rest of them look like?

In [None]:
# Sort non-zero weights
weights_nonzero = pa_clf.coef_[pa_clf.coef_!=0]
feature_sorter_nonzero = np.argsort(weights_nonzero)
weights_nonzero_sorted =weights_nonzero[feature_sorter_nonzero]

# Plot
fig, axs = plt.subplots(1,2, figsize=(9,3))

sns.lineplot(data=weights_nonzero_sorted, ax=axs[0])
axs[0].set_ylabel('Weight')
axs[0].set_xlabel('Feature number \n (Zero-weight omitted)')

axs[1].hist(weights_nonzero_sorted,
            orientation='horizontal', bins=500,)
axs[1].set_xlabel('Count')

fig.suptitle('Weight distribution in features with non-zero weights')

plt.show()

So it appears that even among the features that had non-zero weights, most of them had a value close to zero. This is not surprising; there were close to a hundred thousand tokens; it's very unlikely that a large majority of them would have been much informative for our task. This leads to the next question -- which tokens *were* actually useful?

## Extracting "Indicator" Tokens

In [None]:
# Sort features by their associated weights
tokens = my_tfidf.get_feature_names()
tokens_nonzero = np.array(tokens)[pa_clf.coef_[0]!=0]
tokens_nonzero_sorted = np.array(tokens_nonzero)[feature_sorter_nonzero]

num_tokens = 10
fake_indicator_tokens = tokens_nonzero_sorted[:num_tokens]
real_indicator_tokens = np.flip(tokens_nonzero_sorted[-num_tokens:])

fake_indicator = pd.DataFrame({
    'Token': fake_indicator_tokens,
    'Weight': weights_nonzero_sorted[:num_tokens]
})

real_indicator = pd.DataFrame({
    'Token': real_indicator_tokens,
    'Weight': np.flip(weights_nonzero_sorted[-num_tokens:])
})

print('The top {} tokens likely to appear in fake news were the following: \n'.format(num_tokens))
display(fake_indicator)

print('\n\n...and the top {} tokens likely to appear in real news were the following: \n'.format(num_tokens))
display(real_indicator)

In [None]:
fake_contain_fake = fake.text.loc[[np.any([token in body for token in fake_indicator.Token])
                                for body in fake.text.str.lower()]]
real_contain_real = real.text.loc[[np.any([token in body for token in real_indicator.Token])
                                for body in real.text.str.lower()]]

print('Articles that contained any of the matching indicator tokens:\n')

print('FAKE: {} out of {} ({:.2f}%)'
      .format(len(fake_contain_fake), len(fake), len(fake_contain_fake)/len(fake) * 100))
print(fake_contain_fake)

print('\nREAL: {} out of {} ({:.2f}%)'
      .format(len(real_contain_real), len(real), len(real_contain_real)/len(real) * 100))
print(real_contain_real)

How interesting! Here are several of my speculations on these 20 tokens that were likely to indicate either real or fake news (hence so-called "indicator tokens"):
- Fake news seems to use Getty Images a lot! This might be because many of such fake articles do not have associated photographs taken by real journalists, and they need to outsource.
- Real news must often state the day of the week an event took place, since four of the top 10 real-news indicator tokens were `tuesday`, `wednesday`, `thursday`, and `friday`.
- Although the categories were supposedly not just limited to politics, many of the indicator terms seem relevant to it, and specifically U.S. politics (this is likely a sampling bias of sources being American): `gop`, `hillary` (assuming, and likely, it refers to Hillary Clinton), `sen` (short for senator), `republican`, `spokeswoman` are all such terms. 
- Interestingly, `gop` has been used more often in fake news, while `republican` more so in real news -- maybe this is related to the fact that GOP is a nickname for the Republican party.

Moreover, some questions remain to be answered:
- Why are the top two fake-news indicator tokens `read` and `featured`? Perhaps, the author was trying to convince the reader of the article's supposed realness by stating that it has been "read" many times and "featured" as an important story?
- Why are `nov` and `washington` the second- and third-most likely to indicate real news? Perhaps `nov` is short for "November", which happens to be the election month and `washington` may have been present in headings whenever a U.S. government/politics news was reported.
- One crystal clear conclusion is that Reuters is a reputable source of news... in case that was not clear prior to this analysis. This must have been an obvious identifier for the algorithm that the article is real, since Reuters' articles always begin with "CITY NAME (Reuters)". A necessary follow-up should be to see whether the algorithm performance holds up when this information is masked.

Since we can only make speculations based on the predictiveness of each token, it would be interesting to find out in which ways these terms were used within text, but most of the investigations rooting from comments laid out above are beyond the scope of this project.

## Algorithm "Generalizability" (?)

One final question I would like to ask is whether the removal of "Reuters" will change the algorithm's performance, since it had a disproportionately strong weights (although this may have been to counterbalance the effects of the `idf`). So let's find out what happens.

In [None]:
def FakeNewsDetection(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    # vectorizer
    my_tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
    tfidf_train = my_tfidf.fit_transform(X_train)
    tfidf_test = my_tfidf.transform(X_test)
    
    # model
    my_pac = PassiveAggressiveClassifier(max_iter=50)
    my_pac.fit(tfidf_train, y_train)
    y_pred = my_pac.predict(tfidf_test)
    
    # metrics
    conf_mat = confusion_matrix(y_test, y_pred)
    plot_confusion_matrix(conf_mat,
                          show_normed=True, colorbar=True,
                          class_names=['Fake', 'Real'])
    
    accscore = accuracy_score(y_test, y_pred)
    f1score = f1_score(y_test,y_pred,pos_label='real')

    print('The accuracy of prediction is {:.2f}%.\n'.format(accscore*100))
    print('The F1 score is {:.3f}.\n'.format(f1score))
    
    # Sort non-zero weights
    weights_nonzero = my_pac.coef_[my_pac.coef_!=0]
    feature_sorter_nonzero = np.argsort(weights_nonzero)
    weights_nonzero_sorted =weights_nonzero[feature_sorter_nonzero]
    
    # Sort features by their associated weights
    tokens = my_tfidf.get_feature_names()
    tokens_nonzero = np.array(tokens)[my_pac.coef_[0]!=0]
    tokens_nonzero_sorted = np.array(tokens_nonzero)[feature_sorter_nonzero]

    num_tokens = 10
    fake_indicator_tokens = tokens_nonzero_sorted[:num_tokens]
    real_indicator_tokens = np.flip(tokens_nonzero_sorted[-num_tokens:])

    fake_indicator = pd.DataFrame({
        'Token': fake_indicator_tokens,
        'Weight': weights_nonzero_sorted[:num_tokens]
    })

    real_indicator = pd.DataFrame({
        'Token': real_indicator_tokens,
        'Weight': np.flip(weights_nonzero_sorted[-num_tokens:])
    })

    print('The top {} tokens likely to appear in fake news were the following: \n'.format(num_tokens))
    display(fake_indicator)

    print('\n\n...and the top {} tokens likely to appear in real news were the following: \n'.format(num_tokens))
    display(real_indicator)

In [None]:
# Generate a copy of the "real news" dataset and remove headings f

real_copy = real.copy()
for i,body in real.text.items():
    if '(reuters)' in body.lower():
        idx = body.lower().index('(reuters)') + len('(reuters) - ')
        real_copy.text.iloc[i] = body[idx:]
        
real_copy.head()

In [None]:
# Create new data, and run the algorithm
data2 = pd.concat([fake, real_copy], axis=0)
data2 = data2.sample(frac=1).reset_index(drop=True)
data2.drop('subject', axis=1)

FakeNewsDetection(data2['text'], data2['label'])

Overall, the algorithm does slightly less well but still holds up very nicely! The list of real news indicator tokens have changed minutely, and the weight distribution has shifted in both fake and real news indicator tokens. The most noticeable difference would be, of course, that the token "reuters" is not as heavy of an indicator, and now the remaining one day of the week ("monday") as well as the word "reporters" have replaced "washington" and "saying" as two of the top 10 real news indicator tokens. My guess is that "washington" was largely related to the presence of "reuters" as a heading previously.

<div id='6'></div>

# 6. Conclusion

In this notebook, I used the `TfidfVectorizer` and `PassiveAggressiveClassifier` algorithms to detect "fake news" in the dataset. If you found these interesting, I highly encourage you to do 
further research yourself! 

Thank you for reading until this far -- please upvote if you enjoyed my work and feel free to leave a comment and let me know your thoughts. All kinds of feedback & constructive criticism are appreciated!

## References

*Below are some links to guides I've referred to and gained inspiration from (some were mentioned throughout the notebook):*
- [Detecting Fake News with Python and Machine Learning](https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/) by **DataFlair**: a simple set of guidelines on implementation of toolkits used here
- [Fake News Prediction](https://www.kaggle.com/ayushimishra2809/fake-news-prediction) by [**Ayushi Mishra**](https://www.kaggle.com/ayushimishra2809): a comparison of different model performances using accuracy scores and confusion matrices
- [Classification of text documents using sparse features](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) by **Scikit-learn**: similar to above, but benchmarking accuracy scores and computation times
- [The original paper on the Passive-Aggressive Algorithms](https://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf) (**Crammer et al.**, 2006)
- [A lecture on Passive-Aggressive Classifiers](https://www.youtube.com/watch?v=uxGDwyPWNkU) by **Dr. Victor Lavrenko**: a great resource to learn more about the mathematics behind the PA classifier algorithm