# Detection of Fake News

<p style="text-align:center;">
    <img src="images/fake-news.jpg" alt="fake-news" title="Fake news propaganda" width="500"><br>
    <center><i>Source: <a href="https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2017.01066.x">Brian Tarran / Significance (Royal Statistical Society)</a></i></center>
</p>

# Table of Contents

* [1 Introduction](#Introduction)
	* [1.1 Objective](#Objective)
	* [1.2 Background](#Background)
	* [1.3 Project Motivation](#Project-Motivation)
* [2 Create Virtual Environment](#Create-Virtual-Environment)
* [3 Import & Install Dependencies](#Import-&-Install-Dependencies)
* [4 Import Fake and Real Datasets](#Import-Fake-and-Real-Datasets)
* [5 Random Sampling of Dataframe](#Random-Sampling-of-Dataframe)
* [6 Cleaning the Text](#Cleaning-the-Text)
* [7 Train-test Partition](#Train-test-Partition)
* [8 Encode Text Data ( Vectorization )](#Encode-Text-Data-%28-Vectorization-%29)
* [9 Classification Models](#Classification-Models)
	* [9.1 Logistic Regression](#Logistic-Regression)
	* [9.2 Decision Tree Classification](#Decision-Tree-Classification)
	* [9.3 Gradient Boost Classifier](#Gradient-Boost-Classifier)
	* [9.4 Random Forest Classifier](#Random-Forest-Classifier)
* [10 Model Testing With Manual Entry](#Model-Testing-With-Manual-Entry)
* [11 References](#References)


## Introduction

### Objective

The main objective of this project is to detect and differentiate between real and fake news from a given dataset of news articles.

### Background

Fake news is currently one of the major buzz words worldwide. With Donald J. Trump's election to the American presidency, the term 'fake news' was lately popularized, thanks to his citing the expression in several of his Tweets. However, the expression dates back to even before the rise of social media platforms. One of such instances is when The New Yorks Sun published a series of articles in August 1835 named 'The Great Moon Hoax' regarding aliens living on the Moon and selling out in large numbers <a id="ref-1" href="#cite-brownfield_2020_the">(Brownfield, 2020)</a>, as seen in the illustration below.

<p style="text-align:center;">
    <img src="images/Great-Moon-Hoax-1835-New-York-Sun-lithograph-298px.jpg" alt="news-article" title="Illustration of 1835 article" width="500"><br><center><i>Source: <a href="https://www.saturdayeveningpost.com/2020/08/the-great-moon-hoax-of-1835/">Wikimedia Commons / Public Domain in the United States</a></i></center>
</p>

In spite of the fact that such a sentimentalist approach is still utilized for benefit, fake news has taken a complex turn adding in more political, social and ethical stakes. Especially with the current COVID-19 crisis, this phenomenon has gained momentum wherein several conspiracy theories have emerged relying on the ease of creating and sharing unverified content to convince the public. With more and more digitization, the magnitude of fake news is hence becoming more massive <a id="ref-2" href="#cite-bhajun_2020_fake">(Bhajun et al., 2020)</a>.

### Project Motivation

Now-a-days when we get our news from social media, we are easily bombarded with scams, rumors, scheme hypotheses and deluding news. Perceiving truth can be exceptionally difficult when it is all blended in with solid data from genuine sources. If we were able identify the fake information from the actual ones, we would be able to form unbiased opinions. For this reason, various machine learning algorithms are being widely utilized to track, visualize and perform automated fact-checking on social media to segregate out unverified claims <a id="ref-3" href="#cite-menczer_2016_misinformation">(Menczer, 2016)</a>. That is exactly the motivation behind this project. Here, we aim to identify and distinguish between real and fake news as a means to halt the spread of misinformation.

## Create Virtual Environment

First, we will create a virtual environment `news-env` and activate it. Then, we will add it to this notebook as a Python kernel.

In [1]:
# create virtual environment
!python -m venv news-env

#activate environment
!.\news-env\Scripts\activate

#add to notebook kernel
!ipython kernel install --user --name=news-env

Installed kernelspec news-env in C:\Users\de777\AppData\Roaming\jupyter\kernels\news-env


Next, we will generate a list of all active kernels to check if `news-env` is active.

In [2]:
!jupyter kernelspec list

Available kernels:
  news-env     C:\Users\de777\AppData\Roaming\jupyter\kernels\news-env
  steam-env    C:\Users\de777\AppData\Roaming\jupyter\kernels\steam-env
  yt-env       C:\Users\de777\AppData\Roaming\jupyter\kernels\yt-env
  python3      D:\Python\share\jupyter\kernels\python3


We will need to refresh/relaunch the notebook and then select the new kernel by navigating to `Toolbar Menu` $\rightarrow$ `Kernel` $\rightarrow$ `Change kernel` $\rightarrow$ `news-env`.

## Import & Install Dependencies

Now, we will import and install all the required packages. If any other libraries are required to be installed, we will do that afterwards.

In [3]:
#!pip install sklearn

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# for performance metrics of classification models

import re 
import string
# 're' and 'string' are used to remove special characters or any links from the text

## Import Fake and Real Datasets

We will start by importing the 'Fake' and 'Real' news datasets that are in CSV format.

In [4]:
df_fake = pd.read_csv("data/Fake.csv")
df_real = pd.read_csv("data/Real.csv")

We will also take an overview look at both the imported datasets.

In [5]:
df_fake.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [6]:
df_real.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


Next, we will insert a new column called "class" for the two datasets in order to categorize them using a binary classifier (i.e., $0$ for fake news and $1$ for real news). 

In [7]:
df_fake["class"] = 0
df_real["class"] = 1

We will take the last 10 rows from each dataset for manual testing and remove those rows from the main datasets, saving them as separate CSV files. 

In [8]:
# check shape of the datasets
df_fake.shape, df_real.shape

((23481, 5), (21417, 5))

In [9]:
# for fake news
df_fake_manual_testing = df_fake.tail(10)
for i in range(23480,23470,-1):
    df_fake.drop([i], axis = 0, inplace = True)

# for real news
df_real_manual_testing = df_real.tail(10)
for i in range(21416,21406,-1):
    df_real.drop([i], axis = 0, inplace = True)

We will check the dimensions of the main and manual testing dataframes.

In [10]:
# check shapes of main dataframes
df_fake.shape, df_real.shape

((23471, 5), (21407, 5))

In [11]:
# check shapes of manual testing dataframes
df_fake_manual_testing.shape, df_real_manual_testing.shape

((10, 5), (10, 5))

After this, we will merge the two manual testing dataframes into one `df_manual_testing` dataframe and save it as a CSV file to local working directory.

In [12]:
df_manual_testing = pd.concat([df_fake_manual_testing, df_real_manual_testing], axis = 0) 
# merge/concatenate DFs

df_manual_testing.reset_index(inplace = True, drop = True) 
# reset indices

df_manual_testing.to_csv("manual_testing.csv", index=False) 
# write to CSV

Next, we will merge the main datasets : `df_fake` and `df_real` into `df_merge` and take a look at the dataframe.

In [13]:
df_merge = pd.concat([df_fake, df_real], axis = 0 )
df_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


Since we do not require `title`, `subject` and `date` for detecting fake news, so we will be dropping these columns.

In [14]:
df_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

In [15]:
df = df_merge.drop(["title", "subject","date"], axis = 1)
df.head(10)

Unnamed: 0,text,class
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0
5,The number of cases of cops brutalizing and ki...,0
6,Donald Trump spent a good portion of his day a...,0
7,In the wake of yet another court decision that...,0
8,Many people have raised the alarm regarding th...,0
9,Just when you might have thought we d get a br...,0


## Random Sampling of Dataframe

Let us now shuffle the dataset in order to prevent any classification bias.

In [16]:
df = df.sample(frac = 1)
df.head()

Unnamed: 0,text,class
9212,WASHINGTON (Reuters) - U.S. House Speaker Paul...,1
1649,"WASHINGTON (Reuters) - Paul Manafort, Presiden...",1
15493,LONDON (Reuters) - A project looking at links ...,1
21612,Not a bad imitation of a black pastor from a g...,0
21373,Maybe the Queen of Incompetence isn t as pop...,0


The dataframe is now randomly sampled. We will then check if there are any blank or null values in the dataset.

In [17]:
df.isnull().sum()

text     0
class    0
dtype: int64

We can see that there are no null values present. Next, we will reset the row indices so that the dataframe becomes sequential.

In [18]:
df.reset_index(inplace = True, drop = True)
df.head()

Unnamed: 0,text,class
0,WASHINGTON (Reuters) - U.S. House Speaker Paul...,1
1,"WASHINGTON (Reuters) - Paul Manafort, Presiden...",1
2,LONDON (Reuters) - A project looking at links ...,1
3,Not a bad imitation of a black pastor from a g...,0
4,Maybe the Queen of Incompetence isn t as pop...,0


## Cleaning the Text

We will now define a function to convert the text to lowercase, remove any extra spaces, special characters (e.g. : dots, commas etc.) and URLs present in the text.

In [19]:
def word_drop(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

Then, we will apply this function on our dataframe.

In [20]:
df["text"] = df["text"].apply(word_drop)

In [21]:
df.head(10)

Unnamed: 0,text,class
0,washington reuters u s house speaker paul...,1
1,washington reuters paul manafort presiden...,1
2,london reuters a project looking at links ...,1
3,not a bad imitation of a black pastor from a g...,0
4,maybe the queen of incompetence isn t as pop...,0
5,warsaw reuters poland hopes turkey will ev...,1
6,tunis reuters tunisia s navy rescued almos...,1
7,now something is definitely off with trump fa...,0
8,at the start of the fox news greg gutfeld show...,0
9,ever since we ve heard the news of donald trum...,0


We can see that all unnecessary characters have been removed and the text is now in simple lowercase.

## Train-test Partition

After the data is clean, we will define the dependent variable - `text` and independent variable - `class` as $x$ and $y$, respectively and split the dataset into training and testing sets. Here, we will take 25% of the data as the test set.

In [22]:
x = df["text"]
y = df["class"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

## Encode Text Data ( Vectorization )

Since we cannot proceed with any calculation using the raw text format, we will vectorize the $x$ variable, i.e., convert text to vectors. For this, we will import `TfidfVectorizer` function from `sklearn` module.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

## Classification Models

### Logistic Regression

Coming to the actual classification stage, first we will perform logistic regression on the dataset.

In [24]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(xv_train, y_train) # fit data

LogisticRegression()

Let us check the score of this model.

In [25]:
LR.score(xv_test, y_test)

0.9874331550802139

So, our model has an approximately {{round(100*LR.score(xv_test, y_test), 2)}}% accuracy score.

We will generate the classification report that compares the actual classes with the predicted classes.

In [26]:
pred_lr = LR.predict(xv_test) # prediction
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5846
           1       0.99      0.99      0.99      5374

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



### Decision Tree Classification

Next, we will perform Decision Tree classification on this dataset.

In [27]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train) # fit data

DecisionTreeClassifier()

In [28]:
# check accuracy score
DT.score(xv_test, y_test)

0.9948306595365419

We gain a bit higher accuracy score of approximately {{round(100*DT.score(xv_test, y_test), 2)}}% than before.

In [29]:
# classification report
pred_dt = DT.predict(xv_test)
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5846
           1       1.00      0.99      0.99      5374

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



### Gradient Boost Classifier

Let us apply a Gradient Boost Classifier on this dataset now.

In [30]:
from sklearn.ensemble import GradientBoostingClassifier

GBC = GradientBoostingClassifier(random_state = 0)
GBC.fit(xv_train, y_train) # fit data

GradientBoostingClassifier(random_state=0)

In [31]:
# accuracy score
GBC.score(xv_test, y_test)

0.9950980392156863

In [32]:
# classification report
pred_gbc = GBC.predict(xv_test)
print(classification_report(y_test, pred_gbc))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      5846
           1       0.99      1.00      0.99      5374

    accuracy                           1.00     11220
   macro avg       1.00      1.00      1.00     11220
weighted avg       1.00      1.00      1.00     11220



Using Gradient Boost Classifier, we gain a high accuracy score of approximately {{round(100*GBC.score(xv_test, y_test), 2)}}%.

### Random Forest Classifier

Finally, we run a Random Forest Classification on this dataset.

In [33]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(random_state = 0)
RFC.fit(xv_train, y_train)

RandomForestClassifier(random_state=0)

In [34]:
# accuracy score
RFC.score(xv_test, y_test)

0.9893048128342246

In [35]:
# classification report
pred_rfc = RFC.predict(xv_test)
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5846
           1       0.99      0.99      0.99      5374

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



We receive a relatively slightly lower accuracy score of {{round(100*RFC.score(xv_test, y_test), 2)}}% than before.

## Model Testing With Manual Entry

At the end, we will perform manual testing of the classification models. We will define a new function `manual_testing` that takes in raw text from the news and returns predictions for each classification model.

In [36]:
def output_lable(n):
    if n == 0:
        return "Fake News"
    elif n == 1:
        return "Not A Fake News"
    
def manual_testing(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(word_drop) 
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test) # logistic regression
    pred_DT = DT.predict(new_xv_test) # decision tree
    pred_GBC = GBC.predict(new_xv_test) # gradient boost classifier
    pred_RFC = RFC.predict(new_xv_test) # random forest classifier

    return print("\n\nLR Prediction: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction: {}".format(output_lable(pred_LR[0]), 
                                                                                                              output_lable(pred_DT[0]), 
                                                                                                              output_lable(pred_GBC[0]), 
                                                                                                              output_lable(pred_RFC[0])))

We will call this function and test on the text from individual news articles by manually copying & pasting in the input field.

In [37]:
news = str(input('Enter the news text:'))
manual_testing(news)

Enter the news text: Drunk Bragging Trump Staffer Started Russian Collusion Investigation


LR Prediction: Fake News 
DT Prediction: Fake News 
GBC Prediction: Fake News 
RFC Prediction: Fake News


Alternatively, we can perform this check by selecting text from news articles in the manual testing dataset (`df_manual_testing`) that we generated earlier. 

In [38]:
df_manual_testing

Unnamed: 0,title,text,subject,date,class
0,Seven Iranians freed in the prisoner swap have...,"21st Century Wire says This week, the historic...",Middle-east,"January 20, 2016",0
1,#Hashtag Hell & The Fake Left,By Dady Chery and Gilbert MercierAll writers ...,Middle-east,"January 19, 2016",0
2,Astroturfing: Journalist Reveals Brainwashing ...,Vic Bishop Waking TimesOur reality is carefull...,Middle-east,"January 19, 2016",0
3,The New American Century: An Era of Fraud,Paul Craig RobertsIn the last years of the 20t...,Middle-east,"January 19, 2016",0
4,Hillary Clinton: ‘Israel First’ (and no peace ...,Robert Fantina CounterpunchAlthough the United...,Middle-east,"January 18, 2016",0
5,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
6,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
7,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
8,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0
9,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",0


In [39]:
#1st news article
news = str(df_manual_testing.iloc[0, 1])
manual_testing(news)



LR Prediction: Fake News 
DT Prediction: Fake News 
GBC Prediction: Fake News 
RFC Prediction: Fake News


In [40]:
#10th news article
news = str(df_manual_testing.iloc[10, 1])
manual_testing(news)



LR Prediction: Not A Fake News 
DT Prediction: Not A Fake News 
GBC Prediction: Not A Fake News 
RFC Prediction: Not A Fake News


Therefore, we can verify that the fake news detection model works fine by matching the respective predicted versus actual `class` values in `df_manual_testing` dataframe.

---

## References

<a id="cite-brownfield_2020_the"/><sup><a href=#ref-1>[^]</a></sup>Brownfield, Troy. 2020. _The Great Moon Hoax of 1835_. [URL](https://www.saturdayeveningpost.com/2020/08/the-great-moon-hoax-of-1835/)

<a id="cite-bhajun_2020_fake"/><sup><a href=#ref-2>[^]</a></sup>Bhajun, Marie-Soleil and Lebel, Karine and Saint-Mleux, Arthur. 2020. _Fake news ou un problème de société (Fake news or a social problem)_. [URL](http://mediassocionumeriques.org/medias-sociaux/fake-news-ou-un-probleme-de-societe/)

<a id="cite-menczer_2016_misinformation"/><sup><a href=#ref-3>[^]</a></sup>Menczer, Filippo. 2016. _Misinformation on social media: Can technology save us?_. [URL](https://theconversation.com/misinformation-on-social-media-can-technology-save-us-69264)



<!--bibtex

@misc{menczer_2016_misinformation,
  author = {Menczer, Filippo},
  month = {11},
  title = {Misinformation on social media: Can technology save us?},
  url = {https://theconversation.com/misinformation-on-social-media-can-technology-save-us-69264},
  urldate = {2022-01-11},
  year = {2016},
  organization = {The Conversation}
}

@misc{bhajun_2020_fake,
  author = {Bhajun, Marie-Soleil and Lebel, Karine and Saint-Mleux, Arthur},
  month = {11},
  title = {Fake news ou un problème de société (Fake news or a social problem)},
  url = {http://mediassocionumeriques.org/medias-sociaux/fake-news-ou-un-probleme-de-societe/},
  urldate = {2022-01-12},
  year = {2020},
  organization = {Médias socionumériques (Socio-digital media)}
}

@misc{brownfield_2020_the,
  author = {Brownfield, Troy},
  month = {08},
  title = {The Great Moon Hoax of 1835},
  url = {https://www.saturdayeveningpost.com/2020/08/the-great-moon-hoax-of-1835/},
  urldate = {2022-01-12},
  year = {2020},
  organization = {The Saturday Evening Post}
}

-->