1. Domain-specific area

A short introduction to fake news would be how it has been around us all these times but due to the rise of technology, things changed with how easily spread compared to back then where its by word of mouth, television, newspaper etc. With how prominent social media is, fake news spreads easily like wild fire especially since alot of people tend to not cross check their sources and believing in fake news. 

2.Objectives

The main objective for this project would be to use text classifiers such as Random Forest Classifier, Logistic Regression and Gradient Boosting Classifier to deem the accuracy for each classifier, enabling us to know which classifier is better in finding out the fake and real news.

With the ability to sort out the real and fake news, it would further prevent news consumers from collecting false information, clouding their judgements on certain topics and spreading it to others.

3.Dataset

The dataset used for this project was found on kaggle titled "Fake News Detection". Two csv files are provided, one being fake.csv consisting of fake news and the other one true.csv containing real news.

The fake news csv file contains 3 columns with 17903 unique values which columns are title, text, subject and date. Likewise the columns are same for the real news csv but with 20826 unique values instead. Fake.csv is 61mb and True.csv is 52mb. Data type of this dataset is boolean consisting of both integer and string. The data that is acquired for this dataset is sourced from social media, news sites from dates 31/03/2015 to 19/02/2018.

The dataset is suitable for this project as there are no missing values thus saving some time and effort to clean it, and with enough data to work with.

4.Evaluation Methodology

To decide on which evaluation metric to use, some test will be conducted with the outputs.

First evaluation methodology would be Accuracy, it measures the ratio of items labled correctly to the group of items. This metric is useful but it requires a symmetric dataset for both real and fake.

Second evaluation methodology would be Precision, it checks how many predicted values are labled correctly.

Last evaluation methodology wouuld be F1-score, its the average of both precision and recall which checks how many predicted values are correct. The only downside to this would be that the average can be affected if there are low values.

Considering out of all the evaluation methodologies, Accuracy and Precision will be the choice as since there's an equal amount of both real and fake data.

II. Implementation

Preprocessing

In [1]:
#Import basic operations
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import re

In [2]:
#Import machine learning libraries and others
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import STOPWORDS

**Importing Dataset**

In [3]:
try:
    df_fake = pd.read_csv('fake.csv')
    df_real = pd.read_csv('real.csv')
except:
    print("error reading csv files")

In [4]:
#show dataset for fake news
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
#show dataset for real news
df_real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [6]:
#Adding column to represent fake or real news
df_fake["class"] = 0
df_real["class"] = 1

In [7]:
df_fake.shape, df_real.shape

((23481, 5), (21417, 5))

**Merging both fake and real dataframes**

In [8]:
df_merge = pd.concat([df_fake, df_real], axis = 0 )
df_merge.head(5)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [9]:
#Check dataset column post merging
df_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

**Dropping unnecessary columns**

In [10]:
df = df_merge.drop(["title", "subject","date"], axis = 1)

In [11]:
# Check for null values in dataset
df_merge.isnull().sum()

title      0
text       0
subject    0
date       0
class      0
dtype: int64

In [12]:
df.head()

Unnamed: 0,text,class
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


In [13]:
df.shape

(44898, 2)

In [14]:
df.drop_duplicates(inplace = True)

In [15]:
df.shape

(38647, 2)

In [16]:
#Stopwords
stop_words = stopwords.words('english')

In [17]:
def textProcess(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

In [18]:
df["text"] = df["text"].apply(textProcess)

In [19]:
x = df["text"]
y = df["class"]

6.Baseline Performance

The baseline choosen for this project would be PassiveAggressiveClassifier, it works by responding as passive for correct classifications and responding as aggressive for any miscalculation.

In [20]:
#Split arrays or matrices into random train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

Vectorizer = TfidfVectorizer()
DataTrain = Vectorizer.fit_transform(x_train)
DataTest = Vectorizer.transform(x_test)

7.Classification approach

The 3 classifier that will be used in this project are Logistic Regression, GradientBoosting Classifier and Random Forest Classifier. GradientBoosting Classifier has the highest accuracy score and precision as compared to the rest which will be at the end when testing.

In [22]:
from sklearn.linear_model import LogisticRegression

LogisticRegression = LogisticRegression()
LogisticRegression.fit(DataTrain,y_train)

LogisticRegression()

In [23]:
PredictionLR=LogisticRegression.predict(DataTest)

In [24]:
LogisticRegression.score(DataTest, y_test)

0.984889256882633

In [25]:
print(classification_report(y_test, PredictionLR))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      4335
           1       0.98      0.99      0.99      5327

    accuracy                           0.98      9662
   macro avg       0.99      0.98      0.98      9662
weighted avg       0.98      0.98      0.98      9662



In [26]:
from sklearn.ensemble import GradientBoostingClassifier

GradientBoostingClassifier = GradientBoostingClassifier(random_state=0)
GradientBoostingClassifier.fit(DataTrain, y_train)

GradientBoostingClassifier(random_state=0)

In [27]:
PredictionGBC = GradientBoostingClassifier.predict(DataTest)

In [28]:
GradientBoostingClassifier.score(DataTest, y_test)

0.9947215897329745

In [29]:
print(classification_report(y_test, PredictionGBC))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      4335
           1       0.99      1.00      1.00      5327

    accuracy                           0.99      9662
   macro avg       0.99      0.99      0.99      9662
weighted avg       0.99      0.99      0.99      9662



In [30]:
from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier = RandomForestClassifier(random_state=0)
RandomForestClassifier.fit(DataTrain, y_train)

RandomForestClassifier(random_state=0)

In [31]:
PredictionRFC = RandomForestClassifier.predict(DataTest)

In [32]:
RandomForestClassifier.score(DataTest, y_test)

0.9832332850341544

In [33]:
print(classification_report(y_test, PredictionRFC))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      4335
           1       0.98      0.99      0.98      5327

    accuracy                           0.98      9662
   macro avg       0.98      0.98      0.98      9662
weighted avg       0.98      0.98      0.98      9662



In [34]:
def Prediction(n):
    if n == 0:
        return "Fake"
    elif n == 1:
        return "Real"
    
def DataTesting(news):
    TestNews = {"text":[news]}
    newTest = pd.DataFrame(TestNews)
    newTest["text"] = newTest["text"].apply(textProcess) 
    newTest2 = newTest["text"]
    TestVector = Vectorizer.transform(newTest2)
    PredictionGBC = GradientBoostingClassifier.predict(TestVector)

    return print("\nGBC Prediction: {}".format(Prediction(PredictionGBC[0])))

In [38]:
news = str(input())
DataTesting(news)

WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a â€œfiscal conservativeâ€ on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBSâ€™ â€œFace the Nation,â€ drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense â€œdiscretionaryâ€ spending on programs that s

In [39]:
news = str(input())
DataTesting(news)

21st Century Wire says As 21WIRE predicted in its new year s look ahead, we have a new  hostage  crisis underway.Today, Iranian military forces report that two small riverine U.S. Navy boats were seized in Iranian waters, and are currently being held on Iran s Farsi Island in the Persian Gulf. A total of 10 U.S. Navy personnel, nine men and one woman, have been detained by Iranian authorities. NAVY STRAYED: U.S. Navy patrol boat in the Persian Gulf (Image Source: USNI)According to the Pentagon, the initial narrative is as follows: The sailors were on a training mission around noon ET when their boat experienced mechanical difficulty and drifted into Iranian-claimed waters and were detained by the Iranian Coast Guard, officials added. The story has since been slightly revised by White House spokesman Josh Earnest to follow this narrative:The 2 boats were traveling en route from Kuwait to Bahrain, when they were stopped and detained by the Iranians.According to USNI, search and rescue te

8.Coding Style

Coding style used in this project is done with using meaningful naming variables to enable clearer comprehension.

9.Evaluation

GradientBoosting Classifier scored 99.5% as compared to Logistic Regression and Random Forest Classifier which scored 98.2%and 98.3%. Which in this case GradientBoosting Classifier outperforms the baseline set which was Passive Aggressive Classifier. Although all of the classifiers tested performed well enough based on the scores, GradientBoosting Classifier will be selected as the text classifier since it has the highest precision score.

10.Summary and conclusions

With the data generated and tested in this project, GradientBoosting Classifier is the best choice to be used since it has the highest accuracy and precision outperforming the rest.



11. Additional testing with other classifiers

Naive Bayes

In [40]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [41]:
clf = MultinomialNB()
clf.fit(DataTrain, y_train)

MultinomialNB()

In [43]:
y_pred = clf.predict(DataTest)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.936038087352515


In [44]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93      4335
           1       0.93      0.96      0.94      5327

    accuracy                           0.94      9662
   macro avg       0.94      0.93      0.94      9662
weighted avg       0.94      0.94      0.94      9662



In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Experiment with different vectorization techniques
tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
hashing = HashingVectorizer(stop_words='english')
word_embedding = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('tfidf_transformer', TfidfTransformer()),
])

# Compare performance using cross-validation
from sklearn.model_selection import cross_val_score
for vec in [tfidf, hashing, word_embedding]:
    pipe = Pipeline([('vec', vec), ('clf', LogisticRegression())])
    scores = cross_val_score(pipe, x_train, y_train, cv=5, scoring='accuracy')
    print(f"Vectorization: {vec.__class__.__name__}, Mean Accuracy: {scores.mean():.3f}")


Vectorization: TfidfVectorizer, Mean Accuracy: 0.980
Vectorization: HashingVectorizer, Mean Accuracy: 0.987
Vectorization: Pipeline, Mean Accuracy: 0.970


In [48]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Experiment with different classification algorithms
for clf in [MultinomialNB(), LogisticRegression(), LinearSVC(), RandomForestClassifier()]:
    pipe = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', max_df=0.7)), ('clf', clf)])
    scores = cross_val_score(pipe, x_train, y_train, cv=5, scoring='accuracy')
    print(f"Classifier: {clf.__class__.__name__}, Mean Accuracy: {scores.mean():.3f}")


Classifier: MultinomialNB, Mean Accuracy: 0.931
Classifier: LogisticRegression, Mean Accuracy: 0.980
Classifier: LinearSVC, Mean Accuracy: 0.990
Classifier: RandomForestClassifier, Mean Accuracy: 0.984


In [49]:
from sklearn.model_selection import GridSearchCV

# Fine-tune hyperparameters using grid search
pipe = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', max_df=0.7)), ('clf', LogisticRegression())])
param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__penalty': ['l1', 'l2', 'elasticnet'],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(x_train, y_train)
print(f"Best params: {grid.best_params_}, Mean Accuracy: {grid.best_score_:.3f}")

Best params: {'clf__C': 10, 'clf__penalty': 'l2'}, Mean Accuracy: 0.987
