# Fake News Classification

Given a set of news articles (in text form) the goal is to build a machine learnig model to identify whether the article is fake news. The observations are labelled as fake or real news and hence, we'll treat it as a supervised learning poblem. We'll use Natural Language Processing (NLP) techniques to create a Bag-of-Words representation of the news article text and then use binary classification machine learning model to predict the labels. 

## Library Imports

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

In [2]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import SGDClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


In [4]:
# random number generator
SEED = 1234

## Exploratory Data Analysis

### Basic Information

In [5]:
# load data from csv file into data frame
raw = pd.read_csv('train.csv')
raw.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [6]:
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [7]:
raw.drop_duplicates(inplace=True)
raw.shape

(20800, 5)

In [8]:
# delete all observations if any feature value null
raw.dropna(inplace=True)
raw.shape

(18285, 5)

In [9]:
# class distribution of labels: 0: Real and 1: Fake news

raw.label.value_counts()

0    10361
1     7924
Name: label, dtype: int64

In [10]:
raw.label.value_counts(normalize=True)

0    0.566639
1    0.433361
Name: label, dtype: float64

The two classes are somewhat imbalanced. 

### Text Processing

In [11]:
y = raw['label'].to_numpy()

In [12]:
corpus = raw['text'].to_numpy()
corpus.shape

(18285,)

In [13]:
type(corpus[1])

str

In [14]:
for i in range(corpus.shape[0]):
    
    # split each news article text string in the corpus into word tokens
    corpus[i] = word_tokenize(corpus[i])
    

In [15]:
for i in range(corpus.shape[0]):
    
    # convert all word tokens into lower case
    corpus[i] = [token.lower() for token in corpus[i]]
    
    # remove tokens consisting of non-alphabetic characters (ASCII symbols, punctuation, digits, emojis)
    # retain foreign language characters
    corpus[i] = [token for token in corpus[i] if token.isalpha()]
    

In [16]:
stop_words = set(stopwords.words('english'))

for i in range(corpus.shape[0]):
    # remove English language stop-words
    corpus[i] = list(set(corpus[i]).difference(stop_words))
    

In [17]:
lemmatizer = WordNetLemmatizer()

for i in range(corpus.shape[0]):
    # Lemmatize word tokens
    corpus[i] = [lemmatizer.lemmatize(token) for token in corpus[i]]
    

In [18]:
# convert comma-seperated list of word tokens into a string of space-seperated words

for i in range(corpus.shape[0]):
    corpus[i] = " ".join([word for word in corpus[i]])
    

### Prepare Analytical Base Table

In [19]:
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.25, stratify=y, random_state=SEED)

In [20]:
print(X_train.shape, y_train.shape)

(13713,) (13713,)


In [21]:
print(X_test.shape, y_test.shape)

(4572,) (4572,)


## Model Selection

#### Get baseline scores for several models

In [22]:
# create pipeline dictionary for all models with default parameters, unless otherwise specified

pipelines = {'cnb': Pipeline([
                ('vect', TfidfVectorizer()),
                ('clf', ComplementNB())
                ]),
             
             'mnb': Pipeline([
                ('vect', TfidfVectorizer()),
                ('clf', MultinomialNB())
                ]),
             
             'sgd': Pipeline([
                ('vect', TfidfVectorizer()),
                ('clf', SGDClassifier(class_weight='balanced', random_state=SEED))
                ]),
            }

In [23]:
# fit the models to training data and evaluate their performance on test data

baseline_accuracy = {}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    
    y_pred = pipe.predict(X_test)
    baseline_accuracy[name] = accuracy_score(y_test, y_pred).round(3)


In [24]:
print('Baseline Test Accuracy')
print(baseline_accuracy)

Baseline Test Accuracy
{'cnb': 0.707, 'mnb': 0.682, 'sgd': 0.952}


SGDClassifier is performing the best with 95.2% accuracy. We'll fine tune its parameters in order to improve the performance.

## Model Training

In [29]:
parameters = {'vect__ngram_range': [(1,1), (1,3)],
              'clf__alpha': [0.001, 0.0001],
              'clf__penalty': ['l1', 'l2', 'elasticnet']}

In [30]:
sgd_cv = GridSearchCV(pipelines['sgd'], parameters, cv=10)

In [31]:
sgd_cv.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf',
                                        SGDClassifier(class_weight='balanced',
                                                      random_state=1234))]),
             param_grid={'clf__alpha': [0.001, 0.0001],
                         'clf__penalty': ['l1', 'l2', 'elasticnet'],
                         'vect__ngram_range': [(1, 1), (1, 3)]})

In [32]:
# training accuracy of the fitted model
sgd_cv.best_score_

0.9583611375153375

In [33]:
sgd_cv.best_params_

{'clf__alpha': 0.0001, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1)}

In [34]:
y_pred = sgd_cv.predict(X_test)

In [35]:
print(classification_report(y_test, y_pred, target_names=['Real', 'Fake']))

              precision    recall  f1-score   support

        Real       0.96      0.96      0.96      2591
        Fake       0.94      0.94      0.94      1981

    accuracy                           0.95      4572
   macro avg       0.95      0.95      0.95      4572
weighted avg       0.95      0.95      0.95      4572



In [46]:
feature_weights = sgd_cv.best_estimator_.named_steps['clf'].coef_[0]
feature_weights.shape

(120210,)

In [47]:
feature_names = sgd_cv.best_estimator_.named_steps['vect'].get_feature_names()

In [49]:
# top 20 most informative words for classifying news articles as fake/real
indices = np.argsort(feature_weights)[:-20:-1]

for i in range(len(indices)):
    print("%s feature %d (%f)" % (feature_names[indices[i]], indices[i], feature_weights[indices[i]]))

october feature 68315 (5.957944)
november feature 67479 (4.893250)
share feature 87496 (3.586660)
clinton feature 17770 (3.251315)
source feature 90708 (3.234632)
print feature 76004 (2.966495)
hillary feature 43681 (2.863047)
article feature 5592 (2.778789)
via feature 103609 (2.736486)
comment feature 18691 (2.659843)
oct feature 68305 (2.545226)
snip feature 89896 (2.225676)
however feature 44777 (1.996091)
nov feature 67462 (1.921790)
fact feature 33094 (1.821012)
wikileaks feature 106303 (1.817574)
fbi feature 33843 (1.672069)
reprinted feature 81093 (1.670920)
min feature 62028 (1.644401)
