The Internet has not only made information accessible to the masses but also has become a hotspot of misinformation and fake news. Fake news can lead to more harm if not correctly identified and tagged. The severity of of the effects of misinformation can be judged from the fact that there have been riots and killings attributed to fake news.

Fake news can even sway people's opinions and affiliations - a fact that political parties have used (and still use) to make people vote in their favour.

As such, it has become necessary to segregate the real from the fake news. But this is not feasible manually thanks to the huge amount of information that is churned out every minute on the internet.

To overcome this problem, machine learnng and natural languaging processing can be to automatically classify the fake from the real news.

# Overview
In this project, I've tried to classify the given news as fake or real, by using Naive Bayes classifier and Passive Aggressive Classifier. I tried using Naive Bayes with hyperparameter tuning. 

The best result was given by Passive Aggressive Classifier, with an accuracy of over 99.5%

# Importing necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import itertools
import nltk
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
%matplotlib inline

# Dataset
The [dataset used here](https://www.kaggle.com/pnkjgpt/fake-news-dataset) consists of a train and a test file. The test file can be ignored as it doesn't contain the labels (as I'm doing this project as a part of a competition). We will work only on the train data.

The train dataset contains 7 columns - **'index', 'title', 'text', 'subject', 'date', 'class', 'Unnamed: 6'**

In [None]:
train = pd.read_csv('../input/fake-news-dataset/train.csv')
train.head()

In [None]:
train.head()

In [None]:
train.shape

**Checking for null values**

In [None]:
train.isnull().sum()

**Checking whether the dataset is balanced or imbalanced**

In [None]:
train['class'].value_counts()

The dataset is fairly balanced with the number of Real and Fake classes almost equal. 

Their seems to be another class with the name **'February 5, 2017'** consisting of only one data point. On further inspection, I find that the features for this data point has been shifted one column ahead for all the features. Since it is just one point, it can be removed or the features shifted in the reverse direction. 

I chose to shift the columns in the right places.

In [None]:
train[train['class'] == 'February 5, 2017']

In [None]:
#shifting the column values in the respective places
train.iloc[504, 2] = train.iloc[504, 3]
train.iloc[504, 3] = train.iloc[504, 4]
train.iloc[504, 4] = train.iloc[504, 5]
train.iloc[504, 5] = train.iloc[504, 6]
train.iloc[504, 6] = np.nan

In [None]:
train.iloc[[504]]

The **index** and **Unnamed: 6** columns can now be removed as these are redundant and don't convey any information.

In [None]:
train.drop(['index', 'Unnamed: 6'], axis = 1, inplace=True)
train.head()

# Data description

In [None]:
train.describe(include = 'all').T

From the above table, it can be seen that out of 40,000 titles and texts, 35,075 and 34,965 are unique, respectively.

So the remaining non-unique titles and texts must be removed. I removed the **text**, as it has more non-unique values comapred to **title**.

In [None]:
train.drop_duplicates(subset = ['text'], inplace=True)
train.reset_index(drop = True, inplace = True)
train.describe(include = 'all').T

As can be seen, out of 34,965 texts, 34,653 are unique. The remaining 312 (34965-34653) are non-unique. Since it's a small number, I let it as it is.

# Pie chart showing the type of articles

In [None]:
train['subject'].value_counts().plot.pie(figsize = (7, 7));

# Removing stopwords

In [None]:
import re
from nltk.corpus import stopwords

def stopwordsRemover(document):
    corpus = []
    for i in range(len(train)):
        temp = re.sub('[^a-zA-Z]', ' ', document[i])
        temp = temp.lower()
        temp = temp.split()

        temp = [word for word in temp if not word in stopwords.words('english')]
        temp = ' '.join(temp)
        corpus.append(temp)
    return(corpus)
noStopWordTitle = stopwordsRemover(train['title'])
noStopWordText = stopwordsRemover(train['text'])

In [None]:
#first 10 titles
noStopWordTitle[:10]

I removed the **title** and **text** and inserted the **noStopWordTitle** and **noStopWordText** in the dataframe

In [None]:
train.insert(0, 'noStopWordTitle', noStopWordTitle, True)
train.insert(1, 'noStopWordText', noStopWordText, True)

In [None]:
train.drop(['title', 'text'], axis = 1, inplace = True)

In [None]:
train.head()

# Top unigrams, bigrams and trigrams used in the title

Separating the Fake and Real titles

In [None]:
fakeTitles = train.noStopWordTitle[train['class'] == 'Fake']
realTitles = train.noStopWordTitle[train['class'] == 'Real']

mergedFake = ' '.join(fakeTitles)
mergedReal = ' '.join(realTitles)

In [None]:
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

def ngramFunct(corpus, n):
    token = nltk.word_tokenize(corpus)
    ans = ngrams(token,n)
    return(Counter(ans))

### Unigrams

In [None]:
unigramReal = ngramFunct(mergedReal, 1)
unigramFake = ngramFunct(mergedFake, 1)
ufreqReal = (nltk.FreqDist(unigramReal))
ufreqFake = (nltk.FreqDist(unigramFake))

In [None]:
plt.title('Top 20 Unigrams in Real News')
ufreqReal.plot(20, cumulative=False, color = 'b');

plt.title('Top 20 Unigrams in Fake News')
ufreqFake.plot(20, cumulative=False, color = 'r');

### Bigrams

In [None]:
bigramReal = ngramFunct(mergedReal, 2)
bigramFake = ngramFunct(mergedFake, 2)
bfreqReal = (nltk.FreqDist(bigramReal))
bfreqFake = (nltk.FreqDist(bigramFake))

In [None]:
plt.title('Top 20 Bigrams in Real News')
bfreqReal.plot(20, cumulative=False, color = 'b');

plt.title('Top 20 Bigrams in Fake News')
bfreqFake.plot(20, cumulative=False, color = 'r');

### Trigrams

In [None]:
trigramReal = ngramFunct(mergedReal, 3)
trigramFake = ngramFunct(mergedFake, 3)
tfreqReal = (nltk.FreqDist(trigramReal))
tfreqFake = (nltk.FreqDist(trigramFake))

In [None]:
plt.title('Top 20 Trigrams in Real News')
tfreqReal.plot(20, cumulative=False, color = 'b');

plt.title('Top 20 Trigrams in Fake News')
tfreqFake.plot(20, cumulative=False, color = 'r');

# Model Building

I first tried to use the models on the titles only and then the text only and then merged the titles and text.

### Stemming the words

In [None]:
from nltk.stem.snowball import SnowballStemmer

def stem(data):    
    stemmer = SnowballStemmer('english')
    stemmed = []
    for i in range(len(data)):
        temp = data[i]
        temp = [stemmer.stem(word) for word in temp]
        temp = ''.join(temp)
        stemmed.append(temp)
    return(stemmed)

## Applying models to the Titles

In [None]:
titleCorpus = stem(train.noStopWordTitle)
titleCorpus[:10]

### TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X = tf.fit_transform(titleCorpus).toarray()
y = train['class']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.33)

### Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifierMNB = MultinomialNB()
classifierMNB.fit(X_train, y_train) 
pred = classifierMNB.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f' %score)

cm = plot_confusion_matrix(classifierMNB, X_val, y_val, cmap = 'coolwarm')

### Passive Aggressive Classifier

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
classifierPAC = PassiveAggressiveClassifier(n_iter_no_change=50)
classifierPAC.fit(X_train, y_train)
pred = classifierPAC.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f'%score)
cm = plot_confusion_matrix(classifierPAC, X_val, y_val, cmap = 'coolwarm')

## Applying models to the Text

In [None]:
textCorpus = stem(train.noStopWordText)
textCorpus[:10]

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X = tf.fit_transform(textCorpus).toarray()
y = train['class']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.33)

### Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifierMNB = MultinomialNB()
classifierMNB.fit(X_train, y_train) 
pred = classifierMNB.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f' %score)

cm = plot_confusion_matrix(classifierMNB, X_val, y_val, cmap = 'coolwarm')

### Passive Aggressive Classifier

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
classifierPAC = PassiveAggressiveClassifier(n_iter_no_change=50)
classifierPAC.fit(X_train, y_train)
pred = classifierPAC.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f'%score)
cm = plot_confusion_matrix(classifierPAC, X_val, y_val, cmap = 'coolwarm')

## Combining the Title and Text and applying Models

In [None]:
train['Title and Text'] = train[['noStopWordTitle', 'noStopWordText']].apply(' '.join, axis=1)
train.head()

In [None]:
titleTextCorpus = stem(train['Title and Text'])
titleTextCorpus[:5]

## TD-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X = tf.fit_transform(titleTextCorpus).toarray()
y = train['class']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.33)

### Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifierMNB = MultinomialNB()
classifierMNB.fit(X_train, y_train) 
pred = classifierMNB.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f' %score)

cm = plot_confusion_matrix(classifierMNB, X_val, y_val, cmap = 'coolwarm')

### Passive Aggressive Classifier

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
classifierPAC = PassiveAggressiveClassifier(n_iter_no_change=50)
classifierPAC.fit(X_train, y_train)
pred = classifierPAC.predict(X_val)
score = metrics.accuracy_score(y_val, pred)
print('Accuracy : %0.3f'%score)
cm = plot_confusion_matrix(classifierPAC, X_val, y_val, cmap = 'coolwarm')

As it can be seen, the classification accuracy when the models are applied to the **titles** only is the least followed by the **text**.

The accuracy is highest when the **title** and **text** both are combined together. 

Compared with Naive Bayes, Passive Aggressive Classifier gives the best accuracy.

**Do upvote if this notebook helped you in learning something new.**

**Suggestions and discussions are welcome**