# Fake News Classification

## Introduction
The data from this analysis is from kaggle: [Fake and real news dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset), which contains two news data from different sources. One of the data contains real news and another file contains fake news.

The purpose of this analysis is to analyze the data using **Python NLTK tools** and **Sklearn kit** to perform basic natural language processing and machining learning methods to train the dataset to build a classifier to identify fake news from real ones.

The methods used in this analysis includes:

- Word tokenizing
- Stop words and punctuations removal
- Lexical dispersion plot
- Naive Bayes Classification
- logestic Regression

## Loading data

The analysis starts with loading data and necessary library.

In [None]:
import pandas as pd
import numpy as np
import math
import datetime
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import warnings
import random
from string import punctuation
import seaborn as sns

from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image

# warnings.filterwarnings('ignore')

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize


from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score
%matplotlib inline

# Set default plot size
plt.rcParams["figure.figsize"] = (15,8)

The data contains the title of the news, the actual content of the news,which will be what I will focus on, the subject of the news and the date of the news. For the convenience of the analysis, I added one more column called **Authenticity** to differentiate fake news from real news. 

In [None]:
real = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")
fake = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")
fake['Authenticity'] = 'Fake'
real['Authenticity'] = 'Real'
news_data = fake.append(real)
news_data.head()

## Data Preprocessing and EDA

Before fit the data into a classifier, the text data needs to be preprocessed and explored.

First, we need to tokenize the text, and thenremove the stopwords and punctuation from the text.

In [None]:
sw = stopwords.words('english')

new_words=('’','“', '”')

for i in new_words:
    sw.append(i)


# Convert to lower case
news_data['text'] = news_data['text'].str.lower()

# Tokenizing
news_data['tokenized_text'] = news_data['text'].apply(word_tokenize)

# Remove stopwords
news_data['filtered_text'] = news_data['tokenized_text'].apply(lambda x: [item for item in x if item not in sw])

# Remove punction
news_data['filtered_text'] = news_data['filtered_text'].apply(lambda x: [item for item in x if item not in punctuation])

# Check results
print(len(news_data['text'].iloc[0]),
      len(news_data['tokenized_text'].iloc[0]),
      len(news_data['filtered_text'].iloc[0]))

In [None]:
news_data.head()

Then create a simple wordcloud from the dataset.

In [None]:
text = " ".join(text for text in news_data.text)

wordcloud = WordCloud(background_color="white", max_words=1000,
                      max_font_size=90, random_state=42).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()



The I create two separate datasets for fake and real news to examine the most commonly appeared words in each dataset using `FreqDist` function from `nltk`.

In [None]:
news_data_fake = news_data[news_data.Authenticity == 'Fake']
news_data_real = news_data[news_data.Authenticity == 'Real']

news_data_fake.head()

In [None]:
fake = news_data_fake.filtered_text.tolist()

fake_list = []
for sublist in fake:
    for item in sublist:
        fake_list.append(item)

real = news_data_real.filtered_text.tolist()

real_list = []
for sublist in real:
    for item in sublist:
        real_list.append(item)
        
all_words = news_data.filtered_text.tolist()

all_words_list = []
for sublist in all_words:
    for item in sublist:
        all_words_list.append(item)

Using `most_common()` function from `nltk`, this will print out the most commonly appeared words  in fake, real and all news datasets. As we can see that there are some overlaps in most common words.

In [None]:
vocab_fake = nltk.FreqDist(fake_list)
vocab_real = nltk.FreqDist(real_list)
vocab_all = nltk.FreqDist(all_words_list)

print('Fake most common words: ',vocab_fake.most_common(20),
     'Real most common words: ',vocab_real.most_common(20),
     'All most common words: ',vocab_real.most_common(20))

To further analyze the distribution of the most common words, I choose the top most common words from fake and real news dataset and plot the Lexical **Dispersion Plot**. I'm only choosing the first 10000 character from each datasets.

As we can see, the first plot is the fake news and the second plot is the real news. The word **said** has appeared more frequently and condensed in the real news datasets then the fake news datasets.

In [None]:
common_words_fake = [item[0] for item in vocab_fake.most_common(20)]
nltk.Text(fake_list[:10000]).dispersion_plot(common_words_fake)

common_words_real = [item[0] for item in vocab_real.most_common(20)]
nltk.Text(real_list[:10000]).dispersion_plot(common_words_real)

## Classification

After exploring the data, we had a better understanding of the dataset. We can start building our classifier to train our datasets.

For this part, I'm using `TfidfVectorizer` module from `sklearn.feature_extraction.text`, The result is a sparse matrix recording the number of times each word appears and weights the word counts by a measure of how often they appear in the documents. The result can then be fit to a multinomial Naive Bayes classifier.

In [None]:
vectorizer = TfidfVectorizer(stop_words=sw,lowercase=True)
y = news_data.Authenticity
x = vectorizer.fit_transform(news_data.text)

In [None]:
print (x.shape)
print (y.shape)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

NB_classifier = MultinomialNB()
NB_classifier.fit(X_train,y_train)

After we train the model, we want to test how accurate the model is. By using the `confusion_matrix`, we can plot a `heatmap` using `seaborn` to see how much records in the testing data the classifier has been successfully predicted.

From the plot, we can see that the result is pretty good.

In [None]:
labels = NB_classifier.predict(X_test)

mat = confusion_matrix(y_test, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label')

The roc auc score shows 98% accuracy, which means it's a very good model.

In [None]:
roc_auc_score(y_test,NB_classifier.predict_proba(X_test)[:,1])

By printing out `classification_report`, we can see that both fake and real news have a 94% precision.

In [None]:
print(classification_report(y_test,labels))

After training the dataset to a Naive Bayes classifier, I also want to see how logestic regression model fits the data. Follwoing the similiar procedures, I train the data using logestic regression.

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

The model score is 0.98, which means logestic regression model also fits pretty well.

In [None]:
model.score(X_test, y_test)

Similiarly, the confusion matrix heatmap and the classification report also indicate that logestic regression model fits the data well.

Comparing the two classifiers, we can see that the **logestic regression model** actually doing a better job than Naive Bayes in classifying this dataset as it makes more accurate predictions.

In [None]:
y_model = model.predict(X_test)

mat = confusion_matrix(y_test,y_model)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False) 
plt.xlabel('predicted value')
plt.ylabel('true value')

In [None]:
print(classification_report(y_test,y_model))