# Fake News Detection
#### Akshay U
The Dataset have 4 columns.
    + Unnamed: 0
    + title
    + text
    + label
`Unnamed: 0` is  is seems unwanted datas, hence it removed for further work.

`title` is the News Title. I think it has no role in the prediction, so i just avoided it.

`text` and `label` are the main features used for the prediction.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score, f1_score

import joblib

### Location of Dataset
https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view

### Import data and Cleaning

In [2]:
df = pd.read_csv("news.csv")

Check the dataset, its shape and basic info. 

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [4]:
df['title'][0]

'You Can Smell Hillary’s Fear'

In [5]:
df['text'][0]



In [6]:
df.shape

(6335, 4)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
Unnamed: 0    6335 non-null int64
title         6335 non-null object
text          6335 non-null object
label         6335 non-null object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


#### Check NaN
Check if any null oe NaN values in the dataset

In [8]:
df.isnull().values.any()

False

In [9]:
df.columns

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')

We identified that this dataset is clean without NaN values. Also, we understand that we need only the `text` attribute for predicting the output. <br>
So we should divide the dataset columns for input and output.

#### Inputs

In [10]:
X = df['text']
X

0       Daniel Greenfield, a Shillman Journalism Fello...
1       Google Pinterest Digg Linkedin Reddit Stumbleu...
2       U.S. Secretary of State John F. Kerry said Mon...
3       — Kaydee King (@KaydeeKing) November 9, 2016 T...
4       It's primary day in New York and front-runners...
                              ...                        
6330    The State Department told the Republican Natio...
6331    The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332     Anti-Trump Protesters Are Tools of the Oligar...
6333    ADDIS ABABA, Ethiopia —President Obama convene...
6334    Jeb Bush Is Suddenly Attacking Trump. Here's W...
Name: text, Length: 6335, dtype: object

#### Corresponding outputs

In [11]:
y = df['label']
y

0       FAKE
1       FAKE
2       REAL
3       FAKE
4       REAL
        ... 
6330    REAL
6331    FAKE
6332    FAKE
6333    REAL
6334    REAL
Name: label, Length: 6335, dtype: object

### Train Test Split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=99)
print("Training set - Features: ", X_train.shape, "  Target: ", y_train.shape)
print("Testing set  - Features: ", X_test.shape, "  Target: ",y_test.shape)

Training set - Features:  (5068,)   Target:  (5068,)
Testing set  - Features:  (1267,)   Target:  (1267,)


### Feature Extraction
<p style='text-align: justify'>Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and /or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.</p>

Initialize a CountVectorizer with stop_words = 'english'. 

Then use **fit()** and store the result to a variable for make ***joblib*** file.

In [13]:
vect = CountVectorizer(stop_words='english')
vectorizer = vect.fit(X)

In [14]:
X_train_transformed = vect.transform(X_train)
X_test_transformed = vect.transform(X_test)
print("New Transformed...")
print("Training set - Features: ", X_train_transformed.shape, "  Target: ", y_train.shape)
print("Testing set  - Features: ", X_test_transformed.shape, "  Target: ",y_test.shape)

New Transformed...
Training set - Features:  (5068, 67351)   Target:  (5068,)
Testing set  - Features:  (1267, 67351)   Target:  (1267,)


### Modeling - Naive Bayes

In [15]:
def print_metrics(labels, preds):
    print("Precision Score\t: {}".format(precision_score(labels, preds, average='weighted')))
    print("Recall Score\t: {}".format(recall_score(labels, preds, average='weighted')))
    print("Accuracy Score\t: {}".format(accuracy_score(labels, preds)))
    print("F1 Score\t: {}".format(f1_score(labels, preds, average='weighted')))

In [16]:
mnb = MultinomialNB()
mnb.fit(X_train_transformed,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Prediction and Accuracy

In [17]:
prediction = mnb.predict(X_test_transformed)
print_metrics(prediction, y_test)

Precision Score	: 0.913499653112612
Recall Score	: 0.9123914759273876
Accuracy Score	: 0.9123914759273876
F1 Score	: 0.9124675534112748


### Train with full set of data -- 100% of data and store it for future prediction
We found the accuracy of this machine with 80% training data.<br>For the future prediction, we can train the machine with 100% dataset, which may increase the accuracy.

In [18]:
X_train_transformed = vect.transform(X)
naive = mnb.fit(X_train_transformed,y)

### Joblib File
Our dataset is little bit big, hence we are using ***joblib*** instead ***pickle***. Joblib file work similar to pickle file. And this file is using for future prediction and helps to avoid training the machine over again.

In [19]:
joblib.dump(naive,"naive.joblib")
joblib.dump(vectorizer,"vectoriszer.joblib")

['vectoriszer.joblib']