# Fake News Detection

In [4]:
import numpy as np
import pandas as pd
import itertools
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
data=pd.read_csv('news.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [6]:
data.shape

(6335, 4)

In [7]:
labels=data.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [8]:
labels.shape

(6335,)

In [16]:
X_train,X_test,y_train,y_test=train_test_split(data['text'],labels,test_size=0.2,random_state=42)

Lets initialize the TfidfVectorizer with stop words English and a max document frequency of 0.7. Stop words are most common words that are to be filtered out before processing Natural langanuage data

In [17]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#fit and transform train set and transform test set
tfidf_train=tfidf_vectorizer.fit_transform(X_train)
tfidf_test=tfidf_vectorizer.transform(X_test)

We will now initialise PassiveAggressiveClassifier and we will fit this on tfidf_train and y_train. Then we will pred the model on tfidf test set and finally calculate accuracy score

In [19]:
pac=PassiveAggressiveClassifier(max_iter=30)
pac.fit(tfidf_train,y_train)

y_pred=pac.predict(tfidf_test)
yhat=accuracy_score(y_test,y_pred)
print(round(yhat*100))

94.0


We got accuracy of 94.0% with this model. Now at last we will print the confusion matrix to know the count of true and false positive and negative. 

In [20]:
cm=confusion_matrix(y_test,y_pred,labels=['FAKE','REAL'])
print(cm)

[[588  40]
 [ 42 597]]


Thus with this model we have 588 true positives, 597 true negatives, 42 false positives and 40 false negatives.