# Fake News Detection
### A dataset contains random articles and their authors, and we have to develop a machine learning model to identify when an article might be fake news.

## Reading the dataset and importing the important libraries.

In [1]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from sklearn.metrics import accuracy_score
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

train=pd.read_csv(r"C:\Users\samya\Downloads\train.csv")

In [2]:
train.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


## Checking a random article

In [3]:
from random import randint
randint=randint(0,20800)
train['title'][randint]

'Baseball Tragedy: Two Players Die in Crashes in the Dominican Republic - The New York Times'

## Splitting and filtering words

In [4]:
ps = PorterStemmer()
tokens = []
for i in range(0, len(train)):
    token = re.sub('[^a-zA-Z]', ' ', str(train['title'][i]))
    token = token.lower()
    token = token.split()
    token = [ps.stem(word) for word in token if not word in stopwords.words('english')]
    token = ' '.join(token)
    
    tokens.append(token)

In [5]:
tokens[randint]

'basebal tragedi two player die crash dominican republ new york time'

## Extracting features

In [6]:
cv = CountVectorizer()
x = cv.fit_transform(tokens).toarray()

In [7]:
x.shape

(20800, 14864)

In [8]:
y=train["label"]

## Converting into train and test data

In [9]:
from sklearn.model_selection import train_test_split as tts
train_x,test_x,train_y,test_y=tts(x,y,test_size=0.2,random_state=20)

## Model Building

### MultinomialNB

In [10]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

nb.fit(train_x,train_y)

tnb=nb.predict(train_x)
tnb=pd.DataFrame(tnb,columns=["pred"])

pnb=nb.predict(test_x)
pnb=pd.DataFrame(pnb,columns=["pred"])

nb_train=nb.score(train_x,train_y)
score_nb=nb.score(test_x,test_y)
print("Training score=",100*nb_train,"%")
print("Test score=",100*score_nb,"%")

Training score= 93.75600961538461 %
Test score= 90.1923076923077 %


### Passive Aggressive Classifier

In [11]:
from sklearn.linear_model import PassiveAggressiveClassifier
pa = PassiveAggressiveClassifier()

pa.fit(train_x,train_y)

tpa=pa.predict(train_x)
tpa=pd.DataFrame(tpa,columns=["pred"])

ppa=pa.predict(test_x)
ppa=pd.DataFrame(ppa,columns=["pred"])

pa_train=accuracy_score(train_y,tpa)
score_pa=pa.score(test_x,test_y)
print("Training score=",100*pa_train,"%")
print("Test score=",100*score_pa,"%")

Training score= 100.0 %
Test score= 92.64423076923077 %


#### Thus, our fake news detection model is 92.6% efficient.