# Fake News Detection

In [1]:
#Objective: To build a model that will accurately classify a piece of news as FAKE or REAL.

In [2]:
#About the project: We build a TfIDfVectorizer using Sklearn on our dataset. Then, we initialize a PassiveAgressive classifier and fit the model. 
    #In the end, the accuracy score and the confusion matrix tells us how well our model fares.

In [3]:
#Glossary: 
    #TfidfVectorizer:
        #TF(Term Frequency) - It is the number of times a word appears in a document. A higher value means a term appears more than others which make the document a good match when 
            #the term is a part of the search terms.
        #IDF(Inverse Document Frequency) - It measures the significance of a term in a corpus.
    #Passive Agressive Classifier - It is an algorithm that remains passive for a correct classification outcome and turns aggressive
        #in the event of a miscalculation, updating and adjusting

In [4]:
#1. Importing libraries
import pandas as pd
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [5]:
#2.pull the data into DataFrame, get its shape and pull the first 5 records
import os
os.chdir(r'C:\Users\je638474\Documents\JJ\Folders\Udemy\Python EDUREKA\Python Projects')
data = pd.read_csv('news.csv')
data.shape

(6335, 4)

In [6]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [7]:
#3. let's pull the 'label' column from the table
label = data.label
label.head()


0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [8]:
#4. Split the data to train and test
x_train, x_test, y_train, y_test = train_test_split(data['text'], label, test_size = 0.2, random_state = 7)

In [9]:
#Initialize TfidfVectorizer - 'stop words' - max doc freq 0.7 - terms higher than this will be discarded
#Stop words are the common words in a language that are to be filtered out before processing the natural language data.
#The vectorizer turns a collection of raw documents into a matrix of TF-IDF features

In [10]:
#5. Now lets fit and transform the vectorizer on the train set and transform the vectorizer on the test set
    #Initialize a vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df=0.7)
    #fit and transform train set, transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

In [11]:
#6. Initialize PassiveAggressiveClassifier on the tfidf_train and y_train data
pac =  PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tfidf_train, y_train)


PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

In [12]:
#7. let's predict on the test set and calculate the accuracy score
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)

In [13]:
score

0.9297553275453828

In [14]:
#8. We got an accuracy of 93%, let's pull out the confusion matrix and gain insight into the number of false/true, negatives/positives
confusion_matrix(y_test, y_pred, labels =['FAKE', 'REAL'])

array([[592,  46],
       [ 43, 586]], dtype=int64)

In [15]:
#Here, we have 590 true positives, 589 true negatives, 40 false positives, 48 false negatives