### 1. Importing packages

In [1]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

### 2. Reading the data

In [42]:
# Data read
df = pd.read_csv('news.csv')

df.shape

(6335, 4)

In [45]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [46]:
# to be able to print a long text
pd.options.display.max_colwidth = 1000

# choose line 5
df[df.index==4]

Unnamed: 0.1,Unnamed: 0,title,text,label
4,875,The Battle of New York: Why This Primary Matters,"It's primary day in New York and front-runners Hillary Clinton and Donald Trump are leading in the polls.\n\nTrump is now vowing to win enough delegates to clinch the Republican nomination and prevent a contested convention. But Sens.Ted Cruz, R-Texas, Bernie Sanders, D-Vt., and Ohio Gov. John Kasich and aren't giving up just yet.\n\nA big win in New York could tip the scales for both the Republican and Democratic front-runners in this year's race for the White House. Clinton and Trump have each suffered losses in recent contests, shifting the momentum to their rivals.\n\n""We have won eight out of the last nine caucuses and primaries! Cheer!"" Sanders recently told supporters.\n\nWhile wins in New York for Trump and Clinton are expected, the margins of those victories are also important.\n\nTrump needs to capture more than 50 percent of the vote statewide if he wants to be positioned to win all of the state's 95 GOP delegates. That would put him one step closer to avoiding a contest...",REAL


In [36]:
# Extract labels, which will be used for the target
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [7]:
# Splitting the data into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)

### 3. TfidfVectorizer

Initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded).

Stop words are the most common words in a language that are to be filtered out before processing the natural language data. 

And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

In [8]:
# Initialize TfidfVEctorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and transform train set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)

# Transform test set
tfidf_test = tfidf_vectorizer.transform(x_test)

### 4. PassiveAggressive Classifier

Passive-Aggressive algorithms are somewhat similar to a Perceptron model, in the sense that they do not require a learning rate. However, they do include a regularization parameter.

Passive: If the prediction is correct, keep the model and do not make any changes. i.e., the data in the example is not enough to cause any changes in the model. 

Aggressive: If the prediction is incorrect, make changes to the model. i.e., some change to the model may correct it.

C : This is the regularization parameter, and denotes the penalization the model will make on an incorrect prediction

max_iter : The maximum number of iterations the model makes over the training data.

tol : The stopping criterion. If it is set to None, the model will stop when (loss > previous_loss  –  tol). By default, it is set to 1e-3.

In [50]:
# Initializing PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=150)
pac.fit(tfidf_train,y_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=150, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)

### 5. Predict

In [51]:
# Predict on the test set and calculate accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test,y_pred)

print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.82%


### 6. Confusion Matrix

To gain insight into the number of false and true negatives and positives.

In [52]:
# Building consution matrix
confusion_matrix (y_test,y_pred, labels=['FAKE', 'REAL'])

array([[589,  49],
       [ 42, 587]])

The result shows, 589 True positives, 49 false negatives, 42 false positives and 587 true negatives.