# Fake News Sentiment Analysis

The link to the tutorial can be found here: https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/

## Imports

In [100]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## Read in the dataset

In [3]:
df = pd.read_csv('news.csv')
df.shape
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
#DataFlair - Get the labels
labels=df['label']
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

## Split test and train dataset

In [5]:
#DataFlair - Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

# Get tfidf scores for each word in each document

For this we will instantiate the TfidfVectorizer object. This will do all the work for us in terms of preparing our feature set. 

Note, we fit (get a numberical value for each word) the training set, but we do not do this for the the test data.
This is becasue the test data set asssumes the training data set is representative of itself. Therefore, any "major" word has already been considered in the training dataset's vocabulary. 

Therefore, we only perform the transform on the test datasset of interest.

In [31]:
#DataFlair - Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
#DataFlair - Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

# Implement Passive Agressive Classifier

This classifier will be passive when it is right, but change when it is wrong. 

tdif_train consits of n rows (equivalen to the total number of rows in the traininset data) and p columns (Number of features which will be each word of interest from that article or text)

In [89]:
#DataFlair - Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=100)
pac.fit(tfidf_train,y_train)
#DataFlair - Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.98%


In [9]:
#DataFlair - Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[589,  49],
       [ 44, 585]], dtype=int64)

# Implement with logisitic regression

In [99]:
lrModel = LogisticRegression()
lrModel.fit(tfidf_train, y_train)

y_pred = lrModel.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 91.71%


# Implement with Naive Bayes

We use multinomial because it works well with tfidf according to documentation found online. This is not thought to be expected considering the fact that the tfidf values are not integers..... but it somehows works out better thant the discrete count vlaues somehow.

In [112]:
NB = MultinomialNB()
NB.fit(tfidf_train, y_train)
Accuracy = NB.score(tfidf_test, y_test)
print(Accuracy*100)

84.0568271507498
