# >> Fake news detection <<


## What is fake news?
### >>A type of yellow journalism, fake news encapsulates pieces of news that maybe hoaxes and is generally spread through social media and other online media
### >>This is often done to further or impose certain ideas and is often achieved with political agendas.  False information spreads extraordinarily fast.
### >>This is demonstrated by the fact that, when one fake news site is taken down, another will promptly take its place. In addition, fake news can be come indistinguishable from accurate reporting since it spreads so fast.
### >>People can download articles from sites, share the information, re-share from others and by the end of the day the false information has gone so far from its original site that it becomes indistinguishable

In [1]:
#Make all the necessary imports
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

#### >> NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays.
#### >> Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer various operations and data structures to perform numerical data manipulations and time series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is known for its high productivity and high performance. Pandas is popular because it makes importing and analyzing data much easier.
#### >> Itertool is a module that provides various functions that work on iterators to produce complex iterators. This module works as a fast, memory-efficient tool that is used either by themselves or in combination to form iterator algebra.
#### >> The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. The procedure involves taking a dataset and dividing it into two subsets

## >> What is a TfidfVectorizer?
### ● TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.
### ● IDF (Inverse Document Frequency): Words that occur many times in a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.
### ● The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

## >> What is a PassiveAggressiveClassifier?
### ● Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.


#### >> Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy = Number of correct predictions / Total number of predictions.
#### >> A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.
or
#### >> A confusion matrix is a matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one. Each row of the confusion matrix represents the instances of an actual class and each column represents the instances of a predicted class


In [2]:
#Read the data 
df=pd.read_csv('news.csv')

#Get the shape of the DataFrame
df.shape

(6335, 4)

In [3]:
# Get the head to have an idea abt the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
#Get the labels i.e., Target 
labels=df['label']  ## or labels=df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [5]:
#Split the dataset as follows
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)
# 20% test data and 80% training data 

### >> Random state <<
#### If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
#### However, if a fixed value is assigned like random_state = 0 or 1 or 7 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

In [6]:
#Initialize a TfidfVectorizer or Creating an instance of TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

###  What are stop words?
#### >>These words which are generally filtered out before processing a natural language are called stop words. 
#### >>  These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
### Why do we remove stop words? 
#### >> Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information. In order words, we can say that the removal of such words does not show any negative consequences on the model we train for our task.
### Do we always remove stop words? Are they always useless for us? 
#### The answer is no! We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words. Movie review: “The movie was not good at all.” Text after removal of stop words: “movie good” .We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became positive, which is not the reality. Thus, the removal of stop words can be problematic here.
### >> Tasks like text classification, fake news do not generally need stop words as the other words present in the dataset are more important and give the general idea of the text. So, we generally remove stop words in such tasks. 
### max_df??
#### >> When building the vocabulary ignore terms that have a document frequency(i.e., df) strictly higher than the given threshold (corpus-specific stop words) (i.e., 0.7 in our problem).


### >> fit() vs transform() vs fit_transform() <<
#### >> The fit() method identifies and learns the model parameters from a training data set. For example, standard deviation and mean for normalization. Or Min (and Max) for scaling features to a given range.
#### >> The transform() method applies parameters learned from the fit() method. The transform() method transforms the training data and the test data (aka. unseen data)
#### >> The fit_transform() method first fits, then transforms the data-set in the same implementation. The fit_transform() method is an efficient implementation of the fit() and transform() methods. fit_transform() is only used on the training data set as a “best practice”
#### >> The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance. Now, we want scaling to be applied to our test data too and at the same time do not want to be biased with our model. We want our test data to be a completely new and a surprise set for our model. The transform method helps us in this case.
#### >> If we will use the fit method on our test data too, we will compute a new mean and variance that is a new scale for each feature and will let our model learn about our test data too. Thus, what we want to keep as a surprise is no longer unknown to our model and we will not get a good estimate of how our model is performing on the test (unseen) data which is the ultimate goal of building a model using machine learning algorithm.
####  >> This is the standard procedure to scale our data while building a machine learning model so that our model is not biased towards a particular feature of the dataset and at the same time prevents our model to learn the features/values/trends of our test data.

In [7]:
#Initialize a PassiveAggressiveClassifier or Creating an instance of PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=100)
# max_iter = The maximum number of passes over the training data (aka epochs).
pac.fit(tfidf_train,y_train)
# .fit() Trains the tdidf vectorized data by using PassiveAggressiveClassifier(aka ALgorithm used for this Problem Statement)

PassiveAggressiveClassifier(max_iter=100)

#### >> The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Note that the model is fitted using X(features) and y(target) , but the object holds no reference to X and y .

In [8]:
#Predicting on the test set and calculating accuracy using PassiveAggressiveClassifier Model
y_pred=pac.predict(tfidf_test) # y_pred gives the predicted values of the test data using PassiveAggressiveClassifier 
score=accuracy_score(y_test,y_pred)  #Using accuracy score we get the model's accuracy
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.66%


#### >> fit() method will fit the model to the input training instances while predict() will perform predictions on the testing instances, based on the learned parameters during fit .

#### Execute the below code to see the pic of Confusion matrix to get an idea of what it is!!

In [9]:
import cv2
img=cv2.imread('Confusion martix.png') #Firstly read the image 
cv2.imshow('Confusion Matrix',img) # Then show the image. 1st arg= Image name(upto us), 2nd arg= img(after reading the image)
cv2.waitKey(0) 

-1

#### waitKey(0) will display the window infinitely until any keypress (it is suitable for image display). 2. waitKey(1) will display a frame for 1 ms, after which display will be automatically closed

In [10]:
#Build a confusion matrix to evaluate the accuracy of a classification.

confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[589,  49],
       [ 44, 585]], dtype=int64)

## Finally Trained the model with very few lines of code and accuracy of 92.66% is pretty good for any model!