**Fake News**

Fake news is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media or online social media. Digital news has brought back and increased the usage of fake news, or yellow journalism.

Yellow journalism and the yellow press are American terms for journalism and associated newspapers that present little or no legitimate well-researched news while instead using eye-catching headlines for increased sales.

In this article, we will use the TfidfVectorizer and PassiveAggressive classifier to classify fake news and genuine news. 

TfidfVectorizer Convert a collection of raw documents to a matrix of TF-IDF features.

TF(Term Frequency): The number of times a word has appeared in any document is its term frequency.

IDF(Inverse data frequency):Inverse data frequency determines the weight of rare words across all documents in the corpus.

A Simple Example how TfidfVectorizer works is given below

In [None]:
#Example of TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [

    'This is the first document.',

     'This document is the second document.',

     'And this is the third one.',

     'Is this the first document?',

 ]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

print(X)

print(vectorizer.get_feature_names())

  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483
  (1, 5)	0.5386476208856763
  (1, 1)	0.6876235979836938
  (1, 6)	0.281088674033753
  (1, 3)	0.281088674033753
  (1, 8)	0.281088674033753
  (2, 4)	0.511848512707169
  (2, 7)	0.511848512707169
  (2, 0)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 3)	0.267103787642168
  (2, 8)	0.267103787642168
  (3, 1)	0.46979138557992045
  (3, 2)	0.5802858236844359
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 8)	0.38408524091481483
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




> *Note*

From the output given above, we can understand that , First TfidfVectorizer counted the frequency of every word in the corpus and then it define weightage of every word in matrix.As you can see in the output after doing the Tfidf Vectorization, we have total 9 features in the output.  ie 9 columns (features) and 4 Rows (With weightage of every word)
For example we get "the" word multiple times in any text, So TfidfTransformer finds out how much its contribution to the model is in the classification.
>**PassiveAggressive classifier**

The passive-aggressive algorithms are a family of algorithms for large-scale learning.
Here it is enough to know that this algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting.

>**Project (Implementation)
Importing required libraries**



In [None]:
#importing libraries 

import numpy as np

import pandas as pd

import itertools

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

>**Note**

Numpy and Pandas used here for data manipulation and itertools module is used here to handle iterators.

Itertools: Python itertools module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for loop. The most common iterator in Python is the list.

In next step we will read datasets that we are going to use here it contains both (fake and real) news.

In [None]:
#Reading the datasets 

df=pd.read_csv('/content/news.csv',error_bad_lines=False, engine="python")

df.head()

Skipping line 4719: unexpected end of data


Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


**Note:** As you can see in the above output we have news title text in news and label of news (i.e. fake 0 or real 1).

In [None]:
#Defining our features and target 

X=df['text']

Y=df['label']

#using split function 

x_train,x_test,y_train,y_test=train_test_split(X,Y, test_size=0.2, random_state=7)

> **Note**

As we know, we have to know about a text / corpus whether it is fake or real.that means our target is label (fake or real) which we will know from the features i.e. text. After that we split the data into train and test data.Training data  is used to train the model (learning of model), whereas from testing data we see how much the model has learned.



In [None]:
#preprocessing of data (tokenize and creating matrix)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#preprocessing of train data

tfidf_train=tfidf_vectorizer.fit_transform(x_train)

#preprocessing of test data  

tfidf_test=tfidf_vectorizer.transform(x_test)

> **Note**

As we have discussed earlier, the TfidfVectorizer tokenize the data, then converts the data into a matrix form and decide the weightage of the words.(means preprocessing of data).Because the machine cannot understand the documents (doc type), it is necessary to preprocess the data (convert into matrix as we have seen in above  example of TfidfVectorizer).



In [None]:
#classifier or algorithm to learn the model

passive=PassiveAggressiveClassifier(max_iter=50)

passive.fit(tfidf_train,y_train)

y_pred=passive.predict(tfidf_test)

#accuracy of the model

score=accuracy_score(y_test,y_pred)

print(score)

#confusion matrix or kind of error calculation 

confusion_matrix(y_test,y_pred)

0.9470338983050848


array([[460,  25],
       [ 25, 434]])

> **Note**

Here we used passiveagressive classifier (or algorithm, it is a kind of supervised learning algorithm ) to train our model. After training of the model we tested our model.Accuracy of our model is 0.92 (ie 94 %) and from the confusion matrix we can clearly see that we have total 434 false news (0) , 480 real news (1) and (25+25) wrong prediction by model.


