# Problem Statement : 

# Fake News Classification with The Help Of Natural Language Processing Technique. 
Fake news detection is a hot topic in the field of natural language processing.
We consume news through several mediums throughout the day in our daily routine, but sometimes it becomes difficult to decide which one is fake and which one is authentic. Our job is to create a model which predicts whether a given news is real or fake.

Project Flow:
    1. Problem Statement
    2. Data Gathering
    3. Data Preprocessing : Here we perform some operation on data
        A. Tokenization
        B. Lower Case
        C. Stopwords 
        D. Lemmatization / Stemming
    4. Vectorization (Convert Text data into the Vector):
        A. Bag Of Words (CountVectorizer)
        B. TF-IDF
    5. Model Building :
        A. Model Object Initialization
        B. Train and Test Model
    6. Model Evaluation :
        A. Accuracy Score
        B. Confusition Matrix
        C. Classification Report
    7. Model Deployment
    8. Prediction on Client Data        

## Required Libraries

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

## Data Gathering

In [4]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


## Data Analysis

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
id        20800 non-null int64
title     20242 non-null object
author    18843 non-null object
text      20761 non-null object
label     20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [6]:
df['label'].value_counts() 

1    10413
0    10387
Name: label, dtype: int64

In [7]:
df.shape

(20800, 5)

In [8]:
df.isna().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
df = df.dropna()

In [10]:
df.isna().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [11]:
df.shape

(18285, 5)

In [12]:
len(df)

18285

In [13]:
df = df.drop(["id",'text','author'],axis = 1)
df.head(10)

Unnamed: 0,title,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,Why the Truth Might Get You Fired,1
3,15 Civilians Killed In Single US Airstrike Hav...,1
4,Iranian woman jailed for fictional unpublished...,1
5,Jackie Mason: Hollywood Would Love Trump if He...,0
7,Benoît Hamon Wins French Socialist Party’s Pre...,0
9,"A Back-Channel Plan for Ukraine and Russia, Co...",0
10,Obama’s Organizing for Action Partners with So...,0
11,"BBC Comedy Sketch ""Real Housewives of ISIS"" Ca...",0


In [14]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,index,title,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,2,Why the Truth Might Get You Fired,1
3,3,15 Civilians Killed In Single US Airstrike Hav...,1
4,4,Iranian woman jailed for fictional unpublished...,1


## Data Preprocessing

### 1. Tokenization

In [15]:
sample_data = 'The quick brown fox jumps over the lazy dog'
sample_data = sample_data.split()
sample_data

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

### 2. Make Lowercase

In [16]:
sample_data = [data.lower() for data in sample_data]
print(sample_data)
len(sample_data)

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


9

### 3. Remove Stopwords

In [17]:
stopword = stopwords.words('english')
print(stopword[0:10])
len(stopword)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


179

In [18]:
sample_data = [data for data in sample_data if data not in stopword]
print(sample_data)
len(sample_data)

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


6

### 4. Stemming

In [19]:
ps = PorterStemmer()
sample_data_Stemming = [ps.stem(data) for data in sample_data]
print(sample_data_Stemming)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


### 5. Lemmatization

In [20]:
lm = WordNetLemmatizer()
sample_data_lemma = [lm.lemmatize(data) for data in sample_data]
print(sample_data_lemma)

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']


In [21]:
wl = WordNetLemmatizer()
corpus = []
for i in range (len(df)):
    review = re.sub('^a-zA-Z0-9'," ",df['title'][i])
    review = review.lower()
    review = review.split()
    review = [wl.lemmatize(x) for x in review if x not in (stopwords.words('english'))]
    review = " ".join(review)
    corpus.append(review)

In [22]:
len(corpus)

18285

In [23]:
tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [24]:
y = df['label']
y.head()

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 10, stratify = y)

In [26]:
len(x_train),len(y_train)

(12799, 12799)

In [27]:
len(x_test),len(y_test)

(5486, 5486)

In [28]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [29]:
class Evaluation:
    
    def __init__(self,model,x_train,x_test,y_train,y_test):
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test
        self.model = model
        
    def train_evaluation(self):
        y_pred_train = self.model.predict(self.x_train)
        
        acc_scr_train = accuracy_score(self.y_train,y_pred_train )
        print("Accuracy Score On Training Data Set:",acc_scr_train)
        print()
        
        con_mat_train = confusion_matrix(self.y_train,y_pred_train )
        print("Confusion Matrix On Training Data Set:\n",con_mat_train)
        print()
        
        class_rep_train = classification_report(self.y_train,y_pred_train )
        print("Classification Report On Training Data Set:\n",class_rep_train)
        print()
        
        
    def test_evaluation(self):
        y_pred_test = self.model.predict(self.x_test)
        
        acc_scr_test = accuracy_score(self.y_test,y_pred_test )
        print("Accuracy Score On Testing Data Set:",acc_scr_test)
        print()
        
        con_mat_test = confusion_matrix(self.y_test,y_pred_test )
        print("Confusion Matrix On Testing Data Set:\n",con_mat_test)
        print()
        
        class_rep_test = classification_report(self.y_test,y_pred_test )
        print("Classification Report On Testing Data Set:\n",class_rep_test)
        print()

In [30]:
Evaluation(rf,x_train,x_test,y_train,y_test).train_evaluation()

Accuracy Score On Training Data Set: 0.997656066880225

Confusion Matrix On Training Data Set:
 [[7233   19]
 [  11 5536]]

Classification Report On Training Data Set:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7252
           1       1.00      1.00      1.00      5547

    accuracy                           1.00     12799
   macro avg       1.00      1.00      1.00     12799
weighted avg       1.00      1.00      1.00     12799




In [31]:
Evaluation(rf,x_train,x_test,y_train,y_test).test_evaluation()

Accuracy Score On Testing Data Set: 0.9267225665329931

Confusion Matrix On Testing Data Set:
 [[2865  244]
 [ 158 2219]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.95      0.92      0.93      3109
           1       0.90      0.93      0.92      2377

    accuracy                           0.93      5486
   macro avg       0.92      0.93      0.93      5486
weighted avg       0.93      0.93      0.93      5486




In [32]:
from sklearn.tree import DecisionTreeClassifier

In [33]:
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [34]:
Evaluation(dt,x_train,x_test,y_train,y_test).train_evaluation()

Accuracy Score On Training Data Set: 1.0

Confusion Matrix On Training Data Set:
 [[7252    0]
 [   0 5547]]

Classification Report On Training Data Set:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7252
           1       1.00      1.00      1.00      5547

    accuracy                           1.00     12799
   macro avg       1.00      1.00      1.00     12799
weighted avg       1.00      1.00      1.00     12799




In [35]:
Evaluation(dt,x_train,x_test,y_train,y_test).test_evaluation()

Accuracy Score On Testing Data Set: 0.9196135617936566

Confusion Matrix On Testing Data Set:
 [[2839  270]
 [ 171 2206]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.94      0.91      0.93      3109
           1       0.89      0.93      0.91      2377

    accuracy                           0.92      5486
   macro avg       0.92      0.92      0.92      5486
weighted avg       0.92      0.92      0.92      5486




# Prediction on new data

In [36]:
class Preprocessing:
    
    def __init__(self,data):
        self.data = data
        
    def text_preprocessing(self):
        preprocess_data = []
        for data in range (len(self.data)):
            review = re.sub('^a-zA-Z0-9'," ",self.data['title'][data])
            review = review.lower()
            review = review.split()
            review = [wl.lemmatize(x) for x in review if x not in (stopwords.words('english'))]
            review = " ".join(review)
            preprocess_data.append(review)
        return preprocess_data
    
    def text_preprocessing_pred(self):
        pred_data = [self.data]
        preprocess_data = []
        for data in pred_data:
            review = re.sub('^a-zA-Z0-9'," ",data)
            review = review.lower()
            review = review.split()
            review = [wl.lemmatize(x) for x in review if x not in (stopwords.words('english'))]
            review = " ".join(review)
            preprocess_data.append(review)
        return preprocess_data

In [37]:
preprocessed_data = Preprocessing(df).text_preprocessing()
preprocessed_data[1]

'flynn: hillary clinton, big woman campus - breitbart'

In [43]:
class Prediction:
    
    def __init__(self,pred_data):
        self.pred_data = pred_data
    
    def prediction_model(self):
           
        preprocessed_data = Preprocessing(self.pred_data).text_preprocessing_pred()
        data = tf.transform(preprocessed_data)
        prediction = rf.predict(data)
        
        if prediction [0] == 0:
            return "The News is Fake"
        else:
            return "The News is Real"   

In [39]:
data = 'Flynn: hillary clinton, big woman campus - breitbart'
Preprocessing(data).text_preprocessing_pred()

['flynn: hillary clinton, big woman campus - breitbart']

In [40]:
df['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [44]:
data = 'Flynn: hillary clinton big woman campus - breitbart'
Prediction(data).prediction_model()

'The News is Fake'