# <center><h1> Fake News Classification </h1></center>

### Fake News Classification is an essential task in today's information-driven world. We have to build a model to classify news articles as reliable or potentially unreliable based on the given dataset.

**train.csv**: A full training dataset with the following attributes:

 * id: unique id for a news article
 * title: the title of a news article
 * author: author of the news article
 * text: the text of the article; could be incomplete
 * label: a label that marks the article as potentially unreliable
 1: unreliable , 0: reliable

**test.csv**: A full dataset without label, on which prediction is to be done by building a highly accurate healthy model.


**The Steps involved are  :** 
1. Importing Libraries and Data to be used                         
2. Data Preprocessing                        
3. Model Selection and Evaluation                      
4. Predictions on Test Data                     

Dataset is available at Kaggle : https://www.kaggle.com/competitions/fake-news/data

## Step 1. Importing Libraries and Data to be used

In [1]:
import numpy as np #linear algebra
import pandas as pd # data preprocessing

import nltk # natural language toolkit (NLP tasks)
import re # regular expression
from nltk.corpus import stopwords
nltk.download('stopwords')

# importing tensorflow packages
from tensorflow.keras.layers import Embedding, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Bidirectional

# import sklearn packages
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.naive_bayes import MultinomialNB

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tushar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# importing train and test data using pd.read_csv
train = pd.read_csv('FakeNews_train.csv')
test = pd.read_csv('FakeNews_test.csv')
train

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


In [4]:
# shapes of train and test datasets
train.shape, test.shape

((20800, 5), (5200, 4))

In [5]:
# checking the null values in train dataset
train.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [6]:
# checking the null values in test dataset
test.isnull().sum()

id          0
title     122
author    503
text        7
dtype: int64

In [7]:
# Handling null values by filling with empty spaces
train = train.fillna(" ")
test = test.fillna(" ")
train.isnull().sum(), test.isnull().sum()

(id        0
 title     0
 author    0
 text      0
 label     0
 dtype: int64,
 id        0
 title     0
 author    0
 text      0
 dtype: int64)

In [11]:
# Creating a variable "merged" by merging columns "title" and "author"
train["merged"] = train["title"]+" "+train["author"]
test["merged"]  = test["title"]+" "+test["author"]

In [12]:
# Selecting the feature and target columns
X = train.drop(['label'],axis=1) # feature column selection
y = train['label'] # target column
X.shape, y.shape

((20800, 5), (20800,))

In [13]:
# Copying the Columns for pre-processing
messages_train = X.copy()
messages_train.reset_index(inplace=True)
messages_test = test.copy()
messages_test.reset_index(inplace=True)

## Step 2. Data Preprocessing

In Data Preprocessing, we will be following steps written below :
1. All the sequences expect English characters will be removed from the string.
2. Converting all the string to lower case to aviod false predictions (Strings are sensitive to Upper and Lower case characters)
3. Tokenizing all the sentences into words
4. Stemming for faster preprocessing
5. Words will be joined together and stored in "trainn_corpus" and "test_corpus"

In [15]:
# Performing data preprocessing on column 'title'
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def perform_preprocess(data):
    
    corpus = []
    for i in range(0,len(data)):
        review = re.sub('[^a-zA-Z]',' ',data['merged'][i])
        review = review.lower()
        review = review.split()
        review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
        review = ' '.join(review)
        corpus.append(review)
    return corpus
    
# applying the perform_processes function in train and test datasets    
train_corpus = perform_preprocess(messages_train)
test_corpus  = perform_preprocess(messages_test)
train_corpus[1]

'flynn hillari clinton big woman campu breitbart daniel j flynn'

In [16]:
test_corpus[1]

'russian warship readi strike terrorist near aleppo'

In [18]:
# converting to one_hot represntation
vocab_size = 5000
one_hot_train = [one_hot(words, vocab_size) for words in train_corpus]
one_hot_test = [one_hot(words, vocab_size) for words in test_corpus]
one_hot_train[1]

[2007, 4763, 4981, 2752, 1831, 78, 2648, 2509, 3723, 2007]

In [19]:
one_hot_test[1]

[2, 244, 2393, 3167, 2639, 1282, 4157]

#### *Below Code creates an embedding layer which applies "pre" padding to one_hot encoded features with sentence of length = 20.*
#### *Padding is applied so that the lenght of every sequence in dataset remains same.*

### Embedding Representation

In [22]:
sent_length = 20
embedded_docs_train = pad_sequences(one_hot_train,padding="pre",maxlen = sent_length)
embedded_docs_test = pad_sequences(one_hot_test,padding="pre",maxlen = sent_length)
embedded_docs_train[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 2007,
       4763, 4981, 2752, 1831,   78, 2648, 2509, 3723, 2007])

In [23]:
embedded_docs_test[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    2,  244, 2393, 3167, 2639, 1282, 4157])

In [24]:
# converting Embedding representation
X_train_final = np.array(embedded_docs_train)
X_test_final = np.array(embedded_docs_test)
y_final = np.array(y)

In [25]:
X_train_final.shape, X_test_final.shape, y_final.shape

((20800, 20), (5200, 20), (20800,))

### Dividing the data into training set, testing set and validation set (80/10/10) using train_test_split

In [30]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_train_final, y_final, test_size=0.1, random_state=42, stratify=y_final)

In [31]:
X_train, x_valid, Y_train, y_valid = train_test_split(x_train, y_train, test_size=0.1, random_state=42)

## Step 3. Model Selection and Evaluation

### 3.1 Logistic Regression

In [43]:
model_logistic = LogisticRegression(max_iter=900)
model_logistic.fit(X_train,Y_train)
pred_logistic = model_logistic.predict(x_test)
cr_logistic = classification_report(y_test,pred_logistic)
print(cr_logistic)

              precision    recall  f1-score   support

           0       0.72      0.77      0.74      1039
           1       0.75      0.70      0.72      1041

    accuracy                           0.73      2080
   macro avg       0.73      0.73      0.73      2080
weighted avg       0.73      0.73      0.73      2080



### 3.2 Naive Bayes

In [44]:
model_nb = MultinomialNB()
model_nb.fit(X_train,Y_train)
pred_nb = model_nb.predict(x_test)
cr_nb = classification_report(y_test,pred_nb)
print(cr_nb)

              precision    recall  f1-score   support

           0       0.70      0.60      0.65      1039
           1       0.65      0.75      0.69      1041

    accuracy                           0.67      2080
   macro avg       0.68      0.67      0.67      2080
weighted avg       0.68      0.67      0.67      2080



### 3.3 Decision Tree

In [45]:
model_dtree = DecisionTreeClassifier()
model_dtree.fit(X_train,Y_train)
pred_dtree = model_dtree.predict(x_test)
cr_dtree = classification_report(y_test,pred_dtree)
print(cr_dtree)

              precision    recall  f1-score   support

           0       0.89      0.92      0.90      1039
           1       0.92      0.89      0.90      1041

    accuracy                           0.90      2080
   macro avg       0.90      0.90      0.90      2080
weighted avg       0.90      0.90      0.90      2080



### 3.4 Random Forests

In [46]:
model_rf = RandomForestClassifier()
model_rf.fit(X_train,Y_train)
pred_rf = model_rf.predict(x_test)
cr_rf = classification_report(y_test,pred_rf)
print(cr_rf)

              precision    recall  f1-score   support

           0       0.96      0.86      0.91      1039
           1       0.88      0.97      0.92      1041

    accuracy                           0.91      2080
   macro avg       0.92      0.91      0.91      2080
weighted avg       0.92      0.91      0.91      2080



### 3.5 XGBoost

In [47]:
model_xgb = XGBClassifier()
model_xgb.fit(X_train,Y_train)
pred_xgb = model_xgb.predict(x_test)
cr_xgb = classification_report(y_test,pred_xgb)
print(cr_xgb)

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1039
           1       0.98      0.99      0.98      1041

    accuracy                           0.98      2080
   macro avg       0.98      0.98      0.98      2080
weighted avg       0.98      0.98      0.98      2080



### 3.6 CatBoost

In [49]:
model_cb = CatBoostClassifier(iterations=200)
model_cb.fit(X_train,Y_train)
pred_cb = model_cb.predict(x_test)

Learning rate set to 0.150531
0:	learn: 0.5575632	total: 22.7ms	remaining: 4.52s
1:	learn: 0.5069522	total: 39.1ms	remaining: 3.87s
2:	learn: 0.4647746	total: 56.5ms	remaining: 3.71s
3:	learn: 0.4264419	total: 72.4ms	remaining: 3.55s
4:	learn: 0.3854707	total: 85.7ms	remaining: 3.34s
5:	learn: 0.3689098	total: 98ms	remaining: 3.17s
6:	learn: 0.3519079	total: 109ms	remaining: 2.99s
7:	learn: 0.3428304	total: 119ms	remaining: 2.85s
8:	learn: 0.3369153	total: 127ms	remaining: 2.69s
9:	learn: 0.3277433	total: 136ms	remaining: 2.58s
10:	learn: 0.3230840	total: 144ms	remaining: 2.48s
11:	learn: 0.3172048	total: 152ms	remaining: 2.39s
12:	learn: 0.3115770	total: 161ms	remaining: 2.31s
13:	learn: 0.3053414	total: 169ms	remaining: 2.25s
14:	learn: 0.2973727	total: 178ms	remaining: 2.2s
15:	learn: 0.2909536	total: 225ms	remaining: 2.58s
16:	learn: 0.2871469	total: 233ms	remaining: 2.51s
17:	learn: 0.2831973	total: 242ms	remaining: 2.45s
18:	learn: 0.2801086	total: 251ms	remaining: 2.4s
19:	learn

In [50]:
cr_cb = classification_report(y_test,pred_cb)
print(cr_cb)

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1039
           1       0.97      0.99      0.98      1041

    accuracy                           0.98      2080
   macro avg       0.98      0.98      0.98      2080
weighted avg       0.98      0.98      0.98      2080



### 3.7 LSTM

In this Model, we will be following the below steps :
1. The value for embedding_feature_vector = 40 which are target feature vector for embedding layer
2. Adding a LSTM Model with 100 nodes
3. In case of Binary classification, Dense Layer is used with Sigmoid Activation and single neuron only
4. Adding Dropout layer after each layer to avoid overfitting
5. Optimizing loss function as "Binary Crossentropy" with 'ADAM' optimizer adding metrics as "Accuracy"

In [51]:
# Creating LSTM for Prediction
embedding_feature_vector = 40
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size,embedding_feature_vector,input_length = sent_length))
model_lstm.add(Dropout(0.3))
model_lstm.add(LSTM(100))
model_lstm.add(Dropout(0.3))
model_lstm.add(Dense(1,activation = 'sigmoid'))
model_lstm.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model_lstm.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 dropout (Dropout)           (None, 20, 40)            0         
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [52]:
# fitting the model
model_lstm.fit(X_train,Y_train,validation_data=(x_valid,y_valid),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1d22d3c9e50>

In [64]:
probabilities = model_lstm.predict(x_test)

# Convert probabilities to class labels
pred_lstm = np.round(probabilities).astype(int)

# Generate and print the classification report
cr_lstm = classification_report(y_test, pred_lstm)
print(cr_lstm)

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1039
           1       0.99      0.99      0.99      1041

    accuracy                           0.99      2080
   macro avg       0.99      0.99      0.99      2080
weighted avg       0.99      0.99      0.99      2080



# <center><h3> Model Evaluation </h3></center>

In [61]:
# creating a dataframe and storing the model with their accuracy scores
score_logistic = accuracy_score(y_test,pred_logistic)
score_nb = accuracy_score(y_test, pred_nb)
score_dtree = accuracy_score(y_test,pred_dtree)
score_rf = accuracy_score(y_test,pred_rf)
score_xgboost = accuracy_score(y_test,pred_xgb)
score_catboost = accuracy_score(y_test,pred_cb)
score_lstm = accuracy_score(y_test,pred_lstm)

Results = pd.DataFrame([['Logistic Regression',score_logistic],['Naive Bayes',score_nb],['Decision Tree',score_dtree],
                       ['Random Forest',score_rf],['XGBoost',score_xgboost],['CatBoost',score_catboost],['LSTM',score_lstm]])
Results

Unnamed: 0,0,1
0,Logistic Regression,0.732692
1,Naive Bayes,0.672115
2,Decision Tree,0.902885
3,Random Forest,0.914904
4,XGBoost,0.984135
5,CatBoost,0.98125
6,LSTM,0.989904


### From the above results, LSTM has the highest accuracy among all the models. Therefore, we have decided to select it as final model for making predictions on Test Data.

## Step 4. Predictions on Test Data

In [71]:
# making prediction on Test Data
prob_test = pd.DataFrame(model_lstm.predict(X_test_final))
predictions_final = np.round(prob_test).astype(int)
test_id = pd.DataFrame(test['id'])
Submit = pd.concat([test_id,predictions_final],axis=1)
Submit.columns = ['id','Label']
Submit.to_csv('Submit.csv', index=False)



In [72]:
Submit.head()

Unnamed: 0,id,Label
0,20800,0
1,20801,1
2,20802,1
3,20803,0
4,20804,1


In [73]:
Submit.shape

(5200, 2)

# <center><h1> Thank You </h1></center>