# Fakee News Prediction Using NLP

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks the article as potentially unreliable
1: unreliable
0: reliable

### Import Libraries

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Import Dataset

In [35]:
df = pd.read_csv('Dataset/train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


## Data Analyses

In [36]:
# find the number of null values
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [37]:
#need to check the number of records and determine the percentage of missing values
df.shape

(20800, 5)

In [38]:
# stats info
df.describe()

Unnamed: 0,id,label
count,20800.0,20800.0
mean,10399.5,0.500625
std,6004.587135,0.500012
min,0.0,0.0
25%,5199.75,0.0
50%,10399.5,1.0
75%,15599.25,1.0
max,20799.0,1.0


In [39]:
# get the info of the text data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [40]:
# fill in the missing text with space characters since wecant use the median and imputation for text data
no_null_dataset = df.fillna('')
no_null_dataset.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [41]:
# merge the column of Title with the author columns so that it becomes one full column of title author
no_null_dataset['Title Author'] = no_null_dataset['title'] + ' '+ no_null_dataset['author']
no_null_dataset['Title Author']

0        House Dem Aide: We Didn’t Even See Comey’s Let...
1        FLYNN: Hillary Clinton, Big Woman on Campus - ...
2        Why the Truth Might Get You Fired Consortiumne...
3        15 Civilians Killed In Single US Airstrike Hav...
4        Iranian woman jailed for fictional unpublished...
                               ...                        
20795    Rapper T.I.: Trump a ’Poster Child For White S...
20796    N.F.L. Playoffs: Schedule, Matchups and Odds -...
20797    Macy’s Is Said to Receive Takeover Approach by...
20798    NATO, Russia To Hold Parallel Exercises In Bal...
20799              What Keeps the F-35 Alive David Swanson
Name: Title Author, Length: 20800, dtype: object

In [42]:
no_null_dataset.head()

Unnamed: 0,id,title,author,text,label,Title Author
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Consortiumne...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...


In [44]:
# drop the id column, title and author columns
no_null_dataset = no_null_dataset.drop('id', axis = 1)
no_null_dataset.head()

Unnamed: 0,title,author,text,label,Title Author
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Consortiumne...
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...


In [45]:
no_null_dataset = no_null_dataset.drop('title', axis = 1)
no_null_dataset = no_null_dataset.drop('author', axis = 1)
no_null_dataset.head()

Unnamed: 0,text,label,Title Author
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Consortiumne...
3,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...
4,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...


In [46]:
no_null_dataset = no_null_dataset[["Title Author", "text", "label"]]
no_null_dataset.head()

Unnamed: 0,Title Author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired Consortiumne...,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Print \nAn Iranian woman has been sentenced to...,1


In [47]:
no_null_dataset.tail()

Unnamed: 0,Title Author,text,label
20795,Rapper T.I.: Trump a ’Poster Child For White S...,Rapper T. I. unloaded on black celebrities who...,0
20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",When the Green Bay Packers lost to the Washing...,0
20797,Macy’s Is Said to Receive Takeover Approach by...,The Macy’s of today grew from the union of sev...,0
20798,"NATO, Russia To Hold Parallel Exercises In Bal...","NATO, Russia To Hold Parallel Exercises In Bal...",1
20799,What Keeps the F-35 Alive David Swanson,"David Swanson is an author, activist, journa...",1


### Clean Dataset into corpus

In [48]:
import re
import nltk
nltk.download('stopwords')
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# have a list that will store corpus text
corpus = [] # processedd words to be further used by classification model
# iterate through the te text 
for i in range(0, 20800):
    #take the news that are not punctuated and put them in a review variable
    text = re.sub('[^a-zA-Z]', ' ', no_null_dataset['text'][i])
    # format the text of each row to the correct format
    text = text.lower() # lowers the capital letters
    text = text.split()# spplits the text that is seperated by comma into spaces
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english') # makes sure all the stopwords are in english language
    all_stopwords.remove('not') # makes sure the stopwords do not remove the 'Not' word since it is import to tell us if the text is
    # for example fake or not
    text = [ps.stem(word) for word in text if not word in set(all_stopwords)] # if the word in the text is not in the set of stop words
    # then that word is correct, place it in stem words
    text = ' '.join(text)# join the text since they are tokenized into one sentence
    corpus.append(text) # place the cleaned text into the corpus for further operations
    

[nltk_data] Downloading package stopwords to C:\Users\Cash
[nltk_data]     Crusaders\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Create the Bag of Words

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer() # converts the text data into numerical data
X = cv.fit_transform(corpus)
y = no_null_dataset.iloc[:, -1].values

### Splitting the dataset into training and testing set

In [51]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Building The Model

In [52]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

### Predict the test set using the model

In [54]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [1 1]
 [1 1]
 ...
 [0 0]
 [1 1]
 [1 1]]


### Evauate the accuracy of the model

In [55]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1899  147]
 [ 100 2014]]


0.940625

In [57]:
print(classification_report(y_test, y_pred, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94      2046
           1       0.93      0.95      0.94      2114

    accuracy                           0.94      4160
   macro avg       0.94      0.94      0.94      4160
weighted avg       0.94      0.94      0.94      4160



### Predicting an unseen text to classify it as Fake or Not

### Lets start predicting a Real Text

In [65]:
new_text_1 = 'A car ca has wings and it can be proven'
# apply the preprocessing for the corpus
new_text_1 = re.sub('[^a-zA-Z]', ' ', new_text_1)
new_text_1 = new_text_1.lower()
new_text_1 = new_text_1.split()
ps_1 = PorterStemmer()
all_stopwords_1 = stopwords.words('english')
all_stopwords_1.remove('not')
new_text_1 = [ps_1.stem(word) for word in new_text_1 if not word in set(all_stopwords_1)]
new_text_1 = ' '.join(new_text_1)
new_corpus = [new_text_1]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
if new_y_pred[0]==1:
    print("The news are unreliable.")
else:
    print("The news are real.")
#print(new_y_pred)

The news are unreliable.


### predicting fake text

In [66]:
new_text_2 = "The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer."
new_corpus_2 = []
# We wil have to apply the preprocessing we did for the list of reviews for the single review
new_text_2 = re.sub('[^a-zA-Z]', ' ', new_text_2)
new_text_2 = new_text_2.lower()
new_text_2 = new_text_2.split()
ps_2 = PorterStemmer()
all_stopwords_2 = stopwords.words('english')
all_stopwords_2.remove('not')
new_text_2 = [ps_2.stem(word) for word in new_text_2 if not word in set(all_stopwords_2)]
new_text_2 = ' '.join(new_text_2)
new_corpus_2 = [new_text_2]
new_X_test_2 = cv.transform(new_corpus_2).toarray()
new_y_pred_2 = classifier.predict(new_X_test_2)
if new_y_pred[0]==1:
    print("The news are unreliable.")
else:
    print("The news are reliable.")
#print(new_y_pred_2)

The news are unreliable.


In [64]:
X_new = X_test[3]
X_new

<1x110429 sparse matrix of type '<class 'numpy.float64'>'
	with 109 stored elements in Compressed Sparse Row format>

In [67]:
prediction = classifier.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [68]:
print(y_test[3])

0
