About the Dataset:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks whether the news article is real or fake:
    1: Fake news
    0: real News

Importing dependencies

In [11]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Preprocessing

In [14]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv', on_bad_lines='skip', quoting=3)

In [15]:
news_dataset.shape

(120626, 5)

In [16]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,Unnamed: 1,id,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It,Darrell Lucus,"""House Dem Aide: We Didn’t Even See Comey’s Le...",2016 Subscribe Jason Chaffetz on the stump in...,Utah ( image courtesy Michael Jolley,available under a Creative Commons-BY license)
With apologies to Keith Olbermann,there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide,it looks like we also know who the second-wor...,the ranking Democrats on the relevant committ...,,,
As we now know,Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence,Judiciary,and Oversight committees that his agency was ...,Oversight Committee Chairman Jason Chaffetz s...,"""""The FBI has learned of the existence of ema...",
— Jason Chaffetz (@jasoninthehouse) October 28,2016,,,,,
Of course,we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of “an unrelated case”–which we now know to be Anthony Weiner’s sexting with a teenager. But apparently such little things as facts didn’t matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary wins–at least two years’ worth,and possibly an entire term’s worth of them. ...,,,,


In [17]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
id,60473
title,79333
author,96780
text,108699
label,115957


In [18]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [19]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [20]:
print(news_dataset['content'])

0                                                       House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                     2016 Subscribe Jason Chaffetz on the stump in...
With apologies to Keith Olbermann                        there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                                                                                                                                                   the ranking Democrats on the relevant commit...
As we now 

In [21]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [22]:
print(X)
print(Y)

                                                                                                                                                      id  \
0                                                  House Dem Aide: We Didn’t Even See Comey’s Lett...                                      Darrell Lucus   
With apologies to Keith Olbermann                   there is no doubt who the Worst Person in The ...   it looks like we also know who the second-wor...   
As we now know                                      Comey notified the Republican chairmen and Dem...                                          Judiciary   
— Jason Chaffetz (@jasoninthehouse) October 28      2016                                                                                                   
Of course                                           we now know that this was not the case . Comey...   and possibly an entire term’s worth of them. ...   
...                                                             

Stemming:

Stemming is the process of reducing a word to its Root word

example: actor, actress, acting --> act


In [23]:
port_stem = PorterStemmer()

In [24]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [25]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [26]:
print(news_dataset['content'])

0                                                       House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                    subscrib jason chaffetz stump american fork ho...
With apologies to Keith Olbermann                        there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                                                                                                                                                 rank democrat relev committe hear comey found ...
As we now 

In [27]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [28]:
print(X)

['subscrib jason chaffetz stump american fork hous dem aid even see comey letter jason chaffetz tweet darrel lucu octob'
 'rank democrat relev committe hear comey found via tweet one republican committe chairmen'
 'oversight committe chairman jason chaffetz set polit world ablaz tweet fbi dir inform oversight committe agenc review email recent discov order see contain classifi inform long letter went'
 ...
 'radio host director worldbeyondwar org campaign coordin rootsact org swanson book includ war lie blog davidswanson org warisacrim org host talk nation radio nobel peac prize nomine'
 '' '']


In [29]:
print(Y)

[' available under a Creative Commons-BY license) ' '' '' ... '' '' '']


In [30]:
Y.shape

(120626,)

In [31]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [32]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 413124 stored elements and shape (120626, 27776)>
  Coords	Values
  (0, 486)	0.18456124147677766
  (0, 829)	0.1402087846461952
  (0, 3962)	0.4945687287655592
  (0, 4671)	0.177292554790038
  (0, 5772)	0.24221841574327657
  (0, 6061)	0.22176500785445608
  (0, 8142)	0.13696256333852103
  (0, 9114)	0.2830597978447354
  (0, 11338)	0.1615264079897107
  (0, 12621)	0.39961678280165325
  (0, 13927)	0.19110629895837977
  (0, 14387)	0.24923723226249953
  (0, 17102)	0.14302134162263339
  (0, 21796)	0.15055993645909313
  (0, 23550)	0.2536854814833485
  (0, 23608)	0.19702427504704662
  (0, 25339)	0.1810890353616065
  (1, 3970)	0.3651197416091226
  (1, 4671)	0.23922315541253536
  (1, 4700)	0.5249517798052782
  (1, 6083)	0.210021323870699
  (1, 9172)	0.2284144531802483
  (1, 10834)	0.2613307204649499
  (1, 17261)	0.17359909741430027
  (1, 19866)	0.2791954564850001
  :	:
  (120621, 21796)	0.1132954529030807
  (120621, 25856)	0.20771918967687

Splitting the dataset to training & test data

In [33]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=2) # Remove stratify=Y

Training the Model: Logistic Regression

In [34]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

Evaluation

accuracy score





In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', training_data_accuracy)

In [None]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)

Making a Predictive System

In [None]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

In [None]:
[0]
#The news is Real


In [None]:
print(Y_test[3])