I’m happy to share this Fake News Prediction project I worked on. Check it out here: Link

**About the dataset**

1. ID: unique id for a news article
2. Title : title of the news
3. Author : Author of the news article
4. Text : The text of the article, could be incomplete
5. Label : A label that marks whether the news article is real or fake

1: Fake news

0: Real news

Importing the dependencies

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mahdi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Printing the stopwords in english
print(stopwords.words('English'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [5]:
# Loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('train.csv')

In [6]:
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [7]:
news_dataset.shape

(20800, 5)

In [8]:
# counting the no of missing values
news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
# replacing the null values with empty string 
news_dataset = news_dataset.fillna('')

In [10]:
news_dataset.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [11]:
# merging the author name and news title and storing it into a new column
news_dataset['content'] = news_dataset['author'] + " - " + news_dataset['title']

In [12]:
news_dataset.head()

Unnamed: 0,id,title,author,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus - House Dem Aide: We Didn’t Even...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn - FLYNN: Hillary Clinton, Big ..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com - Why the Truth Might Get Y...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss - 15 Civilians Killed In Singl...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy - Iranian woman jailed for fict...


Dropping the title and author column

In [13]:
news_dataset.drop(columns= 'author', axis=1, inplace= True)

In [14]:
news_dataset.drop(columns= 'title', axis=1, inplace= True)

In [15]:
news_dataset

Unnamed: 0,id,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus - House Dem Aide: We Didn’t Even...
1,1,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn - FLYNN: Hillary Clinton, Big ..."
2,2,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com - Why the Truth Might Get Y...
3,3,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss - 15 Civilians Killed In Singl...
4,4,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy - Iranian woman jailed for fict...
...,...,...,...,...
20795,20795,Rapper T. I. unloaded on black celebrities who...,0,Jerome Hudson - Rapper T.I.: Trump a ’Poster C...
20796,20796,When the Green Bay Packers lost to the Washing...,0,"Benjamin Hoffman - N.F.L. Playoffs: Schedule, ..."
20797,20797,The Macy’s of today grew from the union of sev...,0,Michael J. de la Merced and Rachel Abrams - Ma...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",1,"Alex Ansary - NATO, Russia To Hold Parallel Ex..."


Separating Features and labels

In [16]:
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [17]:
print(X)

          id                                               text  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  Ever get the feeling your life circles the rou...   
2          2  Why the Truth Might Get You Fired October 29, ...   
3          3  Videos 15 Civilians Killed In Single US Airstr...   
4          4  Print \nAn Iranian woman has been sentenced to...   
...      ...                                                ...   
20795  20795  Rapper T. I. unloaded on black celebrities who...   
20796  20796  When the Green Bay Packers lost to the Washing...   
20797  20797  The Macy’s of today grew from the union of sev...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799    David Swanson is an author, activist, journa...   

                                                 content  
0      Darrell Lucus - House Dem Aide: We Didn’t Even...  
1      Daniel J. Flynn - FLYNN: Hillary Clinton, Big ...  
2      Consortiumn

In [18]:
print(Y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


Stemming -> Stemming is the process of reducing the word to its root word(eg : actor, actress, acting --> act)

In [19]:
port_stem = PorterStemmer()

In [20]:
# Creating a function
def stemming(content):
    # re.sub --> replaces some certain values 
    # We only need words thats why we used "[^a-zA-Z]", we are excluding everything that are not present in the set
    # re.sub([^a-zA-Z]) --> removes everything that is not alphabet
    # ' ' in re.sub() --> suppose if there is any other which is not there in the set it will get replaced by ' ' (eg: punctuations)
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    # converting all the letters to lowercase
    stemmed_content = stemmed_content.lower()
    # splitting all the content and storing it into the list
    stemmed_content = stemmed_content.split()
    # stemming all the words with the help of porterStemming which are not stopwords
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [21]:
# Applying stemming function in the 'content' column
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [22]:
# Printing the stemmed content
print(news_dataset['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [23]:
# separating the data and the label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [24]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [25]:
print(Y)

[1 0 1 ... 0 1 1]


In [26]:
Y.shape

(20800,)

Converting textual data to Numerical data

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [28]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 210687 stored elements and shape (20800, 17128)>
  Coords	Values
  (0, 267)	0.2701012497770876
  (0, 2483)	0.36765196867972083
  (0, 2959)	0.24684501285337127
  (0, 3600)	0.3598939188262558
  (0, 3792)	0.27053324808454915
  (0, 4973)	0.23331696690935097
  (0, 7005)	0.2187416908935914
  (0, 7692)	0.24785219520671598
  (0, 8630)	0.2921251408704368
  (0, 8909)	0.36359638063260746
  (0, 13473)	0.2565896679337956
  (0, 15686)	0.2848506356272864
  (1, 1497)	0.2939891562094648
  (1, 1894)	0.15521974226349364
  (1, 2223)	0.3827320386859759
  (1, 2813)	0.19094574062359204
  (1, 3568)	0.26373768806048464
  (1, 5503)	0.7143299355715573
  (1, 6816)	0.1904660198296849
  (1, 16799)	0.30071745655510157
  (2, 2943)	0.3179886800654691
  (2, 3103)	0.46097489583229645
  (2, 5389)	0.3866530551182615
  (2, 5968)	0.3474613386728292
  (2, 9620)	0.49351492943649944
  :	:
  (20797, 3643)	0.2115550061362374
  (20797, 7042)	0.21799048897828685
  (2079

Splitting dataset to training and testing data

In [29]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.2, stratify= Y, random_state= 2)

In [30]:
print(X.shape, X_train.shape, X_test.shape)

(20800, 17128) (16640, 17128) (4160, 17128)


In [31]:
print(Y.shape, Y_train.shape, Y_test.shape)

(20800,) (16640,) (4160,)


Training the model: Logistic Regression

In [32]:
model = LogisticRegression()

In [33]:
# Training our model with the help of training data using model.fit(input, feature)
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [34]:
# accuracy score on the training data 
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [35]:
# Accuracy above 95% is excellent
print("Accuracy score on training data:", round(training_data_accuracy, 4) * 100)

Accuracy score on training data: 98.64


In [36]:
# Accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [37]:
# Accuracy above 95% is excellent
print("Accuracy score on test data:", round(test_data_accuracy, 4)*100)

Accuracy score on test data: 97.91


Making a predictive system

In [38]:
# X_test[0] = first row in our X_test column i.e. 1st News
n = 10
X_new = X_test[n]

# making prediction
prediction = model.predict(X_new)
print(prediction)

if(prediction[0] == 1):
    print('Fake news')
else:
    print('Real news')

[0]
Real news


In [39]:
if(prediction[0] == Y_test[n]):
    print("Correct prediction")
else:
    print("Incorrect prediction")

Correct prediction


In [40]:
import pickle

In [41]:
pickle.dump(model, open('FakeNews.sav', 'wb'))


In [42]:
loaded_model = pickle.load(open('FakeNews.sav', 'rb'))

In [46]:
pickle.dump(vectorizer, open('TFIdfVectorizer Fake News.sav', 'wb'))

In [47]:
fake_news_vectorizer = pickle.load(open('TFIdfVectorizer Fake News.sav', 'rb'))