About the dataset:

1. id: unique id for a news article

2. title: the title of a news article

3. author: author of the news article

4. text: the text of the article; could be incomplete

5. label: a label that marks whethers the news article is real or fake:

1: Fake news

0: Real news


Importing the Dependencies


In [15]:
import numpy as np
import pandas as pd
import re #regular expression useful for searching text in a document

from nltk.corpus import stopwords #corupus means the body of the text;
#nltk standas for natural language of the toolkit
#stopwords are words that don't add much value to a text such as a, an, where etc
#remove the stopwords as they don't add much value

from nltk.stem.porter import PorterStemmer
#stemming removes the prefix and suffix of word and returns the root word of it
# gives the root word of the particular word

from sklearn.feature_extraction.text import TfidfVectorizer
#cover the text into feature vectors which are basically numbers

from sklearn.model_selection import train_test_split #split data into train and test data

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score


In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
#printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing

In [18]:
#loading the dataset to a pandas dataframe
news_dataset = pd.read_csv('/content/drive/MyDrive/Machine Learning Projects/Project 4/train.csv')
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [19]:
news_dataset.shape #20,800 news articles

(20800, 5)

In [20]:
# counting the number of missing values in the dataset

news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [21]:
#replacing the null values with empty string
news_dataset = news_dataset.fillna('') #fills the missing values

In [22]:
news_dataset.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [23]:
#Merging the author name and news title
news_dataset['content'] = news_dataset['author']+'  '+news_dataset['title']

In [24]:
print(news_dataset['content']) #use content data to make predictions

0        Darrell Lucus  House Dem Aide: We Didn’t Even ...
1        Daniel J. Flynn  FLYNN: Hillary Clinton, Big W...
2        Consortiumnews.com  Why the Truth Might Get Yo...
3        Jessica Purkiss  15 Civilians Killed In Single...
4        Howard Portnoy  Iranian woman jailed for ficti...
                               ...                        
20795    Jerome Hudson  Rapper T.I.: Trump a ’Poster Ch...
20796    Benjamin Hoffman  N.F.L. Playoffs: Schedule, M...
20797    Michael J. de la Merced and Rachel Abrams  Mac...
20798    Alex Ansary  NATO, Russia To Hold Parallel Exe...
20799             David Swanson  What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [25]:
# seperating the data and labels

X = news_dataset.drop(columns='label', axis = 1)
Y = news_dataset['label']

In [26]:
print(X)
print(Y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

Stemming:

Stemming is the process of reducing a word to its root word; removes all the prefixes and suffixes to add value

example:
actor, actress, acting --> act

We need to reduce the words as much as possible for the model to predict more accurately

In [27]:
port_stem = PorterStemmer()

In [29]:
#def mean define

def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ',content) #sub means substitues certains values;^ means exclusion; removes everything that are not alphabets; all numbers and special characters will be removed
  # the above processing in the content
  stemmed_content = stemmed_content.lower() #convert all the alphabets to lowercase
  stemmed_content = stemmed_content.split() #split and converted to a list?
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] #taking a word and reducing it to root word
  #the above: if the word is not in stopwords then use the port_stem for root word processing
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [30]:
news_dataset['content']

0        Darrell Lucus  House Dem Aide: We Didn’t Even ...
1        Daniel J. Flynn  FLYNN: Hillary Clinton, Big W...
2        Consortiumnews.com  Why the Truth Might Get Yo...
3        Jessica Purkiss  15 Civilians Killed In Single...
4        Howard Portnoy  Iranian woman jailed for ficti...
                               ...                        
20795    Jerome Hudson  Rapper T.I.: Trump a ’Poster Ch...
20796    Benjamin Hoffman  N.F.L. Playoffs: Schedule, M...
20797    Michael J. de la Merced and Rachel Abrams  Mac...
20798    Alex Ansary  NATO, Russia To Hold Parallel Exe...
20799             David Swanson  What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object

In [31]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [33]:
news_dataset['content']

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object

In [34]:
#separationg the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [38]:
print(X)
print(Y)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']
[1 0 1 ... 0 1 1]


In [39]:
X.shape

(20800,)

In [41]:
Y.shape

(20800,)

In [42]:
#converting the textual data to numerical data
vectorizer = TfidfVectorizer() #termfrequerncy inversve document frequency; it counts the number of times a particular word is repeating in a paragraph;
#repetition tells the model that it is a very important word; it assigns a numerical value to that word
#IDF ~ inverse frequency = a word repeated multiple times does not have meaning in it. such as the movie name repeating in reviews reducing the word importance
#feature vectors are numbers

vectorizer.fit(X)

X = vectorizer.transform(X)

In [44]:
print(X[0-10])

  (0, 15865)	0.3625198671885904
  (0, 13680)	0.30812577800698365
  (0, 12774)	0.21975674117437816
  (0, 12576)	0.19783076443181086
  (0, 10640)	0.23079740532746962
  (0, 10380)	0.3222850414358074
  (0, 8471)	0.27528459143878936
  (0, 7596)	0.23984355971671953
  (0, 7061)	0.24702591270954882
  (0, 6470)	0.318252268025221
  (0, 1607)	0.2832971088655647
  (0, 1481)	0.3128717400396883
  (0, 129)	0.23518907128794672


Splitting the dataset to training and test data

In [45]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=.2, stratify = Y, random_state = 2)

In [48]:
print(X.shape, X_train.shape, X_test.shape)

(20800, 17128) (16640, 17128) (4160, 17128)


Training the Model: Logistic Regression

In [49]:
model = LogisticRegression()

In [51]:
model.fit(X_train,Y_train) #large datasets take longer time

Evaluation

accuracy score

In [52]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [54]:
print('Accuracy score of the training data: ', training_data_accuracy)

#accuracy score on test data is more important compared to training data; test data gives us more details about model prediction

Accuracy score of the training data:  0.9865985576923076


In [55]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [56]:
print('Accuracy score of the test data: ', test_data_accuracy)

Accuracy score of the test data:  0.9790865384615385


Making a Predictive System

In [64]:
X_new = X_test[1]

print(X_new)

  (0, 16996)	0.09117761343372983
  (0, 15295)	0.08946281236254729
  (0, 14046)	0.42524648908354634
  (0, 13190)	0.36773046084789346
  (0, 12741)	0.24868518461414146
  (0, 12279)	0.3796661151115819
  (0, 12041)	0.37327055071909065
  (0, 10306)	0.08813410128297053
  (0, 8813)	0.42524648908354634
  (0, 4008)	0.23098933893199997
  (0, 3339)	0.2834482751186189


In [65]:
prediction = model.predict(X_new)
print(prediction)

[0]


In [66]:
if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

The news is Real


In [68]:
print(Y_test[1]) #the model has predicted correctly

0
