<a href="https://colab.research.google.com/github/sreelakshmig009/Fake-News-Predicition-Using-Logisitic-Regression/blob/main/Fake_news_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing the dependencies**



In [12]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**Data Pre-Processing**

In [2]:
#load the dataset into pandas dataframe
news_dataset = pd.read_csv('/content/train.csv')

In [3]:
#loading the number of rows(number of news articles) and coloumns in the dataset
news_dataset.shape
# this pre-trained dataset from kaggle has 20,800 articles(rows) and 5 columns

(20800, 5)

In [4]:
#print the first five rows of the dataset
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [5]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [6]:
# replace the null values with empty string
# as we have a large dataset, we're doing this process;need not be applied in case of a small dataset
news_dataset = news_dataset.fillna('')

In [30]:
# storing the combined value of title and author under a new column called content
news_dataset['content'] = news_dataset['author']+'  '+news_dataset['title']

In [31]:
# this is how our new column content would be
print(news_dataset['content'])

0        Darrell Lucus  House Dem Aide: We Didn’t Even ...
1        Daniel J. Flynn  FLYNN: Hillary Clinton, Big W...
2        Consortiumnews.com  Why the Truth Might Get Yo...
3        Jessica Purkiss  15 Civilians Killed In Single...
4        Howard Portnoy  Iranian woman jailed for ficti...
                               ...                        
20795    Jerome Hudson  Rapper T.I.: Trump a ’Poster Ch...
20796    Benjamin Hoffman  N.F.L. Playoffs: Schedule, M...
20797    Michael J. de la Merced and Rachel Abrams  Mac...
20798    Alex Ansary  NATO, Russia To Hold Parallel Exe...
20799             David Swanson  What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [32]:
# seperating the label column and storing it in a seperate label
X = news_dataset.drop(columns = 'label',axis = 1)
# for removing a row, axis = 0 and for a column ,axis = 1
Y = news_dataset['label']

In [33]:
print(X) # has all the other values except label(when we print it,not all will be displayed due to space constraint)
print(Y) # only has label values

          id  ...                                            content
0          0  ...  Darrell Lucus  House Dem Aide: We Didn’t Even ...
1          1  ...  Daniel J. Flynn  FLYNN: Hillary Clinton, Big W...
2          2  ...  Consortiumnews.com  Why the Truth Might Get Yo...
3          3  ...  Jessica Purkiss  15 Civilians Killed In Single...
4          4  ...  Howard Portnoy  Iranian woman jailed for ficti...
...      ...  ...                                                ...
20795  20795  ...  Jerome Hudson  Rapper T.I.: Trump a ’Poster Ch...
20796  20796  ...  Benjamin Hoffman  N.F.L. Playoffs: Schedule, M...
20797  20797  ...  Michael J. de la Merced and Rachel Abrams  Mac...
20798  20798  ...  Alex Ansary  NATO, Russia To Hold Parallel Exe...
20799  20799  ...           David Swanson  What Keeps the F-35 Alive

[20800 rows x 5 columns]
0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 2080

**Stemming**


In [34]:
port_stem = PorterStemmer()

In [35]:
# this function is written to decrease code redundancy 
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z ]',' ',content) # replace numbers or special characters with whitespace
  stemmed_content = stemmed_content.lower() # convert the all stemmed data to lowercase for uniformity
  stemmed_content = stemmed_content.split() # split and converted to list values
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] # removes all the stopwords
  stemmed_content = '  '.join(stemmed_content)
  return stemmed_content 

In [36]:
# applying the stemming function to the content column a.k.a function call
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [37]:
print(news_dataset['content'])

0        darrel  lucu  hous  dem  aid  even  see  comey...
1        daniel  j  flynn  flynn  hillari  clinton  big...
2              consortiumnew  com  truth  might  get  fire
3        jessica  purkiss  civilian  kill  singl  us  a...
4        howard  portnoy  iranian  woman  jail  fiction...
                               ...                        
20795    jerom  hudson  rapper  trump  poster  child  w...
20796    benjamin  hoffman  n  f  l  playoff  schedul  ...
20797    michael  j  de  la  merc  rachel  abram  maci ...
20798    alex  ansari  nato  russia  hold  parallel  ex...
20799                        david  swanson  keep  f  aliv
Name: content, Length: 20800, dtype: object


In [38]:
# seperating data and label after stemming
X = news_dataset['content'].values
Y = news_dataset['label'].values
print(X)
print(Y)

['darrel  lucu  hous  dem  aid  even  see  comey  letter  jason  chaffetz  tweet'
 'daniel  j  flynn  flynn  hillari  clinton  big  woman  campu  breitbart'
 'consortiumnew  com  truth  might  get  fire' ...
 'michael  j  de  la  merc  rachel  abram  maci  said  receiv  takeov  approach  hudson  bay  new  york  time'
 'alex  ansari  nato  russia  hold  parallel  exercis  balkan'
 'david  swanson  keep  f  aliv']
[1 0 1 ... 0 1 1]


**Vectorization**

In [39]:
vectorizer = TfidfVectorizer()# counts the number of times a particular word occurs in a text
vectorizer.fit(X)
# converting X to their respective feature vectors(no need to do this with Y as it is already numerical)
X = vectorizer.transform(X)


In [40]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

(Feeding this vectorized input to the ML model)

**Splitting the dataset to training and test data**

In [42]:
# splitting the respective data into training data and testing data
# here we have considered 80% of the inout data to be training data and the rest 20% data to be test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

**Training the Model using Logisitic Regression**

In [43]:
model = LogisticRegression()

In [44]:
model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

**Evaluation**

In [45]:
# finding the accuracy score for training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train) # both values are compared and accuracy score is generated

In [46]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9865985576923076


In [47]:
# finding the accuracy score for test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [48]:
print('Accuracy score of the test data : ', test_data_accuracy)
# accuracy score of test data is more important than accuracy score of training data because this will tell us how good our model works

Accuracy score of the test data :  0.9790865384615385


**Making a Predictive System**

When we provide a new data to our model ,it should tell us whether it is real news or fake news

In [52]:
X_new = X_test[5] # change the value of the index of X_test to test some other data

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [53]:
# checking whether the prediction is correct
print(Y_test[5])

1
