<a href="https://colab.research.google.com/github/shounakk05/ML_basic_projects/blob/main/Fake_news_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This is a Fake or Real news predictor.
### We will use a Logistic regression model to train our ML model to tell if a news is Fake or Real.
### In this
### 0 -> Fake news
### 1 -> Real news
### The important steps:
1. Check if there are any null values and replace them with 'NULL' so that it dosen't cause any problems while training the model.
2. Using porter stemmer, stopwords to stem the words and removing the stopwords respectively.
3. Using TfidfVectorizer to convert the text into feature values.

Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
import re   #re->Regular Expression to find a particular word in the paragraph
from nltk.corpus import stopwords   #The words that don't add much value to the text(and,in,a,the etc)
from nltk.stem.porter import PorterStemmer  #Removes the prefix/suffix to give the root word
from sklearn.feature_extraction.text import TfidfVectorizer   #Converting text into feature vectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#Stopwords in english
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Preprocessing

In [None]:
fake_df = pd.read_csv('/content/Fake.csv')
real_df = pd.read_csv('/content/True.csv')

fake_df['label'] = 0
real_df['label'] = 1

df = pd.concat([fake_df, real_df],axis=0).reset_index(drop=True)    #axis=0 -> Dataframes stacked on top of the other
#reset_index(drop=True) -> the index of the new dataframe is reseted

In [None]:
df.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [None]:
df.shape

(44898, 5)

In [None]:
#Counting number of null values
df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


In [None]:
#Merging the title column and the author column
df['content'] = df['title']

In [None]:
print(df['content'])

0         Donald Trump Sends Out Embarrassing New Year’...
1         Drunk Bragging Trump Staffer Started Russian ...
2         Sheriff David Clarke Becomes An Internet Joke...
3         Trump Is So Obsessed He Even Has Obama’s Name...
4         Pope Francis Just Called Out Donald Trump Dur...
                               ...                        
44893    'Fully committed' NATO backs new U.S. approach...
44894    LexisNexis withdrew two products from Chinese ...
44895    Minsk cultural hub becomes haven from authorities
44896    Vatican upbeat on possibility of Pope Francis ...
44897    Indonesia to buy $1.14 billion worth of Russia...
Name: content, Length: 44898, dtype: object


In [None]:
#Seperating columns and features
X = df.drop(columns = 'label',axis=1)
Y = df['label']

Stemming: Obtaining the root word after removing all the prefixes and suffixes

In [None]:
stem = PorterStemmer()

In [None]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ',content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [None]:
df['content'] = df['content'].apply(stemming)   #Will apply stemming to the content column and then save it to the new content column

In [None]:
print(df['content'])

0        donald trump send embarrass new year eve messa...
1        drunk brag trump staffer start russian collus ...
2        sheriff david clark becom internet joke threat...
3            trump obsess even obama name code websit imag
4            pope franci call donald trump christma speech
                               ...                        
44893    fulli commit nato back new u approach afghanistan
44894         lexisnexi withdrew two product chines market
44895                        minsk cultur hub becom author
44896      vatican upbeat possibl pope franci visit russia
44897              indonesia buy billion worth russian jet
Name: content, Length: 44898, dtype: object


In [None]:
#Seperating data and label
X = df['content'].values
Y = df['label'].values

In [None]:
print(X,Y)

['donald trump send embarrass new year eve messag disturb'
 'drunk brag trump staffer start russian collus investig'
 'sheriff david clark becom internet joke threaten poke peopl eye' ...
 'minsk cultur hub becom author'
 'vatican upbeat possibl pope franci visit russia'
 'indonesia buy billion worth russian jet'] [0 0 0 ... 1 1 1]


In [None]:
#Convwerting text to feature vectors using TfIdfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

In [None]:
print(X)  #Asseen in the output, the text is now converted to their respective feature vectors

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 411810 stored elements and shape (44898, 13206)>
  Coords	Values
  (0, 3404)	0.27228169452981993
  (0, 12010)	0.11874831789619875
  (0, 10341)	0.35617541225037314
  (0, 3731)	0.36585763997365384
  (0, 7866)	0.2355330109368446
  (0, 13104)	0.285897289994607
  (0, 3949)	0.46737756630771704
  (0, 7335)	0.3506307808204068
  (0, 3339)	0.4152733860823366
  (1, 12010)	0.11694584057077223
  (1, 3523)	0.4545211646819864
  (1, 1414)	0.3944388479979501
  (1, 11031)	0.40514397827310017
  (1, 11065)	0.3381297018822546
  (1, 9996)	0.28041269573559624
  (1, 2289)	0.4114009981720189
  (1, 5989)	0.3127247619981533
  (2, 10472)	0.3091623085438117
  (2, 2881)	0.3373830961140794
  (2, 2144)	0.3679456237184158
  (2, 999)	0.2899680869663155
  (2, 5960)	0.2946136857056743
  (2, 6194)	0.3180103625234714
  (2, 11717)	0.25191024945597357
  (2, 8836)	0.42473021917033454
  :	:
  (44893, 169)	0.38874140539823904
  (44894, 12115)	0.27583857506484
  (4489

Splitting dataset into training and testing data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

In [None]:
print(df['content'])

0        donald trump send embarrass new year eve messa...
1        drunk brag trump staffer start russian collus ...
2        sheriff david clark becom internet joke threat...
3            trump obsess even obama name code websit imag
4            pope franci call donald trump christma speech
                               ...                        
44893    fulli commit nato back new u approach afghanistan
44894         lexisnexi withdrew two product chines market
44895                        minsk cultur hub becom author
44896      vatican upbeat possibl pope franci visit russia
44897              indonesia buy billion worth russian jet
Name: content, Length: 44898, dtype: object


In [None]:
#Seperating columns and features
X = df.drop(columns = 'label',axis=1)
Y = df['label']

Stemming: Obtaining the root word after removing all the prefixes and suffixes

In [None]:
stem = PorterStemmer()

In [None]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ',content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [None]:
df['content'] = df['content'].apply(stemming)   #Will apply stemming to the content column and then save it to the new content column

In [None]:
print(df['content'])

0        donald trump send embarrass new year eve messa...
1        drunk brag trump staffer start russian collu i...
2        sheriff david clark becom internet joke threat...
3            trump obsess even obama name code websit imag
4            pope franci call donald trump christma speech
                               ...                        
44893    fulli commit nato back new u approach afghanistan
44894          lexisnexi withdrew two product chine market
44895                        minsk cultur hub becom author
44896      vatican upbeat possibl pope franci visit russia
44897              indonesia buy billion worth russian jet
Name: content, Length: 44898, dtype: object


In [None]:
#Seperating data and label
X = df['content'].values
Y = df['label'].values

In [None]:
print(X,Y)

['donald trump send embarrass new year eve messag disturb'
 'drunk brag trump staffer start russian collu investig'
 'sheriff david clark becom internet joke threaten poke peopl eye' ...
 'minsk cultur hub becom author'
 'vatican upbeat possibl pope franci visit russia'
 'indonesia buy billion worth russian jet'] [0 0 0 ... 1 1 1]


In [None]:
#Convwerting text to feature vectors using TfIdfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

In [None]:
print(X)  #Asseen in the output, the text is now converted to their respective feature vectors

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 411579 stored elements and shape (44898, 13092)>
  Coords	Values
  (0, 3368)	0.27228169452981993
  (0, 11907)	0.11874831789619875
  (0, 10246)	0.35617541225037314
  (0, 3694)	0.36585763997365384
  (0, 7797)	0.2355330109368446
  (0, 12990)	0.285897289994607
  (0, 3900)	0.46737756630771704
  (0, 7269)	0.3506307808204068
  (0, 3306)	0.4152733860823366
  (1, 11907)	0.11694584057077223
  (1, 3486)	0.4545211646819864
  (1, 1403)	0.3944388479979501
  (1, 10934)	0.40514397827310017
  (1, 10967)	0.3381297018822546
  (1, 9902)	0.28041269573559624
  (1, 2270)	0.4114009981720189
  (1, 5928)	0.3127247619981533
  (2, 10375)	0.3091623085438117
  (2, 2855)	0.3373830961140794
  (2, 2129)	0.3679456237184158
  (2, 991)	0.2899680869663155
  (2, 5899)	0.2946136857056743
  (2, 6131)	0.3180103625234714
  (2, 11615)	0.25191024945597357
  (2, 8756)	0.42473021917033454
  :	:
  (44893, 165)	0.38874140539823904
  (44894, 12011)	0.27583857506484
  (4489

Splitting dataset into training and testing data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

Training the Model using Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train,Y_train)

Model Evaluation

In [None]:
#Accuracy score on the training data
X_train_pred = model.predict(X_train)
train_data_accuracy = accuracy_score(X_train_pred, Y_train)

In [None]:
print("Accuracy Score of training data: ", train_data_accuracy)

Accuracy Score of training data:  0.9593518570076285


In [None]:
#Accuracy score on the test data
X_test_pred = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_pred, Y_test)

In [None]:
print("Accuracy Score of test data: ", test_data_accuracy)

Accuracy Score of test data:  0.9416481069042316
