For Completing this project I have been followed bellow steps:

<ol>
  <li>Raw Datasets Collected From <b><code>Kaggle</code></b></li>
  <li>Import Necessary <b><code> Libraries </code></b></li>
  <li> Loading Datasets Using <b><code>Pandas</code></b></li>
  <li>Building an Optimal Model by Using <b><code> Machine Learning Algorithms</code></b></li>
  <li>Evaluated my model</li>
  <li>Create an Summary</li>
</ol>


In [1]:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import MultinomialNB

import nltk
import re # Regular Expression
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

"""
Closed , Closer, Colsing --> Close [Steamming]
Close --> Closed , Closer, Colsing [Lemmatisation]

"""

warnings.filterwarnings('ignore')
%matplotlib inline

<h4><p style="color:red;"> Loading datasets using Pandas libraries</p></h4>

In [2]:
fakeDatasets = pd.read_csv("../Datasets/Fake.csv")
trueDatasets = pd.read_csv("../Datasets/True.csv")

<b style="color:red;"><h4> See 1st two row for Fake News Datasets </h4></b>

In [3]:
fakeDatasets.head(2)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"


<b style="color:red;"><h4> See 1st Two row for True News Datasets </h4></b>

In [4]:
trueDatasets.head(2)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"


<b style="color:red;"><h4>Information of the Fake Datasets</h4></b>

In [5]:
fakeDatasets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


<b style="color:red;"><h4>Information of the True Datasets </h4></b>

In [6]:
trueDatasets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


<b style="color:red;"><h4>Description of the Fake Datasets </h4></b>

In [7]:
fakeDatasets.describe()

Unnamed: 0,title,text,subject,date
count,23481,23481.0,23481,23481
unique,17903,17455.0,6,1681
top,MEDIA IGNORES Time That Bill Clinton FIRED His...,,News,"May 10, 2017"
freq,6,626.0,9050,46


<b style="color:red;"><h4>Description of the True Datasets</h4></b>

In [8]:
trueDatasets.describe()

Unnamed: 0,title,text,subject,date
count,21417,21417,21417,21417
unique,20826,21192,2,716
top,Factbox: Trump fills top jobs for his administ...,(Reuters) - Highlights for U.S. President Dona...,politicsNews,"December 20, 2017"
freq,14,8,11272,182


In [9]:
trueDatasets['target'] = 1
fakeDatasets['target'] = 0

<b style="color:orange;"><h4> Lets verify the True news columns values</h4></b>

In [10]:
trueDatasets.head(1)

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1


<b style="color:orange;"><h4> Lets verify the Fake news columns values</h4></b>

In [11]:
fakeDatasets.head(1)

Unnamed: 0,title,text,subject,date,target
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0


<b style="color:blue;"><h4> See the individual Contributions of True News Subject</h4></b>

In [12]:
trueDatasets['subject'].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

<b style="color:blue;"><h4> See the individual Contributions of Fake News Subject</h4></b>

In [13]:
fakeDatasets['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

In [14]:
fakeDatasets = fakeDatasets[['text', 'target']]
trueDatasets = trueDatasets[['text', 'target']]

In [15]:
fakeDatasets.head(1)

Unnamed: 0,text,target
0,Donald Trump just couldn t wish all Americans ...,0


In [16]:
trueDatasets.head(1)

Unnamed: 0,text,target
0,WASHINGTON (Reuters) - The head of a conservat...,1


<b style="color:orange;"><h4> Now We will concatenet our Both True Datasets and Fake Datasets </h4></b>

In [17]:
datasets = pd.concat([trueDatasets, fakeDatasets])

In [18]:
datasets.target.value_counts()

0    23481
1    21417
Name: target, dtype: int64

<b style="color:orange;"><h4> Check wheter There is any null value or not </h4></b>

In [19]:
datasets.isnull().sum()

text      0
target    0
dtype: int64

<b style="color:blue;"><h4> Now we wil do sampling so that our marging datasets can be shuffled</h4></b>

In [20]:
datasets = datasets.sample(frac=1)

<b style="color:red;"> Now we will use the Natural Language Processing Techniques for cleaing our text datasets so that our model can perform better</b>

In [21]:
lemmatizer = WordNetLemmatizer()
stopword = stopwords.words('english')

In [22]:
datasets['text'].loc[1]

1    WASHINGTON (Reuters) - Transgender people will...
1    House Intelligence Committee Chairman Devin Nu...
Name: text, dtype: object

In [23]:
datasets.head()

Unnamed: 0,text,target
11760,SEOUL (Reuters) - A spokesman for North Korea ...,1
22342,Episode #162 of SUNDAY WIRE SHOW resumes this ...,0
21328,YANGON (Reuters) - Muslim militants in Myanmar...,1
18598,The classless Democrats are at it again. The o...,0
20206,PARIS (Reuters) - French Prime Minister Edouar...,1


In [24]:
def cleanTextData(text):
    text = text.lower()
    text = re.sub('[^a-zA-Z]', " ", text)
    token = text.split()
    text = [lemmatizer.lemmatize(word) for word in token]
    clean_textData = " ".join(text)
    
    return clean_textData

In [25]:
cleanTextData("Hi Mejbah ahammad How#$^%&^*(^$ are you?")

'hi mejbah ahammad how are you'

In [26]:
datasets['text'] = datasets['text'].apply(lambda x: cleanTextData(x))

In [27]:
datasets['text'].loc[1]

1    washington reuters transgender people will be ...
1    house intelligence committee chairman devin nu...
Name: text, dtype: object

<b style="color:red;">Now we are going to apply TF-IDF (Term Frequency - Iverse Document Frequency)</b>

In [28]:
# 1. Uni-gram
# 2. Binary-gram
# 3. Tri-gram
# 4. Multi-gram

vectorizer = TfidfVectorizer(max_features=50000, lowercase=False, ngram_range=(1, 2))

In [29]:
vectorizer

TfidfVectorizer(lowercase=False, max_features=50000, ngram_range=(1, 2))

In [30]:
len(datasets)

44898

In [31]:
# datasets.iloc[:35000, 1]

In [32]:
xdatasets = datasets.iloc[:35000, 0]
ydatasets = datasets.iloc[:35000, 1]

In [33]:
x_train, x_test, y_train, y_test = train_test_split(xdatasets,
                                                   ydatasets, 
                                                   random_state=0, 
                                                   test_size=0.2)

In [34]:
vectorizer_train = vectorizer.fit_transform(x_train)

In [35]:
vectorizer_train = vectorizer_train.toarray()

In [36]:
vectorizer_test = vectorizer.transform(x_test).toarray()

In [37]:
trainData = pd.DataFrame(vectorizer_train, columns = vectorizer.get_feature_names())
testData = pd.DataFrame(vectorizer_test, columns=vectorizer.get_feature_names())

In [38]:
mlbn = MultinomialNB()

In [39]:
mlbn.fit(trainData, y_train)

MultinomialNB()

In [40]:
predictions = mlbn.predict(testData)

In [41]:
print("Predicted Result {}".format(predictions))

Predicted Result [1 1 0 ... 1 1 0]


In [42]:
print("Classification Report For the Models is \n\n{}".format(classification_report(y_test, predictions)))

Classification Report For the Models is 

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      3683
           1       0.95      0.95      0.95      3317

    accuracy                           0.95      7000
   macro avg       0.95      0.95      0.95      7000
weighted avg       0.95      0.95      0.95      7000



In [43]:
predictions_train = mlbn.predict(trainData)

In [44]:
print("Accuracy is {}%".format(round(accuracy_score(y_train, predictions_train),2)*100))

Accuracy is 96.0%


In [45]:
print("Test Accuracy is {}%".format(round(accuracy_score(y_test, predictions),2)*100))

Test Accuracy is 95.0%
