Dataset: https://www.kaggle.com/c/fake-news/data#

About the dataset:

1.   id: unique id for a news article
2.   title: the title of a news article
3.   author: author of the news article
4.   text: the text of the article; could be incomplete

1.   label: a label that marks whether the news article is real or fake
                
               1: Fake News
               0: real news















Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
#printing the stopwords in English
print(stopwords.words('english'))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Preprocessing

In [5]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [13]:
news_dataset = pd.read_csv('/content/train.csv', engine='python', on_bad_lines='skip')



In [14]:
news_dataset.shape

(19493, 5)

In [15]:
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1.0
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0.0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1.0
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1.0
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1.0


In [17]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

id           0
title      522
author    1858
text        54
label       16
dtype: int64

In [18]:
# replacing the null values with empty strings
news_dataset = news_dataset.fillna('')

In [26]:
# merging author and title columns
news_dataset['content'] = news_dataset['title'] + ' ' + news_dataset['author']

In [27]:
news_dataset['content'].head()

0    House Dem Aide: We Didn’t Even See Comey’s Let...
1    FLYNN: Hillary Clinton, Big Woman on Campus - ...
2    Why the Truth Might Get You Fired Consortiumne...
3    15 Civilians Killed In Single US Airstrike Hav...
4    Iranian woman jailed for fictional unpublished...
Name: content, dtype: object

In [28]:
# seperating data and label
X = news_dataset.drop( columns ='label', axis = 1)
Y = news_dataset['label']

In [29]:
print(X)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
19488  19473  Artificial intelligence ‘robot’ says Trump wil...   
19489  19474  Comment on Sweden Bans Christmas Street Lights...   
19490  19475  Arrests for Cannabis Possession Outnumber Arre...   
19491  19476  Clinton Vs. Trump: Latest Electoral Prediction...   
19492  19477        Comment on Links 11/6/16 by susan the other   

                   author                                               text  \
0           Darrell Lucus  House Dem Aide: We Didn’t Even See Comey’s Let...   
1         Daniel J. Flynn  Ever get

In [30]:
print(Y)

0        1.0
1        0.0
2        1.0
3        1.0
4        1.0
        ... 
19488    1.0
19489    1.0
19490    1.0
19491    1.0
19492    1.0
Name: label, Length: 19493, dtype: object


Stemming:

Stemming is the process of reducing a word to its Root word

example: actor, actress, acting --> act

In [31]:
port_stem = PorterStemmer()

In [32]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [33]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [34]:
print(news_dataset['content'])

0        hous dem aid even see comey letter jason chaff...
1        flynn hillari clinton big woman campu breitbar...
2                   truth might get fire consortiumnew com
3        civilian kill singl us airstrik identifi jessi...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
19488          artifici intellig robot say trump win admin
19489    comment sweden ban christma street light avoid...
19490    arrest cannabi possess outnumb arrest violent ...
19491    clinton vs trump latest elector predict greg l...
19492                             comment link susan susan
Name: content, Length: 19493, dtype: object


In [35]:
# seperating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [36]:
print(X)

['hous dem aid even see comey letter jason chaffetz tweet darrel lucu'
 'flynn hillari clinton big woman campu breitbart daniel j flynn'
 'truth might get fire consortiumnew com' ...
 'arrest cannabi possess outnumb arrest violent crime combin mind unleash'
 'clinton vs trump latest elector predict greg laden blog scienc technolog beforeitsnew com'
 'comment link susan susan']


In [37]:
print(Y)

[1.0 0.0 1.0 ... 1.0 1.0 1.0]


In [38]:
Y.shape

(19493,)

In [39]:
# converting the textual data to numerical data
vectorize = TfidfVectorizer()
vectorize.fit(X)
X = vectorize.transform(X)

In [40]:
print(X)

  (0, 15276)	0.28460034195899114
  (0, 13115)	0.2560043545105149
  (0, 8676)	0.3616574512551678
  (0, 8409)	0.2907523644288201
  (0, 7502)	0.24855543206681954
  (0, 6838)	0.22038310127327032
  (0, 4856)	0.23447967966361913
  (0, 3714)	0.2701594839833595
  (0, 3527)	0.35794655074972637
  (0, 2899)	0.2482829170714848
  (0, 2430)	0.37021574571023125
  (0, 264)	0.2701594839833595
  (1, 16348)	0.3007230235422175
  (1, 6654)	0.19036042306514397
  (1, 5379)	0.7145102238525571
  (1, 3496)	0.26452552344645064
  (1, 2754)	0.19112954057685105
  (1, 2173)	0.383212638656324
  (1, 1853)	0.15504721074894226
  (1, 1466)	0.29224629104498306
  (2, 15203)	0.4162204192789078
  (2, 9367)	0.49075367311518386
  (2, 5824)	0.34633457677021035
  (2, 5266)	0.3859237812954556
  (2, 3041)	0.46473210318610936
  :	:
  (19490, 15840)	0.2812839917695623
  (19490, 15498)	0.30172159081670763
  (19490, 11372)	0.34660898406610874
  (19490, 10572)	0.38524821850028584
  (19490, 9414)	0.2659412393174407
  (19490, 3297)	0.246

Splitting the dataset to training and test data

In [45]:
from scipy.sparse import csr_matrix
# Assuming X is a csr_matrix
X_dense = X.toarray()

# If Y is a sparse matrix as well
if isinstance(Y, csr_matrix):
    Y_dense = Y.toarray()
else:
    Y_dense = Y
# Convert to DataFrame to easily handle data types
X_df = pd.DataFrame(X_dense)
Y_df = pd.Series(Y_dense)

# Check and print data types
print(X_df.dtypes)
print(Y_df.dtypes)

# Convert columns to numeric and handle errors
X_df = X_df.apply(pd.to_numeric, errors='coerce')
Y_df = pd.to_numeric(Y_df, errors='coerce')

# Fill or drop NaN values
X_df = X_df.fillna(0)  # or another appropriate value or method
Y_df = Y_df.fillna(0)  # or another appropriate value or method

# Verify data types and check for NaN values
print(X_df.dtypes)
print(Y_df.dtypes)
print(X_df.isnull().sum())
print(Y_df.isnull().sum())
X_train, X_test, Y_train, Y_test = train_test_split(X_df, Y_df, test_size=0.2, stratify=Y_df, random_state=2)


0        float64
1        float64
2        float64
3        float64
4        float64
          ...   
16660    float64
16661    float64
16662    float64
16663    float64
16664    float64
Length: 16665, dtype: object
object
0        float64
1        float64
2        float64
3        float64
4        float64
          ...   
16660    float64
16661    float64
16662    float64
16663    float64
16664    float64
Length: 16665, dtype: object
float64
0        0
1        0
2        0
3        0
4        0
        ..
16660    0
16661    0
16662    0
16663    0
16664    0
Length: 16665, dtype: int64
0


Training the Model: Logistic Regression

In [46]:
model = LogisticRegression()

In [47]:
model.fit(X_train, Y_train)

Evaluation:

*   Accuracy Score



   

In [48]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [49]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9856355008336539


In [50]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [51]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9756347781482432


Making a Predictive System

In [56]:
X_new = X_test.iloc[3].values.reshape(1, -1)

prediction = model.predict(X_new)
print(prediction)

if prediction[0] == 0:
    print('The news is Real')
else:
    print('The news is Fake')

[0.]
The news is Real


In [58]:
print(Y_test.iloc[3])

0.0
