<a href="https://colab.research.google.com/github/tiasaxena/ML-Notebooks/blob/main/Fake_News_Prediction_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## About the dataset
      1. id: unique id for the article
      2. title: title of the article
      3. author: author of the article
      4. text: Text of the article, could be incomplete
      5. label: label that amrks whether the news is fake or real
            0 --> Fake News
            1 --> Real Newsm

### Import dependencies
    1. re(regular expression) is used to search for text in a document
    2. nltk = natural language toolkit, corpus = the body/content of the text
    3. stemming = remove the prefix and suffix of the word and returns the root word from it
    4. TfidVectorizer = converts the text into feature vectors
    5. stopwords do not add much value to a text or a paragraph
      e.g, articles, where, what, who, how, etc.

In [139]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import nltk

In [140]:
nltk.download('stopwords')

# Print the English stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Preprocessing and Collection

In [141]:
# Loading the data to the Pandas dataframe
dataset = pd.read_csv('/content/train.csv')

In [84]:
dataset.shape

(20800, 5)

In [142]:
df = pd.DataFrame(dataset)

In [86]:
df.head(5)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


1. Since the number of missing values are large, we will either impute or drop those values.
2. Here, we will impute by replacing it with empty string

In [143]:
# Count the number of missing values in each column
print(df.isnull().sum())

id           0
title      558
author    1957
text        39
label        0
dtype: int64


In [144]:
# Replace all the nan values with empty string
df.fillna('', inplace = True)

print(df.isnull().sum())

id        0
title     0
author    0
text      0
label     0
dtype: int64


### Since the text field is very large and can take a lot of time for processing, here, we consider the author and the title field together to predict the Fake/Real news.

In [145]:
df['content'] = df['author'] + ' ' + df['title']

### Separate the X and y

In [146]:
X, y = df.drop('label', axis = 1), df['label']

### <u> Stemming </u>
<li>It is the process of reducing a word to its root word.
E.g., acting, act --> act(root word) </li>

<li>Convert these words into their feature vectors. Feature vectors are the numerical data. </li>

In [147]:
porter_stemmer = PorterStemmer()

### ('[^a-zA-Z]', ' ', content)
tells that in 'content', whenever something that is not alphabet is encountered must with replaced with ' '

In [148]:
def stemming(content):
  # Remove all numbers, special characters, commas, etc.
  stemmed_content = re.sub('[^a-zA-Z]', ' ', content)

  # Convert the string to lowercase
  stemmed_content = stemmed_content.lower()

  # Split the string at a space and put them in a list
  stemmed_content = stemmed_content.split()

  # Remove all the stopwords from the string
  stemmed_content = [porter_stemmer.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content

In [149]:
print(df['content'])

df['content'] = df['content'].apply(stemming)

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [150]:
print(df['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [151]:
# Seperate the Data and Labels
X = df.content
y = df.label

In [152]:
# X = rows X column(content)
# y = rows X column(label)
print(X.shape, y.shape)

(20800,) (20800,)


## Convert the textual data into meaningful numerical data using <u > <font color='yellow'>TfidVectorizer</font> </u>
  * Tf stands for 'term frequency'.
      * It balsically counts the number of times a word in repeating in a particular text, paragraph, or a document.
      * The repetition shows how important is the word for the document. Next, it assigns a particular value to that word.
  * idf stands for 'inverse document frequency'.
      * Sometimes a word repeated multiple times does not have meaning in it.
      * It then reduces its importance value


In [153]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

# transform will convert all the values to the feature vectors.
X = vectorizer.transform(X)

In [157]:
print(pd.DataFrame(X))

                                                       0
0        (0, 15686)\t0.28485063562728646\n  (0, 13473...
1        (0, 16799)\t0.30071745655510157\n  (0, 6816)...
2        (0, 15611)\t0.41544962664721613\n  (0, 9620)...
3        (0, 16036)\t0.2246843875919026\n  (0, 13892)...
4        (0, 16799)\t0.4295785603455899\n  (0, 15936)...
...                                                  ...
20795    (0, 16638)\t0.27132795517240194\n  (0, 15582...
20796    (0, 16996)\t0.10405723608046283\n  (0, 15295...
20797    (0, 16996)\t0.08315655906109999\n  (0, 15295...
20798    (0, 13046)\t0.22363267488270608\n  (0, 11052...
20799    (0, 14852)\t0.5677577267055112\n  (0, 8036)\...

[20800 rows x 1 columns]


** <font color='orange'> (0, 13473)\t0.20071745655510157: </font> Similar to the first entry, this line represents another non-zero value in the matrix, located at row 0 and column 13473. **

### Split the Train and Test data

In [185]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

### Training the Model: Logistic Regression

In [162]:
model = LogisticRegression()
model.fit(X_train, y_train)

### Evaluation
Accuracy Score

In [164]:
X_train_prediction = model.predict(X_train)
X_train_accuracy_score = accuracy_score(X_train_prediction, y_train)

In [165]:
X_test_prediction = model.predict(X_test)
X_test_accuracy_score = accuracy_score(X_test_prediction, y_test)

print('Accuracy score for the training data: {}'.format(X_train_accuracy_score), '\n')
print('Accuracy score for the test data: {}'.format(X_test_accuracy_score), '\n')

Accuracy score for the training data: 0.9874399038461539 

Accuracy score for the test data: 0.9752403846153846 



### Predictor System

In [194]:
index = 2

new_datapoint = X_test[index]
print('Expected output: {}'.format('Fake' if y_test.iloc[index] == 1 else 'Real'))

prediction = model.predict(new_datapoint)

print('____________________ PREDICTOR SYSTEM OUTPUT __________________')
if prediction == 0:
  print("The New is Real.")
else :
  print("The News is Fake.")

Expected output: Fake
____________________ PREDICTOR SYSTEM OUTPUT __________________
The News is Fake.
