<a href="https://colab.research.google.com/github/studam/Covid-19-fake-news/blob/main/notebooks/Detecting_covid_fake_news_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Libraries
We’ll start by importing the libraries we’ll need for this task. we’ll want pandas and scikit-learn to help with our analysis.

In [41]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

## Loading Data

In [42]:
# Read data from S3 Bucket
path_true= 'https://fakenewsproject4.s3.amazonaws.com/trueNews.csv'
path_fake= 'https://fakenewsproject4.s3.amazonaws.com/fakeNews.csv'
true_df= pd.read_csv(path_true)
fake_df= pd.read_csv(path_fake)

In [None]:
# Display top 5 rows in the true_df
true_df.head(5)

Unnamed: 0,Date Posted,Link,Text,Region,Username,Publisher,Label
0,2/11/20,https://twitter.com/the_hindu/status/122725962...,Just in: Novel coronavirus named 'Covid-19': U...,India,the_hindu,The Hindu,1
1,2/12/20,https://twitter.com/ndtv/status/12274908434742...,WHO officially names #coronavirus as Covid-19....,India,ndtv,NDTV,1
2,2/12/20,https://twitter.com/the_hindu/status/122744471...,"The #UN #health agency announced that ""COVID-1...",India,the_hindu,The Hindu,1
3,2/14/20,https://twitter.com/IndiaToday/status/12282764...,The Indian Embassy in Tokyo has said that one ...,India,indiatoday,IndiaToday,1
4,2/15/20,https://twitter.com/the_hindu/status/122854247...,Ground Zero | How Kerala used its experience i...,India,the_hindu,The Hindu,1


In [None]:
# Display top 5 rows in the fake_df
fake_df.head(5)

Unnamed: 0,Date Posted,Link,Text,Region,Country,Explanation,Origin,Origin_URL,Fact_checked_by,Poynter_Label,Binary Label
0,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,Tencent revealed the real number of deaths.\t\t,Europe,France,The screenshot is questionable.,Twitter,https://www.liberation.fr/checknews/2020/02/07...,CheckNews,Misleading,0
1,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,Taking chlorine dioxide helps fight coronavir...,Europe,Germany,Chlorine dioxide does guard against the coron...,Website,https://correctiv.org/faktencheck/medizin-und-...,Correctiv,FALSE,0
2,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,This video shows workmen uncovering a bat-inf...,India,India,A video shows bats nesting in the roof; howev...,Facebook,https://factcheck.afp.com/video-shows-workmen-...,AFP,MISLEADING,0
3,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,The Asterix comic books and The Simpsons pred...,India,India,Coronavirus has been around since the 1960s. ...,Twitter,https://www.boomlive.in/health/did-the-simpson...,BOOM FactCheck,Misleading,0
4,2/7/20,https://www.poynter.org/?ifcn_misinformation=c...,Chinese President Xi Jinping visited a mosque...,India,India,Chinese President Xi Jinping's visit to the m...,Facebook,http://newsmobile.in/articles/2020/02/07/chine...,NewsMobile,FALSE,0


## Pre-Processing

In [None]:
#Rename Binary column
fake_df.rename(columns ={"Binary Label":"Label"}, inplace=True)

In [None]:
# Drop Columns not require
fake = fake_df.drop(["Date Posted",'Link','Country','Origin_URL', 'Fact_checked_by','Poynter_Label'], axis=1)
fake

Unnamed: 0,Text,Region,Explanation,Origin,Label
0,Tencent revealed the real number of deaths.\t\t,Europe,The screenshot is questionable.,Twitter,0
1,Taking chlorine dioxide helps fight coronavir...,Europe,Chlorine dioxide does guard against the coron...,Website,0
2,This video shows workmen uncovering a bat-inf...,India,A video shows bats nesting in the roof; howev...,Facebook,0
3,The Asterix comic books and The Simpsons pred...,India,Coronavirus has been around since the 1960s. ...,Twitter,0
4,Chinese President Xi Jinping visited a mosque...,India,Chinese President Xi Jinping's visit to the m...,Facebook,0
...,...,...,...,...,...
3790,Bill Gates said that the COVID-19 vaccine wil...,Europe,The new RNA and DNA vaccine candidates are ex...,Social Media and Websites,0
3791,COVID-19 vaccine candidates will insert micro...,Europe,The hoax comes from a misinterpretation of a ...,Whatsapp and Facebook,0
3792,An image claims that chroma screen panels are...,Europe,The image has been manipulated. The real one ...,Social Media,0
3793,"Alexandria Ocasio-Cortez tweeted, ""It's vital...",United States,Alexandria Ocasio-Cortez didn't tweet this.,Viral image,0


In [None]:
# Drop Columns not required
true = true_df.drop(["Date Posted",'Link','Username', 'Publisher'], axis=1)
true

Unnamed: 0,Text,Region,Label
0,Just in: Novel coronavirus named 'Covid-19': U...,India,1
1,WHO officially names #coronavirus as Covid-19....,India,1
2,"The #UN #health agency announced that ""COVID-1...",India,1
3,The Indian Embassy in Tokyo has said that one ...,India,1
4,Ground Zero | How Kerala used its experience i...,India,1
...,...,...,...
3788,Global COVID-19 prevention trial of hydroxychl...,Europe,1
3789,Bavaria's free COVID-19 test for all splits Ge...,Europe,1
3790,Britain locks down city of Leicester after COV...,Europe,1
3791,UK imposes lockdown on city of Leicester to cu...,Europe,1


In [None]:
#merge & randomly mix into a single dataframe 
news_df =fake.append(true).sample(frac=1).reset_index().drop(columns=['index','Region', 'Explanation','Origin'])
news_df.head(10)

Unnamed: 0,Text,Label
0,Quarantine may last for several years.\t\t,0
1,"""While California is dying ... Gavin (Newsom)...",0
2,Antibody levels in recovered COVID-19 patients...,1
3,People in home quarantine in #Maharashtra to b...,1
4,With air and rail passenger services suspended...,1
5,Uttarakhand man gets COVID positive text on tr...,1
6,"""Coronavirus",0
7,Health Ministry directed that anyone diagnosed...,1
8,"Denied a diploma, April Dunn made sure other s...",1
9,The world's wealthiest nations poured unpreced...,1


In [None]:
# shape of combined dataframe
news_df.shape

(7588, 2)

## Tokenizing the Data With spaCy
Now that we know what we’re working with, let’s create a custom tokenizer function using spaCy. We’ll use this function to automatically strip information we don’t need, like stopwords and punctuation, from each review.

We’ll start by importing the English models we need from spaCy, as well as Python’s string module, which contains a helpful list of all punctuation marks that we can use in string.punctuation. We’ll create variables that contain the punctuation marks and stopwords we want to remove, and a parser that runs input through spaCy‘s English module.

Then, we’ll create a spacy_tokenizer() function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words. This is similar to what we did in the examples earlier in this tutorial, but now we’re putting it all together into a single function for preprocessing each user review we’re analyzing

In [None]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

## Defining a Custom Transformer

To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class. This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [None]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

## Vectorization Feature Engineering (TF-IDF)

When we classify text, we end up with text snippets matched with their respective labels. But we can’t simply use text strings in our machine learning model; we need a way to convert our text into something that can be represented numerically just like the labels (1 for true and 0 for fake). 

One tool we can use for doing this is called Bag of Words. BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we’re telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want.

N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens.So the ngram_range parameter we’ll use in the code below sets the lower and upper bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.

In [None]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# TF-IDF

We’ll also want to look at the TF-IDF (Term Frequency-Inverse Document Frequency) for our terms. It’s simply a way of normalizing our Bag of Words(BoW) by looking at each word’s frequency in comparison to the document frequency. In other words, it’s a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.

In [33]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

## Splitting The Data into Training and Test Sets

We’re trying to build a classification model, but we need a way to know how it’s actually performing. Dividing the dataset into a training set and a test set the tried-and-true method for doing this.

In [37]:
from sklearn.model_selection import train_test_split

X = news_df['Text'] # the features we want to analyze
ylabels = news_df['Label'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2)

## Creating a Pipeline and Generating the Models

Now that we’re all set up, it’s time to actually build our model! We’ll start by importing the LogisticRegression module and creating a LogisticRegression classifier object.

In [39]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f70a88b4690>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f70aa8bc710>)),
                ('classifier', LogisticRegression())])

## Evaluating our Models

Let’s take a look at how our model actually performs! We can do this using the metrics module from scikit-learn. Now that we’ve trained our model, we’ll put our test data through the pipeline to come up with predictions. Then we’ll use various functions of the metrics module to look at our model’s accuracy, precision, and recall.

* Accuracy refers to the percentage of the total predictions our model makes  that are completely correct.
* Precision describes the ratio of true positives to true positives plus false  positives in our predictions.
* Recall describes the ratio of true positives to true positives plus false negatives in our predictions.

In [50]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.9393939393939394
Logistic Regression Precision: 0.9235836627140975
Logistic Regression Recall: 0.9537414965986395


## RandomForestClassifier

In [78]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50,criterion='entropy')
# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f70a6dbe490>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f70aa8bc710>)),
                ('classifier',
                 RandomForestClassifier(criterion='entropy', n_estimators=50))])

In [79]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("RandomForestClassifier Accuracy:",metrics.accuracy_score(y_test, predicted))
print("RandomForestClassifier Precision:",metrics.precision_score(y_test, predicted))
print("RandomForestClassifier Recall:",metrics.recall_score(y_test, predicted))

RandomForestClassifier Accuracy: 0.9400527009222661
RandomForestClassifier Precision: 0.9259259259259259
RandomForestClassifier Recall: 0.9523809523809523


## DecisionTreeClassifier

In [76]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy',max_depth=20,splitter='best',random_state=42,)

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f70a6d91950>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f70aa8bc710>)),
                ('classifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=20,
                                        random_state=42))])

In [77]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("DecisionTreeClassifier Accuracy:",metrics.accuracy_score(y_test, predicted))
print("DecisionTreeClassifierr Precision:",metrics.precision_score(y_test, predicted))
print("DecisionTreeClassifier Recall:",metrics.recall_score(y_test, predicted))

DecisionTreeClassifier Accuracy: 0.8919631093544137
DecisionTreeClassifierr Precision: 0.8386714116251482
DecisionTreeClassifier Recall: 0.9619047619047619
