<a href="https://colab.research.google.com/github/studam/Covid-19-fake-news/blob/main/notebooks/testing_the_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Custom text classification in spaCy
spaCy is an advanced library for performing NLP tasks like classification. 
One significant reason why spaCy is preferred a lot is that it allows to easily build 
or extend a text classification model. We shall be using this feature.

In [1]:
#import dependencies
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [8]:
# Read data from S3 Bucket
path_true= 'https://fakenewsproject4.s3.amazonaws.com/trueNews.csv'
path_fake= 'https://fakenewsproject4.s3.amazonaws.com/fakeNews.csv'
true_df= pd.read_csv(path_true)
fake_df= pd.read_csv(path_fake)

In [9]:
fake_df.head(5)

Unnamed: 0,Date Posted,Link,Text,Region,Country,Explanation,Origin,Origin_URL,Fact_checked_by,Poynter_Label,Binary Label
0,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,Tencent revealed the real number of deaths.\t\t,Europe,France,The screenshot is questionable.,Twitter,https://www.liberation.fr/checknews/2020/02/07...,CheckNews,Misleading,0
1,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,Taking chlorine dioxide helps fight coronavir...,Europe,Germany,Chlorine dioxide does guard against the coron...,Website,https://correctiv.org/faktencheck/medizin-und-...,Correctiv,FALSE,0
2,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,This video shows workmen uncovering a bat-inf...,India,India,A video shows bats nesting in the roof; howev...,Facebook,https://factcheck.afp.com/video-shows-workmen-...,AFP,MISLEADING,0
3,2/7/20,https://www.poynter.org/?ifcn_misinformation=t...,The Asterix comic books and The Simpsons pred...,India,India,Coronavirus has been around since the 1960s. ...,Twitter,https://www.boomlive.in/health/did-the-simpson...,BOOM FactCheck,Misleading,0
4,2/7/20,https://www.poynter.org/?ifcn_misinformation=c...,Chinese President Xi Jinping visited a mosque...,India,India,Chinese President Xi Jinping's visit to the m...,Facebook,http://newsmobile.in/articles/2020/02/07/chine...,NewsMobile,FALSE,0


## Pre-processing

In [10]:
#Rename Binary column
fake_df.rename(columns ={"Binary Label":"Label"}, inplace=True)

In [11]:
# Drop Columns not require
fake = fake_df.drop(["Date Posted",'Link','Country','Origin_URL', 'Fact_checked_by','Poynter_Label'], axis=1)
fake

Unnamed: 0,Text,Region,Explanation,Origin,Label
0,Tencent revealed the real number of deaths.\t\t,Europe,The screenshot is questionable.,Twitter,0
1,Taking chlorine dioxide helps fight coronavir...,Europe,Chlorine dioxide does guard against the coron...,Website,0
2,This video shows workmen uncovering a bat-inf...,India,A video shows bats nesting in the roof; howev...,Facebook,0
3,The Asterix comic books and The Simpsons pred...,India,Coronavirus has been around since the 1960s. ...,Twitter,0
4,Chinese President Xi Jinping visited a mosque...,India,Chinese President Xi Jinping's visit to the m...,Facebook,0
...,...,...,...,...,...
3790,Bill Gates said that the COVID-19 vaccine wil...,Europe,The new RNA and DNA vaccine candidates are ex...,Social Media and Websites,0
3791,COVID-19 vaccine candidates will insert micro...,Europe,The hoax comes from a misinterpretation of a ...,Whatsapp and Facebook,0
3792,An image claims that chroma screen panels are...,Europe,The image has been manipulated. The real one ...,Social Media,0
3793,"Alexandria Ocasio-Cortez tweeted, ""It's vital...",United States,Alexandria Ocasio-Cortez didn't tweet this.,Viral image,0


In [12]:
# show TOP 5 Rows fro True_df
true_df.head(5)

Unnamed: 0,Date Posted,Link,Text,Region,Username,Publisher,Label
0,2/11/20,https://twitter.com/the_hindu/status/122725962...,Just in: Novel coronavirus named 'Covid-19': U...,India,the_hindu,The Hindu,1
1,2/12/20,https://twitter.com/ndtv/status/12274908434742...,WHO officially names #coronavirus as Covid-19....,India,ndtv,NDTV,1
2,2/12/20,https://twitter.com/the_hindu/status/122744471...,"The #UN #health agency announced that ""COVID-1...",India,the_hindu,The Hindu,1
3,2/14/20,https://twitter.com/IndiaToday/status/12282764...,The Indian Embassy in Tokyo has said that one ...,India,indiatoday,IndiaToday,1
4,2/15/20,https://twitter.com/the_hindu/status/122854247...,Ground Zero | How Kerala used its experience i...,India,the_hindu,The Hindu,1


In [13]:
# Drop Columns not required
true = true_df.drop(["Date Posted",'Link','Username', 'Publisher'], axis=1)
true

Unnamed: 0,Text,Region,Label
0,Just in: Novel coronavirus named 'Covid-19': U...,India,1
1,WHO officially names #coronavirus as Covid-19....,India,1
2,"The #UN #health agency announced that ""COVID-1...",India,1
3,The Indian Embassy in Tokyo has said that one ...,India,1
4,Ground Zero | How Kerala used its experience i...,India,1
...,...,...,...
3788,Global COVID-19 prevention trial of hydroxychl...,Europe,1
3789,Bavaria's free COVID-19 test for all splits Ge...,Europe,1
3790,Britain locks down city of Leicester after COV...,Europe,1
3791,UK imposes lockdown on city of Leicester to cu...,Europe,1


In [19]:
#merge & randomly mix into a single dataframe 
news_df =fake.append(true).sample(frac=1).reset_index().drop(columns=['index','Region', 'Explanation','Origin'])
news_df.head(10)

Unnamed: 0,Text,Label
0,"""Due to the large number of people who will r...",0
1,The Indian Army has been called to control se...,0
2,"On June 21, 57 girls of the government shelter...",1
3,Britain's death toll from COVID-19 could have ...,1
4,It is forbidden to be more than one in a car ...,0
5,Many journalists who are working from the fron...,1
6,COVID-19: Bar Council of India asks lawyers ac...,1
7,"A photo where Pablo Iglesias, vice president ...",0
8,Beware of food delivery apps in India as they...,0
9,#MahaHotSpot | #Maharashtra has the highest nu...,1


In [15]:
# Import spaCy ,load model
import spacy
nlp=spacy.load("en_core_web_sm")
nlp.pipe_names

['tagger', 'parser', 'ner']

In [16]:
# Adding the built-in textcat component to the pipeline.
textcat=nlp.create_pipe( "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"})
nlp.add_pipe(textcat, last=True)
nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

In [17]:
# Adding the labels to textcat
textcat.add_label("REAL")
textcat.add_label("FAKE")

1

In [21]:
# Converting the dataframe into a list of tuples
news_df['tuples'] = news_df.apply(lambda row: (row['Text'],row['Label']), axis=1)
train =news_df['tuples'].tolist()
train[:10]

[(' "Due to the large number of people who will refuse the forthcoming COVID-19 vaccine because it will include tracking microchips, the Gates Foundation is now spending billions to ensure that all medical and dental injections and procedures include the chips."\x9d\t\t',
  0),
 (" The Indian Army has been called to control seven areas in Mumbai that are not following the 21 day lockdown and are out of the Mumbai police's control.\t\t",
  0),
 ('On June 21, 57 girls of the government shelter home in Swarup Nagar in Kanpur were found Covid-19 positive.\n#Kanpur #coronavirus https://www.indiatoday.in/india/story/kanpur-shelter-home-case-2-officials-suspended-1694683-2020-06-27\xa0',
  1),
 ("Britain's death toll from COVID-19 could have been halved if lockdown had been introduced a week earlier, a former member of the UK government's scientific advisory group said  https://reut.rs/3cRFBGM\xa0",
  1),
 (' It is forbidden to be more than one in a car in France during the lockdown.\t\t',
  

In [22]:
import random

def load_data(limit=0, split=0.8):
    train_data=train
    # Shuffle the data
    random.shuffle(train_data)
    texts, labels = zip(*train_data)
    # get the categories for each Text
    cats = [{"REAL": bool(y), "FAKE": not bool(y)} for y in labels]

    # Splitting the training and evaluation data
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

n_texts=23486

# Calling the load_data() function 
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)

# Processing the final format of training data
train_data = list(zip(train_texts,[{'cats': cats} for cats in train_cats]))
train_data[:10]

[("Bolsonaro says he 'wouldn't feel anything' if infected with Covid-19 and attacks state lockdowns  https://www.theguardian.com/world/2020/mar/25/bolsonaro-brazil-wouldnt-feel-anything-covid-19-attack-state-lockdowns?utm_term=Autofeed&CMP=twt_b-gdnnews&utm_medium=Social&utm_source=Twitter#Echobox=1585102482\xa0",
  {'cats': {'FAKE': False, 'REAL': True}}),
 ("Florida scientist says she was fired for refusing to change Covid-19 data 'to support reopen plan'  https://www.theguardian.com/us-news/2020/may/20/florida-scientist-dr-rebekah-jones-fired-refusing-change-covid-19-data-reopen-plan?utm_term=Autofeed&CMP=twt_b-gdnnews&utm_medium=Social&utm_source=Twitter#Echobox=1589977348\xa0",
  {'cats': {'FAKE': False, 'REAL': True}}),
 (' Autopsy reveals a Wuhan doctor was murdered in his sickbed.\t\t',
  {'cats': {'FAKE': True, 'REAL': False}}),
 (" Image of news channel claims Pakistan's PM Imran Khan's wife tested COVID-19 positive.\t\t",
  {'cats': {'FAKE': True, 'REAL': False}}),
 ('CJI S.

In [23]:
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positive
    fp = 1e-8  # False positive
    fn = 1e-8  # False negative
    tn = 0.0  # True negative
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "FAKE":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}


#("Number of training iterations", "n", int))
n_iter=10

In [24]:
from spacy.util import minibatch, compounding

# Disabling other components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()

    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))

    # Performing training
    for i in range(n_iter):
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)

      # Calling the evaluate() function and printing the scores
        with textcat.model.use_params(optimizer.averages):
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))

Training the model...
LOSS 	  P  	  R  	  F  
1.298	1.000	1.000	1.000
0.039	1.000	1.000	1.000
0.022	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.043	1.000	1.000	1.000
0.125	1.000	1.000	1.000


In [26]:
# Testing the model
test_text= "The new coronavirus causes sudden death syndrome"
doc =nlp(test_text)
doc.cats

{'FAKE': 0.99931800365448, 'REAL': 0.000682036392390728}

In [None]:
nlp.to_disk("./model")