# ====================================================
# ======== Big Data Science: Project demo (May 2020) ========
# ====================================================
## Study of COVID-19 denialism through twitter activity
### Project group 26:  Wannes Van Leemput, Sam Vanmassenhove

In this project we study the phenomenon of "COVID-denialism", meaning people who, despite all evidence to the contrary, refuse to acknowledge the coronavirus as a serious threat to society, often referring to it as a "hoax". 

We searched corona-related tweets for signs of denialism and created a labelled training set based on the hashtags used by such denialists. The training data was then used to train a classification model in order to predict whether a certain tweet was a case of COVID denial or a regular tweet on the subject of coronavirus.

In [1]:
# Import libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import re

# Import our own code
from Authentication import Authentication
from DataMiner import DataMiner
from PreProcessTweets import PreProcessTweets
from TweetDataIO import TweetDataIO
from DenialPredictor import DenialPredictor
from datastream_test import MyStreamListener

# Create a set of English stopwords
sw = set(stopwords.words("english")) 

# Initiate spark
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()

# Get twitter api authentication
api = Authentication().get_api()

## 1. Classification of denialist tweets
### 1.1 Demonstration of tweet searching and IO

We use the Tweepy library to search tweets. We focus on English language tweets as these are most common and more geopgraphically diverse.

Note that the training set should already exist as a saved CSV-file when running this "application" in production. The following code is only for the purpose of demonstrating the IO functionality.

In [2]:
# Mine some denial tweets (no specific location)
tagignore = ["#Covid_19", "#coronavirus", "#COVIDー19", "#COVID19", "#coronavirusNYC", "#coronavirusoregon", 
             "#lockdown", "#covid19", "#COVID", "#pandemic", "#Corona", "#Covid19", "#CoronaVirus"]
miner = DataMiner(api, "#CoronaHoax", "", "en", tagignore=tagignore, num_tweets=10)
denial_tweets = miner.mine()

Processing tag: #BillGatesIsEvil
Processing tag: #coronaHoax
Processing tag: #FilmYourHospital
Processing tag: #scamdemic
Processing tag: #Plandemic2020
Processing tag: #POTUS
Processing tag: #QAnon
Processing tag: #ResistTheNewWorldOrder
Processing tag: #BillGates
Processing tag: #CORONAHOAX
Processing tag: #plandemic
Processing tag: #Coronabollocks
Processing tag: #sos
Processing tag: #WWG1WGA
Processing tag: #coronabollocks
Processing tag: #Scamdemic
Processing tag: #NWO
Processing tag: #CovidHoax
Processing tag: #q
Processing tag: #woke
Processing tag: #thegreatawakening
Processing tag: #DrainTheSwamp
Processing tag: #Coronahoax
Processing tag: #BillGatesBioTerrorist
Processing tag: #endthelockdown
Processing tag: #FakePandemic
Processing tag: #ObamaGate
Processing tag: #Plandemic
Processing tag: #coronahoax
Processing tag: #CoronaHoax


In [3]:
# Mine some control tweets (no specific location)
tagignore = ["#CoronaHoax", "#covidhoax","#coronahoax", "#covidhoax", "#Plandemic"]
miner = DataMiner(api, "coronavirus", "", "en", tagignore=tagignore, num_tweets=10)
control_tweets = miner.mine()

Processing tag: #Covid_19
Processing tag: #covid19
Processing tag: #Coronavirus
Processing tag: #COVID19
Processing tag: #coronavirus


In [4]:
# Write the tweets to a CSV file
filename = "./training_data.txt"
io = TweetDataIO(filename, spark=spark, context=sc)
io.write(denial_tweets, label=0, append=False)
io.write(control_tweets, label=1, append=True)

In [5]:
# Read same file and show
ddf = io.read()
ddf.show(n=10)

+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|label|            location|                tags|                text|               time|           tweet_id|           user|
+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|    0|                    |    #BillGatesIsEvil|@shreena74052483 ...|2020-05-20 17:46:22|1263164173644468224| OppressedHindu|
|    0|                    |#ChineseVirus|#Ch...|@ChristieGrunwa1 ...|2020-05-20 17:34:58|1263161305306890244|       rcoiteux|
|    0|       New York, USA|him....#BillGates...|@EM_KA_17 This is...|2020-05-20 17:30:01|1263160060160749570|alexandrejakins|
|    0|      Leeds, England|#podcasts|#podcas...|Look out, everyon...|2020-05-20 17:21:22|1263157882536833029|  ChamberElders|
|    0|                    |#FDA|#BillGatesIs...|This is necessary...|2020-05-20 17:20:16|1263157606111199232| 

In [6]:
# Remove duplicates 
ddf = ddf.orderBy("label").dropDuplicates(["tweet_id"])
ddf.show(n=10)

+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|label|            location|                tags|                text|               time|           tweet_id|           user|
+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|    0|                    |          #CovidHoax|@AmandaLeeHouse ....|2020-05-20 17:29:55|1263160036676841473|         LHCBCS|
|    0|                    |              #POTUS|@realDonaldTrump ...|2020-05-20 17:56:08|1263166634887385088|AngelaJ45547503|
|    0|    Aichi-ken, Japan|#vote|#StillIVote...|Although not all ...|2020-05-20 09:56:19|1263045881864974336|     TheMusicks|
|    0|                    |                  #Q|@BaldEagle1964 @L...|2020-05-20 18:00:04|1263167622776004608|       GRANNYQ9|
|    0|       San Diego, CA|       #woke|#resist|The thing that ma...|2020-05-20 17:55:23|1263166444579115009|B

### 1.2 Perform preprocessing steps on the tweets to prepare for classification model

In [7]:
p = PreProcessTweets(ddf, 
                     remove_tags=True, 
                     remove_mentions=True, 
                     remove_punctuation=True, 
                     remove_urls=True, 
                     remove_stopwords=True)
ddf = p.preprocess()

Preprocessing...
>> Removing stopwords...
>> Removing urls...
>> Removing hashtags...
>> Removing user mentions...
>> Removing punctuation...
>> Removing whitespace...
Finished preprocessing!


In [8]:
# Convert to pandas and have a look at the new data
df = ddf.toPandas()
df.head()

Unnamed: 0,label,location,tags,text,time,tweet_id,user,processed_text
0,0,,#CovidHoax,@AmandaLeeHouse ... the way we live. We carrie...,2020-05-20 17:29:55,1263160036676841473,LHCBCS,way live We carried normal During Only 1 indiv...
1,0,,#POTUS,@realDonaldTrump @RussVought45 @USTreasury And...,2020-05-20 17:56:08,1263166634887385088,AngelaJ45547503,And neither you You really know hell talking t...
2,0,"Aichi-ken, Japan",#vote|#StillIVote|#ResistTheNewWorldOrder|#Rev...,Although not all of problems can be solved wit...,2020-05-20 09:56:19,1263045881864974336,TheMusicks,Although problems solved Representation matter...
3,0,,#Q,@BaldEagle1964 @LisaMei62 @USPATRIQT41020 I wa...,2020-05-20 18:00:04,1263167622776004608,GRANNYQ9,I thinking thing I watched Tucker Hannity Laur...
4,0,"San Diego, CA",#woke|#resist,The thing that makes twitter great is the abil...,2020-05-20 17:55:23,1263166444579115009,BravoKiloActual,The thing makes twitter great ability respond ...


### 1.3 Training and evaluation of classification model
#### 1.3.1 First try on the original unedited text

In [9]:
# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(corpus=df.processed_text, labels=df.label, clf="nb")
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 0.884,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000
Test set performance : 
Accuracy: 0.867,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### 1.3.2 Now do the same but without preprocessing

In [10]:
p = PreProcessTweets(ddf, 
                     remove_tags=True, 
                     remove_mentions=False, 
                     remove_punctuation=False, 
                     remove_urls=False, 
                     remove_stopwords=True)
ddf = p.preprocess()
df = ddf.toPandas()
print("\n")

# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(df.processed_text, df.label)
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Preprocessing...
>> Removing stopwords...
>> Removing hashtags...
>> Removing whitespace...
Finished preprocessing!


Training set performance : 
Accuracy: 0.884,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000
Test set performance : 
Accuracy: 0.867,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### 1.3.3 The same again but this time with a different classifier: Support Vector Machines (SVM)

In [11]:
# Split data (corpus and labels) into train and test sets
svm_predictor = DenialPredictor(df.processed_text, df.label, clf="svm")
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
svm_predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
svm_predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
svm_predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 1.000,
Precision: 1.000, 
Recall: 1.000,
F1: 1.000
Test set performance : 
Accuracy: 0.867,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000


  _warn_prf(average, modifier, msg_start, len(result))


## 2. Live tweet streaming and prediction

In [12]:
from LivePredictionStream import LivePredictionStream
import tweepy

In [13]:
FILE_NAME = "hoax.csv"
api = Authentication(isApp=False).get_api()
auth = api.auth
streamListener = LivePredictionStream(FILE_NAME, predictor, num_iter=20, verbose=1)
stream = tweepy.Stream(auth=auth, listener=streamListener)

try:
    print('Start streaming...')
    stream.filter(languages=['en'], 
                    track=["coronavirus"])

except Exception:
    print("Stopped.")

finally:
    print('Done.')
    stream.disconnect()

Start streaming...
(1) denial (0.95): 🤦‍♀️ Does he call the deaths a badge of honor too Stupid fcker Trump calls high number of coronavirus cases in the US a badge of honor attributes it to testing
(2) denial (0.93): And all DDS were like What mass killing So like all 75 billion people
(3) denial (0.93): Counties are on the front lines of the COVID19 crisis working every day to keep our communities healthy and safe For more on the county role and for the latest resources visit CountiesMatter
(4) denial (0.92): chicagotribune More and more sheriffs are saying the hell w Pritzker his tyrannical orders Including DuPage Countys Pretty soon the only county adhering to his unconstitutional lockdown will be crook county coronavirus COVID
(5) denial (0.88): Curb your enthusiasm season 11
(6) denial (0.94): So proud to be mentioned alongside Carolyn Olson who is using her artwork to pay tribune to COVID19 workers Read about it in this PBS NewsHour write up
(7) denial (0.88): • Ooop Blue Bell
(8

## 3. Visualisations