# ============================================================
# =============Big Data Science: Project demo (May 2020)============
# ============================================================
## Study of COVID-19 denialism through twitter activity
### Project group 26:  Wannes Van Leemput, Sam Vanmassenhove

In this project we study the phenomenon of "COVID-denialism", meaning people who, despite all evidence to the contrary, refuse to acknowledge the coronavirus as a serious threat to society, often referring to it as a "hoax". 

We searched corona-related tweets for signs of denialism and created a labelled training set based on the hashtags used by such denialists. The training data was then used to train a classification model in order to predict whether a certain tweet was a case of COVID denial or a regular tweet on the subject of coronavirus.

In [1]:
# Import libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import re
import sys
sys.path.append('../python files')

# Import our own code
from Authentication import Authentication
from DataMiner import DataMiner
from PreProcessTweets import PreProcessTweets
from TweetDataIO import TweetDataIO
from DenialPredictor import DenialPredictor
from datastream_test import MyStreamListener
from Visualisation import Visualisation
from LocationService import LocationService

# Create a set of English stopwords
sw = set(stopwords.words("english")) 

# Initiate spark
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()

# Get twitter api authentication
api = Authentication().get_api()

## 1. Classification of denialist tweets
### 1.1 Demonstration of tweet searching and IO

We use the Tweepy library to search tweets. We focus on English language tweets as these are most common and more geopgraphically diverse.

Note that the training set should already exist as a saved CSV-file when running this "application" in production. The following code is only for the purpose of demonstrating the IO functionality.

In [2]:
# Mine some denial tweets (no specific location)
control_tags = ["#Covid_19", "#coronavirus", "#COVIDー19", "#COVID19", "#coronavirusNYC", "#coronavirusoregon", 
             "#lockdown", "#covid19", "#COVID", "#pandemic", "#Corona", "#Covid19", "#CoronaVirus"]
miner = DataMiner(api, "#CoronaHoax", "", "en", tagignore=control_tags, num_tweets=300)
denial_tweets = miner.mine()

Processing tag: #COVID1984
Processing tag: #BillGatesIsEvil
Processing tag: #coronaHoax
Processing tag: #FilmYourHospital
Processing tag: #scamdemic
Processing tag: #Plandemic2020
Processing tag: #POTUS
Processing tag: #QAnon
Processing tag: #ResistTheNewWorldOrder
Processing tag: #BillGates
Processing tag: #CORONAHOAX
Processing tag: #plandemic
Processing tag: #Coronabollocks
Processing tag: #sos
Processing tag: #WWG1WGA
Processing tag: #coronabollocks
Processing tag: #NWO
Processing tag: #Scamdemic
Processing tag: #CovidHoax
Processing tag: #q
Processing tag: #woke
Processing tag: #thegreatawakening
Processing tag: #DrainTheSwamp
Processing tag: #Coronahoax
Processing tag: #BillGatesBioTerrorist
Processing tag: #endthelockdown
Processing tag: #ObamaGate
Processing tag: #FakePandemic
Processing tag: #Plandemic
Processing tag: #coronahoax
Processing tag: #CoronaHoax


In [3]:
# Mine some control tweets (no specific location)
hoax_tags = ["#CoronaHoax", "#covidhoax","#coronahoax", "#covidhoax", "#Plandemic"]
miner = DataMiner(api, "coronavirus", "", "en", tagignore=hoax_tags, num_tweets=1000)
control_tweets = miner.mine()

Processing tag: #Covid_19
Processing tag: #Coronavirus
Processing tag: #COVID19
Processing tag: #coronavirus


In [4]:
# Write the tweets to a CSV file
filename = "./training_data.txt"
io = TweetDataIO(filename, spark=spark, context=sc)
io.write(denial_tweets, label=0, append=False)
io.write(control_tweets, label=1, append=True)



##### Note: the cell below may give an error. This is likely a problem with a recent pyspark version and may not occur on your machine. If it does, just run the below cell again manually and it should work...

In [5]:
# Read same file and show
ddf = io.read()
ddf.show(n=10)

+-----+-------------------+--------------------+--------------------+-------------------+-------------------+-------------+
|label|           location|                tags|                text|               time|           tweet_id|         user|
+-----+-------------------+--------------------+--------------------+-------------------+-------------------+-------------+
|    0|                USA|#CoronaHoax|#Plan...|Another BS poll, ...|2020-05-20 20:28:52|1263205068461019138|   informusa1|
|    0|                USA|#CoronaHoax|#Plan...|This is not news,...|2020-05-20 20:27:00|1263204598644383744|   informusa1|
|    0|                   | #COVID1984|#WWG1WGA|@Under_Our_Watch ...|2020-05-20 20:25:20|1263204181482995712|  DeepDrainer|
|    0|         Washington|#COVID19|#Coronav...|So policy enforce...|2020-05-20 20:24:33|1263203984258588672| JeffHannMADA|
|    0|                USA|#CoronaHoax|#Plan...|Most cases in a d...|2020-05-20 20:22:45|1263203530053214208|   informusa1|
|    0| 

In [6]:
# Remove duplicates 
ddf = ddf.orderBy("label").dropDuplicates(["tweet_id"])
ddf.show(n=10)

+-----+--------------------+--------------------+--------------------+-------------------+-------------------+-------------+
|label|            location|                tags|                text|               time|           tweet_id|         user|
+-----+--------------------+--------------------+--------------------+-------------------+-------------------+-------------+
|    0|          Texas, USA|#WakeUpAmerica|#W...|I think people wi...|2020-05-13 19:07:19|1260647830344962048| kristintoday|
|    0|                    |#Covid19|#FilmYou...|Tanzanian Preside...|2020-05-15 01:28:01|1261106024347508738|    nurudinho|
|    0|                    |#CoronaHoax|#Fake...|@Stinky_Sausage5 ...|2020-05-15 20:20:53|1261391119704707072| OManHereWeGo|
|    0|                    |"""#Corona""|#Cor...|"@DonovanTim @San...|2020-05-16 20:24:40|1261754459895537665|        Si93B|
|    0|N.W. England nr m...|     #Coronabollocks|#Coronabollocks I...|2020-05-17 15:48:55|1262047453592813569|    dave80743|


### 1.2 Perform preprocessing steps on the tweets to prepare for classification model
We use PySpark dataframes to store the most important tweet information, namely the text, location, time and user. The tweet text can be preprocessed by removing punctuation, stop-words, hashtags, user mentions and hyperlinks.

In [7]:
p = PreProcessTweets(ddf, 
                     remove_tags=True, 
                     remove_mentions=True, 
                     remove_punctuation=True, 
                     remove_urls=True, 
                     remove_stopwords=True)
ddf = p.preprocess()

Preprocessing...
>> Removing stopwords...
>> Removing urls...
>> Removing hashtags...
>> Removing user mentions...
>> Removing punctuation...
>> Removing whitespace...
Finished preprocessing!


In [8]:
# Convert to pandas and have a look at the new data
df = ddf.toPandas()
df.head()

Unnamed: 0,label,location,tags,text,time,tweet_id,user,processed_text
0,0,"Texas, USA",#WakeUpAmerica|#WakeUp|#HumanityIsNotAVirus|#B...,I think people will finally start to wake up w...,2020-05-13 19:07:19,1260647830344962048,kristintoday,I think people finally start wake kids taken h...
1,0,,#Covid19|#FilmYourHospital|#EndTheShutdown|#KOT,Tanzanian President Pombe Maghufuli confirms t...,2020-05-15 01:28:01,1261106024347508738,nurudinho,Tanzanian President Pombe Maghufuli confirms s...
2,0,,#CoronaHoax|#FakePandemic|#FilmYourHospital|#F...,@Stinky_Sausage5 In 2017 Fauci said that there...,2020-05-15 20:20:53,1261391119704707072,OManHereWeGo,In 2017 Fauci said surprise outbreak Trump adm...
3,0,,"""""""#Corona""""|#CoronaVirus|#Covid19|#EmptyHospi...","""@DonovanTim @SandraWors3 @TommyCorbyn @GetBre...",2020-05-16 20:24:40,1261754459895537665,Si93B,Heres Chief Medical Officer UK admitting recor...
4,0,N.W. England nr mcr airport,#Coronabollocks,#Coronabollocks It most certainly is and anywa...,2020-05-17 15:48:55,1262047453592813569,dave80743,It certainly anyway lagers Poofs Ooops sorry w...


### 1.3 Training and evaluation of classification model
We train a machine learning classification model with "sklearn" in order to classify tweets. The effects of the preprocessing is investigated, as is the effect of replacing the simle Naive Bayes Classifier by a SVM model.
#### 1.3.1 First try on the original unedited text

In [9]:
# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(corpus=df.processed_text, labels=df.label, clf="nb")
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 0.920,
Precision: 0.941, 
Recall: 0.788,
F1: 0.857
Test set performance : 
Accuracy: 0.816,
Precision: 0.751, 
Recall: 0.532,
F1: 0.623


#### 1.3.2 Now do the same but without preprocessing

In [10]:
# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(df.text, df.label)
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Preprocessing...
>> Removing stopwords...
>> Removing hashtags...
>> Removing whitespace...
Finished preprocessing!


Training set performance : 
Accuracy: 0.940,
Precision: 0.965, 
Recall: 0.833,
F1: 0.894
Test set performance : 
Accuracy: 0.823,
Precision: 0.794, 
Recall: 0.512,
F1: 0.622


#### 1.3.3 The same again but this time with a different classifier: Support Vector Machines (SVM)

In [11]:
# Split data (corpus and labels) into train and test sets
svm_predictor = DenialPredictor(df.processed_text, df.label, clf="svm")
X_train, X_test, y_train, y_test = svm_predictor.train_test_split(split=0.3)

# Fit the model
svm_predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
svm_predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
svm_predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 0.911,
Precision: 0.963, 
Recall: 0.737,
F1: 0.835
Test set performance : 
Accuracy: 0.806,
Precision: 0.748, 
Recall: 0.485,
F1: 0.588


In [None]:
# Split data (corpus and labels) into train and test sets
svm_predictor = DenialPredictor(df.text, df.label, clf="svm")
X_train, X_test, y_train, y_test = svm_predictor.train_test_split(split=0.3)

# Fit the model
svm_predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
svm_predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
svm_predictor.calc_metrics(X_test, y_test)

## 2. Live tweet streaming and classification

In this section we demonstrate that we can correctly classify denialist tweets when streaming tweets live. A streamer is set up to search twitter for new tweets with the hoax hashtags.

We can see that these tweets are for the most part correctly identified, but the classifier is occasionally defeated by irony; satiral tweets are often mistaken for real denialist tweets.

The streamed tweets below are labelled (denial or normal), and the label probability is also given between brackets.

In [12]:
from LivePredictionStream import LivePredictionStream
import tweepy

In [13]:
FILE_NAME = "hoax.csv"
api = Authentication(isApp=False).get_api()
auth = api.auth
streamListener = LivePredictionStream(FILE_NAME, predictor, num_iter=5, verbose=1, tagignore=tagignore)
stream = tweepy.Stream(auth=auth, listener=streamListener)

try:
    print('Start streaming...')
    stream.filter(languages=['en'], 
                    track=tagignore)

except Exception:
    print("Stopped.")

finally:
    stream.disconnect()

Start streaming...
(1) denial (1.00): @mitchellvii @dgaliger2 We all live in hope. Especially the Qanons that gave followed thus really since 2016/7. WWG1WGA 🇬🇧 🇺🇸 #Trump2020 #VoterID #VoterFraud #FlynnFighters #ObamaGate #SubpoenaObama
(2) denial (1.00): @BreitbartNews Ofcourse it's denied!
(3) denial (1.00): Must-see 📺Connecting all the dots - with a great amount of passion I might add. Check out the now banned-for-life from Twitter Edu(ating L1berals blowing it up! #wwg1wga #greatawakening #qanon #qarmy #plandemic #election2020
(4) denial (1.00): Do these people care about your well being? Do these people love America?🇺🇲 Do anything to regain power? #NewQ #pandemic #QAnon #PatriotsFight #FakeNews #DrainTheSwamp #KAG2020
(5) denial (1.00): @jujo28 @washingtonpost #CreepyJoeBiden and #ObamaGate created division for many years. #Trump has brought unity among Patriots at a time when America was very polarized politically and racially. #QHatTip #BidenBeHidin from embarrassing rallies and

## 3. Visualisations

### 3.1 Aquiring tweet location data

Tweet data is aquired and preprocessed. Preprocessed includes geocoding the author location. Since a free api is used, we are limited to 1 request per second. Due to this, the geocoding can take a while (>20 min if training set is large).

#### Note: this code will throw Exceptions and warnings. This is normal, and is because we are using the free version of the API.

In [None]:
locationservice = LocationService()
filename = "./training_data.txt"
io = TweetDataIO(filename, spark=spark, context=sc)
ddf = io.read()
df = ddf.toPandas()
df = locationservice.add_location_data(df)

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Aether',), **{}).
Traceback (most recent call last):
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\urllib\request.py", line 1319, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaco

RateLimiter caught an error, retrying (0/2 tries). Called with (*('',), **{}).
Traceback (most recent call last):
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\urllib\request.py", line 1319, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\e

RateLimiter caught an error, retrying (0/2 tries). Called with (*('',), **{}).
Traceback (most recent call last):
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\urllib\request.py", line 1319, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\envs\bds\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "C:\Users\SamVa\AppData\Local\Continuum\anaconda3\e

### 3.2 Visualise global trending hashtags
Global trending hashtags amongst hoax-believer tweets are shown, along with their occurance count.

In [None]:
city = "New York City"
country = "USA"
radius = 100

In [None]:
vis=Visualisation(df)
vis.trending_tags_local(city, country, radius)

### 3.3 Visualise hoax believers
The location of hoax believers can be shown on a heat map.

In [None]:
vis.heat_map()