# ====================================================
# ======== Big Data Science: Project demo (May 2020) ========
# ====================================================
## Study of COVID-19 denialism through twitter activity
### Project group 26:  Wannes Van Leemput, Sam Vanmassenhove

In this project we study the phenomenon of "COVID-denialism", meaning people who, despite all evidence to the contrary, refuse to acknowledge the coronavirus as a serious threat to society, often referring to it as a "hoax". 

We searched corona-related tweets for signs of denialism and created a labelled training set based on the hashtags used by such denialists. The training data was then used to train a classification model in order to predict whether a certain tweet was a case of COVID denial or a regular tweet on the subject of coronavirus.

In [20]:
# Import libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import re
import sys
sys.path.append('../python files')

# Import our own code
from Authentication import Authentication
from DataMiner import DataMiner
from PreProcessTweets import PreProcessTweets
from TweetDataIO import TweetDataIO
from DenialPredictor import DenialPredictor
from datastream_test import MyStreamListener
from Visualisation import Visualisation
from LocationService import LocationService

# Create a set of English stopwords
sw = set(stopwords.words("english")) 

# Initiate spark
#sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()

# Get twitter api authentication
api = Authentication().get_api()

## 1. Classification of denialist tweets
### 1.1 Demonstration of tweet searching and IO

We use the Tweepy library to search tweets. We focus on English language tweets as these are most common and more geopgraphically diverse.

Note that the training set should already exist as a saved CSV-file when running this "application" in production. The following code is only for the purpose of demonstrating the IO functionality.

In [4]:
# Mine some denial tweets (no specific location)
tagignore = ["#Covid_19", "#coronavirus", "#COVIDー19", "#COVID19", "#coronavirusNYC", "#coronavirusoregon", 
             "#lockdown", "#covid19", "#COVID", "#pandemic", "#Corona", "#Covid19", "#CoronaVirus"]
miner = DataMiner(api, "#CoronaHoax", "", "en", tagignore=tagignore, num_tweets=10)
denial_tweets = miner.mine()

Processing tag: #BillGatesIsEvil
Processing tag: #coronaHoax
Processing tag: #FilmYourHospital
Processing tag: #scamdemic
Processing tag: #Plandemic2020
Processing tag: #POTUS
Processing tag: #QAnon
Processing tag: #ResistTheNewWorldOrder
Processing tag: #BillGates
Processing tag: #CORONAHOAX
Processing tag: #plandemic
Processing tag: #Coronabollocks
Processing tag: #sos
Processing tag: #WWG1WGA
Processing tag: #coronabollocks
Processing tag: #Scamdemic
Processing tag: #NWO
Processing tag: #CovidHoax
Processing tag: #q
Processing tag: #woke
Processing tag: #thegreatawakening
Processing tag: #DrainTheSwamp
Processing tag: #Coronahoax
Processing tag: #BillGatesBioTerrorist
Processing tag: #endthelockdown
Processing tag: #FakePandemic
Processing tag: #ObamaGate
Processing tag: #Plandemic
Processing tag: #coronahoax
Processing tag: #CoronaHoax


In [5]:
# Mine some control tweets (no specific location)
tagignore = ["#CoronaHoax", "#covidhoax","#coronahoax", "#covidhoax", "#Plandemic"]
miner = DataMiner(api, "coronavirus", "", "en", tagignore=tagignore, num_tweets=10)
control_tweets = miner.mine()

Processing tag: #Covid_19
Processing tag: #Coronavirus
Processing tag: #COVID19
Processing tag: #coronavirus


In [6]:
# Write the tweets to a CSV file
filename = "./training_data.txt"
io = TweetDataIO(filename, spark=spark, context=sc)
io.write(denial_tweets, label=0, append=False)
io.write(control_tweets, label=1, append=True)

In [7]:
# Read same file and show
ddf = io.read()
ddf.show(n=10)

+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|label|            location|                tags|                text|               time|           tweet_id|           user|
+-----+--------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|    0|    Salford, England|#BillGates|#BillG...|@BillGates #BillG...|2020-05-20 19:19:03|1263187500786466816|   SlickTrick14|
|    0|                    |#Corruption|#Bill...|@BillGates Hey mo...|2020-05-20 19:09:20|1263185053963730946|       RikiCrew|
|    0|A Wee Spot In Europe|#BillGatesVirus|#...|Both are meeting ...|2020-05-20 19:04:33|1263183848956977152|the_trading_ark|
|    0|         Chicago, IL|#hr6666|#hr6666tr...|Is Contact Tracin...|2020-05-20 19:00:14|1263182764934823937|  anarkistbeatz|
|    0|      United Kingdom|    #BillGatesIsEvil|@davidicke Agreed...|2020-05-20 18:55:40|1263181616731639809| 

In [8]:
# Remove duplicates 
ddf = ddf.orderBy("label").dropDuplicates(["tweet_id"])
ddf.show(n=10)

+-----+------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|label|          location|                tags|                text|               time|           tweet_id|           user|
+-----+------------------+--------------------+--------------------+-------------------+-------------------+---------------+
|    0|   Alberta, Canada|#ObamaGate|#WWG1W...|Correct me if I’m...|2020-05-20 19:12:39|1263185890823847940|    ABPolitical|
|    1|Khyber Pakhtunkhwa|#StayAtHome|#Stay...|Great #StayAtHome...|2020-05-20 19:20:57|1263187976147947522|     drishaq786|
|    0|  Aichi-ken, Japan|#vote|#StillIVote...|Although not all ...|2020-05-20 09:56:19|1263045881864974336|     TheMusicks|
|    0|                  |#COVID19|#Plandem...|This is just gros...|2020-05-20 18:15:37|1263171534883098625|    AmRedPilled|
|    0|Massachusetts, USA|                  #Q|Thanks for linkin...|2020-05-20 19:14:47|1263186424888778756|featherjourney4|


### 1.2 Perform preprocessing steps on the tweets to prepare for classification model

In [9]:
p = PreProcessTweets(ddf, 
                     remove_tags=True, 
                     remove_mentions=True, 
                     remove_punctuation=True, 
                     remove_urls=True, 
                     remove_stopwords=True)
ddf = p.preprocess()

Preprocessing...
>> Removing stopwords...
>> Removing urls...
>> Removing hashtags...
>> Removing user mentions...
>> Removing punctuation...
>> Removing whitespace...
Finished preprocessing!


In [10]:
# Convert to pandas and have a look at the new data
df = ddf.toPandas()
df.head()

Unnamed: 0,label,location,tags,text,time,tweet_id,user,processed_text
0,0,"Alberta, Canada",#ObamaGate|#WWG1WGA_WORLDWIDE|#WWG1WGA|#TheGre...,Correct me if I’m wrong but isn’t “conspiracy ...,2020-05-20 19:12:39,1263185890823847940,ABPolitical,Correct I’m wrong isn’t “conspiracy overthrow ...
1,1,Khyber Pakhtunkhwa,#StayAtHome|#StaySafe|#Covid_19,Great #StayAtHome #StaySafe #Covid_19 https://...,2020-05-20 19:20:57,1263187976147947522,drishaq786,Great
2,0,"Aichi-ken, Japan",#vote|#StillIVote|#ResistTheNewWorldOrder|#Rev...,Although not all of problems can be solved wit...,2020-05-20 09:56:19,1263045881864974336,TheMusicks,Although problems solved Representation matter...
3,0,,#COVID19|#Plandemic|#CovidHoax,This is just gross #COVID19 #Plandemic #CovidH...,2020-05-20 18:15:37,1263171534883098625,AmRedPilled,This gross
4,0,"Massachusetts, USA",#Q,Thanks for linking to this article #Q “most co...,2020-05-20 19:14:47,1263186424888778756,featherjourney4,Thanks linking article “most committee work pu...


### 1.3 Training and evaluation of classification model
#### 1.3.1 First try on the original unedited text

In [11]:
# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(corpus=df.processed_text, labels=df.label, clf="nb")
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 0.892,
Precision: 1.000, 
Recall: 0.100,
F1: 0.182
Test set performance : 
Accuracy: 0.903,
Precision: 1.000, 
Recall: 0.125,
F1: 0.222


#### 1.3.2 Now do the same but without preprocessing

In [12]:
p = PreProcessTweets(ddf, 
                     remove_tags=True, 
                     remove_mentions=False, 
                     remove_punctuation=False, 
                     remove_urls=False, 
                     remove_stopwords=True)
ddf = p.preprocess()
df = ddf.toPandas()
print("\n")

# Split data (corpus and labels) into train and test sets
predictor = DenialPredictor(df.processed_text, df.label)
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
predictor.calc_metrics(X_test, y_test)

Preprocessing...
>> Removing stopwords...
>> Removing hashtags...
>> Removing whitespace...
Finished preprocessing!


Training set performance : 
Accuracy: 0.886,
Precision: 1.000, 
Recall: 0.050,
F1: 0.095
Test set performance : 
Accuracy: 0.889,
Precision: 0.000, 
Recall: 0.000,
F1: 0.000


  _warn_prf(average, modifier, msg_start, len(result))


#### 1.3.3 The same again but this time with a different classifier: Support Vector Machines (SVM)

In [13]:
# Split data (corpus and labels) into train and test sets
svm_predictor = DenialPredictor(df.processed_text, df.label, clf="svm")
X_train, X_test, y_train, y_test = predictor.train_test_split(split=0.3)

# Fit the model
svm_predictor.fit_model(X_train, y_train)

# Calculate some metrics to evaluate performance
print("Training set performance : ")
svm_predictor.calc_metrics(X_train, y_train)
print("Test set performance : ")
svm_predictor.calc_metrics(X_test, y_test)

Training set performance : 
Accuracy: 1.000,
Precision: 1.000, 
Recall: 1.000,
F1: 1.000
Test set performance : 
Accuracy: 0.903,
Precision: 1.000, 
Recall: 0.125,
F1: 0.222


## 2. Live tweet streaming and prediction

In [14]:
from LivePredictionStream import LivePredictionStream
import tweepy

In [15]:
FILE_NAME = "hoax.csv"
api = Authentication(isApp=False).get_api()
auth = api.auth
streamListener = LivePredictionStream(FILE_NAME, predictor, num_iter=20, verbose=1)
stream = tweepy.Stream(auth=auth, listener=streamListener)

try:
    print('Start streaming...')
    stream.filter(languages=['en'], 
                    track=["coronavirus"])

except Exception:
    print("Stopped.")

finally:
    print('Done.')
    stream.disconnect()

Start streaming...
(1) denial (0.94): How about another round of PPP
(2) denial (0.93): Drop by our Facebook page or follow the link below to find out more about how schools in Clackmannanshire are consulting with parents and carers on their child’s wellbeing and learning ⁦AtLochies⁩
(3) denial (0.93): Everyday we have to ProtectTheVote
(4) denial (0.89): 🔥Check out this news coronavirus COVID19 COVID19 COVIDー19 Wuhan coronaviruschina WuhanCoronavirus ChinaVirus gtshare and RT
(5) denial (0.93): Stanford has new free journalism coronavirus webinars DataDriven Storytelling Data Viz Be a Good Boss in Trying Times Innovate FastBetter Cyber Security
(6) denial (0.93): The Trump administration gave a drugmaking contract worth up to 812 million to a small Virginia firm founded less than 6 months ago SmartNews
(7) denial (0.96): Coronavirus of course they warn against it Because disagreeing with Trump is priority instead of saving lives Trump is withholding funding to the Who This is disgusti

## 3. Visualisations

## Aquiring tweet data

Tweet data is aquired and preprocessed. Preprocessed includes geocoding the author location. Since a free api is used, we are limited to 1 request per second. Due to this, the geocoding can take a while.

In [22]:
locationservice = LocationService()
filename = "./training_data.txt"
io = TweetDataIO(filename, spark=spark, context=sc)
ddf = io.read()
df = ddf.toPandas()
df = locationservice.add_location_data(df)

## Visualise global trending hashtags
Global trending hashtags amongst hoax-believer tweets are shown, along with their occurance count

In [23]:
city = "New York City"
country = "USA"
radius = 100

In [24]:
vis.trending_tags_local(city, country, radius)

NameError: name 'vis' is not defined

## Visualise hoax believers
The location of hoax believers can be shown on a heat map.

In [None]:
vis.heat_map()