<a href="https://colab.research.google.com/github/stephen-crone/monkeypox/blob/main/model-retesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<br>Detecting Misinformation and Superspreaders in Social Media:<br>Designing a System for the Next Pandemic**
#Phase 5:  Evaluating model resilience to temporal shift

Stephen Crone
<br>
<br>
<br>


---
## **1. Setting up programming environment**
Our first step will be to install and import the libraries we need.

---

###**1.1. Installing libraries**

In [None]:
# For access to transformer models.
!pip install transformers
# For operations on HuggingFace dataset objects.
!pip install datasets

###**1.2. Importing libraries**

In [None]:
# For general dataset manipulation.
from datasets import ClassLabel, Dataset, DatasetDict, Features, Value
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
# For dataset preprocessing.
from transformers import AutoTokenizer, DefaultDataCollator
# For model predictions / evaluation.
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report
# For visualisation.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Other imports.
import random as python_random
import datetime
import time

<br>
<br>

---
## **2. Importing the dataset**
In this section, we load the monkeypox misinformation follow-up dataset and create two versions to select from: the first a full version (based on the 'misinformation' vs 'other' class split); the second a smaller subset (based on the 'misinformation' vs 'good information' class split).

Please note: user must upload a Kaggle API token to session storage in order to successfully download the dataset. Failure to do so will generate an error.

---

In [None]:
# Preparing Kaggle and Kaggle API token.
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
# Downloading the dataset from Kaggle.
! kaggle datasets download stephencrone/monkeypox
# Unzipping the dataset.
! unzip monkeypox
# Assigning dataset to Pandas DataFrame.
bigDF = pd.read_csv('/content/monkeypox-followup.csv')
# Reconfiguring datetime features.
bigDF['created_at'] = pd.to_datetime(bigDF['created_at'])
bigDF['user created at'] = pd.to_datetime(bigDF['user created at'])
# Creating a second, smaller DataFrame where we replace 'other' (i.e. non-misinformation)
# class with 'good' (i.e. reliable information) class.
littleDF = bigDF.copy()
littleDF = littleDF.drop(littleDF[littleDF.ternary_class == 9].index)
littleDF = littleDF.drop(['binary_class'],axis=1)
littleDF = littleDF.rename(columns={"ternary_class": "class"})
# Removing redundant class label column from bigDF.
bigDF = bigDF.drop(['ternary_class'],axis=1)
bigDF = bigDF.rename(columns={"binary_class": "class"})

<br>
<br>


---
## **3. Exploratory data analysis (EDA)**
There is no need to subject this follow-up dataset to the same level of EDA as we subjected the larger dataset (on which the final model was trained). However, as a minimum, we will quickly review the features of the dataset and the distribution of the class labels. In addition, we will look at at the distribution of the newly-added 'beto_flag' feature (a Boolean feature that denotes whether or not the tweet references the Texan Democratic politician, Beto O'Rourke), which has been added due to [false reports](https://www.reuters.com/article/factcheck-beto-bacterial-infection-idUSL1N3051YN) of his infection with monkeypox that were conspicuously prominent during annotation of the data.

---


In [None]:
# Review dataset head.
bigDF.head()

In [None]:
# Review dataset info.
bigDF.info()

In [None]:
# Show class distribution for larger version of dataset.
bigDF['class'].value_counts(normalize=True).sort_index()

In [None]:
# Show class distribution for smaller version of dataset.
littleDF['class'].value_counts(normalize=True).sort_index()

In [None]:
# Show distribution of beto_flag feature for larger version of dataset.
bigDF['beto_flag'].value_counts(normalize=True).sort_index()

In [None]:
# Show distribution of beto_flag feature for smaller version of dataset.
littleDF['beto_flag'].value_counts(normalize=True).sort_index()

<br>
<br>

---
## **4. Preprocessing the dataset**
For Phase 5, we will preprocess the dataset in exactly the same way that we did the winning model from Phase 2. This means: (i) selecting the larger version of the dataset (with 'misinformation' vs. 'non-misinformation' class labels); and (ii) allowing the model to see the same combination of features that the winning model saw during training (i.e. the tweet text plus the user verification column). With those steps concluded, the dataset must be tokenized and converted into a format amenable to processing by Tensorflow.

---

In [None]:
# Decide whether to test 'Beto theory': that decline in classifier performance can be accounted for in part
# by emergence of false claims regarding Beto O'Rourke's purported monkeypox infection.
exploreBetoFactor = False
if exploreBetoFactor == True:
  bigDF = bigDF.drop(bigDF[bigDF.beto_flag == False].index)
  littleDF = littleDF.drop(littleDF[littleDF.beto_flag == False].index)

In [None]:
# Selecting larger dataset for consistency with Phase 2 winning model.
datasetChoice = bigDF

In [None]:
# Selecting same dataset features as Phase 2 winning model.
chosenFeatures = ['text', 'user is verified', 'class']
featuresToConcatenate = chosenFeatures.copy()
featuresToConcatenate.remove('class')

In [None]:
# Concatenating the desired features in a copy of the dataframe.
slimmedDownDF = datasetChoice[chosenFeatures].copy()
for i in featuresToConcatenate:
  slimmedDownDF[i] = slimmedDownDF[i].name + ": " + slimmedDownDF[i].astype(str)
slimmedDownDF['combined'] = slimmedDownDF[featuresToConcatenate].apply(lambda row: ' [SEP] '.join(row.values.astype(str)), axis=1)
finalDF = slimmedDownDF[['combined','class']].copy()
# Assigning correct class labels.
if datasetChoice is bigDF:
  classNames = ["other","misinformation"]
elif datasetChoice is littleDF:
  classNames = ["good information","misinformation"]
else:
  raise ValueError("datasetChoice must be either 'bigDF' or 'littleDF'")
# Creating a Huggingface dataset object from the slimmed-down Pandas dataframe.
ds_features = Features({'combined': Value('string'), 'class': ClassLabel(names=classNames)})
dataset = Dataset.from_pandas(df=finalDF, features=ds_features,preserve_index=False)

In [None]:
# Instantiating the winning model and tokenizer.
model = TFAutoModelForSequenceClassification.from_pretrained("smcrone/monkeypox-misinformation")
tokenizer = AutoTokenizer.from_pretrained("smcrone/monkeypox-misinformation",use_fast=False)

In [None]:
# Tokenizing the dataset.
def tokenize_function(examples):
  return tokenizer(examples['combined'], padding = "max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [None]:
# Converting tokenized dataset to Tensorflow format.
data_collator = DefaultDataCollator(return_tensors="tf")
tf_test_dataset = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=1)

<br>
<br>

---
## **5. Evaluating the model**
Call the model on the dataset and evaluate the results.

---

In [None]:
# Compiling the model.
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.SparseCategoricalAccuracy())

In [None]:
# Creating classification report based on model predictions.
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
modelLogits = model.predict(tf_test_dataset).logits
modelPredictions = []
for i in range(len(modelLogits)):
  prediction = np.argmax(modelLogits[i])
  modelPredictions.append(prediction)
classificationReport = classification_report(testDataLabels,modelPredictions,digits=5)
print(classificationReport)

In [None]:
# Creating and plotting a confusion matrix based on model predictions.
confusionMatrix = confusion_matrix(testDataLabels, modelPredictions, labels=None, sample_weight=None, normalize=None)
sns.heatmap(confusionMatrix, square=True, annot=True, cbar=False, cmap="Blues",fmt='g')
plt.title("Fine-tuned model predictions")
plt.xlabel('predicted value')
plt.ylabel('true value')
plt.show()
plt.clf()