<a href="https://colab.research.google.com/github/stephen-crone/monkeypox/blob/main/model_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<br>Detecting Misinformation and Superspreaders in Social Media:<br>Designing a System for the Next Pandemic**
#Phase 2:  Training and evaluating candidate models

Stephen Crone
<br>
<br>
<br>


---
## **1. Setting up programming environment**
Our first step will be to install and import the libraries we need.

---

###**1.1. Installing libraries**

In [None]:
# For access to transformer models.
!pip install transformers
# For operations on HuggingFace dataset objects.
!pip install datasets
# For few-shot learning (FSL) experiments.
!pip install accelerate
# For tokenization with Deberta models.
!pip install sentencepiece
# For additional optimizer options.
!pip install tensorflow-addons

###**1.2. Importing libraries**

In [None]:
# For file handling and data exports.
from google.colab import files
from openpyxl import Workbook, load_workbook
import os
# For general dataset manipulation.
from datasets import ClassLabel, Dataset, DatasetDict, Features, Value
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
# For dataset preprocessing (e.g. splitting, tokenization).
from transformers import AutoTokenizer, DefaultDataCollator
from sklearn.model_selection import train_test_split
import sentencepiece
# For exploratory data analysis.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from wordcloud import WordCloud
# For model training.
from transformers import TFAutoModelForSequenceClassification, pipeline, GPTJForCausalLM
import torch
import tensorflow as tf
import tensorflow_addons as tfa
# For model evaluation.
from sklearn.metrics import confusion_matrix, classification_report
# For visualisation.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# For model export to HuggingFace hub.
from huggingface_hub import notebook_login
# Other imports
import random as python_random
import datetime
import time

###**1.3. Setting random seed**

In [None]:
# Setting random seed
np.random.seed(1)
tf.random.set_seed(2)
python_random.seed(3)

<br>
<br>

---
## **2. Importing the dataset**
In this section, we load the monkeypox misinformation dataset and create two versions to use in the notebook: the first a full version (based on the 'misinformation' vs 'other' class split); the second a smaller subset (based on the 'misinformation' vs 'good information' class split).

Please note: user must upload a Kaggle API token to session storage in order to successfully download the dataset. Failure to do so will generate an error.

---

In [None]:
# Preparing Kaggle and Kaggle API token.
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
# Downloading the dataset from Kaggle.
! kaggle datasets download stephencrone/monkeypox
# Unzipping the dataset.
! unzip monkeypox
# Assigning dataset to Pandas DataFrame.
bigDF = pd.read_csv('/content/monkeypox.csv')
# Reconfiguring datetime features.
bigDF['created_at'] = pd.to_datetime(bigDF['created_at'])
bigDF['user created at'] = pd.to_datetime(bigDF['user created at'])
# Creating a second, smaller DataFrame where we replace 'other' (i.e. non-misinformation)
# class with 'good' (i.e. reliable information) class.
littleDF = bigDF.copy()
littleDF = littleDF.drop(littleDF[littleDF.ternary_class == 9].index)
littleDF = littleDF.drop(['binary_class'],axis=1)
littleDF = littleDF.rename(columns={"ternary_class": "class"})
# Removing redundant class label column from bigDF.
bigDF = bigDF.drop(['ternary_class'],axis=1)
bigDF = bigDF.rename(columns={"binary_class": "class"})

<br>
<br>


---
## **3. Exploratory data analysis**
In this section we provide an overview of the features in the dataset and their data types; consider the superficial lexical differences that can be observed between the classes; and identify the features, aside from tweet text, which might help in separating the classes.  

---


###**3.1. Overview of dataset features and class distribution**
As the following table shows, we have a wealth of tweet-related and user-related features that we could potentially feed to our classifier.

In [None]:
bigDF.head()

Given that this is a dataset that we have collated ourselves, we know already that there are no missing values.

In [None]:
bigDF.info()

We have a choice, however, of two binary labelling systems: a 'misinformation' / 'other' system involving the full dataset ('bigDF'); or a 'misinformation' / 'good information' system ('littleDF'). 'Good information' posts are a subset of the 'other' class, meaning that this dataset is the smaller of the two. However, it has the advantage of being less imbalanced in terms of the underrepresentation of the 'misinformation' class. 

In [None]:
# Show class distribution for larger version of dataset.
bigDF['class'].value_counts(normalize=True).sort_index()

In [None]:
# Show class distribution for smaller version of dataset.
littleDF['class'].value_counts(normalize=True).sort_index()

###**3.2. Exploring the similarity of the classes**
Although it will be for the transformer model to separate the classes, we can form a preliminary view as to how difficult this is likely to be based on some straightforward lexical analysis of the tweets themselves.

Overall, this analysis suggests that whilst there are discernible differences between the classes, separating them is not likely to be a trivial task. This view is supported firstly by our word clouds. As the word clouds show, the misinformation category clearly reflects a distinct set of themes, centred for example on the purported transmission of the virus via gay sex and airborne particles. However, there are also points of overlap between the classes -- including for example the discussion of Covid-19 in the context of both 'misinformation' and 'other' posts.

In [None]:
# Selecting samples from each of the three classes.
tweets_containing_misinformation = bigDF.loc[bigDF['class'] == 1]
tweets_not_containing_misinformation = bigDF.loc[bigDF['class'] == 0]
tweets_containing_reliable_information = littleDF.loc[littleDF['class'] == 0]
threeViews = [["Tweets containing misinformation",tweets_containing_misinformation],
              ["Tweets not containing misinformation",tweets_not_containing_misinformation],
              ["Tweets containing reliable information",tweets_containing_reliable_information]]
# Creating set of stopwords to exclude from wordclouds.
stopWords = set(stopwords.words('english'))
stopWords = ['monkey','pox','_URL_','monkeypox','amp','get'] + list(stopWords)

In [None]:
# # Creating and plotting trio of wordclouds.
for i in range(len(threeViews)):
  x = WordCloud(stopwords=stopWords,width=3500,height=2500,max_words=100,collocations=False).generate(' '.join(threeViews[i][1]['text']))
  plt.figure( figsize=(20,10) )
  plt.imshow(x)
  plt.axis("off")
  print(threeViews[i][0])
  plt.show()

Visualising the sentence embeddings produced in Section 6.4 equally suggests that the classes are not readily separable.

###**3.3. Identifying features that might help to separate the classes**
We have various features -- including various text features, besides tweets themselves -- that we could additionally incorporate into our transformer model. The good news, given the maximum length of a tweet and the maximum number of tokens that transformer models can typically accommodate, is that there is plenty of scope to incorporate these additional features as part of a single input pattern. Moreover, there is evidence to suggest that transformer models can understand numbers, despite the fact that they are obviously parsed by the model as token embeddings rather than numerical data per se [(Wallace et al., 2019)](https://arxiv.org/abs/1909.07940). This opens up the possibility that even numerical features could be used to enrich the model.

In terms of features that might be helpful in separating our classes, it seems possible that **tweet source** might be of use. Although there is little difference between the 'misinformation' and 'other' classes, it is evident that the subset of reliable tweets from the 'other' class features a greater proportion of tweets published via tweet schedulers and web software (e.g. Wordpress).

In [None]:
# Show tweet source in descending order of importance for misinformation class (1).
pd.crosstab(bigDF['source'], bigDF['class'],normalize='columns').sort_values(by=1, axis=0, ascending=False, kind='stable').head(10)

In [None]:
# Show tweet source in descending order of importance for reliable information class (0).
pd.crosstab(littleDF['source'], littleDF['class'],normalize='columns').sort_values(by=0, axis=0, ascending=False, kind='stable').head(10)

Another possibly useful feature is the '**userverified**' column. As the data tables below demonstrate, although few accounts overall are verified, verified accounts are much less likely to publish misleading posts.

In [None]:
# Show the breakdown of 'user is verified values'.
bigDF['user is verified'].value_counts(normalize=True).sort_index()

In [None]:
# Break down class distribution by 'user is verified' values (bigger version of dataset).
pd.crosstab(bigDF['user is verified'], bigDF['class'],normalize='index')

In [None]:
# Break down class distribution by 'user is verified' values (smaller version of dataset).
pd.crosstab(littleDF['user is verified'], littleDF['class'],normalize='index')

Another option is whether or not the user has a **URL** listed on their profile. Only 38 per cent of posts (in the full dataset) are by users with an external URL. The data suggests that these posts are far less likely to spread misinformation (perhaps because users with URLs are more likely to be institutional users).

In [None]:
# Show the overall breakdown of 'user has url' values.
bigDF['user has url'].value_counts(normalize=True).sort_index()

In [None]:
# Show a cross-tab of 'user has url' and class for the larger dataset.
pd.crosstab(bigDF['user has url'], bigDF['class'],normalize='index')

In [None]:
# Show a cross-tab of 'user has url' and class for the smaller dataset.
pd.crosstab(littleDF['user has url'], littleDF['class'],normalize='index')

We would not necessarily expect a relationship between **when the user profile was created** and their propensity for spreading misinformation. However, the fact that so many accounts have been created so recently seems suspicious. Indeed, recently created accounts seem to be especially prevalent among the misinformation class. This raises the possibility of a 'years since account creation' variable being of use in separating the classes.

In [None]:
# Print a series of histogram plots showing when accounts were created, according to class label.
print("User created at (all users)")
sns.displot(data=bigDF,x='user created at',kind='hist',col='class',stat='percent',common_norm=False)

We cannot use post metrics to help judge which class a tweet belongs to as these features will be unavailable (or artificially low) for any tweet that is gathered in real-time (or shortly after publication).

As such, we move on to consider **user metrics**. Here, we notice that users posting misinformation are likely to have a smaller footprint in general. In particular, users posting misinformation are noted as having fewer followers.

In [None]:
# Print a series of histogram plots showing the number of followers, according to class label.
sns.displot(data=bigDF,x='followers count',kind='hist',col='class',binrange=(0,10000),stat='percent',common_norm=False)

In [None]:
# Print boxplots showing the number of followers, according to class label.
sns.boxplot(data=bigDF, x='class',y='followers count', showfliers=False)

In [None]:
# Print a series of histogram plots showing the number of accounts followed, according to class label.
sns.displot(data=bigDF,x='following count',kind='hist',col='class',binrange=(0,10000),stat='percent',common_norm=False)

In [None]:
# Print boxplots showing the number of accounts followed, according to class label.
sns.boxplot(data=bigDF, x='class',y='following count', showfliers=False)

In [None]:
# Print a series of histogram plots showing the number of tweets published, according to class label.
sns.displot(data=bigDF,x='tweet count',kind='hist',col='class',binrange=(0,100000),stat='percent',common_norm=False)

In [None]:
# Print boxplots showing the number of lists to which user belongs, according to class label.
sns.displot(data=bigDF,x='listed_count',kind='hist',col='class',binrange=(0,50),stat='percent',common_norm=False)

In Section 6, we will consider whether these features (or features based on them) can assist fine-tuned models.

<br>
<br>

---
## **4. Zero-shot experiments**

---
Zero-shot learning can be done a number of ways [(Davison, 2020)](https://joeddav.github.io/blog/2020/05/29/ZSL.html), with the chosen paradigm having implications in terms of the appropriate model architecture and pretraining regimen. We do not have time during this project to research and experiment with all of these methods, so we opt for the commonly used natural langugage inference (NLI) method, where the text is presented as a premise and the label is treated as a hypothesis. Only models that have been pretrained on NLI tasks are appropriate for this approach to zero-shot learning (which excludes 'vanilla' BERT and other models that we might look to under a pretraining / fine-tuning approach). As such, we will conduct experiments here with two NLI trained models: 'bart-large-mlni' and 'nli-distilroberta-base'.



###**4.1. Instantiating model and tokenizer**

In [None]:
# Instantiating model and tokenizer via the HuggingFace pipeline.
zslModelChoice = 'cross-encoder/nli-distilroberta-base'
zslClassifier = pipeline(task='zero-shot-classification', model=zslModelChoice)

###**4.2. Preparing copy of the dataset**

In [None]:
# Picking dataset version to use.
zslDatasetChoice = "bigDF"
# Picking the class labels that we will experiment with.
candidateLabels = ['Authoritative','Unauthoritative']
# Creating a DataFrame in which to store our results.
if zslDatasetChoice == "littleDF":
  zslDF = pd.DataFrame(data={"text": littleDF['text'].copy(),
                                "class": littleDF['class'].copy(),
                                candidateLabels[0]:np.nan,
                                candidateLabels[1]:np.nan,
                                "predictedLabel":np.nan,
                                  "timeRequired":np.nan})
elif zslDatasetChoice == "bigDF":
  zslDF = pd.DataFrame(data={"text": bigDF['text'].copy(),
                                "class": bigDF['class'].copy(),
                                candidateLabels[0]:np.nan,
                                candidateLabels[1]:np.nan,
                                "predictedLabel":np.nan,
                                  "timeRequired":np.nan})
else:
  raise ValueError("zslDatasetChoice must be either 'bigDF' or 'littleDF'")

In [None]:
# Optionally: split dataset and use a smaller test set, given time required for each inference step.
# Note: comment out the line below if no splitting is desired.
zslDF, zslDiscard = train_test_split(zslDF, test_size=0.95, shuffle=True, random_state=7, stratify=zslDF['class'])

###**4.3. Calling model and storing predictions**

In [None]:
# For each tweet in our dataset ...
for i in range(len(zslDF.index)):
  # Time each step.
  start = time.time()
  sequence = zslDF['text'].iloc[i]
  # Pass the tweet to the ZSL classifier.
  result = zslClassifier(sequence,candidateLabels)
  # Assign the confidence scores to the correct columns of the df.
  if result['labels'][0] == candidateLabels[0]:
    zslDF.loc[zslDF.index[i], candidateLabels[0]] = result['scores'][0]
    zslDF.loc[zslDF.index[i], candidateLabels[1]] = result['scores'][1]
  else:
    zslDF.loc[zslDF.index[i], candidateLabels[0]] = result['scores'][1]
    zslDF.loc[zslDF.index[i], candidateLabels[1]] = result['scores'][0]
  # Assign a predicted class label.
  if zslDF.loc[zslDF.index[i], candidateLabels[0]] > 0.5:
    zslDF.loc[zslDF.index[i], 'predictedLabel'] = 0
  else:
    zslDF.loc[zslDF.index[i], 'predictedLabel'] = 1
  # Record time per step.
  end = time.time()
  zslDF.loc[zslDF.index[i], 'timeRequired'] = end-start

###**4.4. Analysis of results**

In [None]:
# Recasting unwanted float values as int.
zslDF['predictedLabel'].astype('int', copy=False, errors='raise')

In [None]:
# Creating a confusion matrix based on model predictions.
confusionMatrix = confusion_matrix(zslDF['class'], zslDF['predictedLabel'], labels=None, sample_weight=None, normalize=None)

In [None]:
# Plotting confusion matrix.
sns.heatmap(confusionMatrix, square=True, annot=True, cbar=False, cmap="Blues",fmt='g')
plt.title("Zero-shot learning performance")
plt.xlabel('predicted value')
plt.ylabel('true value')
plt.show()
plt.clf()

In [None]:
# Creating classification report based on model predictions.
classificationReport = classification_report(zslDF['class'],zslDF['predictedLabel'],digits=5)

In [None]:
# Printing classification report.
print(classificationReport)

###**4.5. Exporting experimental data**

In [None]:
# Exporting results data to an Excel sheet.
zslDF.to_excel('zslResults.xlsx', sheet_name='data')

In [None]:
# Adding classification report to Excel workbook.
classificationReportForPrint = classification_report(zslDF['class'],zslDF['predictedLabel'],digits=5,output_dict=True)
classificationReportDF = pd.DataFrame(classificationReportForPrint).transpose()
classificationReportExport = classificationReportDF.to_excel('classReportZSL.xlsx')
# Loading Excel workbooks.
wb = load_workbook(filename = 'zslResults.xlsx')
wbCR = load_workbook(filename = 'classReportZSL.xlsx')
# Creating new worksheet which is a copy of classification report.
ws1 = wb.create_sheet(title="class. report")
ws2 = wbCR.active
for row in ws2:
    for cell in row:
        ws1[cell.coordinate].value = cell.value
# Saving changes.
wb.save(filename = 'zslResults.xlsx')

In [None]:
# Adding metadata to Excel workbook.
wb = load_workbook(filename = 'zslResults.xlsx')
ws = wb.create_sheet(title="metadata")
ws['A1'] = "Date"
ws['B1'] = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M')
ws['A3'] = "Candidate labels"
ws['B3'] = candidateLabels[0]
ws['C3'] = candidateLabels[1]
ws['A5'] = "Model used"
ws['B5'] = zslModelChoice
ws['A7'] = "Dataset used"
ws['B7'] = zslDatasetChoice
# Saving changes.
wb.save(filename = 'zslResults.xlsx')

In [None]:
# Downloading results file.
os.rename("zslResults.xlsx","zsl-{}-{}-{}-{}.xlsx".format(candidateLabels[0],zslDatasetChoice,datetime.datetime.now().month,datetime.datetime.now().day))
files.download("zsl-{}-{}-{}-{}.xlsx".format(candidateLabels[0],zslDatasetChoice,datetime.datetime.now().month,datetime.datetime.now().day))

<br>
<br>

---
## **5. Few-shot experiments**

---
Similar to zero-shot learning (ZSL), few-shot learning (FSL) can be approached in different ways. Importantly, not every way of doing ZSL is easily adaptable to a few-shot setting, and not all models are equally adept in different settings (i.e. ZSL vs FSL). Just like our experiments with ZSL, we do not have time during this project to experiment comprehensively with different FSL approaches. Accordingly, we will trial one common way of doing FSL. This involves the use of a very large autoregressive model, which are not usually known for classification tasks, but whose text generation abilities can be used for *de facto* classification. We base our approach on one [suggested by HuggingFace](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GPT-J-6B/Inference_with_GPT_J_6B.ipynb).

Similar to our zero-shot experiments, we conduct our FSL experiments using the smaller and more easily separable version of our dataset (with the 'good' information vs 'bad' information distinction).

###**5.1. Instantiating model and tokenizer**

In [None]:
# Instantiating GPT-J model and tokenizer.
fslModelChoice = 'EleutherAI/gpt-j-6B'
tokenizer = AutoTokenizer.from_pretrained(fslModelChoice)
fslModel = GPTJForCausalLM.from_pretrained(fslModelChoice, low_cpu_mem_usage=True, torch_dtype=torch.float32, pad_token_id=tokenizer.eos_token_id)

###**5.2. Preparing copy of the dataset**

In [None]:
# Picking dataset version to use.
fslDatasetChoice = "littleDF"
# Creating a DataFrame in which to store our results.
if fslDatasetChoice == "littleDF":
  fslDF = pd.DataFrame(data={"text": littleDF['text'].copy(),
                                "class": littleDF['class'].copy(),
                                "predictedLabel":np.nan,
                                  "timeRequired":np.nan})
elif fslDatasetChoice == "bigDF":
  fslDF = pd.DataFrame(data={"text": bigDF['text'].copy(),
                                "class": bigDF['class'].copy(),
                                "predictedLabel":np.nan,
                                  "timeRequired":np.nan})
else:
  raise ValueError("fslDatasetChoice must be either 'bigDF' or 'littleDF'")

In [None]:
# Splitting dataset and using a smaller test set, given time required for each inference step
fslDF, fslDiscard = train_test_split(fslDF, test_size=0.99, shuffle=True, random_state=1, stratify=fslDF['class'])

###**5.3. Preparing few-shot prompts**
In this section we construct the examples that will be fed to the model at inference time.

In [None]:
# Select the two words (i.e. labels) that we will insist our classifier chooses from (i.e. constrained text generation).
force_words = ["true","false"]
# Tokenize the force words.
force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids
# We will set max_new_tokens / min_length to be equal to the longest / shortest force word.
# This means that the model is prevented from generating too many or too few tokens.
max_new_tokens = max(len(elem) for elem in force_words_ids)
min_length = min(len(elem) for elem in force_words_ids)

In [None]:
# Creating 6 examples to guide inference.
promptHead = """
####
Statement: Monkeypox spreads through close contact with an infected person or an infected surface.
This statement is true
####
Statement: Monkeypox is the government's evil plan for election fraud and more lockdowns.
This statement is false
####
Statement: The city reported 56 confirmed new cases of Monkeypox - an increase of 40%.
This statement is true
####
Statement: Monkeypox was created in a Wuhan laboratory and released deliberately by China.
This statement is false
####
Statement: Health officials say that the first case of Monkeypox has been detected in the country.
This statement is true
####
Statement: Monkeypox only affects gays because it is a sexually transmitted disease.
This statement is false
####
Statement: """

In [None]:
# Creating string to append to the prompt. (We alter string conditionally to correct 
# for erratic text generation behaviour observed during experimentation.)
if max_new_tokens > 1:
  promptTail = """This statement is """
else:
  promptTail = """This statement is"""

###**5.4. Calling model and storing predictions**

In [None]:
# For each tweet in our dataset ...
for i in range(len(fslDF.index)):
  # Time each step.
  start = time.time()
  # Run the text generator and compel it to produce one of the two force words.
  sequence = fslDF['text'].iloc[i]
  prompt = promptHead+sequence+"\n"+promptTail
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  generated_ids = fslModel.generate(input_ids, do_sample=False, max_new_tokens=max_new_tokens, min_length=min_length, num_beams=2, length_penalty=0.0,remove_invalid_values=True, force_words_ids=force_words_ids)
  generated_text = tokenizer.decode(generated_ids[0],skip_special_tokens=True)
  newText = generated_text.split()[-1]
  # Add model prediction to dataframe.
  if newText == force_words[1]:
    fslDF.loc[fslDF.index[i], 'predictedLabel'] = 1
  elif newText == force_words[0]:
    fslDF.loc[fslDF.index[i], 'predictedLabel'] = 0
  # Contingency in case model fails to output expected value.
  else:
    fslDF.loc[fslDF.index[i], 'predictedLabel'] = 9
  end = time.time()
  # Record time per step.
  fslDF.loc[fslDF.index[i], 'timeRequired'] = end-start


###**5.5. Analysis of results**

In [None]:
# Recasting unwanted float values as int.
fslDF['predictedLabel'].astype('int', copy=False, errors='raise')

In [None]:
# Creating a confusion matrix based on model predictions.
confusionMatrix = confusion_matrix(fslDF['class'], fslDF['predictedLabel'], labels=None, sample_weight=None, normalize=None)

In [None]:
# Plotting confusion matrix.
sns.heatmap(confusionMatrix, square=True, annot=True, cbar=False, cmap="Blues",fmt='g')
plt.title("Few-shot learning performance")
plt.xlabel('predicted value')
plt.ylabel('true value')
plt.show()
plt.clf()

In [None]:
# Creating classification report based on model predictions.
classificationReport = classification_report(fslDF['class'],fslDF['predictedLabel'],digits=5)

In [None]:
# Printing classification report.
print(classificationReport)

###**5.6. Exporting experimental data**

In [None]:
# Exporting data to an Excel sheet.
fslDF.to_excel('fslResults.xlsx', sheet_name='data')

In [None]:
# Adding classification report to Excel workbook.
classificationReportForPrint = classification_report(fslDF['class'],fslDF['predictedLabel'],digits=5,output_dict=True)
classificationReportDF = pd.DataFrame(classificationReportForPrint).transpose()
classificationReportExport = classificationReportDF.to_excel('classReportFSL.xlsx')
# Loading Excel workbooks.
wb = load_workbook(filename = 'fslResults.xlsx')
wbCR = load_workbook(filename = 'classReportFSL.xlsx')
# Creating new worksheet which is a copy of classification report.
ws1 = wb.create_sheet(title="class. report")
ws2 = wbCR.active
for row in ws2:
    for cell in row:
        ws1[cell.coordinate].value = cell.value
# Saving changes.
wb.save(filename = 'fslResults.xlsx')

In [None]:
# Adding metadata to Excel sheet.
wb = load_workbook(filename = 'fslResults.xlsx')
ws = wb.create_sheet(title="metadata")
ws['A1'] = "Date"
ws['B1'] = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M')
ws['A3'] = "Prompts used"
ws['B3'] = promptHead
ws['A5'] = "Model used"
ws['B5'] = fslModelChoice
ws['A7'] = "Dataset used"
ws['B7'] = fslDatasetChoice
ws['A9'] = "Labels used"
ws['B9'] = force_words[0]+','+force_words[1]
wb.save(filename = 'fslResults.xlsx')

In [None]:
# Downloading results file.
os.rename("fslResults.xlsx","fsl-{}-{}-{}-{}.xlsx".format(force_words[0],fslDatasetChoice,datetime.datetime.now().month,datetime.datetime.now().day))
files.download("fsl-{}-{}-{}-{}.xlsx".format(force_words[0],fslDatasetChoice,datetime.datetime.now().month,datetime.datetime.now().day))

<br>
<br>

---
## **6. Preprocessing the data for fine-tuning experiments**

---
For experiments using the fine-tuning approach, we will create some additional features to experiment with. We then need to pick which version of the dataset to use; select which features we want to add to the model; tokenize the text; and split the dataset into stratified training, validation and test sets. 

###**6.1. Engineering new features**
There is a question as to whether we might be able to engineer new features to help separate the classes, particularly given suggestions that transformer models can understand numbers [(Wallace et al., 2019)](https://arxiv.org/abs/1909.07940). For example, we might add a column (as we do below) that counts the **number of years since account creation** (rather than simply listing the date on which it was created), as otherwise the model may associate misinformation with accounts created in 2022 rather than accounts that have been created recently.

In [None]:
# Creating a feature based on the number of years since account was created.
for ds in [bigDF, littleDF]:
  ds['years since account created'] = ds['created_at'].dt.year.astype('Int64') - ds['user created at'].dt.year.astype('Int64')

We might also add a feature to identify 'spammy' accounts by dividing the total number of **tweets published by the account** by the number of days since it was created. However, we observe below no obvious difference, in this regard, between the 'misinformation' and 'other' class.

In [None]:
# Creating a feature to capture tweets published per day.
for ds in [bigDF, littleDF]:
  ds['tweets per day'] = ds['tweet count']/((ds['created_at'] - ds['user created at']).dt.days + 1)

In [None]:
# Print a series of histogram plots of tweet published per day, according to class.
sns.displot(data=bigDF,x='tweets per day',kind='hist',col='class',binrange=(0,100),stat='percent',common_norm=False)

In [None]:
# Print a series of boxplots of tweets published per day, according to class.
sns.boxplot(data=bigDF, x='class',y='tweets per day', showfliers=False)

Lastly, following suggestions in other studies, we can **divide the number of followers for each user by the number of accounts that the user follows**, based on the observation that users that follow many accounts but have few followers are more likely to be bots or trolls. (Conversely, users with many followers relative to accounts followed are intuitively more likely to be institutional accounts.) 

In [None]:
# Create a feature capturing the user's following-to-followers ratio.
for ds in [bigDF, littleDF]:
  ds['follower to following ratio'] = ds['followers count']/(ds['following count']+1)

In [None]:
# Print a series of histogram plots of users' following-to-followers ratio, according to class.
sns.displot(data=bigDF,x='follower to following ratio',kind='hist',col='class',binrange=(0,20),stat='percent',common_norm=False)

In [None]:
# Print a series of boxplots of users' following-to-followers ratio, according to class.
sns.boxplot(data=bigDF, x='class',y='follower to following ratio', showfliers=False)

In [None]:
# Format our new variables to make them more amenable to processing by a transformer model (i.e. without too many decimal points).
bigDF['tweets per day'] = bigDF['tweets per day'].round().astype(int)
bigDF['follower to following ratio'] = bigDF['follower to following ratio'].round(decimals=1)

###**6.2. Selecting dataset version**
For fine-tuning experiments, we will focus on the larger of the two dataset options, as this represents a greater challenge and is arguably more representative of the real-world demands that a classifier would be confronted with.

In [None]:
# datasetChoice must either be bigDF or littleDF
datasetChoice = bigDF

###**6.3. Selecting features**
We have the following features to choose from, in terms of model inputs.

In [None]:
for cols in datasetChoice.columns:
    print(cols)

Excluding the class label (which must be selected), we will select the following features as inputs to the model.

In [None]:
# Specify features to select.
chosenFeatures = ['text', 'user is verified', 'class']
# Concatenated features will be the selected features minus the class label.
featuresToConcatenate = chosenFeatures.copy()
featuresToConcatenate.remove('class')

In [None]:
# Concatenate the desired features in a copy of the dataframe.
if datasetChoice is bigDF:
  recordedDSChoice = 'Misinformation / other'
  slimmedDownDF = datasetChoice[chosenFeatures].copy()
  for i in featuresToConcatenate:
    slimmedDownDF[i] = slimmedDownDF[i].name + ": " + slimmedDownDF[i].astype(str)
  slimmedDownDF['combined'] = slimmedDownDF[featuresToConcatenate].apply(lambda row: ' [SEP] '.join(row.values.astype(str)), axis=1)
  finalDF = slimmedDownDF[['combined','class']].copy()
  classNames = ["other","misinformation"]
elif datasetChoice is littleDF:
  recordedDSChoice = 'Misinformation / good'
  slimmedDownDF = datasetChoice[chosenFeatures].copy()
  for i in featuresToConcatenate:
    slimmedDownDF[i] = slimmedDownDF[i].name + ": " + slimmedDownDF[i].astype(str)
  slimmedDownDF['combined'] = slimmedDownDF[featuresToConcatenate].apply(lambda row: ' [SEP] '.join(row.values.astype(str)), axis=1)
  finalDF = slimmedDownDF[['combined','class']].copy()
  classNames = ["good information","misinformation"]
else:
  raise ValueError("datasetChoice must be either 'bigDF' or 'littleDF'")
# Create a Huggingface dataset object from the slimmed-down Pandas dataframe.
ds_features = Features({'combined': Value('string'), 'class': ClassLabel(names=classNames)})
dataset = Dataset.from_pandas(df=finalDF, features=ds_features,preserve_index=False)

In [None]:
# Print a few examples from the concatenated dataframe, to make sure everything has gone to plan.
y=1
for x in finalDF['combined']:
  print("******************************************************")
  print(y)
  print(x)
  y+=1
  if y==10:
    break

###**6.4. Tokenizing the dataset**

In [None]:
# Specify at this stage which model we wish to use and use this ID as input to the tokenizer.
ftModelChoice = "digitalepidemiologylab/covid-twitter-bert-v2"
# Note: this doesn't work for covid-twitter-bert-v2 -- hence why we stipulate a different tokenizer!
if ftModelChoice == "digitalepidemiologylab/covid-twitter-bert-v2":
   tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased", use_fast=True)
else:
  # Use fast tokenizer if available for model; otherwise, use a slow tokenizer.
  try:
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=ftModelChoice, use_fast=True)
  except:
    try:
      tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=ftModelChoice,use_fast=False)
    except:
      raise ValueError("An error occurred in instantiating the tokenizer.")

# Tokenize the dataset.
def tokenize_function(examples):
  return tokenizer(examples['combined'], padding = "max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

###**6.5. Exporting initial embeddings**
We can optionally export the initial embeddings as TSV files to visualise using Tensorboard's embedding projector.

In [None]:
exportInitialEmbeddings = False

In [None]:
if exportInitialEmbeddings == True:
  # Convert from HuggingFace to TF dataset format.
  tfTokenizedDataset = tokenized_datasets.to_tf_dataset(
      columns=["input_ids","class"],
      shuffle=False,
      batch_size=1)
  # Convert from TF dataset to Pandas dataframe.
  pdTokenizedDataset = tfds.as_dataframe(tfTokenizedDataset)
  # Split into two files required for the embedding projector.
  vectors = pd.DataFrame(pdTokenizedDataset['input_ids'])
  metadata = pd.DataFrame(pdTokenizedDataset["class"])
  # Remove outer list wrappings and export to tsv file.
  fileNo = 1
  for series in [vectors['input_ids'], metadata["class"]]:
    unwrappedData = []
    for index, value in series.items():
      for vector in value:
        unwrappedData.append(vector)
    pdUnwrapped = pd.DataFrame(unwrappedData)
    pdUnwrapped.to_csv('{}.tsv'.format(fileNo), header=False,index=False,sep='\t')
    fileNo += 1

###**6.6. Splitting the dataset**
We next split the dataset into training, validation and test sets in TensorFlow data formats.

In [None]:
# Splitting dataset into training and test/val sets.
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class',seed=5)
# Splitting test/val set into equally sized validation and test sets.
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class',seed=5)
# Renaming test and val sets for convenience.
val_set = valid_test['train']
test_set = valid_test['test']
# Optionally, discard some training data to see how performance is affected during model fitting.
percentToDiscard = 0
if percentToDiscard != 0:
  trainSize = trainSize * (1-percentToDiscard)
  train_split = train_testvalid['train'].train_test_split(test_size=percentToDiscard,stratify_by_column='class',seed=5)
  train_set = train_split['train']
else:
  train_set = train_testvalid['train']
percentageKept = trainSize + valTestSize
samplesUsed = len(train_set['class']) + len(val_set['class']) + len(test_set['class']) 

In [None]:
# Specify batch size to be used.
batch_size = 4
# Converting the tokenized datasets to TensorFlow datasets.
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
tf_validation_dataset = val_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)
tf_test_dataset = test_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=1,
)

<br>
<br>

---
## **7. Fine-tuning the model**

---
We can now use TensorFlow to fine-tune our selected model using the training data.



###**7.1. Instantiating and compiling the model**

In [None]:
# Instantiating the model.
model = TFAutoModelForSequenceClassification.from_pretrained(ftModelChoice, num_labels=2)

In [None]:
# Create schedule for learning rate decay.
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=5e-6,decay_steps=10000,decay_rate=0.8)
# Decide at this point whether learning rate should decay ("decay") OR reduce on plateau ("plateau").
learningRateStrategy = "plateau"

In [None]:
# Compiling the model.
if learningRateStrategy == "decay":
  model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.SparseCategoricalAccuracy())
elif learningRateStrategy == "plateau":
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=tf.keras.metrics.SparseCategoricalAccuracy())

In [None]:
# Print summary of model structure.
model.summary()

In [None]:
# For covid-twitter-bert: the default dropout rate for the dropout layer near the model head is 0.1.
# The following line can be used to alter the dropout rate for this layer.
if ftModelChoice == "digitalepidemiologylab/covid-twitter-bert-v2":
  dropoutRate = 0.2
  model.get_layer('dropout_73').rate = dropoutRate

###**7.2. Creating custom callbacks**
To optimise model performance and ensure that we collect the data we need, we will use a series of callbacks.

In [None]:
# Adding an early stopping callback that will restore weights with the lowest val loss.
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=3,
    verbose=1,
    restore_best_weights=True)

In [None]:
# If learning rate strategy is to use reduce-on-plateau, use a callback to reduce the learning rate when no improvement is observed after two epochs.
reduceLRonPlateau = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=2, verbose=1)

In [None]:
# Creating custom callback for training time per epoch.
class TimeHistory(tf.keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []

    def on_epoch_begin(self, epoch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, epoch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

In [None]:
# Instantiating custom training-time callback.
time_callback = TimeHistory()

###**7.3. Fitting the model**

In [None]:
# Before fitting, creating serial number that we can use to differentiate between experiments using the same params.
now = datetime.datetime.now()
year = str(now.year)
month = str(now.month)
day = str(now.day)
hour = str(now.hour)
minute = str(now.minute)
now = [year,month,day,hour,minute]
serialNo = ""
for i in now:
  if len(i) == 1:
    i = '0'+i
  serialNo += i
serialNo = int(serialNo)

In [None]:
# Fitting model either with learning rate plateau strategy or decay strategy.
if learningRateStrategy == "decay":
  history = model.fit(tf_train_dataset,
            validation_data=tf_validation_dataset,
            epochs=50,
            callbacks=[early_stopping_callback, time_callback])
elif learningRateStrategy == "plateau":
  history = model.fit(tf_train_dataset,
            validation_data=tf_validation_dataset,
            epochs=50,
            callbacks=[early_stopping_callback,reduceLRonPlateau,time_callback])

### **7.4. Saving training data**

In [None]:
# Creating dataframe to store results of model training.
histDF = pd.DataFrame(history.history)
histDF['time_per_epoch'] =  time_callback.times
histDF['datasetChoice'] = recordedDSChoice
histDF['percentageKept'] = percentageKept
histDF['samplesUsed'] = samplesUsed
histDF['trainSize'] = trainSize
histDF['valSize'] = valTestSize / 2
histDF['testSize'] = valTestSize / 2
histDF['model'] = ftModelChoice
histDF['batchSize'] = batch_size
histDF['LRstrategy'] = learningRateStrategy
histDF['dropoutRate'] = dropoutRate
allDatasetCols = [col for col in datasetChoice]
allDatasetCols.remove('class')
histDF[[allDatasetCols]] = False
for i in chosenFeatures:
  if i in histDF.columns:
    histDF[i] = True
histDF['isFastTokenizer'] = tokenizer.is_fast
histDF['optimizer'] = model.optimizer.get_config()['name']
histDF['serialNumber'] = serialNo

In [None]:
# Exporting training data to an Excel sheet.
histDF.to_excel('finetuneResults.xlsx')

###**7.5. Evaluating the fine-tuned model on test data**



In [None]:
# Creating classification report based on model predictions.
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
modelLogits = model.predict(tf_test_dataset).logits
modelPredictions = []
for i in range(len(modelLogits)):
  prediction = np.argmax(modelLogits[i])
  modelPredictions.append(prediction)
classificationReport = classification_report(testDataLabels,modelPredictions,digits=5)
print(classificationReport)

In [None]:
# Creating and plotting a confusion matrix based on model predictions.
confusionMatrix = confusion_matrix(testDataLabels, modelPredictions, labels=None, sample_weight=None, normalize=None)
sns.heatmap(confusionMatrix, square=True, annot=True, cbar=False, cmap="Blues",fmt='g')
plt.title("Fine-tuned model predictions")
plt.xlabel('predicted value')
plt.ylabel('true value')
plt.show()
plt.clf()

In [None]:
# Adding classification report to Excel workbook.
classificationReportForPrint = classification_report(testDataLabels,modelPredictions,digits=5,output_dict=True)
classificationReportDF = pd.DataFrame(classificationReportForPrint).transpose()
classificationReportExport = classificationReportDF.to_excel('classReportFT.xlsx')
# Loading Excel workbooks.
wb = load_workbook(filename = 'finetuneResults.xlsx')
wbCR = load_workbook(filename = 'classReportFT.xlsx')
# Creating new worksheet which is a copy of classification report.
ws1 = wb.create_sheet(title="class. report")
ws2 = wbCR.active
for row in ws2:
    for cell in row:
        ws1[cell.coordinate].value = cell.value
# Saving changes.
wb.save(filename = 'finetuneResults.xlsx')

###**7.6. Exporting experimental data**

In [None]:
# Export fine-tuning data
if ftModelChoice == "digitalepidemiologylab/covid-twitter-bert-v2":
  os.rename("finetuneResults.xlsx","finetuneResults-covid-twitter-bert-{}-{}.xlsx".format(datetime.datetime.now().month,datetime.datetime.now().day))
  files.download("finetuneResults-covid-twitter-bert-{}-{}.xlsx".format(datetime.datetime.now().month,datetime.datetime.now().day))
else:
  os.rename("finetuneResults.xlsx","finetuneResults-{}-{}-{}.xlsx".format(ftModelChoice,datetime.datetime.now().month,datetime.datetime.now().day))
  files.download("finetuneResults-{}-{}-{}.xlsx".format(ftModelChoice,datetime.datetime.now().month,datetime.datetime.now().day))

## **8. Saving and exporting best model**
For the model evaluated as 'best in class', we can push the model to the HuggingFace model hub.

In [None]:
# notebook_login()

In [None]:
# model.push_to_hub("monkeypox-misinformation")
# tokenizer.push_to_hub("monkeypox-misinformation")