<a href="https://colab.research.google.com/github/zacharypangan/AgentNetworkSimulation/blob/main/Copy_of_lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Since the resources on Google Colab are limited, you may bump into limitations when trying to use it for your own project. In this case, copy this notebook on your computational plateform to use it with your own GPU.*

## Enabling GPU

The package requires a GPU to run. To enable a GPU for this Notebook, you will need to:  
- Click 'Edit' in the menu bar, then click 'Notebook Settings'.
- Select GPU from the Hardware Accelerator drop-down list, then click 'Save'.

*Click* on the arrow below to verify that you are successfully connected to a GPU. This should return the name of the GPU used.

In [None]:
from torch import cuda

cuda.get_device_name(0)

# **Lab 1: The python package *AugmentedSocialScientist***

## Installing the package

Run the cell below to install the package *AugmentedSocialScientist* on the current Google Colab runtime.

In [None]:
!pip install AugmentedSocialScientist

Import BERT model ([Devlin et al. 2019](https://arxiv.org/pdf/1810.04805.pdf)) from the package *AugmentedSocialScientist*.

In [None]:
from AugmentedSocialScientist import bert

The module `bert` contains 3 main functions:
- `bert.encode()` to preprocess the data;
- `bert.run_training()` to train, validate and save a model;
- `bert.predict_with_model()`  to make predictions with a saved model.

In this lab session, we will use a classic text classification task -- clickbait detection -- to illustrate the use of these functions.

> **N.B.**
>
> BERT is a pre-trained language model for the English language. Our package also contains models for other languages:
> - `camembert` for French;
> - `german_bert` for German;
> - `spanish_bert` for Spanish;
> - `xlmroberta` which is a multi-lingual model supporting 100 languages.
> To use them, simply import the corresponding model and replace `bert` with the name of the imported model.
>
> For example, to use the French language model `camembert`:
> 1. Import the model `camembert`:
```python
from AugmentedSocialScientist import camembert
```
> 2. Then use the functions `camembert.encode()`, `camembert.run_training()`, `camembert.predict_with_model()`.
>
> The source code of the package can be found here: https://github.com/rubingshen/AugmentedSocialScientist

# **Example: Clickbait Detection**


For this example, we use data from [Chakraborty et al. 2016](https://github.com/bhargaviparanjape/clickbait), in order to train a classifier that distinguishes between clickbait and non-clickbait headlines.

Import other required packages for this tutorial.

In [None]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth=None
pd.options.display.max_rows=100

Loading data

In [None]:
cb_train = pd.read_csv('https://raw.githubusercontent.com/rubingshen/AugmentedSocialScientist/main/datasets/english/clickbait_train.csv')
cb_test = pd.read_csv('https://raw.githubusercontent.com/rubingshen/AugmentedSocialScientist/main/datasets/english/clickbait_test.csv')

In [None]:
cb_train

In [None]:
cb_test

## Training a model

### Step 1: Preprocessing the data with the function `bert.encode()`

The function `bert.encode(sentences, labels)` will preprocess the training and test data and convert them to pytorch's *dataloader* object, a format readable by the model.


The function takes two arguments: a list of texts and a list of corresponding labels.

**⚠️ For technical reasons, the labels must be integers starting from 0 (0, 1, 2...)**

In [None]:
train_dataloader = bert.encode(cb_train.headline, cb_train.is_clickbait)

In [None]:
test_dataloader = bert.encode(cb_test.headline, cb_test.is_clickbait)

### Step 2: Training a model with the function `bert.run_training()`

The function `bert.run_training()` trains, validates, and saves the fine-tuned BERT model. It takes the following argument:

* `training_dataloader`: the preprocessed training set;
* `test_dataloader`: the preprocessed test set;
* `n_epochs`: the number of epochs;
* `lr`: the learning rate;
* `random_state`: the fixed random state (for replicability purposes).
* `save_model_as`: the name of model saving folder. The model will be saved at `./models/<model_name>`. If you don't want to save the model after training, set this parameter to `None`.

Once the model has completed its training phase, it calculates the F1-score (between 0 and 1) to assess the quality of the model.

In [None]:
score = bert.run_training(train_dataloader,
                          test_dataloader,
                          n_epochs=2,
                          lr=5e-5,
                          random_state=42,
                          save_model_as='clickbait')

In [None]:
score

## Predicting on new data

Load unlabelled data that we would like to automatically label using the trained model.

In [None]:
cb_pred = pd.read_csv('https://raw.githubusercontent.com/rubingshen/AugmentedSocialScientist/main/datasets/english/clickbait_pred.csv')

In [None]:
cb_pred

### Step 1: Preprocessing the data with the function `bert.encode()`

Preprocess the prediction data with the function `bert.encode()` by inputing only the list of texts.

In [None]:
pred_dataloader = bert.encode(cb_pred.headline)

### Step 2: Automatic annotation with `bert.predict_with_model()`

Use the function `bert.predict_with_model()` to make predictions on the data with the trained model.

The function takes two arguments:

* `pred_dataloader`: the preprocessed prediction data;
* `model_path`: the path of saved model to be used for prediction.

In [None]:
pred_proba = bert.predict_with_model(pred_dataloader, model_path='./models/clickbait')

Output: the model returns the probabiliby of each headline in the unlabelled data set to belong to a given category (0: not clickbait; 1: clickbait).

In [None]:
pred_proba

Store the predicted category and probability to the dataframe

In [None]:
cb_pred['pred_label'] = np.argmax(pred_proba, axis=1)
cb_pred['pred_proba'] = np.max(pred_proba, axis=1)

In [None]:
cb_pred