# Fine tuning with S-BERT

### Instructions to run this notebook

#### These are the sections in this notebook. Please read the notes below to avoid any errors.

1. [Setup](#setup) - Install the necessary libraries, the GitHub repository, and import the code from the repository.
2. [Data Loading](#data-loading) - Define your **GLOBAL_EXPERIMENT_NUMBER** carefully so that it doesn't overwrite another folder on Google Drive. Running this section will produce the data distriibution for the binary/multiclass classification dataset. A good visual check to see if the number of examples in the classes are roughly equal and well distributes in the training and test datasets. This section also makes sure that the training sentences and labels are available to the Python scripts from the GitHub repository to fine-tune the model.
3. [Configuring W&B credentials and Run](#wandb-config) - The most importany part of this section is to insert your WANDB_API_KEY in the key variable. This will make sure that you can submit your run to the common Weights&Biases project. The WANDB_RUN_GROUP and WANDB_JOB_TYPE variables can be set as well. These 2 variables help in filtering multiple runs of the same or similar hyperparameter configuring for better readability on the Weights&Biases dashboard.
4. [Grid Search Fine Tuning - W&B sweeps](#wandb-sweep) - In this section we define the Weights&Biases hyperparameter sweep. The method of the sweep defines how the automatic tuning will happen (random is the easiest way). If we use the random method, we have to specify the ``count`` parameter later on to tell wandb how many random subsets are to be taken when running the sweep. You can also specify the name of the sweep in this sweep_config. Another important part is to define the maximizing factor, right now in our case it is the ``Weighted F1 validation`` score. **Note**, the maximizing factor in the metrics needs to be a value that we are loggings with wandb (wandb.log), the string needs to match exactly. In the hyperparameter dictionary of the sweep is where we list the multiple values of each hyperparameter that we wish to fine-tune. If the key ``values`` is used, wandb expects a list of values for that hyperparameter, if the key is ``value`` there should only be one value. In this way we can specify constant hyperparameters as well. 
5. [Running training function for only one run](#single-run) - In the scenario where we push to only submit a single run to Weights&Biases, we can use this section to run another training function with **one set of hyperparameter values**. **Note** that this section still requires the WANDB_API_KEY to be set in the [Configuring W&B credentials and Run](#wandb-config) section as well as the run group if it is required.
6. [Removing the saving directory from Google Drive](#delete-folder) - While fine-tuning we save the model in the Google Drive in the GLOBAL_EXPERIMENT_NUMBER folder. After the model is successfully saved onto the Weights&Biases run, we can safely delete the folder from Google Drive to save storage space.
7. [Loading saved model](#load-model) - We can load the model saved in the single run fine-tuning section. The run id is already saved in a variable which is how wandb finds the ``saved_model.pt`` file. If you wish to retrieve a model from a sweep, you need to find the run id from the online dashboard of the best run and use it in this section.
8. [Testing model on test set](#test-set) - In this section we get a realistic performance of the saved model on the test set.

<a name="setup"></a>
## Setup

In [None]:
# Install necessary libraries
! pip install --quiet \
  scprep \
  wandb \
  sentence_transformers==1.0.2 \
  phate==1.0.7
# Setup connection with your own google drive
from google.colab import drive
drive.mount('/content/drive')

# Restarting the runtime is required for the libraries to be active in the notebook
import os
os.kill(os.getpid(), 9)

In [None]:
# Clone branch from github
!rm -rf policy-data-analyzer/
!branch_name='wandb-experiments' && \
  git clone --branch $branch_name https://github.com/wri-dssg/policy-data-analyzer.git

In [None]:
#If you run this cell more than once, comment out this line because you are already in this folder and you will get an error
import os
os.chdir("policy-data-analyzer") 

from tasks.fine_tuning_sbert.src.loops import *

<a name="data-loading"></a>
## Data Loading

In [None]:
"""
MAKE SURE GLOBAL_EXPERIMENT_NUMBER IS NOT OVERWRITING ANOTHE FOLDER
"""

GLOBAL_EXPERIMENT_NUMBER = 16

experiment = "EXP26"
classifier = "Binary"

base_path = "/content/drive/MyDrive/Official Folder of WRI Latin America Project/WRI-LatinAmerica-Talent"

data_path = f"{base_path}/Cristina_Policy_Files/Tagged_sentence_lists/Spanish/datasets/{classifier}"

results_save_path = f"{base_path}/Modeling/Model_reproducibility/Model_results/{classifier}ClassificationExperiments/{GLOBAL_EXPERIMENT_NUMBER}"

if not os.path.exists(results_save_path):
    os.makedirs(results_save_path)
    print(f"Making new experiment folder for experiment # {GLOBAL_EXPERIMENT_NUMBER}")
else:
    print("Please do not overwrite existing models and their results from previous experiments")
    print(f"You are writing to Experiment # {GLOBAL_EXPERIMENT_NUMBER}")

train_sents, train_labels, test_sents, test_labels = load_dataset(data_path, experiment)
label_names = unique_labels(train_labels)

make_dataset_public(train_sents, train_labels, label_names)

numeric_train_labels = labels2numeric(train_labels, label_names)

plot_data_distribution(numeric_train_labels, label_names)

In [None]:
# class balance/imbalance for test set
label_names_test = unique_labels(test_labels)
numeric_train_labels_test = labels2numeric(test_labels, label_names_test)

plot_data_distribution(numeric_train_labels_test, label_names_test)

<a name="wandb-config"></a>
## Configuring W&B credentials and Run

In [None]:
'''
PASTE YOUR WEIGHTS & BIASES KEY HERE
Please do not forget to delete the key after finishing using the notebook. Or simply don't save the notebook to GitHub or Google Drive :)
If the key is compromised you can always make a new one in your W&B settings and remove the old one :)
'''
wandb_key = ''
group_desc = ''
job_type = ''

os.environ['WANDB_JOB_TYPE'] = job_type
os.environ['WANDB_RUN_GROUP'] = group_desc
os.environ['WANDB_API_KEY'] = wandb_key

<a name="wandb-sweep"></a>
## Grid Search Fine Tuning - W&B sweeps

In [None]:
# wandb sweeps
sweep_config = {
    'method': 'random',
    "name": "SBERT hyperparam tuning"
}

metric = {
    'name': 'Weighted F1 validation',
    'goal': 'maximize'   
}

parameters_dict = {
    "dev_perc": {
        "values": [0.20, 0.25]
    },
    'model_name': {
        'values': ['paraphrase-xlm-r-multilingual-v1', 'stsb-xlm-r-multilingual', 'quora-distilbert-multilingual']
    },
    'seeds': {
        'values': [10, 11, 12]
    },
    'learning_rate': {
        'values': [2e-5, 2e-4]
    },
    # all values below are set but not varies
    "max_num_epochs": {
        "value": 10
    },
    "baseline": {
        "value": 0.001
    },
    "patience": {
        "value": 5
    },
    "eval_classifier": {
        "value": "SBERT"
    },
    "output_path": {
        "value": results_save_path
    }
}

sweep_config['parameters'] = parameters_dict
sweep_config['metric'] = metric

import pprint

pprint.pprint(sweep_config)

In [None]:
sweep_id = wandb.sweep(sweep_config, project="WRI", entity="ramanshsharma")

In [None]:
wandb.agent(sweep_id, train, count=5)

<a name="single-run"></a>
## Running training function for only one run

In [None]:
single_run_params = {
    "all_dev_perc": 0.25,
    "model_names": 'paraphrase-xlm-r-multilingual-v1',
    "output_path": results_save_path,
    "max_num_epochs": 10,
    "baseline": 0.001,
    "patience": 5,
    "learning_rate": 2e-5,
    "seeds": 10,
    "eval_classifier": "SBERT"
}

run_name = single_run_fine_tune(single_run_params, train_sents, train_labels, label_names)

<a name="delete-folder"></a>
## Removing the saving directory from Google Drive

In [None]:
# deletes the things on google drive because everything is there on weights&biases
import shutil
shutil.rmtree(results_save_path)
print(f'Removed {results_save_path}')

<a name="load-model"></a>
## Loading saved model

In [None]:
# run name only available for singular runs, is using sweeps retrieve the individual run id from W&B
print(run_name)

In [None]:
wandb.restore('saved_model.pt', run_path=f"ramanshsharma/WRI/{run_name}")

saved_model = torch.load('saved_model.pt')

str(saved_model)

<a name="test-set"></a>
## Testing model on test set

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=500,
                            max_features=0.06,
                            n_jobs=6,
                            random_state=69420)

In [None]:
os.environ['WANDB_SILENT'] = "true"
wandb.init(id=run_name, project='WRI', entity='ramanshsharma', resume='allow')

evaluate_using_sklearn(clf, saved_model, train_sents, train_labels, test_sents, test_labels,
                           label_names)