# Fine tuning with S-BERT

### Instructions to run this notebook

#### These are the sections in this notebook. Please read the notes below to avoid any errors.

1. [Setup](#setup) - Install the necessary libraries, the GitHub repository, and import the code from the repository.
2. [Data Loading](#data-loading) - Define your **GLOBAL_EXPERIMENT_NUMBER** carefully so that it doesn't overwrite another folder on Google Drive. Running this section will produce the data distriibution for the binary/multiclass classification dataset. A good visual check to see if the number of examples in the classes are roughly equal and well distributes in the training and test datasets. This section also makes sure that the training sentences and labels are available to the Python scripts from the GitHub repository to fine-tune the model.
3. [Configuring W&B credentials and Run](#wandb-config) - The most importany part of this section is to insert your WANDB_API_KEY in the key variable. This will make sure that you can submit your run to the common Weights&Biases project. The WANDB_RUN_GROUP and WANDB_JOB_TYPE variables can be set as well. These 2 variables help in filtering multiple runs of the same or similar hyperparameter configuring for better readability on the Weights&Biases dashboard.
4. [Grid Search Fine Tuning - W&B sweeps](#wandb-sweep) - In this section we define the Weights&Biases hyperparameter sweep. The method of the sweep defines how the automatic tuning will happen (random is the easiest way). If we use the random method, we have to specify the ``count`` parameter later on to tell wandb how many random subsets are to be taken when running the sweep. You can also specify the name of the sweep in this sweep_config. Another important part is to define the maximizing factor, right now in our case it is the ``Weighted F1 validation`` score. **Note**, the maximizing factor in the metrics needs to be a value that we are loggings with wandb (wandb.log), the string needs to match exactly. In the hyperparameter dictionary of the sweep is where we list the multiple values of each hyperparameter that we wish to fine-tune. If the key ``values`` is used, wandb expects a list of values for that hyperparameter, if the key is ``value`` there should only be one value. In this way we can specify constant hyperparameters as well. 
5. [Running training function for only one run](#single-run) - In the scenario where we push to only submit a single run to Weights&Biases, we can use this section to run another training function with **one set of hyperparameter values**. **Note** that this section still requires the WANDB_API_KEY to be set in the [Configuring W&B credentials and Run](#wandb-config) section as well as the run group if it is required.
6. [Removing the saving directory from Google Drive](#delete-folder) - While fine-tuning we save the model in the Google Drive in the GLOBAL_EXPERIMENT_NUMBER folder. After the model is successfully saved onto the Weights&Biases run, we can safely delete the folder from Google Drive to save storage space.
7. [Loading saved model](#load-model) - We can load the model saved in the single run fine-tuning section. The run id is already saved in a variable which is how wandb finds the ``saved_model.pt`` file. If you wish to retrieve a model from a sweep, you need to find the run id from the online dashboard of the best run and use it in this section.
8. [Testing model on test set](#test-set) - In this section we get a realistic performance of the saved model on the test set.

<a name="setup"></a>
## Setup

In [None]:
# Install necessary libraries
! pip install --quiet scprep \
  wandb \
  sentence_transformers==1.0.2 \
  phate==1.0.7 \
  boto3
  # Restarting the runtime is required for the libraries to be active in the notebook
import os
os.kill(os.getpid(), 9)

[K     |████████████████████████████████| 112kB 26.6MB/s 
[K     |████████████████████████████████| 1.8MB 40.3MB/s 
[K     |████████████████████████████████| 81kB 11.9MB/s 
[K     |████████████████████████████████| 133kB 55.4MB/s 
[K     |████████████████████████████████| 174kB 54.9MB/s 
[K     |████████████████████████████████| 102kB 11.8MB/s 
[K     |████████████████████████████████| 2.5MB 46.1MB/s 
[K     |████████████████████████████████| 1.2MB 49.4MB/s 
[K     |████████████████████████████████| 419kB 50.2MB/s 
[K     |████████████████████████████████| 51kB 8.5MB/s 
[K     |████████████████████████████████| 71kB 11.5MB/s 
[K     |████████████████████████████████| 3.3MB 25.0MB/s 
[K     |████████████████████████████████| 901kB 41.1MB/s 
[K     |████████████████████████████████| 1.8MB 35.3MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 

In [1]:
# Setup connection with your own google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Clone branch from github
!rm -rf policy-data-analyzer/
!branch_name='hssc' && \
  git clone --branch $branch_name https://github.com/wri-dssg/policy-data-analyzer.git

Cloning into 'policy-data-analyzer'...
remote: Enumerating objects: 6654, done.[K
remote: Counting objects: 100% (720/720), done.[K
remote: Compressing objects: 100% (479/479), done.[K
remote: Total 6654 (delta 474), reused 446 (delta 239), pack-reused 5934[K
Receiving objects: 100% (6654/6654), 209.79 MiB | 30.91 MiB/s, done.
Resolving deltas: 100% (3702/3702), done.
Checking out files: 100% (1008/1008), done.


In [2]:
#If you run this cell more than once, comment out this line because you are already in this folder and you will get an error
import os
os.chdir("policy-data-analyzer") 

# from tasks.fine_tuning_sbert.src.loops import *

<a name="data-loading"></a>
## Experiment setup and run

In [3]:
'''
PASTE YOUR WEIGHTS & BIASES KEY HERE
Please do not forget to delete the key after finishing using the notebook. Or simply don't save the notebook to GitHub or Google Drive :)
If the key is compromised you can always make a new one in your W&B settings and remove the old one :)
'''
import os
import wandb
import time
from tasks.data_loading.src.utils import *
from tasks.fine_tuning_sbert.src.loops import *

wandb_key = ''
group_desc = 'testing-error'
job_type = ''

os.environ['WANDB_JOB_TYPE'] = job_type
os.environ['WANDB_RUN_GROUP'] = group_desc
os.environ['WANDB_API_KEY'] = wandb_key



Using the GPU


In [4]:
base_path = "/content/drive/MyDrive/Official Folder of WRI Latin America Project/WRI-LatinAmerica-Talent"
data_path = f"{base_path}/Modeling/Labeled data"
results_save_path = f"{base_path}/Modeling/HSSC/Results/"

project_name = 'HSSC'
languages = ["spanish", "english"]
classification = ["binary", "multiclass"]
labeling = ["handpicked", "assisted", "merged"]


languages = ["spanish"]
classification = ["binary"]
labeling = ["handpicked"]

for language in languages:

  for classif_type in classification:

    for training in labeling:
        # Setup the WandB group
        group_name = language + "_" + classif_type + "_" + training + '_testingerror'
        
        # Load training dataset
        train_sents, train_labels = load_training_dataset_HSSC(data_path, language, classif_type, training, "train")
        label_names = unique_labels(train_labels)

        #Model training
        single_run_params = {
            "all_dev_perc": 0.20,
            "model_names": 'paraphrase-xlm-r-multilingual-v1',
            "output_path": None,
            "max_num_epochs": 10,
            "group_name": group_name
        }
        
        model = single_run_fine_tune_HSSC(single_run_params, train_sents, train_labels, label_names)

        for testing in labeling:
            start = time.time()

            wandb.run.name = group_name + "_" + testing
            #Loading testing dataset
            test_sents, test_labels = load_training_dataset_HSSC(data_path, language, classif_type, testing, "test")
            # class balance/imbalance for test set
            label_names_test = unique_labels(test_labels)
            numeric_train_labels_test = labels2numeric(test_labels, label_names_test)
            print("\n*****", group_name, " -- ", testing, "*****\n")
            class_distribution = plot_data_distribution_HSSC(numeric_train_labels_test, label_names_test)
            wandb.log({"label class distribution": wandb.Image(class_distribution)})
            #Classification and evaluation
            clf = RandomForestClassifier(n_estimators=500,
                                max_features=0.06,
                                n_jobs=6,
                                random_state=69420)
            F1 = evaluate_using_sklearn(clf, model, train_sents, train_labels, test_sents, test_labels,
                            label_names)
            wandb.log({"F1": F1})
            print("\######", F1, "######\n")
            end = time.time()
            hours, rem = divmod(end - start, 3600)
            minutes, seconds = divmod(rem, 60)
            print("Time taken for fine-tuning:",
                "{:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), seconds))


[34m[1mwandb[0m: Currently logged in as: [33mramanshsharma[0m (use `wandb login --relogin` to force relogin)


Problem at: <ipython-input-4-3383b75456b6> 23 <module>


Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_init.py", line 757, in init
    run = wi.init()
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_init.py", line 520, in init
    backend.cleanup()
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/backend/backend.py", line 167, in cleanup
    self.interface.join()
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 836, in join
    _ = self._communicate(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 545, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 550, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
[34m[1mwandb[0m: [32m[41mERROR[0m Abnormal program exit


Exception: ignored