# Note
This notebook can be run on google colab for improved performance. The code changes necessary for running on this system are commented over the code.

## Data preprocessing

In [1]:
! pip install \
  scprep\
  spacy==2.3.2 \
  sentence_transformers==0.4.0 \
  phate==1.0.4 && \
  python -m spacy download es_core_news_lg

Collecting scprep
[?25l  Downloading https://files.pythonhosted.org/packages/17/41/f2d4728a5c4d762b5e7424e80fd6297ad53c58bc666d1cc8236827462b8c/scprep-1.0.11-py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 4.3MB/s 
[?25hCollecting spacy==2.3.2
[?25l  Downloading https://files.pythonhosted.org/packages/10/b5/c7a92c7ce5d4b353b70b4b5b4385687206c8b230ddfe08746ab0fd310a3a/spacy-2.3.2-cp36-cp36m-manylinux1_x86_64.whl (9.9MB)
[K     |████████████████████████████████| 10.0MB 13.0MB/s 
[?25hCollecting sentence_transformers==0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/22/9a/62beeb5501b70ab48b9e5bb92de290f00a661a1caa075c4aae56d452aaa0/sentence-transformers-0.4.0.tar.gz (65kB)
[K     |████████████████████████████████| 71kB 11.2MB/s 
[?25hCollecting phate==1.0.4
  Downloading https://files.pythonhosted.org/packages/c6/62/b98433d2b220ad1f106dde89d2e6d753d66b9cc3a914e1245c94556edad8/phate-1.0.4-py3-none-any.whl
Collecting thinc==7.4.1
[?25l  Dow

<span style="color:red"><strong>WARNING!</strong></span> Once you installed the packages in the previous cell you must restart your runtime and then import the library and load the model

In [1]:
import spacy
if spacy.prefer_gpu():
    print("Using the GPU")
else:
    print("Using the CPU")
es_nlp = spacy.load('es_core_news_lg')

Using the GPU


For development work, in case you want to update the files in your GitHub branch by rerunning the clone, you first have to empty the folder.

In [2]:
!rm -rf policy-data-analyzer/

In [3]:
# Define branch to clone
! branch_name='#50_dfq_sbert_fine_tuning' && \
  git clone --branch $branch_name https://github.com/wri-dssg/policy-data-analyzer.git

Cloning into 'policy-data-analyzer'...
remote: Enumerating objects: 467, done.[K
remote: Counting objects: 100% (467/467), done.[K
remote: Compressing objects: 100% (296/296), done.[K
remote: Total 2869 (delta 293), reused 321 (delta 170), pack-reused 2402[K
Receiving objects: 100% (2869/2869), 126.65 MiB | 17.67 MiB/s, done.
Resolving deltas: 100% (1451/1451), done.
Checking out files: 100% (843/843), done.


In [4]:
import pandas as pd
import sys
import os
from sklearn.model_selection import train_test_split
from sentence_transformers import SentencesDataset, SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import LabelAccuracyEvaluator
from torch import nn, Tensor
from typing import Iterable, Dict
from torch.utils.data import DataLoader
import math
import time
import cupy as cp

os.chdir("policy-data-analyzer") #If you run this cell more than once, comment out this line because you are ready in this folder and you will get an error
from tasks.data_loader.src.utils import *
from tasks.data_augmentation.src.zero_shot_classification.latent_embeddings_classifier import *
from tasks.evaluate_model.src.model_evaluator import *
from tasks.data_visualization.src.plotting import *

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Fine-tuning the embedding model on the labeled data

### Something we can try out:
https://www.sbert.net/examples/training/data_augmentation/README.html#extend-to-your-own-datasets

### Links:
https://github.com/UKPLab/sentence-transformers/issues/350

https://omoindrot.github.io/triplet-loss

### Possible tasks for fine-tuning:
1) Given a pair of sentence embeddings, do they belong to the same category (binary)?

2) Given a sentence and a category embedding, does the sentence belong to the category (binary)?

3) Given a sentence embedding, use a classifier to predict its category (multiclass) [https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py)

4) Use a triplet loss approach such that sentences (texts) that have the same labels will become close in vector space, while sentences with a different label will be further away [https://github.com/UKPLab/sentencetransformers/blob/master/examples/training/other/training_batch_hard_trec_continue_training.py](https://github.com/UKPLab/sentencetransformers/blob/master/examples/training/other/training_batch_hard_trec_continue_training.py)
   
#### In this notebook **task number 3** is used to fine-tune the model.

## Run fine tuning experiments

In [5]:
class SoftmaxClassifier(nn.Module):
    """
    This loss adds a softmax classifier on top of the output of the transformer network. 
    It takes a sentence embedding and learns a mapping between it and the corresponding category.
    :param model: SentenceTransformer model
    :param sentence_embedding_dimension: Dimension of your sentence embeddings
    :param num_labels: Number of different labels
    """
    def __init__(self,
                 model: SentenceTransformer,
                 sentence_embedding_dimension: int,
                 num_labels: int):
        super(SoftmaxClassifier, self).__init__()
        self.model = model
        self.num_labels = num_labels
        self.classifier = nn.Linear(sentence_embedding_dimension, num_labels)

    def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
        # Get batch sentence embeddings
        features = self.model(sentence_features[0])['sentence_embedding']
        
        # Get batch loss
        output = self.classifier(features)
        loss_fct = nn.CrossEntropyLoss()

        if labels is not None:
            loss = loss_fct(output, labels.view(-1))
            return loss
        else:
            return features, output

Reading files from JSON

In [6]:
# This is the dictionary used to transform policy instrument labels into numeric codes. BEWARE that we have a new category which is 0. This new category might represent unknown incentive or no incentive!!!
policy_dict = {'Credit' : 'Credit',
 'Direct' : 'Direct payment',
 'Fine' : 'Fine',
 'General' : 'Unknown', 
 'Guarantee' : 'Credit', 
 'Supplies' : 'Supplies', 
 'Tax' : 'Tax deduction', 
 'Technical' : 'Technical assistance', 
 'Unknown' : 'Unknown', 
 'Other' : 'Unknown', 
 'Nan' : 'Unknown' }
 
Three_most_common = ['Credit', 'Direct payment', 'Fine']
All_but_unknown = ['Credit', 'Direct payment', 'Fine', 'Supplies', 'Tax deduction', 'Technical assistance']
All = ['Credit', 'Direct payment', 'Fine', 'Supplies', 'Tax deduction', 'Technical assistance', 'Unknown']

# This is the dictionary used to transform is_incentive labels into numeric codes.
incentive_dict = {'Incentive' : 'Incentive', 
'Disincentive' : 'Incentive', 
'Unknown' : 'not_Incentive', 
'Nan' : 'not_Incentive'}

In [24]:
#THIS IS THE PREFERRED PATH
# If the json is in the format that includes headers and other titles:
#TODO: Adapt/choose the path according to your system configuration

rater = "Rater2" # TODO: Change accordingly to what is the dataset you want to analyze

# dataset_fname = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/Data/Final_input_data"
# dataset_fname = "/content/drive/MyDrive/WRI-LatinAmerica-Talent/Cristina_Policy_Files/Tagged_sentence_lists/Rater_3_labeled.json"
dataset_fname = "/content/drive/MyDrive/Official Folder of WRI Latin America Project/WRI-LatinAmerica-Talent/Cristina_Policy_Files/Tagged_sentence_lists/{}_labeled.json".format(rater)

dataset = load_file(dataset_fname)
dataset_map = labeled_sentences_from_dataset(dataset) # Labels AND sentences
dataset_map_target_labels =  select_labels(dataset_map, Three_most_common) # Adjust to the labels you want to use for your analysis. In the cell above you have three pre-defined sets.
all_sents = sentences_from_dataset(dataset_map_target_labels) # Just sentences
all_labels = labels_from_dataset(dataset_map_target_labels) # Just labels. Use "labels" if you want to retrieve policy instrument tags
                                                    # Use "incentive" if you want to retrieve is_incentive tags.

In [25]:
# The rest:
label_names = unique_labels(all_labels)
numeric_labels = labels2numeric(all_labels, label_names)
label_names

['Fine', 'Credit', 'Direct payment']

In [26]:
model_names = ['stsb-xlm-r-multilingual', 'paraphrase-xlm-r-multilingual-v1']#, 'quora-distilbert-multilingual''distiluse-base-multilingual-cased-v2', 
Experiment = "EXP6" # TODO: Adjust the value of the variable for the results to be saved in the right folder

# Train test split stratified
all_test_perc = [0.15, 0.2, 0.25, 0.3]

# Output setup
output = {}

for test_perc in all_test_perc:
  output[f"test_perc={test_perc}"] = {}
  X_train, X_test, y_train, y_test = train_test_split(all_sents, all_labels, test_size=test_perc, stratify=all_labels, random_state=69420)

  # Load data samples into batches
  train_batch_size = 16
  label2int = dict(zip(label_names, range(len(label_names))))
  train_samples = []
  for sent, label in zip(X_train, y_train):
      label_id = label2int[label]
      train_samples.append(InputExample(texts=[sent], label=label_id))

  # Configure the dev set evaluator - still need to test whether this works
  dev_samples = []
  for sent, label in zip(X_test, y_test):
      label_id = label2int[label]
      dev_samples.append(InputExample(texts=[sent], label=label_id))
  
  for model_name in model_names:
    # Setup
    model_preds = []
    model_scores = []
    output[f"test_perc={test_perc}"][model_name] = []
    
    # Train set config
    model = SentenceTransformer(model_name)
    train_dataset = SentencesDataset(train_samples, model=model)
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
    
    # Define the way the loss is computed
    classifier = SoftmaxClassifier(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=len(label2int))
    
    # Dev set config
    dev_dataset = SentencesDataset(dev_samples, model=model)
    dev_dataloader = DataLoader(dev_dataset, shuffle=True, batch_size=train_batch_size)
    dev_evaluator = LabelAccuracyEvaluator(dataloader=dev_dataloader, softmax_model=classifier, name='lae-dev')

    # Configure the training
    max_num_epochs = 10
        
    for num_epochs in range(4, max_num_epochs + 2, 2):
        print("Num epochs:", num_epochs)
        
        warmup_steps = math.ceil(len(train_dataset) * num_epochs / train_batch_size * 0.1)  # 10% of train data for warm-up
        model_deets = f"model={model_name}_test-perc={test_perc}_n-epoch={num_epochs}"
        model_save_path = f"/content/drive/MyDrive/Official Folder of WRI Latin America Project/WRI-LatinAmerica-Talent/Modeling/FineTuningExperiments/{Experiment}/FineTuning_{model_deets}"
        

        # Train the model
        start = time.time()
        model.fit(train_objectives=[(train_dataloader, classifier)],
                  evaluator=dev_evaluator,
                  epochs=2, # We always tune on an extra epoch to see the performance gain
                  evaluation_steps=1000,
                  warmup_steps=warmup_steps
                  )
        
        end = time.time()
        hours, rem = divmod(end-start, 3600)
        minutes, seconds = divmod(rem, 60)
        print("Time taken for fine-tuning:", "{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
        
        ### Classify sentences
        # Projection matrix Z low-dim projection
        print("Classifying sentences...")
        proj_matrix = cp.asnumpy(calc_proj_matrix(all_sents, 50, es_nlp, model, 0.01))
        all_sent_embs = encode_all_sents(all_sents, model, proj_matrix)
        all_label_embs = encode_labels(label_names, model, proj_matrix)
        visualize_embeddings_2D(np.vstack(all_sent_embs), all_labels, tsne_perplexity=50, store_name=f"{model_save_path}/{model_deets}")
        model_preds, model_scores = calc_all_cos_similarity(all_sent_embs, all_label_embs, label_names)
        
        ### Evaluate the model
        numeric_preds = labels2numeric(model_preds, label_names)
        evaluator = ModelEvaluator(label_names, y_true=numeric_labels, y_pred=numeric_preds)
        
        output[f"test_perc={test_perc}"][model_name].append({"num_epochs": num_epochs, "avg_f1": evaluator.avg_f1.tolist()})
        
        evaluator.plot_confusion_matrix(color_map='Blues', exp_name=f"{model_save_path}/{model_deets}")

Output hidden; open in https://colab.research.google.com to view.

In [20]:
output.keys()

dict_keys(['test_perc=0.15', 'test_perc=0.2', 'test_perc=0.25', 'test_perc=0.3'])

In [27]:
new_json = {}

for key in output.keys():
  new_json[key] = {}
  for subkey in output[key].keys():
    new_json[key][subkey] = []
    for element in output[key][subkey]:
      el_copy = {"avg_f1": element["avg_f1"], "num_epochs": element["num_epochs"]}
      new_json[key][subkey].append(el_copy)

In [28]:
import json
with open(f"/content/drive/MyDrive/Official Folder of WRI Latin America Project/WRI-LatinAmerica-Talent/Modeling/FineTuningExperiments/{Experiment}/{Experiment}_FineTuningResults.json", "w") as f:
  json.dump(new_json, f)