# 51-train-multilabel
> An exploration of training for multilabel classification on the audio model

In this notebook, we continue using wav2vec2, except now we extend this to using multilabel classification. The model trained here is trained on equal lengths of audio, and all labels occuring in that segment are attributed to the that segment.

### Read in packages

In [None]:
#all_no_test

In [None]:
#modeling imports
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification, pipeline, TrainingArguments, Trainer
import torch
import soundfile as sf
import torch
import librosa

#ds imports
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score, roc_auc_score

#python imports
import os.path
import glob
import re
import warnings

In [None]:
#constants
sampling_rate = 16000
base_prefix = '/data/p_dsi/wise/data/'
test_audio_id = '055-1'
sample_csv_dir = base_prefix + 'multilabel_parquet/'
audio_dir = base_prefix + 'resampled_audio_16khz/'

# Read in audio and encoding data
We read in the subsegmented data generated in another notebook and stored as a parquet file. We also read in an audio file to work with.

In [None]:
#read audio data
class_audio, class_sr = sf.read(audio_dir + test_audio_id + '.wav')

#read dataframe and preview
ts_df = pd.read_parquet(sample_csv_dir + test_audio_id + '.parquet')
display(ts_df.head(10))
ts_df.shape

Unnamed: 0,id,speech_list,speech,label,label_id,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index,OTR,NEU,REP,PRS
0,055-1,[(okay) we are gonna go on and get started guys.],(okay) we are gonna go on and get started guys.,[NEU],[1],00:01.000,00:03.380,1000,3380,2380,16000,54080,0,1,0,0
1,055-1,[we are gonna do a little bit of reviewing wit...,we are gonna do a little bit of reviewing with...,[NEU],[1],00:03.750,00:06.763,3750,6763,3013,60000,108208,0,1,0,0
2,055-1,[(now) keep in mind that we are playing the go...,(now) keep in mind that we are playing the goo...,[NEU],[1],00:07.150,00:12.769,7150,12769,5619,114400,204304,0,1,0,0
3,055-1,[everyone look up here please.],everyone look up here please.,[NEU],[1],00:14.012,00:16.260,14012,16260,2248,224192,260160,0,1,0,0
4,055-1,"[let's go over the problems., most of you are ...",let's go over the problems. most of you are done.,"[NEU, NEU]","[1, 1]",00:16.615,00:19.125,16615,19125,2510,265840,306000,0,2,0,0
5,055-1,[if you are not you're just gonna follow along.],if you are not you're just gonna follow along.,[NEU],[1],00:19.125,00:21.762,19125,21762,2637,306000,348192,0,1,0,0
6,055-1,[(now) remember when we group>],(now) remember when we group>,[NEU],[1],00:22.375,00:24.000,22375,24000,1625,358000,384000,0,1,0,0
7,055-1,[raise your hand if you can tell me what numbe...,raise your hand if you can tell me what number...,[OTR],[0],00:24.010,00:32.596,24010,32595,8585,384160,521520,1,0,0,0
8,055-1,[what number name],what number name,[OTR],[0],00:33.250,00:34.697,33250,34697,1447,532000,555152,1,0,0,0
9,055-1,[okay ten or more.],okay ten or more.,[NEU],[1],00:35.250,00:37.375,35250,37375,2125,564000,598000,0,1,0,0


(177, 16)

# Data pre-processing
## Get audio clips (inputs)
Here, we'll split up the audio data and groom the labels to be one-hot encoded.

In [None]:
#Split audio into 20*sampling_rate equal segments
audio_clips = [class_audio[start:end+1] for start, end in ts_df[['start_index', 'end_index']].values]
len(audio_clips)

177

## Get labels in correct format

In [None]:
#create some data labels to enforce consistency of label order
data_labels = ['OTR', 'NEU', 'REP', 'PRS']

#create forward and reverse dictionaries
label2id = {label:lid for lid, label in enumerate(data_labels)}
id2label = {value:key for key, value in label2id.items()}

print(label2id)
print(id2label)

{'OTR': 0, 'NEU': 1, 'REP': 2, 'PRS': 3}
{0: 'OTR', 1: 'NEU', 2: 'REP', 3: 'PRS'}


In [None]:
#convert labels to one-hot encoding
multilabels = (ts_df[data_labels]>0).astype(int)
new_total = multilabels.values.tolist()

#sanity checks
display(new_total[:6])
print(len(new_total))

[[0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0]]

177


# Split train and test
Here, we'll split the data to make sure we have a training and validation set. We'll need a test set as well at some point.

In [None]:
#determine training size
train_size = round(ts_df.shape[0]*0.8)

In [None]:
#split data
audio_clips_train = audio_clips[:train_size]
audio_clips_test = audio_clips[train_size:]

#this looks about right
print(len(audio_clips_train))
print(audio_clips_train[0][:10])
print(audio_clips_train[1][:10])
print(len(audio_clips_test))
print(len(audio_clips_test[0]))

142
[-0.00299072 -0.00701904 -0.00869751 -0.00930786 -0.00946045 -0.0085144
 -0.0057373  -0.00164795 -0.00030518 -0.00048828]
[-4.88281250e-04  3.05175781e-05  5.18798828e-04  8.23974609e-04
  8.54492188e-04  1.12915039e-03  1.15966797e-03  8.54492188e-04
  7.32421875e-04  9.46044922e-04]
35
60497


In [None]:
#Create lists for labels
train_label = new_total[:train_size]
test_label = new_total[train_size:]

# Prepare inputs for wav2vec2
Here, we pre-process the inputs specifically for wav2vec2 models. We also create a custom dataset which will generate the inputs as desired by the HuggingFace API

In [None]:
#load processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

In [None]:
#process inputs appropriately
train_inputs = processor(audio_clips_train, return_tensors="pt", padding="longest", sampling_rate=sampling_rate)
test_inputs = processor(audio_clips_test, return_tensors="pt", padding="longest", sampling_rate=sampling_rate)

In [None]:
#helpers for class size and class names
no_classes = len(train_label[0])

#Create custom Datasets Class
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels, is_multilabel=False):
        self.encodings = encodings
        self.labels = labels
        self.is_multilabel = is_multilabel

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
        if self.is_multilabel:
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float)
        else:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

#Create datasets from encodings
train_dataset = CustomDataset(train_inputs, train_label, is_multilabel=True)
val_dataset = CustomDataset(test_inputs, test_label, is_multilabel=True)

# Setup model for training
Here, we define a number of functions which will allow our model to perform multilabel classification.

In [None]:
model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h",
                                                          num_labels=no_classes,
                                                          id2label=id2label,
                                                          label2id=label2id)

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForSequenceClassification: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['classifier.weight', 'projector.bias', 'wav2vec2.masked_spec_embed', 'classifier.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be 

This function is just to create a shorthand for try/except. It's used in the computation of ROC AUC score. On the off-chance that the AUC cannot be calculated, this unfortunately throws an error which we need to catch; if we don't, the whole `trainer.evaluate()` function fails and we cannot see the results.

In [None]:
def try_except(try_exp):
    '''
    Function try_except: a helper function for streamlining try/except statements
        Inputs: callable to be executed
        Output: the successful output of the called input or np.nan
    '''
    try:
        return try_exp()
    except:
        return np.nan

This function is exactly as it reads; it's a function used to compute some metrics. These metrics are shown during training on the evaluation set. For more information on the format of what a metric output should look like, see the [GitHub metrics repository](https://github.com/huggingface/datasets/blob/master/metrics/accuracy/accuracy.py) where several are defined.

In [None]:
#function to calculate metrics
def compute_metrics(eval_pred):
    '''
    Function compute_metrics: computes a number of metrics and returns a dictionary as required by HF API
        Inputs: eval_pred: output of forward pass of HF
        Outputs: dictionary containing named metrics
    '''
    
    #separate components of forward pass and labels
    logits, labels = eval_pred
    
    #Calculate probabilities and get hard predictions
    probabilities = torch.sigmoid(torch.tensor(logits))#.data
    predictions = (probabilities>= 0.5).float()
    
    #Calculate metrics of interest to be returned as a dictionary
    metrics_calculated = {'f1_score_samples_mean': f1_score(labels.astype(float), predictions, average='samples'),
                          'f1_score_macro_mean': f1_score(labels.astype(float), predictions, average='macro'),
                          'roc_auc_samples_mean': try_except(lambda: roc_auc_score(labels, predictions, average='samples')),
                          'roc_auc_macro_mean': try_except(lambda: roc_auc_score(labels, predictions, average='macro'))}

    return metrics_calculated

The `compute_loss` function of `MultilabelTrainer` below is a bit squirrely, and with new commits to the wav2vec2 model source code (current commit is 6645eb61fa61cd24c77), I'm sure that it will become obselete. The source code of the forward pass is [here](https://github.com/huggingface/transformers/blob/master/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L1716).  The problem lies on lines[1762-1764](https://github.com/huggingface/transformers/blob/master/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L1762).

As you can see, by default, the forward pass wants to compute CrossEntropyLoss. This doesn't work for us; however, it only runs in the case that labels is not None. So, we force labels to be None, and skip that step and use our own loss calculation. This is a bit hacky; another approach would be to inherit from wav2vec2 and then write a new forward pass. This seems...unfortunate when something small like this can be integrated in.

A second part of this is that the `inputs` parameter passed in already has the `labels` parameter inside it as a key in the dictionary. So, we need to snatch that out before calling the forward pass of the model.

In [None]:
#Subclassing trainer to customize the loss function
class MultilabelTrainer(Trainer):
    '''
    Class MultilabelTrainer: Subclass of Trainer class for with loss function tailored to multilabel classification
    '''
    def compute_loss(self, model, inputs, return_outputs=False):
        #get the labels
        labels = inputs.get("labels")
        
        #replace forward call for labels with None. This prevents the default loss computation from wav2vec2
        #and get outputs
        outputs = model(**{key:value for key,value in inputs.items() if key!='labels'}, labels=None)
        
        #get logits and compute loss
        logits = outputs.get('logits')
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                        labels.float().view(-1, self.model.config.num_labels))
        
        #return loss
        return (loss, outputs) if return_outputs else loss

# Train Model
Now, we set up the training arguments for logging and evaluation every epoch. This shows us the evaluation metrics at the end of every epoch. This can be seen below.

In [None]:
# set parameters around training
training_args = TrainingArguments("test_trainer", 
                                  logging_strategy='epoch', 
                                  evaluation_strategy='epoch',
                                  num_train_epochs = 3,
                                  learning_rate = 2e-5,
                                  weight_decay = 1e-5,
                                  #lr_scheduler_type = 'cosine',
                                  #adam_beta1 = 0.8,
                                  #adam_beta2 = 0.95,
                                  #adam_epsilon = 1e-6,
                                  per_device_train_batch_size=5,
                                  per_device_eval_batch_size=5,
                                  report_to='all'
                                 )
    
trainer = MultilabelTrainer(
    model=model,
    args=training_args,
    tokenizer=processor,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

train_output = trainer.train()

***** Running training *****
  Num examples = 142
  Num Epochs = 3
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 87


Epoch,Training Loss,Validation Loss,F1 Score Samples Mean,F1 Score Macro Mean,Roc Auc Samples Mean,Roc Auc Macro Mean
1,0.6622,0.635755,0.428571,0.156863,0.611905,0.5
2,0.603,0.61122,0.428571,0.156863,0.611905,0.5
3,0.5866,0.601642,0.428571,0.156863,0.611905,0.5


***** Running Evaluation *****
  Num examples = 35
  Batch size = 5
***** Running Evaluation *****
  Num examples = 35
  Batch size = 5
***** Running Evaluation *****
  Num examples = 35
  Batch size = 5


Training completed. Do not forget to share your model on huggingface.co/models =)




# Evaluate Model
If we want to evaluate the performance, we can use the evaluate method of trainer. This shows us all of our metrics from before.

In [None]:
trainer.evaluate(train_dataset)

***** Running Evaluation *****
  Num examples = 142
  Batch size = 5


{'eval_loss': 0.5638657212257385,
 'eval_f1_score_samples_mean': 0.5023474178403755,
 'eval_f1_score_macro_mean': 0.17129629629629628,
 'eval_roc_auc_samples_mean': 0.6625586854460095,
 'eval_roc_auc_macro_mean': 0.5,
 'eval_runtime': 8.9881,
 'eval_samples_per_second': 15.799,
 'eval_steps_per_second': 3.227,
 'epoch': 3.0}

The model is doing pretty rough, but hopefully it will improve with the use of more data!

# Look at predictions
Let's check out the probabilities themselves since we don't see a lot of movement in the evaluation metrics (we'll look at the training set here, though...)

In [None]:
#Get predictions
train_preds = trainer.predict(train_dataset)

***** Running Prediction *****
  Num examples = 142
  Batch size = 5


In [None]:
#Run sigmoid to get probabilities
torch.sigmoid(torch.tensor(train_preds.predictions))

tensor([[0.5307, 0.4796, 0.2736, 0.3569],
        [0.5287, 0.4785, 0.2781, 0.3590],
        [0.5299, 0.4816, 0.2942, 0.3692],
        [0.5289, 0.4796, 0.2668, 0.3518],
        [0.5313, 0.4799, 0.2740, 0.3561],
        [0.5285, 0.4762, 0.2768, 0.3573],
        [0.5291, 0.4793, 0.2670, 0.3523],
        [0.5301, 0.4803, 0.3096, 0.3806],
        [0.5277, 0.4787, 0.2653, 0.3503],
        [0.5303, 0.4784, 0.2682, 0.3523],
        [0.5297, 0.4784, 0.2656, 0.3517],
        [0.5314, 0.4789, 0.2714, 0.3534],
        [0.5301, 0.4792, 0.2841, 0.3616],
        [0.5316, 0.4775, 0.2820, 0.3631],
        [0.5248, 0.4782, 0.2791, 0.3608],
        [0.5304, 0.4770, 0.2943, 0.3713],
        [0.5268, 0.4767, 0.2669, 0.3522],
        [0.5258, 0.4799, 0.3210, 0.3883],
        [0.5277, 0.4768, 0.2677, 0.3534],
        [0.5258, 0.4774, 0.2724, 0.3560],
        [0.5261, 0.4799, 0.3345, 0.3964],
        [0.5291, 0.4813, 0.2860, 0.3652],
        [0.5251, 0.4768, 0.2713, 0.3554],
        [0.5300, 0.4793, 0.2690, 0

This tells us a lot about what we saw during training. The probabilities themselves really aren't all that different in value across the different samples. A good sign is that they are different, but they will need to be differentiated.