# Connect

## Google Drive

The datasets containing the examples of Linguistic Antipattern that the model will have to be able to identify are located in a shared folder, created specifically for the project, on google drive, to facilitate data recovery operations they are then accessed to Drive, so you can then easily retrieve the datasets

In [1]:
# Mount in the Colab runtime a folder corresponding to your google drive

from google.colab import drive
drive.mount('/content/drive');

Mounted at /content/drive


## Hugging Face

We log in to Hugging Face so that we can upload the model lastly after finishing the operations necessary for Fine Tuning

In [4]:
!pip install huggingface

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Installing collected packages: huggingface
Successfully installed huggingface-0.0.1


In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Installation

## Installing Tranformers Libraries

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m101.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


## Installing the Necessary Libraries

In the next step we go instead to download all those libraries necessary to be able to carry out training, evaluation and testing of the model, also adding some libraries used to improve data visualization:

1.   **train_test_split**: this function comes from the _sklearn_ library, and allows you to divide a dataset into two parts: _training-set_ and _test-set_. Respectively, one will be used to train the model, while the other will be used to evaluate its generalization ability.
2.   **pandas**: offers tools for analyzing data in tabular form, dataframes and manipulating them. It allows operations such as column filtering, aggregation and merging of dataframes.
3.  **numpy**: processing of multidimensional numerical arrays, for the latter, offers linear algebra operations and mathematical operations.
4.  **tabulate**: allows you to transform an array into a table to graphically display data structures.
5.  **tqdm**: allows displaying progressbar in iterative processing loops. It could be useful at this stage of system development, to monitor the time required by the various processes.

In [3]:
import torch # ML framework
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler # library for data manipulation

import pandas as pd
import numpy as np

from tabulate import tabulate
from tqdm import trange
import random # random numbers generation

import warnings
warnings.filterwarnings(action='once')

# Data Preparation

## Retrive Data from Dataset

in the code below, we get the data files from the google drive of the project

In [4]:
# Da eseguire se hai caricato il dataset a mano nel runtime
mop_instances = "/content/drive/Shareddrives/se4ai/mop.csv"
aop_instances = "/content/drive/Shareddrives/se4ai/aop.csv"
nic_instances = "/content/drive/Shareddrives/se4ai/nic.csv"

The training of the model will be carried out through supervised learning, i.e. labels are added to the data which will allow the model to understand what type of information it is analyzing, the labels are then transformed into numbers, so as to be able to be more easily included by the model.
The model must be able to recognize 4 different types of linguistic antipattern within python code, they are:
- **"Get" More Than an Accessor**: A getter that performs actions other than returning the corresponding attribute.
- **Not Implemented Condition**: The comments of a method suggest a conditional behavior that is not implemented in the code. When the implementation is default this should be documented.
- **Attribute Signature and Comment are Opposite**: The declaration of an attribute is in contradiction with its documentation.
- **Method Signature and Comment are Opposite**: The  declaration of a method is in contradiction with its documentation.

In addition to the 4 types of Antipattern, the model must also be able to recognize the instances of code that do not contain them, thus classifying them as instances of clean code (**Clear**)

The labels assigned to the various code snippets, based on the antipattern they contain, within the dataset are:
- **mop**: used for "Metohd Signature and Comment are Opposite", this label will be assigned the value **0** for model training
- **aop**: used for "Attribute Signature and Comment are Opposite", this label will be assigned the value **1** for model training
- **clr**: used for "Clear", this label will be assigned the value **2** for model training
- **nic**: used for "Not Implemented Condition", this label will be assigned the value **3** for model training
- **get**: used for "Get" More than an Accessor, this label will be assigned the value **4** for model training


In [5]:
mop = pd.read_csv(mop_instances) # pandas read_csv legge automaticamente file csv e crea un "oggetto"
aop = pd.read_csv(aop_instances)
nic = pd.read_csv(nic_instances)

# Creiamo array di label e array di snippet di codice
mop_labels = list([])
aop_labels = list([])
nic_labels = list([])
for l in list(mop['label']):
  if l == 'mop':
    mop_labels.append(0)
  else:
    mop_labels.append(2)

for l in list(aop['label']):
  if l == 'aop':
    aop_labels.append(1)
  else:
    aop_labels.append(2)

for l in list(nic['label']):
  if l == 'nic':
    nic_labels.append(3)
  else:
    nic_labels.append(2)



mop_code = list(mop['code'])
aop_code = list(aop['code'])
nic_code = list(nic['code'])



## Split data for Training and Testing

In [6]:
from sklearn.model_selection import train_test_split # funzione in sklearn per dividere il dataset in train, test, validation
# Divide il dataset in train test validation
# Parametri test_size:
# valori di default: train test_size = 0.33 val test_size = 0.3
TRAIN_TEST_SIZE = 0.33;
VAL_TEST_SIZE = 0.3;
# Tendenzialmente dovrebbero essere simili così train = 70%, test = 30%, val = 30% di test;
# Alzando test e val può migliorare accuracy perché ha più esempi per test e validazione

mop_train_codes, mop_temp_codes, mop_train_labels, mop_temp_labels = train_test_split(mop_code, mop_labels, test_size = TRAIN_TEST_SIZE, shuffle = True, stratify = mop_labels);
aop_train_codes, aop_temp_codes, aop_train_labels, aop_temp_labels = train_test_split(aop_code, aop_labels, test_size = TRAIN_TEST_SIZE, shuffle = True, stratify = aop_labels);
nic_train_codes, nic_temp_codes, nic_train_labels, nic_temp_labels = train_test_split(nic_code, nic_labels, test_size = TRAIN_TEST_SIZE, shuffle = True, stratify = nic_labels);

train_codes = mop_train_codes + aop_train_codes + nic_train_codes
temp_codes = mop_temp_codes +  aop_temp_codes + nic_temp_codes
train_labels = mop_train_labels + aop_train_labels + nic_train_labels
temp_labels = mop_temp_labels + aop_temp_labels + nic_temp_labels

test_codes, val_codes, test_labels, val_labels = train_test_split(temp_codes, temp_labels, test_size = VAL_TEST_SIZE, shuffle = True, stratify = temp_labels );





# Model Evaluations Metrics



## Metrics Calculation

Let's now import the functions that calculate the metrics of our model, through which we will be able to understand how efficient it is in determining whether an instance of code to be analyzed belongs to a certain class or to another.
The Metrics used to evaluate our model are:

- **Accuracy**: It gives you the overall accuracy of the model, meaning the fraction of the total samples that were correctly classified by the classifier. To calculate accuracy, use the following formula: (TP+TN)/(TP+TN+FP+FN).
- **Precision**: It tells you what fraction of predictions as a positive class were actually positive. To calculate precision, use the following formula: TP/(TP+FP).
- **Recall**: It tells you what fraction of all positive samples were correctly predicted as positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity, Probability of Detection. To calculate Recall, use the following formula: TP/(TP+FN).
- **Specificity**: It tells you what fraction of all negative samples are correctly predicted as negative by the classifier. It is also known as True Negative Rate (TNR). To calculate specificity, use the following formula: TN/(TN+FP).
- **F1 Score**: It combines precision and recall into a single measure. Mathematically it’s the harmonic mean of precision and recall.

installiamo la libreria **evaluate** di hugging face



In [9]:
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiproce

  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):


We import the metrics with an avarage of type "macro".
This step is very important as the evaluation metrics of this library are default for binary classification problems, in our case we need to set them in such a way that we have some overall evaluation metrics to evaluate the efficiency for a classification model multiclass, and this is done by setting the average to "macro"

In [20]:
import evaluate

accuracy = evaluate.load("accuracy", average="macro")
precision = evaluate.load("precision", average="macro")
f1 = evaluate.load("f1", average="macro")
recall = evaluate.load("recall", average="macro")
mcc = evaluate.load("matthews_correlation", avarage="macro")

# Model Inizialization

## Roberta Tokenizer

We download the HuggingFace tokenization module which selects the tokenizer of the ML model **Roberta**, as it is the model on which CodeBERT is based.
It splits text into tokens using the WordPiece algorithm, which breaks words into smaller, more common parts (subwords), so you can handle unknown or infrequent words during training.

In [11]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base', do_lower_case = True);

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

## Roberta for Sequesnce Classification

We download the module through which it is possible to select the Machine Learning models that are based on the **Roberta** model, as in our case for codeBERT.

With this specific module the model is loaded by adding a linear level above the **pooled output**, which is nothing more than an attention head for sequence classification/regression.
CodeBERT is in fact a model that allows you to perform feature extraction operations, while our model must be able to perform code classification operations, hence the need to train it to perform an operation different from its base, and therefore to add a new attention head.

In [12]:
from transformers import RobertaForSequenceClassification
# Quando esegui per la prima volta (nello stesso runtime) li deve scaricare.
# Dopo aver eseguito returna questo warning:
# ---
# Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
# This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
# ---
# Noi stiamo facendo esattamente quello che dice quindi va bene.
tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base', do_lower_case = True);

id2label = {0: "mop", 2: "clr", 1: "aop", 3: "nic"}
label2id = {"mop": 0, "clr": 2, "aop": 1, "nic": 3}

# Attenzione al parametro num_labels in base a quante label si devono classificare (4 = aop, clr, mop, nic)
model = RobertaForSequenceClassification.from_pretrained('microsoft/codebert-base', num_labels = 4, id2label=id2label, label2id=label2id)

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be 

In [13]:
def preprocessing(input_text, tokenizer):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        max_length = 90,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )

def preprocessing_batch(data_set):
    token_id = []
    attention_masks = []
    for sample in data_set:
      encoding_dict = preprocessing(sample, tokenizer)
      token_id.append(encoding_dict['input_ids']) 
      attention_masks.append(encoding_dict['attention_mask'])
    token_id = torch.cat(token_id, dim = 0)
    attention_masks = torch.cat(attention_masks, dim = 0)
    return token_id,attention_masks;

train_token_id,train_attention_masks = preprocessing_batch(train_codes);
test_token_id,test_attention_masks = preprocessing_batch(test_codes);
val_token_id,val_attention_masks = preprocessing_batch(val_codes);

def print_rand_sentence_encoding(text, token_id):
  '''Displays tokens, token IDs and attention mask of a random text sample'''
  index = random.randint(0, len(text) - 1)
  tokens = tokenizer.tokenize(tokenizer.decode(token_id[index]))
  token_ids = [i.numpy() for i in token_id[index]]
  print(tokens);
  table = np.array([tokens, token_ids]).T
  print(tabulate(table, 
                 headers = ['Tokens', 'Token IDs'],
                 tablefmt = 'fancy_grid'))


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


We take the labels of our dataset that we have divided into different samples for training, testing and validation and transform them into Tensors

In [14]:
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)
val_labels = torch.tensor(val_labels)

In [15]:
batch_size = 16

train_set = TensorDataset(train_token_id, 
                          train_attention_masks, 
                          train_labels)

val_set = TensorDataset(val_token_id, 
                        val_attention_masks, 
                        val_labels)

test_set = TensorDataset(test_token_id, 
                        test_attention_masks, 
                        test_labels)

train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            val_set,
            sampler = SequentialSampler(val_set),
            batch_size = batch_size
        )


test_dataloader = DataLoader(
            test_set,
            sampler = SequentialSampler(test_set),
            batch_size = batch_size
        )

# Optimization

Let's optimize the parameters of our model in order to obtain the best possible result for the fine tuning, specifically let's modify:
1. **Learning Rate (lr)**: a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. In our case it was set to 5e-5
2. **Epsilon**: used as a guard against a by zero division. In our case it was set to 1e-08
3. **Weight Decay**: a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. In our case it was set to 0.01

In [16]:
optimizer = torch.optim.AdamW(model.parameters(), 
                              lr = 5e-5, # LEARNING RATE DELL'ALGORITMO OTTIMIZZATORE (2e-5 = 2*10^-5 = 0.00005)
                              eps = 1e-08,
                              weight_decay = 0.01
                              )
model.cuda(); # Eseguire per impostare il modello in modo da usare la GPU durante training

# Training

The training of the model is carried out in this block, it has been set with a batch size of 16 and 8 epochs, for each batch of each epoch the metrics calculated for that training are printed on the screen.
At the end of the Training we will have a model developed to carry out classification operations, and capable of recognizing the linguistic antipatterns within the python code described above.

In [23]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

epochs = 8

for _ in trange (epochs, desc = 'Epoch'):
  model.train()
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0

  for step, batch in enumerate(train_dataloader):
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch
      #reset gradient value for the new epoch
      optimizer.zero_grad()
      # Forward pass
      train_output = model(b_input_ids, 
                            token_type_ids = None, 
                            attention_mask = b_input_mask, 
                            labels = b_labels)
      # Backward pass
      train_output.loss.backward()
      optimizer.step()
      # Update tracking variables
      tr_loss += train_output.loss.item()
      nb_tr_examples += b_input_ids.size(0)
      nb_tr_steps += 1

       # ========== Validation ==========

      # Set model to evaluation mode
      model.eval()

      # Tracking variables 
      val_accuracy = []
      val_precision = []
      val_recall = []
      val_specificity = []

      latest_acc = 0.0000;

      for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
          # Forward pass
          eval_output = model(b_input_ids, 
                              token_type_ids = None, 
                              attention_mask = b_input_mask)
        logits = eval_output.logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Calculate validation metrics for the entire model
        predictions = np.argmax(logits, axis=1)
        print('\n\t - Train loss: {:.5f}'.format(tr_loss / nb_tr_steps))
        print('\t - Accuracy: {:.4f}'.format(accuracy.compute(predictions=predictions, references=label_ids)['accuracy']))
        print('\t - F1 Score: {:.4f}'.format(f1.compute(predictions=predictions, references=label_ids, average="macro")['f1']))
        print('\t - Precision: {:.4f}'.format(precision.compute(predictions=predictions, references=label_ids, average="macro", zero_division=0)['precision']))
        print('\t - Recall: {:.4f}'.format(recall.compute(predictions=predictions, references=label_ids, average="macro", zero_division=0)['recall']))
        print('\t - Recall: {:.4f}'.format(mcc.compute(predictions=predictions, references=label_ids, average="macro", zero_division=0)['matthews_correlation']))


PATH = './greet'
torch.save(model, PATH)

Epoch:   0%|          | 0/8 [00:00<?, ?it/s]


	 - Train loss: 0.0838
	 - Accuracy: 0.687500
	 - F1 Score: 0.735994
	 - Precision: 0.750000
	 - Recall: 0.729167

	 - Train loss: 0.0838
	 - Accuracy: 0.625000
	 - F1 Score: 0.614583
	 - Precision: 0.566667
	 - Recall: 0.812500

	 - Train loss: 0.0838
	 - Accuracy: 0.500000
	 - F1 Score: 0.523810
	 - Precision: 0.533333
	 - Recall: 0.666667

	 - Train loss: 0.0838
	 - Accuracy: 0.875000
	 - F1 Score: 0.843750
	 - Precision: 0.944444
	 - Recall: 0.833333

	 - Train loss: 0.0838
	 - Accuracy: 0.812500
	 - F1 Score: 0.781593
	 - Precision: 0.770833
	 - Recall: 0.803571

	 - Train loss: 0.0838
	 - Accuracy: 0.666667
	 - F1 Score: 0.559524
	 - Precision: 0.583333
	 - Recall: 0.541667

	 - Train loss: 0.0786
	 - Accuracy: 0.687500
	 - F1 Score: 0.735994
	 - Precision: 0.750000
	 - Recall: 0.729167

	 - Train loss: 0.0786
	 - Accuracy: 0.687500
	 - F1 Score: 0.718137
	 - Precision: 0.697619
	 - Recall: 0.837500

	 - Train loss: 0.0786
	 - Accuracy: 0.500000
	 - F1 Score: 0.523810
	 - Precis

Epoch:  12%|█▎        | 1/8 [00:31<03:40, 31.55s/it]

	 - Recall: 0.928571

	 - Train loss: 0.0748
	 - Accuracy: 0.666667
	 - F1 Score: 0.559524
	 - Precision: 0.583333
	 - Recall: 0.541667

	 - Train loss: 0.1096
	 - Accuracy: 0.687500
	 - F1 Score: 0.735994
	 - Precision: 0.750000
	 - Recall: 0.729167

	 - Train loss: 0.1096
	 - Accuracy: 0.687500
	 - F1 Score: 0.676471
	 - Precision: 0.614286
	 - Recall: 0.837500

	 - Train loss: 0.1096
	 - Accuracy: 0.500000
	 - F1 Score: 0.523810
	 - Precision: 0.533333
	 - Recall: 0.666667

	 - Train loss: 0.1096
	 - Accuracy: 0.875000
	 - F1 Score: 0.843750
	 - Precision: 0.944444
	 - Recall: 0.833333

	 - Train loss: 0.1096
	 - Accuracy: 0.875000
	 - F1 Score: 0.872619
	 - Precision: 0.854167
	 - Recall: 0.928571

	 - Train loss: 0.1096
	 - Accuracy: 0.666667
	 - F1 Score: 0.559524
	 - Precision: 0.583333
	 - Recall: 0.541667

	 - Train loss: 0.0895
	 - Accuracy: 0.625000
	 - F1 Score: 0.659524
	 - Precision: 0.691667
	 - Recall: 0.645833

	 - Train loss: 0.0895
	 - Accuracy: 0.625000
	 - F1 Score

Epoch:  25%|██▌       | 2/8 [01:05<03:18, 33.16s/it]


	 - Train loss: 0.0669
	 - Accuracy: 0.625000
	 - F1 Score: 0.648810
	 - Precision: 0.687500
	 - Recall: 0.732143

	 - Train loss: 0.0669
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.1065
	 - Accuracy: 0.687500
	 - F1 Score: 0.700877
	 - Precision: 0.825758
	 - Recall: 0.697917

	 - Train loss: 0.1065
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.1065
	 - Accuracy: 0.562500
	 - F1 Score: 0.591575
	 - Precision: 0.612500
	 - Recall: 0.750000

	 - Train loss: 0.1065
	 - Accuracy: 0.812500
	 - F1 Score: 0.788889
	 - Precision: 0.812500
	 - Recall: 0.830952

	 - Train loss: 0.1065
	 - Accuracy: 0.625000
	 - F1 Score: 0.648810
	 - Precision: 0.687500
	 - Recall: 0.732143

	 - Train loss: 0.1065
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0600
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precis

Epoch:  38%|███▊      | 3/8 [01:37<02:43, 32.64s/it]


	 - Recall: 0.767857

	 - Train loss: 0.0505
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0472
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0472
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.0472
	 - Accuracy: 0.562500
	 - F1 Score: 0.591575
	 - Precision: 0.612500
	 - Recall: 0.750000

	 - Train loss: 0.0472
	 - Accuracy: 0.812500
	 - F1 Score: 0.803105
	 - Precision: 0.925000
	 - Recall: 0.783333

	 - Train loss: 0.0472
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0472
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0267
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0267
	 - Accuracy: 0.625000
	 - F1 Scor

Epoch:  50%|█████     | 4/8 [02:08<02:07, 31.79s/it]


	 - Train loss: 0.0389
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0389
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.2058
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.2058
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.2058
	 - Accuracy: 0.625000
	 - F1 Score: 0.635714
	 - Precision: 0.637500
	 - Recall: 0.777778

	 - Train loss: 0.2058
	 - Accuracy: 0.812500
	 - F1 Score: 0.803105
	 - Precision: 0.925000
	 - Recall: 0.783333

	 - Train loss: 0.2058
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.2058
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.1067
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precis

Epoch:  62%|██████▎   | 5/8 [02:40<01:35, 31.89s/it]

	 - Recall: 0.803571

	 - Train loss: 0.0376
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0382
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0382
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.0382
	 - Accuracy: 0.750000
	 - F1 Score: 0.747024
	 - Precision: 0.714286
	 - Recall: 0.833333

	 - Train loss: 0.0382
	 - Accuracy: 0.812500
	 - F1 Score: 0.803105
	 - Precision: 0.925000
	 - Recall: 0.783333

	 - Train loss: 0.0382
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0382
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0216
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0216
	 - Accuracy: 0.625000
	 - F1 Score

Epoch:  75%|███████▌  | 6/8 [03:16<01:06, 33.31s/it]


	 - Train loss: 0.0401
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0401
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0979
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0979
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.0979
	 - Accuracy: 0.562500
	 - F1 Score: 0.591575
	 - Precision: 0.612500
	 - Recall: 0.750000

	 - Train loss: 0.0979
	 - Accuracy: 0.812500
	 - F1 Score: 0.803105
	 - Precision: 0.925000
	 - Recall: 0.783333

	 - Train loss: 0.0979
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0979
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0783
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precis

Epoch:  88%|████████▊ | 7/8 [03:48<00:32, 32.96s/it]

	 - Recall: 0.803571

	 - Train loss: 0.0340
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0724
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0724
	 - Accuracy: 0.625000
	 - F1 Score: 0.500000
	 - Precision: 0.462500
	 - Recall: 0.587500

	 - Train loss: 0.0724
	 - Accuracy: 0.562500
	 - F1 Score: 0.591575
	 - Precision: 0.612500
	 - Recall: 0.750000

	 - Train loss: 0.0724
	 - Accuracy: 0.812500
	 - F1 Score: 0.803105
	 - Precision: 0.925000
	 - Recall: 0.783333

	 - Train loss: 0.0724
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0724
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167

	 - Train loss: 0.0610
	 - Accuracy: 0.625000
	 - F1 Score: 0.625000
	 - Precision: 0.770833
	 - Recall: 0.614583

	 - Train loss: 0.0610
	 - Accuracy: 0.625000
	 - F1 Score

Epoch: 100%|██████████| 8/8 [04:21<00:00, 32.69s/it]


	 - Train loss: 0.0307
	 - Accuracy: 0.750000
	 - F1 Score: 0.759524
	 - Precision: 0.782738
	 - Recall: 0.803571

	 - Train loss: 0.0307
	 - Accuracy: 0.777778
	 - F1 Score: 0.637500
	 - Precision: 0.687500
	 - Recall: 0.604167





# Load the Fine Tuned Model

In [21]:
# load the model saved
PATH = './greet';
model = torch.load(PATH)

# Model Testing

Now let's go test our model on the portion of the Dataset that has been specially split for this purpose, in order to train our model on data it has never seen, again here we are going to print the model metrics for testing

In [22]:
for batch in test_dataloader:
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    eval_output = model(b_input_ids, 
                        token_type_ids = None, 
                        attention_mask = b_input_mask)
    logits = eval_output.logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    # Calculate validation metrics for each class
    b_accuracy_mop, b_precision_mop, b_recall_mop, b_specificity_mop = b_metrics(logits, label_ids, 0)
    b_accuracy_aop, b_precision_aop, b_recall_aop, b_specificity_aop = b_metrics(logits, label_ids, 1)
    b_accuracy_clr, b_precision_clr, b_recall_clr, b_specificity_clr = b_metrics(logits, label_ids, 2)
    b_accuracy_nic, b_precision_nic, b_recall_nic, b_specificity_nic = b_metrics(logits, label_ids, 3)
    # for the entire model
    test_accuracy, test_precision, test_recall, test_specificity = total_metrics(logits, label_ids)

    test_predictions = np.argmax(logits, axis=1)
    print('\n\t - Train loss: {:.4f}'.format(tr_loss / nb_tr_steps))
    print('\t - Accuracy: {:4f}'.format(accuracy.compute(predictions=test_predictions, references=label_ids)['accuracy']))
    print('\t - F1 Score: {:4f}'.format(f1.compute(predictions=test_predictions, references=label_ids, average="macro")['f1']))
    print('\t - Precision: {:4f}'.format(precision.compute(predictions=test_predictions, references=label_ids, average="macro")['precision']))
    print('\t - Recall: {:4f}'.format(recall.compute(predictions=test_predictions, references=label_ids, average="macro")['recall']))


	 - Validation Accuracy: 0.7500
	 - Validation Precision: 1.0000
	 - Validation Recall: 0.5000
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.7500
	 - Validation Precision: 1.0000
	 - Validation Recall: 0.5000
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.7500
	 - Validation Precision: 1.0000
	 - Validation Recall: 0.3333
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.7500
	 - Validation Precision: 0.6000
	 - Validation Recall: 0.6000
	 - Validation Specificity: 0.6000



	 - Train loss: 0.3010
	 - Validation Accuracy: 0.8571
	 - Validation Precision: 0.7500
	 - Validation Recall: 0.7500
	 - Validation Specificity: 0.9000


	 - Validation Accuracy: 0.6875
	 - Validation Precision: nan
	 - Validation Recall: 0.0000
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.6875
	 - Validation Precision: 1.0000
	 - Validation Recall: 0.4286
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.6875
	 - Validation Precisi

  b_precision = tp / (tp + fp) #if (tp + fp) > 0 else 'nan'


	 - Validation Accuracy: 0.6250
	 - Validation Precision: nan
	 - Validation Recall: 0.0000
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.6250
	 - Validation Precision: 0.0000
	 - Validation Recall: 0.0000
	 - Validation Specificity: 0.9091


	 - Validation Accuracy: 0.6250
	 - Validation Precision: 1.0000
	 - Validation Recall: 0.2500
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.6250
	 - Validation Precision: 0.6154
	 - Validation Recall: 0.6154
	 - Validation Specificity: 0.2857



	 - Train loss: 0.3010
	 - Validation Accuracy: 0.7692
	 - Validation Precision: 0.6250
	 - Validation Recall: 0.6250
	 - Validation Specificity: 0.8333


	 - Validation Accuracy: 0.5000
	 - Validation Precision: 0.0000
	 - Validation Recall: 0.0000
	 - Validation Specificity: 0.8889


	 - Validation Accuracy: 0.5000
	 - Validation Precision: nan
	 - Validation Recall: 0.0000
	 - Validation Specificity: 1.0000


	 - Validation Accuracy: 0.5000
	 - Validation Precision: 

We rerun the testing on the same portion of data as before, but this time the data on which the model has made the prediction are printed on the screen, plus the prediction and the oracle for each data, so as to be able to verify on which portions of code the model has more problems and improve the dataset, so as to improve the performance of the model as well

In [23]:
correct = 0
wrong = 0

for index, test in enumerate(test_codes):
  encoding = preprocessing(test, tokenizer)
  predict_ids = []
  predict_attention_mask = []
  # Extract IDs and Attention Mask
  predict_ids.append(encoding['input_ids'])
  predict_attention_mask.append(encoding['attention_mask'])
  predict_ids = torch.cat(predict_ids, dim = 0)
  predict_attention_mask = torch.cat(predict_attention_mask, dim = 0)

  # Forward pass, calculate logit predictions
  with torch.no_grad():
    output = model(predict_ids.to(device), token_type_ids = None, attention_mask = predict_attention_mask.to(device))
  # print("0 = method opposite comment; 1 = attribute opposite comment; 2 = clear")
  print(output.logits.softmax(dim=-1).tolist())
  prediction = np.argmax(output.logits.cpu().numpy()).flatten().item()
  print(test)
  if prediction == 0:
    print('predicted: mop');
  elif prediction == 1:
    print('predicted: aop');
  elif prediction == 2:
    print('predicted: clr');
  elif prediction == 3:
    print('predicted: nic');


  oracle = test_labels.numpy()[index]
  if oracle == 0:
    print('oracle: mop');
  elif oracle == 1:
    print('oracle: aop');
  elif oracle == 2:
    print('oracle: clr');
  elif oracle == 3:
    print('oracle: nic');

  if prediction == oracle:
    print("PASS")
    correct += 1
  else:
    print("FAULT")
    wrong += 1
  
  print("=================================================================================================================")

print("correct predictions: " + str(correct))
print("wrong predictions: " + str(wrong))




[[0.0021038935519754887, 0.002897729864344001, 0.011442724615335464, 0.9835557341575623]]
# this function decrements the quantity for a given product if the quantity is greater than zero, otherwise it is no longer decremented
def decrease_quantity(self, amount):
    self.quantity -= amount
predicted: nic
oracle: nic
PASS
[[0.09834358096122742, 0.0018036579713225365, 0.8971239924430847, 0.0027288447599858046]]
def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
	"""
		Load tf checkpoints in a pytorch model.
	"""
predicted: clr
oracle: clr
PASS
[[0.041288867592811584, 0.0010227753082290292, 0.9526039958000183, 0.0050842720083892345]]
def print_even_length_words(s):
	"""
		Write a python program to print only even length words in a sentence
	"""
predicted: clr
oracle: clr
PASS
[[0.05458243936300278, 0.0015642119105905294, 0.9384941458702087, 0.005359247326850891]]
def shuffle_array(arr):
	"""
		takes an array of numbers as a parameter and sorts it
	"""
predicted: cl

# Model Expalinability

In [24]:
!pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.6/572.6 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.7 (from shap)
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7


  for line in open(toplevel):
  for line in open(toplevel):


In [49]:
import shap



AttributeError: ignored