<a href="https://colab.research.google.com/github/shaw0155/AI-models/blob/main/Task2_3_Multiclass_classification_of_NFR_subclasses_edited.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiclass Classification of non-functional requirement subclasses on Promise NFR dataset

This notebook includes all code needed to train and evaluate a multiclass classifier for predicting a number of NFR subclasses on the Promise NFR dataset.

Note: some cells are hidden and only the title is shown. To display the code, double-click the cell to switch the display mode.


## Prepare
Install required libraries and import

In [1]:
#@title Install needed libraries {display-mode: "form"}
!pip uninstall numpy
!pip install numpy==1.23.5
!pip install fastai==1.0.61 fastcore==1.3.29 fastprogress==1.0.3 pytorch-transformers==1.2.0 sklearn==0.0 spacy==3.6.1

Found existing installation: numpy 1.23.5
Uninstalling numpy-1.23.5:
  Would remove:
    /usr/local/bin/f2py
    /usr/local/bin/f2py3
    /usr/local/bin/f2py3.10
    /usr/local/lib/python3.10/dist-packages/numpy-1.23.5.dist-info/*
    /usr/local/lib/python3.10/dist-packages/numpy.libs/libgfortran-040039e1.so.5.0.0
    /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
    /usr/local/lib/python3.10/dist-packages/numpy.libs/libquadmath-96973f99.so.0.0.0
    /usr/local/lib/python3.10/dist-packages/numpy/*
Proceed (Y/n)? y
  Successfully uninstalled numpy-1.23.5
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Using cached numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the sou



In [2]:
#@title Import python packages
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import os

from fastai import *
from fastai.text import *
from fastai.callbacks import *
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.utils.multiclass import unique_labels

from pytorch_transformers import BertTokenizer, BertPreTrainedModel, BertModel, BertConfig
from pytorch_transformers import AdamW

from fastprogress import master_bar, progress_bar
from datetime import datetime

In [3]:
#@title Check, if and what kind of GPU is used
def get_memory_usage():
    return torch.cuda.memory_allocated(device)/1000000

def get_memory_usage_str():
    return 'Memory usage: {:.2f} MB'.format(get_memory_usage())

cuda_available = torch.cuda.is_available()
if cuda_available:
    curr_device = torch.cuda.current_device()
    print(torch.cuda.get_device_name(curr_device))
device = torch.device("cuda" if cuda_available else "cpu")
device

Tesla T4


device(type='cuda')

### Define configuration used in this experiment run

Create config and set hyperparameters.
One can configure:


*   BERT model to use (model_name)
*   Learning Rate to use (max_lr)
*   Momentum (moms)
*   Epoch number for training (epochs)
*   Badge size for training (bs)
*   Weight decay for training (weight_decay)
*   Maximal sequence length (max_seq_len)
*   Train size used for both test/train and train/validation split (train_size)
*   Loss function used for training (loss_func)
*   The random seed used for shuffling, sampling and splitting (seed)
*   Whether, or not to use early stopping (es)
*   The minimal delta used to indicate early stopping (min_delta)
*   The number of epochs that need to undergo this delta to early stop training (patience)
*   The way of folding used for this experiment (either test/train split (No), ten-fold cross validation (TenFold), or project specific folding (ProjFold))


Further one can configure, where to get the dataset from and where to save log, result and model files.
By setting the classes Array one can decide which binary classifiers to train in evaluate in one experiment run.
Two booleans are provided to decide whether to
1. load data from Google Drive or download data from zenodo and to
2. save the model file.



In [4]:
class Config(dict):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        for k, v in kwargs.items():
            setattr(self, k, v)

    def set(self, key, val):
        self[key] = val
        setattr(self, key, val)

class Fold(Enum):
  No = 1
  TenFold = 2
  ProjFold = 3

config = Config(
    num_labels = 10, # will be set automatically afterwards
    model_name="bert-base-cased", # bert_base_uncased, bert_large_cased, bert_large_uncased
    max_lr=2e-5, # default: 2e-5
    moms=(0.8, 0.7), # default: (0.8, 0.7); alt.(0.95, 0.85)
    epochs=16, # 10, 16, 32, 50
    bs=16, # default: 16
    weight_decay = 0.01,
    max_seq_len=128, # 50, 128
    train_size=0.75, # 0.8
    loss_func=nn.CrossEntropyLoss(),
    seed=904727489, #default: 904727489, 42 (as in Dalpiaz) or None
    es = False, # True
    min_delta = 0.01,
    patience = 3,
    fold = Fold.No, # Fold.No, Fold.TenFold, Fold.ProjFold
)

clazz = 'clazz'

config_data = Config(
    root_folder = '.', # where is the root folder? Keep it that way if you want to load from Google Drive
    data_folder = '/', # where is the folder containing the datasets; relative to root
    train_data = ['promise_nfr.csv'], # dataset file to use
    label_column = clazz,
    log_folder_name = '/log/',
    log_file = clazz + '_' + Fold(config.fold).name + '_classifierPredictions_' + datetime.now().strftime('%Y%m%d-%H%M') + '.txt', # log-file name (make sure log folder exists)
    result_file = clazz + '_' + Fold(config.fold).name + '_classifierResults_' + datetime.now().strftime('%Y%m%d-%H%M') + '.txt', # result-file name (make sure log folder exists)
    model_path = '/models/', # where is the folder for the model(s); relative to the root
    model_name = 'NoRBERT.pkl', # what is the model name?
    gdrive_root_folder = '/content/drive/My Drive/Code/Task1_to_3_original_Promise_NFR_dataset/', # Set this to the Google Drive path. Starts with '/content/drive/' and then usually 'My Drive/*' for the files in your Drive

    orig_data_set_zip = 'https://zenodo.org/record/8347866/files/NoRBERT_RE20_Paper65.zip', # link to the data set (on zenodo). DO NOT CHANGE!
    orig_data_zip_name = 'NoRBERT_RE20_Paper65.zip', # DO NOT CHANGE
    orig_data_file_in_zip = 'Code/Task1_to_3_original_Promise_NFR_dataset/promise_nfr.csv', # DO NOT CHANGE

    # Project split to use, either p-fold (as in Dalpiaz) or loPo
    #project_fold = [[3, 9, 11], [1, 5, 12], [6, 10, 13], [1, 8, 14], [3, 12, 15], [2, 5, 11], [6, 9, 14], [7, 8, 13], [2, 4, 15], [4, 7, 10] ], # p-fold
    project_fold = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] ], # loPo
    #classes = ['US', 'SE', 'O', 'PE'], # mostFrequent
    classes= ['A', 'FT', 'L', 'LF', 'MN', 'O', 'PE', 'PO', 'SC', 'SE', 'US'], # all
)

load_from_gdrive = False # True, if you want to use Google Drive; else, False
save_model = False # True, if you want to use save the model file (make sure model folder exists)


To import the dataset, first we have to either load the data set from zenodo (and unzip the needed file) or connect to our Google drive (if data should be loaded from gdrive). To connect to our Google drive, we have to authenticate the access and mount the drive.

In [5]:
#@title Prepare data loading: Init loading from Google Drive, if set in config above. Else, download the data set from zenodo (using wget) {display-mode: "form"}
if load_from_gdrive:
    from google.colab import drive
    # Connect to drive to load the corpus from there
    drive.mount('/content/drive', force_remount=True)
    config_data.root_folder = config_data.gdrive_root_folder
else:
    # If the file does not exist already, download the zip and extract the needed file
    data_path = config_data.root_folder + config_data.data_folder + config_data.train_data[0]
    data_file = Path(data_path)
    if not data_file.exists():
        !wget {config_data.orig_data_set_zip}
        import zipfile
        with zipfile.ZipFile(config_data.orig_data_zip_name) as z:
            with open(data_path, 'wb') as f:
                f.write(z.read(config_data.orig_data_file_in_zip))


In [6]:
#@title Define logging functions and seed generation {display-mode: "form"}
def initLog():
    logfolder = config_data.root_folder + config_data.log_folder_name

    if not os.path.isdir(logfolder):
      print("Log folder does not exist, trying to create folder.")
      try:
        os.mkdir(logfolder)
      except OSError:
        print ("Creation of the directory %s failed" % logfolder)
      else:
        print ("Successfully created the directory %s" % logfolder)
    logfile = logfolder + config_data.log_file
    log_txt = datetime.now().strftime('%Y-%m-%d %H:%M') + ' ' + get_info()
    with open(logfile, 'w') as log:
        log.write(log_txt + '\n')

def logLine(line):
    logfile = config_data.root_folder + config_data.log_folder_name  + config_data.log_file
    with open(logfile, 'a') as log:
        log.write(line + '\n')

def logResult(result):
    logfile = config_data.root_folder + config_data.log_folder_name + config_data.result_file
    with open(logfile, 'a') as log:
        log.write(get_info() + '\n')
        log.write(result + '\n')

def get_info():
     model_config = 'model: {}, max_lr: {}, epochs: {}, bs: {}, train_size: {}, weight decay: {},  Seed: {}, Data: {}, Column: {}, EarlyStopping: {}:{};pat:{}'.format(config.model_name, config.max_lr, config.epochs, config.bs, config.train_size, config.weight_decay, config.seed, config_data.train_data, config_data.label_column, config.es, config.min_delta, config.patience)
     return model_config

def set_seed(seed):
    if seed is None:
        seed = random.randint(0, 2**31)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    return seed

set_seed(config.seed)

904727489

## Learner


In [7]:
#@title Create proper tokenizer for our data (adapting FastAiTokenizer to use BertTokenizer) {display-mode: "form"}
class FastAiBertTokenizer(BaseTokenizer):
    """Wrapper around BertTokenizer to be compatible with fast.ai"""
    def __init__(self, tokenizer: BertTokenizer, max_seq_len: int=512, **kwargs):
        self._pretrained_tokenizer = tokenizer
        self.max_seq_len = max_seq_len

    def __call__(self, *args, **kwargs):
        return self

    def tokenizer(self, t:str):
        """Limits the maximum sequence length. Prepend with [CLS] and append [SEP]"""
        return ["[CLS]"] + self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2] + ["[SEP]"]



Now, we can create our own databunch using the tokenizer above. Notice we're passing the include_bos=False and include_eos=False options. This is to prevent fastai from adding its own SOS/EOS tokens that will interfere with BERT's SOS/EOS tokens.

We can pass our own list of Preprocessors to the databunch.

In [8]:
#@title Define Processors and Databunch {display-mode: "form"}
class BertTokenizeProcessor(TokenizeProcessor):
    """Special Tokenizer, where we remove sos/eos tokens since we add that ourselves in the tokenizer."""
    def __init__(self, tokenizer):
        super().__init__(tokenizer=tokenizer, include_bos=False, include_eos=False)

class BertNumericalizeProcessor(NumericalizeProcessor):
    """Use a custom vocabulary to match the original BERT model."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, vocab=Vocab(list(bert_tok.vocab.keys())), **kwargs)

def get_bert_processor(tokenizer:Tokenizer=None, vocab:Vocab=None):
    return [BertTokenizeProcessor(tokenizer=tokenizer),
            NumericalizeProcessor(vocab=vocab)]

class BertDataBunch(TextDataBunch):
    @classmethod
    def from_df(cls, path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:Optional[DataFrame]=None,
              tokenizer:Tokenizer=None, vocab:Vocab=None, classes:Collection[str]=None, text_cols:IntsOrStrs=1,
              label_cols:IntsOrStrs=0, **kwargs) -> DataBunch:
        "Create a `TextDataBunch` from DataFrames."
        p_kwargs, kwargs = split_kwargs_by_func(kwargs, get_bert_processor)
        # use our custom processors while taking tokenizer and vocab as kwargs
        processor = get_bert_processor(tokenizer=tokenizer, vocab=vocab, **p_kwargs)
        if classes is None and is_listy(label_cols) and len(label_cols) > 1: classes = label_cols
        src = ItemLists(path, TextList.from_df(train_df, path, cols=text_cols, processor=processor),
                      TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
        src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_df(cols=label_cols, classes=classes)
        if test_df is not None: src.add_test(TextList.from_df(test_df, path, cols=text_cols))
        return src.databunch(**kwargs)

In [9]:
#@title Define own BertTextClassifier class{display-mode: "form"}
class BertTextClassifier(BertPreTrainedModel):
    def __init__(self, model_name, num_labels):
        config = BertConfig.from_pretrained(model_name)
        super(BertTextClassifier, self).__init__(config)
        self.num_labels = num_labels

        self.bert = BertModel.from_pretrained(model_name, config=config)

        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
        self.classifier = nn.Linear(self.config.hidden_size, num_labels)


    def forward(self, tokens, labels=None, position_ids=None, token_type_ids=None, attention_mask=None, head_mask=None):
        outputs = self.bert(tokens, position_ids=position_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, head_mask=head_mask)

        pooled_output = outputs[1]

        dropout_output = self.dropout(pooled_output)
        logits = self.classifier(dropout_output)

        activation = nn.Softmax(dim=1)
        probs = activation(logits)

        return logits

## Data

Load the dataset

In [10]:
#@title Create the dictionary that contains the labels along with their indices. This is useful for evaluation and similar. {display-mode: "form"}
def create_label_indices():
    #prepare label
    labels = config_data.classes
    labels.append('Other')

    #create dict
    labelDict = dict()
    for i in range (0, len(labels)):
        labelDict[i] = labels[i]
    return labelDict

label_indices = create_label_indices()
print(label_indices)

{0: 'A', 1: 'FT', 2: 'L', 3: 'LF', 4: 'MN', 5: 'O', 6: 'PE', 7: 'PO', 8: 'SC', 9: 'SE', 10: 'US', 11: 'Other'}


In [11]:
#@title Define functions to load data {display-mode: "form"}
def load_data(filename):
    fpath = config_data.root_folder + config_data.data_folder + filename
    print(fpath)
    df = pd.read_csv(fpath, delimiter=';', header=0, encoding='utf8', names=['number', 'ProjectID', 'RequirementText', 'clazz', 'NFR', 'F', 'A', 'FT', 'L', 'LF', 'MN', 'O', 'PE', 'PO', 'SC', 'SE', 'US'])
    df = df.dropna()
    is_NFR = df['NFR']==1
    df = df[is_NFR]

    inv_map = {v: k for k, v in label_indices.items()}
    df[config_data.label_column] = df[config_data.label_column].map(inv_map)
    df[config_data.label_column].fillna(inv_map.get('Other'), inplace=True)
    df[config_data.label_column]=df[config_data.label_column].astype(int)
    df = df.loc[df[config_data.label_column] != 7]
    df = df.reset_index()
    return df

def load_all_data(filenames):
    df = load_data(filenames[0])
    for i in range(1, len(filenames)):
        df = df.append(load_data(filenames[i]))
    return df



In [12]:
#@title Actually load dataset {display-mode: "form"}
# load the train datasets
df = load_all_data(config_data.train_data)
input_col = 'RequirementText'

# shuffle the dataset a bit and get the amount of classes
df = df.sample(frac=1, axis=0, random_state = config.seed)
config.num_labels = df[config_data.label_column].nunique()

print(df.shape)
print(df[config_data.label_column].value_counts())
print(df['ProjectID'].value_counts())



./promise_nfr.csv
(369, 18)
clazz
10    67
9     66
5     62
6     54
3     38
8     21
0     21
4     17
2     13
1     10
Name: count, dtype: int64
ProjectID
8     73
6     47
5     37
3     33
4     30
2     28
12    22
13    19
14    16
10    15
11    13
15    12
9      8
1      8
7      8
Name: count, dtype: int64


In [13]:
#@title Function to split dataframe according to train size {display-mode: "form"}
def split_dataframe(df, train_size = 0.8, random_state = None):
    # split data into training and validation set
    df_trn, df_valid = train_test_split(df, stratify = df[config_data.label_column], train_size = train_size, random_state = random_state)
    return df_trn, df_valid

## Predictor


In [14]:
#@title Create a predictor class{display-mode: "form"}
class Predictor:
    def __init__(self, classifier):
        self.classifier = classifier
        self.classes = self.classifier.data.classes

    def predict(self, text):
        prediction = self.classifier.predict(text)
        prediction_class = prediction[1]
        return self.classes[prediction_class]

## Create and train the learner/classifier


In [15]:
#@title Define functions to create databunch, learner and actual classifier{display-mode: "form"}
def create_databunch(config, df_trn, df_valid):
    bert_tok = BertTokenizer.from_pretrained(config.model_name,)
    fastai_tokenizer = Tokenizer(tok_func=FastAiBertTokenizer(bert_tok, max_seq_len=config.max_seq_len), pre_rules=[], post_rules=[])
    fastai_bert_vocab = Vocab(list(bert_tok.vocab.keys()))
    return BertDataBunch.from_df(".",
                   train_df=df_trn,
                   valid_df=df_valid,
                   tokenizer=fastai_tokenizer,
                   vocab=fastai_bert_vocab,
                   bs=config.bs,
                   text_cols=input_col,
                   label_cols=config_data.label_column,
                   collate_fn=partial(pad_collate, pad_first=False, pad_idx=0),
              )


def create_learner(config, databunch):
    model = BertTextClassifier(config.model_name, config.num_labels)

    optimizer = partial(AdamW)
    if config.es:
      learner = Learner(
        databunch, model,
        optimizer,
        wd = config.weight_decay,
        metrics=accuracy,
        loss_func=config.loss_func, callback_fns=[partial(EarlyStoppingCallback, monitor='accuracy', min_delta=config.min_delta, patience=config.patience)]
      )
    else:
      learner = Learner(
        databunch, model,
        optimizer,
        wd = config.weight_decay,
        metrics=accuracy,
        loss_func=config.loss_func,
      )

    return learner

# Create the classifier
def create_classifier(config, df):
  df_trn, df_valid = split_dataframe(df, train_size = config.train_size, random_state = config.seed)
  databunch = create_databunch(config, df_trn, df_valid)

  return create_learner(config, databunch)

In [16]:
#@title Define predict loop {display-mode: "form"}
def predict_and_log_result(classifier, df_eval):
  predictor = Predictor(classifier)
  flat_predictions, flat_true_labels = [], []
  column_index = df_eval.columns.get_loc(config_data.label_column)
  for row in progress_bar(df_eval.itertuples(), total=len(df_eval)):
      class_text = row.RequirementText
      class_label = row[column_index+1]
      flat_true_labels.append(class_label)
      prediction = predictor.predict(class_text)
      flat_predictions.append(prediction)

      log_text = 'PID: {}, {}, {} -> {}'.format(row.ProjectID, class_text, label_indices.get(class_label), label_indices.get(prediction))
      logLine(log_text)

  # get labels in correct order
  target_names = []
  test_labels = unique_labels(flat_true_labels, flat_predictions)
  test_labels = np.sort(test_labels)
  for x in test_labels:
    target_names.append(label_indices.get(x))

  result = classification_report(flat_true_labels, flat_predictions, target_names=target_names, digits = 5)
  logResult(result)
  print(result)
  return flat_predictions, flat_true_labels

In [17]:
#@title Define train and test loop{display-mode: "form"}
def train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results):
  classifier = create_classifier(config, df_train)
  # Train the classifier on train set
  print(classifier.fit_one_cycle(config.epochs, max_lr=config.max_lr, moms=config.moms, wd=config.weight_decay))
  #Predict on test set
  flat_predictions, flat_true_labels = predict_and_log_result(classifier, df_eval)
  overall_flat_predictions.extend(flat_predictions)
  overall_flat_true_labels.extend(flat_true_labels)
  test_labels = df[config_data.label_column].unique()
  test_labels = np.sort(test_labels)
  results.extend(precision_recall_fscore_support(flat_true_labels, flat_predictions, labels = test_labels))
  return classifier, overall_flat_predictions, overall_flat_true_labels, results

In [18]:
#@title Decide how to fold and train the classifier {display-mode: "form"}
overall_flat_predictions, overall_flat_true_labels, results = [], [], []
initLog()
if config.fold == Fold.TenFold:
  skf = StratifiedKFold(n_splits=10)
  fold_number = 1
  for train, test in skf.split(df, df[config_data.label_column]):
    df_train = df.iloc[train]
    df_eval = df.iloc[test]
    log_text = '/////////////////////// Fold: {} of {} /////////////////////////////'.format(fold_number,10)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
    fold_number = fold_number + 1
elif config.fold == Fold.ProjFold:
  for k in config_data.project_fold:
    test = df.loc[df['ProjectID'].isin(k)].index
    train = df.loc[~df['ProjectID'].isin(k)].index
    df_train = df.loc[train]
    df_eval = df.loc[test]
    log_text = '/////////////////////// Test-Projects: {} /////////////////////////////'.format(k)
    logLine(log_text)
    classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)
else:
  df_train, df_eval = train_test_split(df,stratify=df[config_data.label_column], train_size=config.train_size, random_state= config.seed)
  classifier, overall_flat_predictions, overall_flat_true_labels, results = train_and_predict(df_train, df_eval, overall_flat_predictions, overall_flat_true_labels, results)

get_memory_usage_str()


  self.pid = os.fork()
  return np.array(a, dtype=dtype, **kwargs)


  self.pid = os.fork()
  return np.array(a, dtype=dtype, **kwargs)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
100%|██████████| 433/433 [00:00<00:00, 1232112.37B/s]
100%|██████████| 435779157/435779157 [00:18<00:00, 23054069.24B/s]
  state_dict = torch.load(resolved_archive_file, map_location='cpu')


epoch,train_loss,valid_loss,accuracy,time
0,2.532867,2.50013,0.028986,00:03
1,2.48799,2.389054,0.086957,00:02
2,2.40763,2.217564,0.217391,00:02
3,2.320046,2.089252,0.217391,00:01
4,2.21875,1.935261,0.478261,00:01
5,2.086598,1.626339,0.507246,00:01
6,1.899667,1.318928,0.652174,00:01
7,1.678859,1.098083,0.724638,00:02
8,1.465765,0.907181,0.768116,00:01
9,1.258402,0.813511,0.768116,00:02


  self.pid = os.fork()
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1581.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


None


              precision    recall  f1-score   support

           A    1.00000   1.00000   1.00000         5
          FT    0.00000   0.00000   0.00000         2
           L    1.00000   0.33333   0.50000         3
          LF    1.00000   0.80000   0.88889        10
          MN    0.00000   0.00000   0.00000         4
           O    0.83333   0.93750   0.88235        16
          PE    0.84615   0.78571   0.81481        14
          SC    0.75000   0.60000   0.66667         5
          SE    0.77273   1.00000   0.87179        17
          US    0.72727   0.94118   0.82051        17

    accuracy                        0.81720        93
   macro avg    0.69295   0.63977   0.64450        93
weighted avg    0.77881   0.81720   0.78512        93



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


'Memory usage: 1320.96 MB'

In [19]:
#@title Define function to calculate averaged metric results {display-mode: "form"}
def calcAverageMetrics(results):
  precisions, recalls, fscores = [], [], []
  for i in range(int(len(results)/4)):
    precisions.append(results[i*4])
    recalls.append(results[i*4+1])
    fscores.append(results[i*4+2])
  precision = [0]*len(precisions[0])
  recall = [0]*len(recalls[0])
  fscore = [0]*len(fscores[0])
  for i in range(len(precisions)):
    precision = precision + precisions[i]
    recall = recall + recalls[i]
    fscore = fscore + fscores[i]
  precision = precision / int(len(results)/4)
  recall = recall / int(len(results)/4)
  fscore = fscore / int(len(results)/4)
  return precision, recall, fscore

In [20]:
#@title Display overall results and log them {display-mode: "form"}
target_names = []
test_labels = df[config_data.label_column].unique()

test_labels = np.sort(test_labels)
for x in test_labels:
  target_names.append(label_indices.get(x))

print('/////////////////////// Aggregated Predictions Result /////////////////////////////')
logResult('/////////////////////// Aggregated Predictions Result /////////////////////////////')
result = classification_report(overall_flat_true_labels, overall_flat_predictions, target_names=target_names, digits = 5)
logResult(result)
print(result)
print('/////////////////////// Averaged Metrics Result /////////////////////////////')
logResult('/////////////////////// Averaged Metrics Result /////////////////////////////')
precision, recall, fscore = calcAverageMetrics(results)
print("              precision    recall  f1-score")
logResult("              precision    recall  f1-score")
for i in range(len(precision)):
  print('{:<14}'.format(target_names[i]) + '  {:.5f}'.format(precision[i]) + '   {:.5f}'.format(recall[i]) + '   {:.5f}'.format(fscore[i]))
  logResult('{:<14}'.format(target_names[i]) + '  {:.5f}'.format(precision[i]) + '   {:.5f}'.format(recall[i]) + '   {:.5f}'.format(fscore[i]))


/////////////////////// Aggregated Predictions Result /////////////////////////////
              precision    recall  f1-score   support

           A    1.00000   1.00000   1.00000         5
          FT    0.00000   0.00000   0.00000         2
           L    1.00000   0.33333   0.50000         3
          LF    1.00000   0.80000   0.88889        10
          MN    0.00000   0.00000   0.00000         4
           O    0.83333   0.93750   0.88235        16
          PE    0.84615   0.78571   0.81481        14
          SC    0.75000   0.60000   0.66667         5
          SE    0.77273   1.00000   0.87179        17
          US    0.72727   0.94118   0.82051        17

    accuracy                        0.81720        93
   macro avg    0.69295   0.63977   0.64450        93
weighted avg    0.77881   0.81720   0.78512        93

/////////////////////// Averaged Metrics Result /////////////////////////////
              precision    recall  f1-score
A               1.00000   1.00000  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Save Model

In [21]:
#@title Save the model along with its config
def create_model_name():
    name = 'NoRBERT_{clasz}_e{epochs}_{sampling}'.format(clasz=clazz, epochs=str(config.epochs),sampling="NoSampling")
    return name

def save_config(model_save_path, model_name):
    settings = ''
    for item in config.__dict__:
        value = config[item]
        setting = '{item}={value},\n'.format(item=item, value=value)
        settings += setting
    save_path = model_save_path + model_name + '.config'
    with open(save_path, 'w', encoding='utf-8') as out:
        out.write(settings)

# make it flase if you want to unsave the trained model
if True:
    model_name = create_model_name()
    model_save_path = config_data.root_folder + config_data.model_path
    if not os.path.isdir(model_save_path):
      print("Models folder does not exist, trying to create folder.")
      try:
        os.mkdir(model_save_path)
      except OSError:
        print ("Creation of the directory %s failed" % model_save_path)
      else:
        print ("Successfully created the directory %s" % model_save_path)
    save_config(model_save_path, model_name)
    model_save_file = model_save_path + model_name + '.pkl'
    classifier.export(file = model_save_file)


Models folder does not exist, trying to create folder.
Successfully created the directory ./models/
