We are going to use [Simple Transformers](https://github.com/ThilinaRajapakse/simpletransformers) - an NLP library based on the [Transformers](https://github.com/huggingface/transformers) library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.  

### We are going to

- install Simple Transformers library
- select a pre-trained monolingual model
- load the dataset
- train/fine-tune our model
- evaluate the results of it
- save and load the model
- test the loaded model on a real example

# Install Simple Transformers library 

In [1]:
# install simpletransformers
!pip install simpletransformers

# check installed version
!pip freeze | grep simpletransformers
# simpletransformers==0.28.2

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/d0/33/8c1dd10117b0ba76357485e9db1803cac6b2e5fc639c5314304e50c7ac14/simpletransformers-0.49.2-py3-none-any.whl (219kB)
[K     |█▌                              | 10kB 19.9MB/s eta 0:00:01[K     |███                             | 20kB 23.6MB/s eta 0:00:01[K     |████▌                           | 30kB 27.6MB/s eta 0:00:01[K     |██████                          | 40kB 30.7MB/s eta 0:00:01[K     |███████▌                        | 51kB 32.4MB/s eta 0:00:01[K     |█████████                       | 61kB 25.3MB/s eta 0:00:01[K     |██████████▌                     | 71kB 25.8MB/s eta 0:00:01[K     |████████████                    | 81kB 21.8MB/s eta 0:00:01[K     |█████████████▌                  | 92kB 20.5MB/s eta 0:00:01[K     |███████████████                 | 102kB 21.4MB/s eta 0:00:01[K     |████████████████▌               | 112kB 21.4MB/s eta 0:00:01[K     |██████████████████   

simpletransformers==0.49.2


# Select a pre-trained monolingual model

As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the [Transformers library](https://huggingface.co/transformers/pretrained_models.html) and all community-uploaded models. For a list that includes community-uploaded models, refer to [https://huggingface.co/models](https://huggingface.co/models).

We are going to use the `dbmdz/bert-base-german-uncased` model. [DistilBERT is a small, fast, cheaper version of BERT](https://huggingface.co/transformers/model_doc/distilbert.html). It has 40% less parameters than `bert-base-uncased` and runs 60% faster while preserving over 95% of Bert’s performance.

# Load the dataset

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
%cd gdrive/MyDrive/SK[AI]

/content/gdrive/MyDrive/SK[AI]


In [4]:
ls

[0m[01;34mcache_dir[0m/                         [01;34moutputs[0m/  submission.csv    train_clean.csv
germeval-distilbert-german.tar.gz  [01;34mruns[0m/     test_reduced.csv  train.csv


In [5]:
import pandas as pd
import numpy as np

df = pd.read_csv('train_clean.csv',sep=",", encoding="utf8")

# class_list = ['Failure', 'Service Request', 'Security', 'User Service Request']

df['text'] = df['MailSubject'] + df['MailTextBody']

# df['pred_class'] = df.apply(lambda x:  class_list.index(x['IncidentType']),axis=1)

In [6]:
import math

# threshold_min_value = 80

# thresholded_target_services = list(df.groupby('ServiceProcessed').filter(lambda x: len(x) >= threshold_min_value).groupby('ServiceProcessed').groups.keys())
# thresholded_target_services

# ## sampling / balancieren
# # use thresholded target_services instead of creating a list manually as above
# target_services = thresholded_target_services
# threshold_max_value = 100
# # define services to use for eda other & how many incidents to take per service
# eda_other_targets = ['EDA_S_BA_Datenablage', 'EDA_S_BA_Internetzugriff', 'EDA_S_BA_RemoteAccess', 'EDA_S_IT Sicherheit', 'EDA_S_Netzwerk Ausland', 'EDA_S_Raumbewirtschaftung']
# threshold_eda_other_max_value = math.floor(threshold_max_value / len(eda_other_targets))

# # create a large df_other with all other services, in order to subsample from this one later
# df_other = df[~df.ServiceProcessed.isin(target_services)]
# # create an empty dataframe (could be done easier..)
# df_other_sampled = df_other.reset_index(drop=True)
# df_other_sampled = df_other_sampled[0:0] 
# for eda_other_target in eda_other_targets:
#     totalForService = df_other[df_other.ServiceProcessed == eda_other_target]["MailTextBody"].size
#     print('total incidents of: ' + eda_other_target + ': ' + str(totalForService))
#     if(totalForService > threshold_eda_other_max_value):
#         df_other_sampled = pd.concat([df_other_sampled, df_other[df_other.ServiceProcessed == eda_other_target].sample(n=threshold_eda_other_max_value)])
#     else:
#         df_other_sampled = pd.concat([df_other_sampled, df_other[df_other.ServiceProcessed == eda_other_target]])

# other_count = df_other_sampled['MailTextBody'].size
# print('Total incidents in df_other_sampled: ' + str(other_count))
# #filling up 
# if(other_count < threshold_max_value):
#      df_other_sampled = pd.concat([df_other_sampled, df_other[df_other['ServiceProcessed'].isin(eda_other_targets)].sample(n=(threshold_max_value - other_count))])
# other_count = df_other_sampled['MailTextBody'].size
# print('Total incidents in df_other_sampled after filling up: ' + str(other_count))

# other_count = df_other_sampled['MailTextBody'].size
# print('Total incidents in df_other_sampled: ' + str(other_count))

# print('distribution in eda_other')
# print(df_other_sampled.groupby('ServiceProcessed').size()) 

# df_other_sampled.loc[:,'ServiceProcessed'] = 'EDA_other'

# # create an empty dataframe (could be done easier..)
# df_sampled = df_other_sampled.groupby('ServiceProcessed').apply(pd.DataFrame.sample, n=threshold_max_value).reset_index(drop=True)
# df_sampled = df_sampled[0:0] 

# for target_service in target_services:
#     if(df[df.ServiceProcessed == target_service]['MailTextBody'].size > threshold_max_value):
#         df_sampled = pd.concat([df_sampled, df[df.ServiceProcessed == target_service].sample(n=threshold_max_value)])
#     else:
#         df_sampled = pd.concat([df_sampled, df[df.ServiceProcessed == target_service]])

# # add a subsampling from df_other for the EDA_other service
# df_sampled = pd.concat([df_sampled, df_other_sampled])

# # reset technical dataframe indexes newly
# df_sampled = df_sampled.reset_index(drop=True)

# # print sizes
# print(df_sampled.groupby('ServiceProcessed').size()) 

# df = df_sampled.copy()

class_list = np.unique(df['ServiceProcessed'].dropna()).tolist()

df['pred_class'] = df.apply(lambda x:  class_list.index(x['ServiceProcessed']),axis=1)

df = df[['text','pred_class']].dropna()

print(df.shape)
df.head()

(2660, 2)


Unnamed: 0,text,pred_class
0,smart card blockiertlieber helpdesk sie haben ...,32
1,Webcamliebe kolleginnen und kollegen ich möcht...,50
2,IT Support heute Nachmittagliebe kolleginnen u...,51
3,Botschaft Warschau - Administratoren-Zugang fü...,42
4,Commande d'un écranliebe kolleginnen und kolle...,50


In [7]:
len(class_list)

55

In [8]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.10)

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train shape:  (2394, 2)
test shape:  (266, 2)


# Load pre-trained model

In [9]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 20,
             "save_model_every_epoch": False,
             "save_steps": 8000}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "dbmdz/bert-base-german-uncased",
    num_labels=len(class_list),
    args=train_args
)

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=433.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=442256365.0), HTML(value='')))




Some weights of the model checkpoint at dbmdz/bert-base-german-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=247333.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=59.0), HTML(value='')))




# Train model

In [None]:
# Train the model
# model.train_model(train_df)
model.train_model(df)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2660.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 1 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 5 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 6 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 7 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 8 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 9 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 10 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 11 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 12 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 13 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 14 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f3dde5e2160>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1203, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1177, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process





HBox(children=(HTML(value='Running Epoch 15 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 16 of 20'), FloatProgress(value=0.0, max=333.0), HTML(value='')))

In [None]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)

result

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(train_df, f1=f1_multiclass, acc=accuracy_score)

result

# save and load the model

save files without outputs/ 


In [None]:
import os
import tarfile

def save_model(model_path='',file_name=''):
  files = [files for root, dirs, files in os.walk(model_path)][0]
  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:
    for file in files:
      f.add(f'{model_path}/{file}')

In [None]:
save_model('outputs','skai_raw')

In [None]:
df.pred_class.value_counts().plot(kind='bar')

In [None]:
df_test = pd.read_csv('test_reduced.csv',sep=";", encoding="utf8")

df_test['text'] = df_test['MailSubject'] + df_test['MailTextBody']

predictions, raw_outputs = model.predict(df_test['text'])

df_test['Predicted'] = predictions

df_test['Predicted'] = df_test.apply(lambda x:  class_list[x['Predicted']],axis=1)

df_test = df_test[['Id', 'Predicted']]

df_test.to_csv('submission.csv', index=False)

In [None]:
!tar -zxvf ./germeval-distilbert-german.tar.gz

In [None]:
!rm -rf outputs

# Test the loaded model on a real example

In [None]:
import os
import tarfile

def unpack_model(model_name=''): 
  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")
  tar.extractall()
  tar.close()

unpack_model('germeval-distilbert-german')

In [None]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "outputs/",
    num_labels=4,
    args=train_args
)