<a href="https://colab.research.google.com/github/szulcmaciej/colab-notebooks/blob/main/Build_a_BERT_Text_classification_model_in_Polish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial

We are going to use [Simple Transformers](https://github.com/ThilinaRajapakse/simpletransformers) - an NLP library based on the [Transformers](https://github.com/huggingface/transformers) library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.  

As the dataset, we are going to use the [Germeval 2019](https://projects.fzai.h-da.de/iggsa/projekt/), which consists of German tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes: `PROFANITY`, `INSULT`, `ABUSE`, and `OTHERS`. The highest score achieved on this dataset is `0.7361`.

### We are going to

- install Simple Transformers library
- select a pre-trained monolingual model
- load the dataset
- train/fine-tune our model
- evaluate the results of it
- save and load the model
- test the loaded model on a real example

# Install Simple Transformers library 

In [None]:
# install simpletransformers
!pip install simpletransformers imbalanced-learn

# check installed version
!pip freeze | grep simpletransformers
# simpletransformers==0.28.2

simpletransformers==0.48.1


# Select a pre-trained monolingual model

As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the [Transformers library](https://huggingface.co/transformers/pretrained_models.html) and all community-uploaded models. For a list that includes community-uploaded models, refer to [https://huggingface.co/models](https://huggingface.co/models).

We are going to use the `distilbert-base-german-cased` model. [DistilBERT is a small, fast, cheaper version of BERT](https://huggingface.co/transformers/model_doc/distilbert.html). It has 40% less parameters than `bert-base-uncased` and runs 60% faster while preserving over 95% of Bert’s performance.

# Load the dataset

In [None]:
!unzip /content/polish_sentiment_dataset.zip -d data/

Archive:  /content/polish_sentiment_dataset.zip
  inflating: data/polish_sentiment_dataset.csv  


In [None]:
import pandas as pd
df = pd.read_csv('/content/data/polish_sentiment_dataset.csv')

In [None]:
df.rate = df.rate + 1

In [None]:
!wget https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v2.9.0/roberta_large_transformers.zip
!mkdir models/roberta_large_transformers -p
!unzip roberta_large_transformers.zip -d models/roberta_large_transformers
!rm roberta_large_transformers.zip

--2020-09-10 19:20:47--  https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v2.9.0/roberta_large_transformers.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/247501435/34d1df80-953b-11ea-914f-7e12619aa10e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200910%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200910T192047Z&X-Amz-Expires=300&X-Amz-Signature=f70b9bbb867938f28cae631b322bf72aaf2a86612f619660e5ee06e88fbd590e&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=247501435&response-content-disposition=attachment%3B%20filename%3Droberta_large_transformers.zip&response-content-type=application%2Foctet-stream [following]
--2020-09-10 19:20:47--  https://github-production-release-asset-2e65be.s3.amazonaws.com/247501435/34d1df80-953

In [None]:
df = df[['description','rate']]

print(df.shape)
df.head()

(936883, 2)


Unnamed: 0,description,rate
0,Polecam nie pierwszy i nie ostatni raz!,2.0
1,Bardzo dobra komunikacja sms i telefoniczna. Z...,2.0
2,Polecam zakupy w tym sklepie. Są dostępne częś...,2.0
3,0,1.0
4,Jestem w pełni zadowolona z przebiegu transakcji,2.0


In [None]:
from imblearn.under_sampling import RandomUnderSampler
df = df.

In [None]:
df = df[::100]

In [None]:
from sklearn.model_selection import train_test_split
df = df.dropna()
train_df, test_df = train_test_split(df, test_size=0.10, stratify=df.rate, random_state=0)

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train_df.rate.value_counts()

train shape:  (843, 2)
test shape:  (94, 2)


2.0    667
0.0    162
1.0     14
Name: rate, dtype: int64

# Load pre-trained model

In [None]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: abf8bbd3b53f79a7ca11e39b1b8f559fc85c0fc4
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


In [None]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             'save_model_every_epoch': False,
             'wandb_project': 'polish_sentiment_analysis',
             "num_train_epochs": 3}

# Create a ClassificationModel
model = ClassificationModel(
    "roberta", '/content/models/roberta_large_transformers',
    num_labels=3,
    args=train_args
)

Some weights of the model checkpoint at /content/models/roberta_large_transformers were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/models/roberta_large_transformers and are newly initialized: ['classifier.d

# Train model

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
# Train the model
model.train_model(train_df)

RuntimeError: ignored

In [None]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)

result

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=702.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




{'acc': 0.6894586894586895,
 'eval_loss': 0.8673831869594075,
 'f1': 0.6894586894586895,
 'mcc': 0.25262380289641617}

# save and load the model

save files without outputs/ 


In [None]:
import os
import tarfile

def save_model(model_path='',file_name=''):
  files = [files for root, dirs, files in os.walk(model_path)][0]
  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:
    for file in files:
      f.add(f'{model_path}/{file}')

In [None]:
save_model('outputs','germeval-distilbert-german')

In [None]:
!tar -zxvf ./germeval-distilbert-german.tar.gz

sample_data/README.md
sample_data/anscombe.json
sample_data/mnist_train_small.csv
sample_data/mnist_test.csv
sample_data/california_housing_test.csv
sample_data/california_housing_train.csv


In [None]:
!rm -rf outputs

# Test the loaded model on a real example

In [None]:
import os
import tarfile

def unpack_model(model_name=''): 
  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")
  tar.extractall()
  tar.close()

unpack_model('germeval-distilbert-german')

In [None]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "outputs/",
    num_labels=4,
    args=train_args
)

In [None]:
class_list = ['INSULT','ABUSE','PROFANITY','OTHER']

test_tweet = "Meine Mutter hat mir erzählt, dass mein Vater einen Wahlkreiskandidaten nicht gewählt hat, weil der gegen die Homo-Ehe ist"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


OTHER


In [None]:
class_list = ['INSULT','ABUSE','PROFANITY','OTHER']

test_tweet = "Frau #Böttinger meine Meinung dazu ist sie sollten uns mit ihrem Pferdegebiss nicht weiter belästigen #WDR"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])
# INSULT

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


INSULT
