# Demo Notebook for transformers models
*SSC, May 2023*

This notebook demonstrates the preliminary use for training transformers models. For now, all the methods are called from the notebook. In the future, a more user-friendly user interface will be generated.

In [None]:
# Please ignore this cell: extra install steps that are only executed when running the notebook on Google Colab
# flake8-noqa-cell
import os
if 'google.colab' in str(get_ipython()) and not os.path.isdir('Test_Data'):
    # we're running on colab and we haven't already downloaded the test data
    # first install pinned version of setuptools (latest version doesn't seem to work with this package on colab)
    %pip install setuptools==61 -qqq
    # install the moralization package
    %pip install git+https://github.com/ssciwr/moralization.git -qqq

    # download test data sets
    !wget https://github.com/ssciwr/moralization/archive/refs/heads/test_data.zip -q
    !mkdir -p data && unzip -qq test_data.zip && mv -f moralization-test_data/*_Data ./data/. && rm -rf moralization-test_data test_data.zip
    !spacy download de_core_news_sm
    from google.colab import drive
    drive.mount('/content/drive')

Import the required classes from the moralization package.

In [None]:
from moralization import DataManager, TransformersDataHandler, TransformersModelManager

### Import training data using DataManager

If you need more information about raised warnings run: <br>
```import logging ``` <br>
```logging.getLogger().setLevel(logging.DEBUG)```

Note that currently only annotations of one file, the one specified in `example_name` (see below) will be used.

In [None]:
# train on small dataset
# data_manager = DataManager("../../moralization_data/Test_Data/")
data_manager = DataManager("../data/Test_Data/XMI_11")
# train on full dataset
# data_manager = DataManager("/content/data/All_Data/XMI_11") 

In [None]:
for title, doc in data_manager.doc_dict.items():
    print(f"  - {title}: {len(doc)} tokens")

## Prepare the data in dataset format
The data is read in as xmi and then converted to a spacy doc object. This is done so we can specify the spans in the flowing text; and also that sentence boundaries are detected. For the transformers models, we feed the data in chunks, and currently each sentence is a chunk. One could also think about different choices such as paragraphs or instances.

The doc object is generated by the `DataManager`. We then need to use the transformers specific methods in the `TransformersDataHandler` to create nested lists of tokens (nesting by sentences, these are the "chunks"), and make sure that the labels for the selected annotation are nested in the same way. The labels that are then assigned are "2" for the first token in an annotation, "1" for a token inside an annotation, "0" for no annotation, "-100" for punctuation marks as these should be ignored in the calculation of the loss function (cross entropy).

1. xmi data -> spacy doc object
2. get tokens, sentences and labels from spacy doc object and put in nested lists

In [None]:
example_name = "test_data-trimmed_version_of-Interviews-pos-SH-neu-optimiert-AW"
# init the TransformersDataHandler
tdh = TransformersDataHandler()
# pass the dictionary of spacy doc objects to the TransformersDataHandler
# select the file to be used by example_name
tdh.get_data_lists(data_manager.doc_dict, example_name)
tdh.generate_labels(data_manager.doc_dict, example_name)
list_of_sentence_list_of_tokens, list_of_labels = tdh.structure_labels()

We have now obtained our nested lists. We can check the first few items of them to see if they look ok:

In [None]:
for i in range(10):
    print(list_of_sentence_list_of_tokens[i])
    print(list_of_labels[i])

Now we convert the nested lists into a pandas dataframe. This dataframe can then be exported into a Hugging Face dataset and can be pushed to the hub. This functionality is not implemented yet but will be in the future.

3. Nested lists into dataframe
4. Dataframe to dataset

In [None]:
data_in_frame = data_manager.lists_to_df(sentence_list=list_of_sentence_list_of_tokens, label_list=list_of_labels)

In [None]:
data_in_frame.head(10)

In [None]:
# you can either obtain a raw dataset or one that is split into test and train
# raw_dataset = data_manager.df_to_dataset(data_in_frame=data_in_frame, split=False)
train_test_dataset = data_manager.df_to_dataset(data_in_frame=data_in_frame)

In [None]:
print(train_test_dataset)

## Initiate required elements for training
Before the training, we have to tokenize the pre-tokenized data with the tokenizer that goes along with the selected model. You need to provide the path to the directory where you want to save the model. The model name can be given using the `model_name` keyword. This keyword defaults to `bert-base-cased`.

In [None]:
model_name = "bert-base-uncased"
tmm = TransformersModelManager(model_path=".", model_name=model_name)
tokenized_dataset = tmm.map_dataset(train_test_set=train_test_dataset, token_column_name="Sentences", label_column_name="Labels")

In [None]:
print(tokenized_dataset)

In [None]:
print(tokenized_dataset["train"][0])

Now the data collator that forms the batches from the training and test data is initiated.

In [None]:
tmm.init_data_collator()

The metric is seleced and initiated. You may pass different label names using the `label_names` keyword argument. The default is `["0", "M", "M-BEG"]` for the labels 0, 1, 2 that are used to designate no moralization, moralization, beginning of a moralization. The metric that is chosen is `seqeval` but can be changed using the eval_metric keyword. See https://huggingface.co/docs/evaluate/choosing_a_metric

In [None]:
tmm.load_evaluation_metric()
# we need to map ids and labels for the model
tmm.set_id2label()
tmm.set_label2id()

Now we load the model. This uses the model name from above. You can load a different model than the tokenizer is from, this is however not recommended.

In [None]:
tmm.load_model()

Now we load the dataloader that handles the loading of the data into the batches for training.

In [None]:
tmm.load_dataloader(tokenized_datasets=tokenized_dataset)

Now load the optimizer; we use AdamW for this. Learning rate can be adjusted directly using the `learning_rate` keyword, all other arguments can be passed as a dictionary if desired.

In [None]:
tmm.load_optimizer(learning_rate=2e-5, kwargs=None)

Now load the accelerator that makes use of the existing hardware.

In [None]:
tmm.load_accelerator()

Load the scheduler that handles the adjustment of the learning rate during the training.

In [None]:
tmm.load_scheduler()

Train the model.

In [None]:
tmm.train()

Evaluate the model, providing the `model_path` of the trained model and a sample phrase.

In [None]:
evaluation_results = tmm.evaluate(token="Jupyter Notebooks sind super.", model_path=".")

In [None]:
for result in evaluation_results:
    print(result)