# Demo Notebook for transformers models
*SSC, May 2023*

This notebook demonstrates the preliminary use for training transformers models. For now, all the methods are called from the notebook. In the future, a more user-friendly user interface will be generated.

In [None]:
# Please ignore this cell: extra install steps that are only executed when running the notebook on Google Colab
# flake8-noqa-cell
import os
if 'google.colab' in str(get_ipython()) and not os.path.isdir('Test_Data'):
    # we're running on colab and we haven't already downloaded the test data
    # first install pinned version of setuptools (latest version doesn't seem to work with this package on colab)
    %pip install setuptools==61 -qqq
    # install the moralization package
    %pip install git+https://github.com/ssciwr/moralization.git -qqq

    # download test data sets
    !wget https://github.com/ssciwr/moralization/archive/refs/heads/test_data.zip -q
    !mkdir -p data && unzip -qq test_data.zip && mv -f moralization-test_data/*_Data ./data/. && rm -rf moralization-test_data test_data.zip
    !spacy download de_core_news_sm
    from google.colab import drive
    drive.mount('/content/drive')

Import the required classes from the moralization package.

In [None]:
from moralization import DataManager, TransformersModelManager

### Import training data using DataManager

If you need more information about raised warnings run: <br>
```import logging ``` <br>
```logging.getLogger().setLevel(logging.DEBUG)```

In [None]:
# train on small dataset
data_manager = DataManager("/content/data/Test_Data/XMI_11")
# train on full dataset
# data_manager = DataManager("/content/data/All_Data/XMI_11") 

In [None]:
for title, doc in data_manager.doc_dict.items():
    print(f"  - {title}: {len(doc)} tokens")

The default task that is trained on is task 1: Detection of moralization constructs (category I). If you want to train for a different task or label, you can specify it as so:

In [None]:
task = "task2"
selected_labels = "all" # select all the labels for the given task
data_manager = DataManager("/content/data/Test_Data/XMI_11", task=task, selected_labels=selected_labels)

The tasks are defined as:
```
"task1": ["KAT1-Moralisierendes Segment"]
"task2": ["KAT2-Moralwerte", "KAT2-Subjektive Ausdrücke"]
"task3": ["KAT3-Rolle", "KAT3-Gruppe", "KAT3-own/other"]
"task4": ["KAT4-Kommunikative Funktion"]
"task5": ["KAT5-Forderung explizit"]
```
You can select one of the tasks and all the labels for that task by setting `selected_labels="all"`, or you can specify selected labels for a given task, for example if you selected `task="task2"`, the labels can be given as a list `selected_labels=["Fairness", "Cheating"]`.

In [None]:
task = "task2"
selected_labels = ["Fairness", "Cheating"] # select only the specified labels for the given task
data_manager = DataManager("/content/data/Test_Data/XMI_11", task=task, selected_labels=selected_labels)

If you want to select labels to train on that do not belong to a specific category, you should select "sc" as the task. This will give you access to all labels. You can then combine the labels freely, for example "Moralisierung" and "Fairness".

In [None]:
task = "sc"
selected_labels = ["Moralisierung", "Fairness"] # select the specified labels you want to train on from the set of all labels
data_manager = DataManager("/content/data/Test_Data/XMI_11", task=task, selected_labels=selected_labels)

## Prepare the data in dataset format
The data is read in as xmi and then converted to a spacy doc object. This is done so we can specify the spans in the flowing text; and also that sentence boundaries are detected. For the transformers models, we feed the data in chunks, and currently each sentence is a chunk. One could also think about different choices such as paragraphs or instances.

The doc object is generated by the `DataManager`. We then need to use the transformers specific methods in the `TransformersDataHandler` to create nested lists of tokens (nesting by sentences, these are the "chunks"), and make sure that the labels for the selected annotation are nested in the same way. The labels that are then assigned are "2" for the first token in an annotation, "1" for a token inside an annotation, "0" for no annotation, "-100" for punctuation marks as these should be ignored in the calculation of the loss function (cross entropy).
This is all taken care of by the `DataManager`.

1. xmi data -> spacy doc object
2. get tokens, sentences and labels from spacy doc object and put in nested lists
3. Nested lists into dataframe

The pandas dataframe can then be exported into a Hugging Face dataset and can be pushed to the hub.

4. Dataframe to dataset
5. Optional: Publish dataset on hub

In [None]:
# prepare the dataset from the dataframe and split the data into test and training set
data_manager.df_to_dataset(split=True)

You can now publish the dataset to the Hugging Face Hub. For this you either need to set the environment variable `HUGGING_FACE_TOKEN` or you can provide it here using the `hugging_face_token` keyword. The `repo_id` variable specifices the name of the repository that you want to use (or create).

In [None]:
# now push to hub
data_manager.publish(repo_id="test-data-3")

You can also update the metadata in the `DatasetInfo` object that goes along with your dataset. Possible options to update are `description`, `version`, `license`, `citation`, `homepage`. You can update one or several of these, or all of them at the same time.

In [None]:
updated_dataset = data_manager.set_dataset_info(version="0.0.2")

To update the dataset on Hugging Face Hub, you may now push this updated dataset, directly providing the updated dataset as a keyword.

In [None]:
data_manager.publish(repo_id="test-data-3", data_set=updated_dataset)

## Pull an existing dataset from Hugging Face
Instead of creating a dataset from your own annotated data, you may also load a dataset from Hugging Face. For this, when initializing the DataManager, you need to set `skip_read` so that the DataManager does not attempt to read data from the provided directory. Instead, the dataset that you pull from Hugging Face will be saved to the provided directory. Further, you need to specify the name of the dataset, the split you want to load ("train" or "test") and optionally a revision number if you do not want to load the current default version.

In [None]:
data_manager = DataManager("../data/Test_Data/", skip_read=True)
dataset = data_manager.pull_dataset(dataset_name="conllpp")

You can inspect the loaded dataset by looking at its DataFrame:

In [None]:
data_manager.data_in_frame.head(10)

In [None]:
data_manager.data_in_frame.ner_tags.max()

In [None]:
data_manager.column_names

In [None]:
data_manager.train_test_set

## Get started with training a transformers model
For this you need a model that you want to base your training on. You also need to provide the path to the directory where you want to save the model. The model name can be given using the `model_name` keyword. This keyword defaults to `bert-base-cased`. You should set the `label_names` as well if they differ from the three default names `0, M-BEG, M` (which stand for no moralization, beginning of moralization segment and continuing moralization segment).
The language is determined by the model that you use. The default model is an English language model.

In [None]:
tmm = TransformersModelManager(model_path=".", model_name="bert-base-cased", label_names = ["0", "B-PER", "I-PER", "B-ORG", "I-ORG", "some", "other", "label", "here"])
# tmm = TransformersModelManager(model_path=".", model_name="bert-base-cased", label_names = ["0", "M", "M-BEG"])

To train, simply call the `train` method with the above `data_manager`. The token and column names are passed using the `token_column_name` and  and `label_column_name` keywords. If the data has been prepared by the `DataManager` and was not a dataset you pulled from the Hugging Face Hub, these are set to `Sentences` and `Labels`. The number of training epochs is set by the keyword `num_train_epochs`.
As optimizer we currently use AdamW. The learning rate can be adjusted directly using the `learning_rate` keyword.

In [None]:
# token_column_name = "Sentences"
# label_column_name = "Labels"
token_column_name = "tokens"
label_column_name = "ner_tags"
num_train_epochs = 1
learning_rate = 1e-5
tmm.train(data_manager, token_column_name, label_column_name, num_train_epochs, learning_rate=2e-5)

You can now evaluate the model with an example phrase.

In [None]:
evaluation_results = tmm.evaluate(token="Jupyter Notebooks sind super.")

Print the evaluation results.

In [None]:
for result in evaluation_results:
    print(result)

The model is now saved in your provided `model_path`. We will add a functionality to push the model to the Hugging Face Hub.

### Edit metadata

- `metadata` is a dictionary of metadata for the model
- This is pre-set to initiate the tags on the Hugging Face hub
- modify below to update the entries

In [None]:
print(tmm.metadata.metadata)

In [None]:
tmm.metadata.metadata["datasets"] = "conllpp"
tmm.metadata.metadata["language"] = "en"
tmm.metadata.metadata["license"] = "mit"
tmm.metadata.metadata["metrics"] = "seqeval"
tmm.metadata.metadata["tags"] = ["token-classification"]
tmm.metadata.metadata["thumbnail"] = None

Save the updated metadata:

In [None]:
tmm.save()

### Publish to a new repository on Hugging Face

In [None]:
url = tmm.publish(repo_name="test-other-dataset2", hf_namespace="iulusoy", create_new_repo=True)
print(url)

### Publish to an existing repository on Hugging Face

In [None]:
url = tmm.publish(repo_name="t2", hf_namespace="iulusoy", create_new_repo=False)
print(url)