<div style='font-size:250%; font-weight:bold'>Train NER with huggingface/transformers</div>

This notebook shows how to use `huggingface/transformers` on Amazon SageMaker to transfer-learn the Roberta language model into a new NER model.

In [None]:
!pip install --upgrade s3fs

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

import os

import s3fs
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
from sagemaker.session import s3_input

from gtner_blog.util import split, bilou2bio, write_split, LabelCollector

# Download transformers NER scripts

The `huggingface/transformers` repo contains two PyTorch scripts to download, namely `run_ner.py` and `utils_ner.py`. The following `bash` cell downloads version v2.5.0 which matches the library listed in `requirements.txt`.

To minimize the dependencies installed to the training container, we also download the `seqeval` into `source_dir/` to emulate `pip install --no-deps seqeval`. The `seqeval.callbacks` depends on [Keras](https://github.com/chakki-works/seqeval/blob/v0.0.12/seqeval/callbacks.py) and [tensorflow](https://github.com/chakki-works/seqeval/blob/v0.0.12/requirements.txt), but `run_ner.py` does not uses these callbacks (and only `seqeval.metrics`), hence both dependencies can be [skipped](https://github.com/chakki-works/seqeval/blob/v0.0.12/seqeval/metrics/sequence_labeling.py).


<details>
    <summary>Note</summary>
    <blockquote>As of this writing, the master branch of `huggingface/transformers` has relocated the NER scripts from `examples/` to `examples/ner/`, which is beyond the scope of this notebook.</blockquote>
</details>

In [2]:
%%bash
GITHUB=https://raw.githubusercontent.com
cd transformers-scripts

# Download NER scripts
for i in run_ner.py utils_ner.py
do
    curl --silent --location $GITHUB/huggingface/transformers/v2.5.0/examples/$i > $i
done

# Download seqeval
mkdir -p seqeval/metrics
for i in __init__.py callbacks.py metrics/__init__.py metrics/sequence_labeling.py
do
    curl --silent --location $GITHUB/chakki-works/seqeval/v0.0.12/seqeval/$i > seqeval/$i
done

ls -ald * seqeval/* seqeval/metrics/* | egrep --color=always 'run_ner.py|utils_ner.py|seqeval.*|^'

-rw-rw-r-- 1 ec2-user ec2-user    32 Feb 21 07:38 requirements.txt
-rw-rw-r-- 1 ec2-user ec2-user 30349 Feb 21 16:06 [01;31m[Krun_ner.py[m[K
drwxrwxr-x 3 ec2-user ec2-user  4096 Feb 21 08:12 [01;31m[Kseqeval[m[K
-rw-rw-r-- 1 ec2-user ec2-user  3111 Feb 21 16:06 [01;31m[Kseqeval/callbacks.py[m[K
-rw-rw-r-- 1 ec2-user ec2-user     0 Feb 21 16:06 [01;31m[Kseqeval/__init__.py[m[K
drwxrwxr-x 2 ec2-user ec2-user  4096 Feb 21 08:12 [01;31m[Kseqeval/metrics[m[K
-rw-rw-r-- 1 ec2-user ec2-user   371 Feb 21 16:06 [01;31m[Kseqeval/metrics/__init__.py[m[K
-rw-rw-r-- 1 ec2-user ec2-user 12604 Feb 21 16:06 [01;31m[Kseqeval/metrics/sequence_labeling.py[m[K
-rw-rw-r-- 1 ec2-user ec2-user  3559 Feb 21 16:04 transformers-train.py
-rw-rw-r-- 1 ec2-user ec2-user  8428 Feb 21 16:06 [01;31m[Kutils_ner.py[m[K


# Prepare data channels

Split the whole corpus in S3 into train:test = 3:1 proportion, then upload the splits to S3.

In [3]:
bucket = 'gtner-blog'                # Change me as necessary
gt_jobname = 'test-gtner-blog-004'   # Change me as necessary

iob_file = f's3://{bucket}/gt/{gt_jobname}/manifests/output/output.iob'
train = f's3://{bucket}/transformers-data/train'
dev = f's3://{bucket}/transformers-data/dev'
label = f's3://{bucket}/transformers-data/label'
label_collector = LabelCollector()
fs = s3fs.S3FileSystem(anon=False)

with fs.open(iob_file, 'r') as f:
    train_split = os.path.join(train, 'train.txt')
    dev_split = os.path.join(dev, 'dev.txt')
    
    # Chain of functions: .iob -> bilou2bio -> label_collector -> split -> write_split.
    write_split(split(label_collector(bilou2bio(f))), train_split, dev_split)

with fs.open(os.path.join(label, 'label.txt'), 'w') as f:
    for ner_tag in label_collector.sorted_labels:
        f.write(f'{ner_tag}\n')

display(iob_file, train, dev, label)

's3://gtner-blog/gt/test-gtner-blog-004/manifests/output/output.iob'

's3://gtner-blog/transformers-data/train'

's3://gtner-blog/transformers-data/dev'

's3://gtner-blog/transformers-data/label'

# Start training

We create a PyTorch estimator with our entry point script `transformers-train.py`, a thin wrapper over `run_ner.py` that does the following:

1. parse SageMaker's entry-point protocol, namely model and channel directories.
2. pre-define a few arguments to `run-ner.py`: `{"--do_train", "--do-eval", "--evaluate_during_train", "--data_dir", "--output_dir", "--label"}`.
3. passes the estimator's hyper-parameters as arguments to `run-ner.py`.
   1. Each hyperparameter `abcd` will be passed down as `--abcd`.
   2. The hyperparameters must not conflict with those in the above mentioned step 2.
   3. The entry point only support `--abcd SOME_VALUE` form of arguments.

```bash
usage: run_ner.py [-h] --data_dir DATA_DIR --model_type MODEL_TYPE
                  --model_name_or_path MODEL_NAME_OR_PATH --output_dir
                  OUTPUT_DIR [--labels LABELS] [--config_name CONFIG_NAME]
                  [--tokenizer_name TOKENIZER_NAME] [--cache_dir CACHE_DIR]
                  [--max_seq_length MAX_SEQ_LENGTH] [--do_train] [--do_eval]
                  [--do_predict] [--evaluate_during_training]
                  [--do_lower_case]
                  [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
                  [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
                  [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                  [--learning_rate LEARNING_RATE]
                  [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON]
                  [--max_grad_norm MAX_GRAD_NORM]
                  [--num_train_epochs NUM_TRAIN_EPOCHS]
                  [--max_steps MAX_STEPS] [--warmup_steps WARMUP_STEPS]
                  [--logging_steps LOGGING_STEPS] [--save_steps SAVE_STEPS]
                  [--eval_all_checkpoints] [--no_cuda]
                  [--overwrite_output_dir] [--overwrite_cache] [--seed SEED]
                  [--fp16] [--fp16_opt_level FP16_OPT_LEVEL]
                  [--local_rank LOCAL_RANK] [--server_ip SERVER_IP]
                  [--server_port SERVER_PORT]
```

In [4]:
estimator = PyTorch(entry_point='transformers-train.py',
                    source_dir='./transformers-scripts',
                    role=get_execution_role(),
                    train_instance_count=1,
                    train_instance_type='ml.m5.large',
                    framework_version='1.3.1',
                    py_version='py3',
                    debugger_hook_config=False,
                    hyperparameters={
                        'num_train_epochs': 5.0,
                        'model_type': 'roberta',
                        'model_name_or_path': 'roberta-base'
                    })

In [7]:
estimator.fit({'train': s3_input(train), 'dev': s3_input(dev), 'label': s3_input(label)})

2020-02-21 16:24:29 Starting - Starting the training job...
2020-02-21 16:24:31 Starting - Launching requested ML instances......
2020-02-21 16:25:34 Starting - Preparing the instances for training...
2020-02-21 16:26:22 Downloading - Downloading input data...
2020-02-21 16:26:53 Training - Downloading the training image...
2020-02-21 16:27:24 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-02-21 16:27:24,991 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-02-21 16:27:24,996 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-02-21 16:27:25,008 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-02-21 16:27:25,009 sagemaker_pytorch_container.training INFO     Invoking user training script.