<div style='font-size:250%; font-weight:bold'>Train NER with huggingface/transformers</div>

This notebook shows how to train a new NER model with transfer learning, using the huggingface/transformers library on Amazon SageMaker.

In [None]:
!pip install --upgrade s3fs

In [6]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

import os
import s3fs
from gtner_blog.util import split, get_latest_version

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# A few standard SageMaker stanzas

import sagemaker
from sagemaker.pytorch import PyTorch

role: str = sagemaker.get_execution_role()
sess = sagemaker.Session()

# Download `run_ner.py` script

Download the PyTorch train script from the transformers repository. The `requirements.txt` specifies the latest stable version, published in pip, to be installed on training jobs. Thus, the NER train script should also come from matching version in the git repo.

In [16]:
# Check the latest transformers version in pip.
version = get_latest_version('transformers')
%env VERSION=$version

# Download run_ner.py
!curl --silent --location \
    https://raw.githubusercontent.com/huggingface/transformers/v${VERSION}/examples/run_ner.py \
    > transformers-scripts/run_ner.py

# Download utils_ner.py
!curl --silent --location \
    https://raw.githubusercontent.com/huggingface/transformers/v${VERSION}/examples/utils_ner.py \
    > transformers-scripts/utils_ner.py

!ls -al transformers-scripts/ | egrep --color=always 'run_ner.py|utils_ner.py|^'

env: VERSION=2.5.0
total 68
drwxrwxr-x 3 ec2-user ec2-user  4096 Feb 21 02:49 .
drwxrwxr-x 5 ec2-user ec2-user  4096 Feb 21 02:55 ..
-rw-rw-r-- 1 ec2-user ec2-user    24 Feb 21 02:48 .gitignore
drwxrwxr-x 2 ec2-user ec2-user  4096 Feb 21 02:15 .ipynb_checkpoints
-rw-rw-r-- 1 ec2-user ec2-user    25 Feb 21 02:16 requirements.txt
-rw-rw-r-- 1 ec2-user ec2-user 30349 Feb 21 02:55 [01;31m[Krun_ner.py[m[K
-rw-rw-r-- 1 ec2-user ec2-user  3120 Feb 21 02:15 transformers-train.py
-rw-rw-r-- 1 ec2-user ec2-user  8428 Feb 21 02:55 [01;31m[Kutils_ner.py[m[K


# Prepare data channels

Split the whole corpus into train:test = 3:1 proportion, then upload the splits to S3.

In [17]:
bucket = 'gtner-blog'                # Change me as necessary
gt_jobname = 'test-gtner-blog-004'   # Change me as necessary

iob_file = f's3://{bucket}/gt/{gt_jobname}/manifests/output/output.iob'
train = f's3://{bucket}/transformers-data/train'
test = f's3://{bucket}/transformers-data/test'
split(iob_file,
      os.path.join(train, 'data.iob'),
      os.path.join(test, 'data.iob'))

display(iob_file, train, test)

's3://gtner-blog/gt/test-gtner-blog-004/manifests/output/output.iob'

's3://gtner-blog/transformers-data/train'

's3://gtner-blog/transformers-data/test'

# Start training

In [None]:
# NOTES: see run_ner.py for more hyperparameters. Do note that transformers-train.py forbids you
# to set these hyperparameters: {...}.
estimator = PyTorch(entry_point='transformers-train.py',
                    source_dir='./transformers-scripts',
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.m5.large',
                    framework_version='1.3.1',
                    py_version='py3',
                    sagemaker_session=sess,
                    debugger_hook_config=False,
                    hyperparameters={'num_train_epochs': 10.0})

In [None]:
estimator.fit({'train': sagemaker.session.s3_input(train),
               'test': sagemaker.session.s3_input(test)})