# Huggingface on SageMaker Pipeline
### Binary Classification with `Trainer` and `imdb` dataset

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)
    3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)

# Introduction

Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Faces `transformers` and `datasets` library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the `imdb` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

This is an extend for this [get start demo](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb), we add SageMaker Processing, SageMaker Batch Transform and SageMaker Pipeline.

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [None]:
!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

## Development environment 

In [3]:
import sagemaker.huggingface

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [18]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

base_job_name = 'huggingfaces-sm-demo'

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws-cn:iam::346044390830:role/service-role/AmazonSageMaker-ExecutionRole-20200402T172406
sagemaker bucket: sagemaker-cn-north-1-346044390830
sagemaker session region: cn-north-1


# Preprocessing

We are using the `datasets` library to download and preprocess the `imdb` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [imdb](http://ai.stanford.edu/~amaas/data/sentiment/) dataset consists of 25000 training and 25000 testing highly polar movie reviews.

在该部分，我们将原来的数据处理过程，改为在sagemaker processing上面跑。

In [43]:
from datasets import load_dataset

    # load dataset
dataset = load_dataset('imdb')
dataset.save_to_disk('./dataset/')

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


                                           

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 596.49it/s]


In [59]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [61]:
dataset.set_format("pandas")
train_dataset = dataset["train"][:]
test_dataset = dataset["test"][:]

In [96]:
type(train_dataset)

pandas.core.frame.DataFrame

In [98]:
import pandas as pd

full_dataset = pd.concat([train_dataset, test_dataset])
full_dataset

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,Just got around to seeing Monster Man yesterda...,1
24996,I got this as part of a competition prize. I w...,1
24997,I got Monster Man in a box set of three films ...,1
24998,"Five minutes in, i started to feel how naff th...",1


In [106]:
train_dataset.to_csv('./data/train.csv',index=False)
test_dataset.to_csv('./data/test.csv',index=False)

In [120]:
# upload datest_data_tfmto S3

!aws s3 cp data s3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/input --recursive

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
upload: data/test.csv to s3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/input/test.csv
upload: data/train.csv to s3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/input/train.csv


In [113]:
train_dataset = load_dataset('csv', data_files={'train':'data/train.csv'})
test_dataset = load_dataset('csv', data_files={'test':'data/test.csv'})

Using custom data configuration default-fdae08e6380708dd


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-fdae08e6380708dd/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...




100%|██████████| 1/1 [00:00<00:00, 5629.94it/s]


100%|██████████| 1/1 [00:00<00:00, 71.75it/s]


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-fdae08e6380708dd/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.




100%|██████████| 1/1 [00:00<00:00, 436.41it/s]
Using custom data configuration default-7938c6692a10877f


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-7938c6692a10877f/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...




100%|██████████| 1/1 [00:00<00:00, 4332.96it/s]


100%|██████████| 1/1 [00:00<00:00, 77.14it/s]


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-7938c6692a10877f/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.




100%|██████████| 1/1 [00:00<00:00, 559.69it/s]


In [121]:
%%writefile processing.py

# Tokenization
import argparse
import os

from datasets import load_dataset
from transformers import AutoTokenizer

input_data_path = "/opt/ml/processing/input_data"
output_data_path = "/opt/ml/processing/output_data"

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--tokenizer_name", type=str, default="distilbert-base-uncased")
    parser.add_argument("--dataset_name", type=str, default="imdb")

    args, _ = parser.parse_known_args()

    print("Received arguments {}".format(args))

    
    # tokenizer used in preprocessing
    tokenizer_name = args.tokenizer_name # 'distilbert-base-uncased'

    # dataset used
    dataset_name = args.dataset_name # 'imdb'

    # s3 key prefix for the data
#     s3_prefix = 'samples/datasets/imdb'

    # load dataset
#     dataset = load_dataset(dataset_name)

    # download tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # tokenizer helper function
    def tokenize(batch):
        return tokenizer(batch['text'], padding='max_length', truncation=True)

    # load dataset
    train_dataset = load_dataset('csv', data_files={'train':os.path.join(input_data_path,'train.csv')})
    test_dataset = load_dataset('csv', data_files={'test':os.path.join(input_data_path,'test.csv')})
#     train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])
#     test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 


    # tokenize dataset
    train_dataset = train_dataset.map(tokenize, batched=True)
    test_dataset = test_dataset.map(tokenize, batched=True)

    # set format for pytorch
    train_dataset =  train_dataset.rename_column("label", "labels")
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
    test_dataset = test_dataset.rename_column("label", "labels")
    test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])


    # save dataset to /opt/ml/processing/
    train_dataset.save_to_disk(output_data_path)
    test_dataset.save_to_disk(output_data_path)

Overwriting processing.py


In [122]:
from sagemaker.processing import (ProcessingInput, ProcessingOutput,
                                  ScriptProcessor)

processing_repository_uri = '727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/huggingface-pytorch-training:1.9-transformers4.12-gpu-py38-cu111-ubuntu20.04'
script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.2xlarge',
                base_job_name=base_job_name + '-processing')

prefix = 'hf-sm-pipeline/dataset'

input_data = 's3://{}/{}/input'.format(sagemaker_session_bucket, prefix)
output_data = 's3://{}/{}/output'.format(sagemaker_session_bucket, prefix)

tokenizer_name = 'distilbert-base-uncased'
dataset_name = 'imdb'

script_processor.run(code='processing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input_data',
                        s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[ProcessingOutput(destination=output_data,
                                                source='/opt/ml/processing/output_data',
                                                s3_upload_mode = 'Continuous')],
                      arguments=['--tokenizer_name', tokenizer_name,
                                '--dataset_name', dataset_name]
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)


Job Name:  huggingfaces-sm-demo-processing-2022-07-05-12-04-16-407
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/input', 'LocalPath': '/opt/ml/processing/input_data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-cn-north-1-346044390830/huggingfaces-sm-demo-processing-2022-07-05-12-04-16-407/input/code/processing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/output', 'LocalPath': '/opt/ml/processing/output_data', 'S3UploadMode': 'Continuous'}}]
........................

# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


## Creating an Estimator and start a training job

In [127]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased'
                 }

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            hyperparameters = hyperparameters)

In [128]:
# starting the train job with our uploaded datasets as input

training_input_path = output_data + '/train'
test_input_path = output_data + '/test'

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

2022-07-05 12:13:14 Starting - Starting the training job...
2022-07-05 12:13:43 Starting - Preparing the instances for trainingProfilerReport-1657023194: InProgress
.........
2022-07-05 12:15:05 Downloading - Downloading input data...
2022-07-05 12:15:41 Training - Downloading the training image........................
2022-07-05 12:19:42 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-07-05 12:19:35,750 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-07-05 12:19:35,772 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-07-05 12:19:35,778 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-07-05 12:19:36,286 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining En

## Batch transform

Now let's try use batch transform to inference mount of data.

In [129]:
from  sagemaker.model import Model

image_uri = '727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/huggingface-pytorch-inference:1.9-transformers4.12-cpu-py38-ubuntu20.04'

hf_model = Model(image_uri=image_uri, 
              model_data=huggingface_estimator.model_data, 
              role=role)

### 准备推理数据并上传到S3

In [145]:
%%writefile test_data_tfm.jsonl
{"inputs":"I love using the new Inference DLC."}
{"inputs":"I love using the new Inference DLC."}
{"inputs":"I love using the new Inference DLC."}
{"inputs":"I love using the new Inference DLC."}

Writing test_data_tfm.jsonl


In [146]:
!aws s3 cp test_data_tfm.jsonl s3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/input/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
upload: ./test_data_tfm.jsonl to s3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/input/test_data_tfm.jsonl


In [148]:
## This is another way to create batch transform job with huggingface model

# from sagemaker.huggingface.model import HuggingFaceModel

# huggingface_model = HuggingFaceModel(
#     role=role, 
#     model_data=huggingface_estimator.model_data, 
#     transformers_version='4.12', 
#     pytorch_version='1.9', 
#     py_version='py38'
# )

# tfm_output = 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/output'

# # create transformer to run a batch job
# batch_job = huggingface_model.transformer(
#     instance_count=1,
#     instance_type='ml.m5.xlarge',
#     strategy='SingleRecord',
#     output_path=tfm_output, 
# )

# test_data = 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/input'

# # starts batch transform job and uses S3 data as input
# batch_job.transform(
#     data=test_data,
#     content_type='application/json',    
#     split_type='Line'
# )

[34m2022-07-05T15:43:06,772 [INFO ] main com.amazonaws.ml.mms.ModelServer - [0m
[34mMMS Home: /opt/conda/lib/python3.8/site-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 4[0m
[34mMax heap size: 3499 M[0m
[34mPython executable: /opt/conda/bin/python3.8[0m
[34mConfig file: /etc/sagemaker-mms.properties[0m
[34mInference address: http://0.0.0.0:8080[0m
[34mManagement address: http://0.0.0.0:8080[0m
[34mModel Store: /.sagemaker/mms/models[0m
[34mInitial Models: ALL[0m
[34mLog dir: null[0m
[34mMetrics dir: null[0m
[34mNetty threads: 0[0m
[34mNetty client threads: 0[0m
[34mDefault workers per model: 4[0m
[34mBlacklist Regex: N/A[0m
[34mMaximum Response Size: 6553500[0m
[34mMaximum Request Size: 6553500[0m
[34mPreload model: false[0m
[34mPrefer direct buffer: false[0m
[34m2022-07-05T15:43:06,846 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attac

#### 创建 SageMaker Batch Transform任务

[Hugging Face SageMaker Batch Transform](https://huggingface.co/docs/sagemaker/inference#run-batch-transform-with-transformers-and-sagemaker)

In [149]:
tfm_output = 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/output'

tfm = hf_model.transformer(
    instance_count=1, 
    instance_type='ml.m5.xlarge', 
    output_path=tfm_output, 
    strategy='SingleRecord',
#     max_concurrent_transforms=None, 
#     max_payload=None
    )

test_data = 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/input'

tfm.transform(
    data=test_data, 
    data_type='S3Prefix', 
    split_type='Line', #
    content_type='application/json',#
    wait=True, 
    logs=True)

[34m2022-07-05T15:52:49,572 [INFO ] main com.amazonaws.ml.mms.ModelServer - [0m
[34mMMS Home: /opt/conda/lib/python3.8/site-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 4[0m
[34mMax heap size: 3499 M[0m
[34mPython executable: /opt/conda/bin/python3.8[0m
[34mConfig file: /etc/sagemaker-mms.properties[0m
[34mInference address: http://0.0.0.0:8080[0m
[34mManagement address: http://0.0.0.0:8080[0m
[34mModel Store: /.sagemaker/mms/models[0m
[34mInitial Models: ALL[0m
[34mLog dir: null[0m
[34mMetrics dir: null[0m
[34mNetty threads: 0[0m
[34mNetty client threads: 0[0m
[34mDefault workers per model: 4[0m
[34mBlacklist Regex: N/A[0m
[34mMaximum Response Size: 6553500[0m
[34mMaximum Request Size: 6553500[0m
[34mPreload model: false[0m
[34mPrefer direct buffer: false[0m
[34m2022-07-05T15:52:49,645 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attac

In [150]:
import json
from sagemaker.s3 import S3Downloader,S3Uploader,s3_path_join
from ast import literal_eval
# creating s3 uri for result file -> input file + .out
output_file = f"test_data_tfm.jsonl.out"
output_path = s3_path_join(tfm_output,output_file)

# download file
S3Downloader.download(output_path,'.')

batch_transform_result = []
with open(output_file) as f:
    for line in f:
        # converts jsonline array to normal array
        line = "[" + line.replace("[","").replace("]",",") + "]"
        batch_transform_result = literal_eval(line) 
        
# print results 
print(batch_transform_result[:3])

[{'label': 'LABEL_1', 'score': 0.9220663905143738}, {'label': 'LABEL_1', 'score': 0.9220663905143738}, {'label': 'LABEL_1', 'score': 0.9220663905143738}]


---

# Build SageMaker Pipeline

In [151]:
import sys
import boto3
import sagemaker


sagemaker_session = sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
sagemaker_session_bucket = sagemaker_session.default_bucket()
model_package_group_name = f"HuggingFacesTCModelPackageGroupName"

In [170]:
dataset_prefix = 'hf-sm-pipeline/dataset'
batch_prefix = 'hf-sm-pipeline/tfm'

input_data_uri = 's3://{}/{}/input'.format(sagemaker_session_bucket, dataset_prefix)
output_data_uri = 's3://{}/{}/output'.format(sagemaker_session_bucket, dataset_prefix)

batch_data_uri = 's3://{}/{}/input'.format(sagemaker_session_bucket, batch_prefix)

In [171]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)


processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)
instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.xlarge")
# model_approval_status = ParameterString(
#     name="ModelApprovalStatus", default_value="PendingManualApproval"
# )
input_data = ParameterString(
    name="InputData",
    default_value=input_data_uri,
)
batch_data = ParameterString(
    name="BatchData",
    default_value=batch_data_uri,
)
# mse_threshold = ParameterFloat(name="MseThreshold", default_value=6.0)

### Define ProcessingStep

In [160]:
from sagemaker.processing import (ProcessingInput, ProcessingOutput,
                                  ScriptProcessor)
from sagemaker.workflow.steps import ProcessingStep

processing_repository_uri = '727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/huggingface-pytorch-training:1.9-transformers4.12-gpu-py38-cu111-ubuntu20.04'

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.2xlarge',
                base_job_name=base_job_name + '-processing')


tokenizer_name = 'distilbert-base-uncased'
dataset_name = 'imdb'


step_process = ProcessingStep(
    name="TextTokenizerProcess",
    processor=script_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input_data"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output_data/train", destination=output_data_uri+'/train/'),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output_data/test", destination=output_data_uri+'/test/'),
    ],
    job_arguments=['--tokenizer_name', tokenizer_name, '--dataset_name', dataset_name],
    code="processing.py",
)


### Define a Training Step to Train a Model

In [161]:
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased'
                 }

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            hyperparameters = hyperparameters)


step_train = TrainingStep(
    name="HuggingFaceTextClassificationTrain",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
        ),
    },
)

#### You also could add a model evaluation step here
We will dismiss this step in demo

### Define a Create Model Step to Create a Model

In [162]:
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.steps import CreateModelStep

huggingface_model = HuggingFaceModel(
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts, 
    transformers_version='4.12', 
    pytorch_version='1.9', 
    py_version='py38',
    sagemaker_session=sagemaker_session,
    role=role,
)


inputs = CreateModelInput(
    instance_type="ml.p3.2xlarge",
)
step_create_model = CreateModelStep(
    name="HuggingFaceTextClassificationCreateModel",
    model=huggingface_model,
    inputs=inputs,
)

### Define a Transform Step to Perform Batch Transformation

In [172]:
from sagemaker.transformer import Transformer

from sagemaker.inputs import TransformInput
from sagemaker.workflow.steps import TransformStep

batch_output = 's3://{}/{}/output'.format(sagemaker_session_bucket, dataset_prefix)

transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    output_path=batch_output,
    strategy='SingleRecord',
)


step_transform = TransformStep(
    name="HuggingFaceTextClassificationTransform", 
    transformer=transformer, 
    inputs=TransformInput(data=batch_data,
                         split_type='Line',
                         content_type='application/json',)
)

### Define a Pipeline of Parameters, Steps

In [173]:
from sagemaker.workflow.pipeline import Pipeline


pipeline_name = f"HuggingFaceTextClassificationPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_count,
        instance_type,
        input_data,
        batch_data,
    ],
    steps=[step_process, step_train, step_create_model, step_transform],
)

#### Examining the pipeline definition (Optional)

In [174]:
import json


definition = json.loads(pipeline.definition())
definition

{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ProcessingInstanceCount',
   'Type': 'Integer',
   'DefaultValue': 1},
  {'Name': 'TrainingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'InputData',
   'Type': 'String',
   'DefaultValue': 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/dataset/input'},
  {'Name': 'BatchData',
   'Type': 'String',
   'DefaultValue': 's3://sagemaker-cn-north-1-346044390830/hf-sm-pipeline/tfm/input'}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'TextTokenizerProcess',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.m5.2xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/huggingface-pytorch-training:1.9-transformers4.12-gpu-

### Submit the pipeline to SageMaker and start execution

In [175]:
pipeline.upsert(role_arn=role)

{'PipelineArn': 'arn:aws-cn:sagemaker:cn-north-1:346044390830:pipeline/huggingfacetextclassificationpipeline',
 'ResponseMetadata': {'RequestId': '25c604ce-5bd4-4cc6-bdc3-ec5ec265db89',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '25c604ce-5bd4-4cc6-bdc3-ec5ec265db89',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '109',
   'date': 'Tue, 05 Jul 2022 16:42:06 GMT'},
  'RetryAttempts': 0}}

In [176]:
execution = pipeline.start()

#### Lineage
Review the lineage of the artifacts generated by the pipeline.

In [178]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer


viz = LineageTableVisualizer(sagemaker.session.Session())
for execution_step in reversed(execution.list_steps()):
    print(execution_step)
    display(viz.show(pipeline_execution_step=execution_step))
    time.sleep(5)

{'StepName': 'TextTokenizerProcess', 'StartTime': datetime.datetime(2022, 7, 5, 16, 42, 11, 244000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 7, 5, 16, 49, 19, 235000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'AttemptCount': 0, 'Metadata': {'ProcessingJob': {'Arn': 'arn:aws-cn:sagemaker:cn-north-1:346044390830:processing-job/pipelines-5m2w4nnpdsgr-texttokenizerprocess-ezqmd0krb0'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...cb9825c1762c773/input/code/processing.py,Input,DataSet,ContributedTo,artifact
1,s3://...46044390830/hf-sm-pipeline/dataset/input,Input,DataSet,ContributedTo,artifact
2,72789...nsformers4.12-gpu-py38-cu111-ubuntu20.04,Input,Image,ContributedTo,artifact
3,s3://...0830/hf-sm-pipeline/dataset/output/test/,Output,DataSet,Produced,artifact
4,s3://...830/hf-sm-pipeline/dataset/output/train/,Output,DataSet,Produced,artifact


{'StepName': 'HuggingFaceTextClassificationTrain', 'StartTime': datetime.datetime(2022, 7, 5, 16, 49, 19, 637000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 7, 5, 17, 10, 48, 415000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'AttemptCount': 0, 'Metadata': {'TrainingJob': {'Arn': 'arn:aws-cn:sagemaker:cn-north-1:346044390830:training-job/pipelines-5m2w4nnpdsgr-huggingfacetextclass-scop50m7bv'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...0830/hf-sm-pipeline/dataset/output/test/,Input,DataSet,ContributedTo,artifact
1,s3://...830/hf-sm-pipeline/dataset/output/train/,Input,DataSet,ContributedTo,artifact
2,72789...nsformers4.12-gpu-py38-cu111-ubuntu20.04,Input,Image,ContributedTo,artifact
3,s3://...TextClass-sCop50M7bV/output/model.tar.gz,Output,Model,Produced,artifact


{'StepName': 'HuggingFaceTextClassificationCreateModel', 'StartTime': datetime.datetime(2022, 7, 5, 17, 10, 49, 350000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 7, 5, 17, 10, 50, 505000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'AttemptCount': 0, 'Metadata': {'Model': {'Arn': 'arn:aws-cn:sagemaker:cn-north-1:346044390830:model/pipelines-5m2w4nnpdsgr-huggingfacetextclass-k5zczibord'}}}


None

{'StepName': 'HuggingFaceTextClassificationTransform', 'StartTime': datetime.datetime(2022, 7, 5, 17, 10, 51, 109000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 7, 5, 17, 18, 53, 251000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'AttemptCount': 0, 'Metadata': {'TransformJob': {'Arn': 'arn:aws-cn:sagemaker:cn-north-1:346044390830:transform-job/pipelines-5m2w4nnpdsgr-huggingfacetextclass-9vjmakhl7i'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...TextClass-sCop50M7bV/output/model.tar.gz,Input,Model,ContributedTo,artifact
1,72789...nsformers4.12-gpu-py38-cu111-ubuntu20.04,Input,Image,ContributedTo,artifact
2,s3://...-1-346044390830/hf-sm-pipeline/tfm/input,Input,DataSet,ContributedTo,artifact
3,s3://...6044390830/hf-sm-pipeline/dataset/output,Output,DataSet,Produced,artifact
