## Setup

Let's start by specifying:
- The S3 bucket and prefix that you want to use for training and model data. **This should be within the same region as the Notebook Instance, training, and hosting.**

In [3]:
bucket = 'sm-nlp-data'
prefix = 'ie-baseline'

In [4]:
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

In [5]:
role

'arn:aws:iam::093729152554:role/service-role/AWSNeptuneNotebookRole-NepTestRole'

## Download data
Download [DuIE 2.0](https://dataset-bj.cdn.bcebos.com/qianyan/DuIE_2_0.zip) and extract json files to `{project_root}/data` folder.

In [None]:
%%bash
wget https://dataset-bj.cdn.bcebos.com/qianyan/DuIE_2_0.zip
mkdir data
mkdir data
unzip -j DuIE_2_0.zip -d data
ls data
rm DuIE_2_0.zip

## Transform the data to easier-comprehensible form

In [None]:
%%bash
mkdir generated
python trans.py
ls generated

In [None]:
!ls generated

## Upload Processed Data to S3

In [None]:
def upload_to_s3(bucket, prefix, channel, file_path, file_name):
    s3 = boto3.resource("s3")
    data = open(file_path, "rb")
    key = prefix + "/" + channel + "/" + file_name
    s3.Bucket(bucket).put_object(Key=key, Body=data)

upload_to_s3(bucket, prefix, "train", "generated/train_data_me.json", "train_data_me.json")
upload_to_s3(bucket, prefix, "train", "generated/dev_data_me.json", "dev_data_me.json")
upload_to_s3(bucket, prefix, "train", "generated/schemas_me.json", "schemas_me.json")
upload_to_s3(bucket, prefix, "train", "generated/all_chars_me.json","all_chars_me.json")

In [None]:
from sagemaker.pytorch.estimator import PyTorch
import os

pytorch_estimator = PyTorch('sm_train.py',
                            role=role,
                            instance_type='ml.c5.4xlarge',
                            instance_count=1,
                            framework_version='1.8.0',
                            py_version='py3',
                            source_dir='./',
                            hyperparameters = {'epochs': 1, 'batch-size': 64, 'learning-rate': 0.001})
pytorch_estimator.fit({'train': 's3://sm-nlp-data/ie-baseline/train/train_data_me.json',
                       'test': 's3://sm-nlp-data/ie-baseline/train/dev_data_me.json',
                       'output_path': 's3://sm-nlp-data/ie-baseline/outputs/'})

2021-06-10 07:29:23 Starting - Starting the training job...
2021-06-10 07:29:46 Starting - Launching requested ML instancesProfilerReport-1623310145: InProgress
......
2021-06-10 07:30:46 Starting - Preparing the instances for training......
2021-06-10 07:31:48 Downloading - Downloading input data
2021-06-10 07:31:48 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-10 07:32:30,434 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-06-10 07:32:30,436 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-10 07:32:30,444 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-06-10 07:32:31,870 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-06-10 07:32:34,474 sage

In [None]:
from sagemaker.pytorch.estimator import PyTorch
import os

pytorch_estimator = PyTorch('sm_train.py',
                            role=role,
                            instance_type='ml.p2.16xlarge',
                            instance_count=1,
                            framework_version='1.8.0',
                            py_version='py3',
                            source_dir='./',
                            hyperparameters = {'epochs': 100, 'batch-size': 64, 'learning-rate': 0.001})
pytorch_estimator.fit({'train': 's3://sm-nlp-data/ie-baseline/train/train_data_me.json',
                       'test': 's3://sm-nlp-data/ie-baseline/train/dev_data_me.json'})

In [None]:
!python sm_