# 1D sub-pillar modeling: Transformers

Runs `huggingface-multihead` on SageMaker to train transformer-based classifiers.

* 1D pillars and subpillars preprocessing
* Multihead (per pillar) transformer sequence classification
* SageMaker training jobs
* Macro precision, recall, fscore evaluation
* MLFlow tracking


In [None]:
import os
import sys

sys.path.append(os.path.abspath(os.getcwd() + "../../../../"))

In [None]:
import sagemaker
from deep.constants import DEV_BUCKET
from deep.utils import formatted_time

sess = sagemaker.Session(default_bucket=DEV_BUCKET.name)
job_name = f"1D-test-{formatted_time()}" 

## Data

In [None]:
import pandas as pd

train_df = pd.read_csv("../../../data/frameworks_data/data_v0.5/data_v0.5_train.csv")
val_df = pd.read_csv("../../../data/frameworks_data/data_v0.5/data_v0.5_val.csv")

sample = False  # To make the computations faster, sample = True.

if sample:
    train_df = train_df.sample(n=1000)
    val_df = val_df.sample(n=1000)

In [None]:
input_path = DEV_BUCKET / 'training' / 'input_data' / job_name  # Do not change this

train_path = str(input_path / 'train_df.pickle')
val_path = str(input_path / 'test_df.pickle')

train_df.to_pickle(train_path, protocol=4)  # protocol 4 is necessary, since SageMaker uses python 3.6
val_df.to_pickle(val_path, protocol=4)

## Sagemaker

In [None]:
# GPU instances

instances = [
    'ml.p2.xlarge',
    'ml.p3.2xlarge'
]

The hyperparameters are passed as command line arguments to the training script. 

You can add/change them as you like. It's important to keep the `tracking_uri` and the `experiment_name` which are used by MLFlow.

The class `PyTorch` is part of the `SageMaker` python API. The parameters are important and you should probably not change most of them. The ones you may want to change are:

- `instance_type`, specify the instance you want
- `source_dir`, specify your script directory. Try to use global variable as much as possible

In [None]:
from sagemaker.pytorch import PyTorch

from deep.constants import MLFLOW_SERVER, SAGEMAKER_ROLE

hyperparameters={
    'epochs': 5,
    'model_name': 'distilbert-base-uncased',
    'tracking_uri': MLFLOW_SERVER,
    'experiment_name': '1D-multihead-transformers',
    'iterative': False,
    'loss': 'focal'
}

estimator = PyTorch(
    entry_point='train.py',
    source_dir=str('../../../scripts/training/oguz/huggingface-multihead'),
    output_path=str(DEV_BUCKET / 'models/'),
    code_location=str(input_path),
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=SAGEMAKER_ROLE,
    framework_version='1.8',
    py_version='py36',
    hyperparameters = hyperparameters,
    job_name=job_name,
)

In [None]:
fit_arguments = {
    'train': str(input_path),
    'test': str(input_path)
}

In [None]:
# Fit the estimator

estimator.fit(fit_arguments, job_name=job_name)

## Debugging

In [None]:
#!pip install cloudpathlib
#!pip install mlflow
#!pip install transformers

In [None]:
from deep.constants import MLFLOW_SERVER, SAGEMAKER_ROLE

PATH = os.path.abspath('../../../../deep-experiments/scripts/training/oguz/huggingface-multihead/train.py')

In [None]:
%env SM_OUTPUT_DATA_DIR=''
%env SM_MODEL_DIR=''
%env SM_NUM_GPUS=1
%env SM_CHANNEL_TRAIN={input_path}
%env SM_CHANNEL_TEST={input_path}

!python {PATH} --epochs 3 --tracking_uri {MLFLOW_SERVER} --experiment_name {'1D-multihead-transformers'} --model_name {'distilbert-base-uncased'}