## How to use hyperparameter tuning with SageMaker

### 1. Create synthetic data for testing. 

#### 1.1 make_classification() from sklearn provides a handy way to generate synthetic dataset for classification task. In this case, we define a dataset with 15 input features and 3 output classes. Using the train_test_split() frrom sklearn.model_selection, we create the train, test and validation datasets. 

In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, n_features=15, n_informative=10, 
                             n_redundant=5, n_classes=3, n_clusters_per_class=2, 
                             class_sep=1.5, flip_y=0.01, weights=[0.5, 0.5, 0.5])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)


#### 1.2 Convert the numpy arrays to pandas dataframes and store as csvs to data/ folder. We can then upload this folder to our S3 bucket. 

In [2]:
import pandas as pd

X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)

X_val_df = pd.DataFrame(X_val)
y_val_df = pd.DataFrame(y_val)

X_test_df = pd.DataFrame(X_test)
y_test_df = pd.DataFrame(y_test)

X_train_df.to_csv("data/training/X_train.csv", index=False)
y_train_df.to_csv("data/training/y_train.csv", index=False)

X_val_df.to_csv("data/training/X_validation.csv", index=False)
y_val_df.to_csv("data/training/y_validation.csv", index=False)

X_test_df.to_csv("data/test/X_test.csv", index=False)
y_test_df.to_csv("data/test/y_test.csv", index=False)

### 2. Sagemaker session

#### 2.1 Define session, bucket and role. 

In [3]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/pytorch-synthetic"

role = sagemaker.get_execution_role()

#### 2.2 Upload data to S3 bucket. 

Upload the data/ folder to S3. 

In [4]:
inputs = sagemaker_session.upload_data(path="data/training", bucket=bucket, key_prefix=prefix)
print("input spec (in this case, just an S3 path): {}".format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-2-046610044696/sagemaker/pytorch-synthetic


#### 2.3 Training in Sagemaker

In [5]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="pytorch_synthetic_data_entry.py",
                    role=role,
                    framework_version="1.1.0",
                    train_instance_count=1,
                    train_instance_type="ml.c4.xlarge",
                    hyperparameters={
                        "num-epochs": 10, 
                        "learning-rate": 0.005,
                        "batch-size": 32, 
                        "test-batch-size": 32
                    })

In [6]:
# Synchronous 
estimator.fit({"training": inputs})

2019-09-20 13:40:33 Starting - Starting the training job...
2019-09-20 13:40:37 Starting - Launching requested ML instances...
2019-09-20 13:41:31 Starting - Preparing the instances for training......
2019-09-20 13:42:17 Downloading - Downloading input data...
2019-09-20 13:42:56 Training - Training image download completed. Training in progress..[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-09-20 13:42:58,122 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-09-20 13:42:58,125 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-09-20 13:42:58,136 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-09-20 13:43:01,151 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-09-20 13:43:01,425 sagemaker-containers INFO     Mo


2019-09-20 13:43:17 Uploading - Uploading generated training model
2019-09-20 13:43:17 Completed - Training job completed
Training seconds: 60
Billable seconds: 60


#### 2.4 Hyperparameter tuning in Sagemaker

The HyperparameterTuner class performs the hyperparameter tuning for us. 
1. We first define all the different inputs to the HyperparameterTuner class.
2. Next, we supply the estimator object along with inputs defined in 1. and create the tuner instance.
3. We call the tuner's fit function similar to how we called estimator's fit function. 

Note: The key for this to work is --- to log the test_loss variable inside of test() function in the pytorch_synthetic_data_entry.py script. The logger.info() function inside the test() function uses the string "Test set: Average loss:" to log the loss value. This string must match the one provided in the "Regex" component for the metric_definitions variable. 

If we do not log the test_loss variable inside the test() function, Sagemaker cannot make a decision on which hyperparameter configuration gives the best result and thus the tuning cannot successfully complete. 

In [7]:
from sagemaker.tuner import CategoricalParameter, HyperparameterTuner

hyperparameter_ranges = {"num-epochs": CategoricalParameter([5, 10])}
objective_metric_name = 'average test loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'average test loss',
                       'Regex': 'Test set: Average loss: ([0-9\\.]+)'}]

In [8]:
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=2,
                            max_parallel_jobs=2,
                            objective_type=objective_type)

In [9]:
# Asynchronous 
tuner.fit({"training": inputs})