# AutoGluon Tabular with Deep Learning Containers on SageMaker

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/advanced_functionality|autogluon-tabular-containers|AutoGluon_Tabular_SageMaker_Containers.ipynb)

---

[AutoGluon](https://github.com/awslabs/autogluon) automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data.
This example shows how to use AutoGluon-Tabular with Amazon SageMaker by applying [pre-built deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers).

# Prerequisites

In [38]:
# # Ensure autogluon the most recent images information is available in SageMaker Python SDK
# !pip install -q -U 'sagemaker>=2.126.0'

In [39]:
from omegaconf import OmegaConf

path_root = "/Users/elnath/004_deep_learning/SageMaker-Practical-Course"
conf = OmegaConf.load(f"{path_root}/config_my.json")

In [40]:
import sagemaker
import pandas as pd
from ag_model import (
    AutoGluonSagemakerEstimator,
    AutoGluonNonRepackInferenceModel,
    AutoGluonSagemakerInferenceModel,
    AutoGluonRealtimePredictor,
    AutoGluonBatchPredictor,
)
from sagemaker import utils
from sagemaker.serializers import CSVSerializer
import os
import boto3

# role = sagemaker.get_execution_role()
role = conf.common.execution_role
sagemaker_session = sagemaker.session.Session()
region = sagemaker_session._region_name

bucket = sagemaker_session.default_bucket()
s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}"
output_path = f"s3://{bucket}/{s3_prefix}/output/"

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/elnath/Library/Application Support/sagemaker/config.yaml


### Get the data
We'll be using the [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult) for this exercise. 
This data was extracted from the [1994 Census bureau database](http://www.census.gov/en.html) by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics), with the task being to predict if an individual person makes over 50K a year. 

In [41]:
!mkdir -p data

In [42]:
columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "class",
]

In [43]:
# Download the data - needed for examples; in notebooks, S3 URL can be directly used for loading from S3
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/uci_adult/adult.data",
    "data/adult.data",
)
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/uci_adult/adult.test",
    "data/adult.test",
)

In [44]:
df_train = pd.read_csv("data/adult.data", header=None, names=columns)
df_train.to_csv("data/train.csv")

In [45]:
df_test = pd.read_csv("data/adult.test", header=None, skiprows=1, names=columns)
df_test["class"] = df_test["class"].map(
    {
        " <=50K.": " <=50K",
        " >50K.": " >50K",
    }
)
df_test.to_csv("data/test.csv")

# Training

Users can create their own training/inference scripts using [SageMaker Python SDK examples](https://sagemaker.readthedocs.io/en/stable/overview.html#prepare-a-training-script).
The scripts we created allow to pass AutoGluon configuration as a YAML file (located in `data/config` directory).

We are using [official AutoGluon Deep Learning Container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers) with custom training scripts (see `scripts/` directory).

In [46]:
ag = AutoGluonSagemakerEstimator(
    role=role,
    entry_point="scripts/tabular_train.py",
    region=region,
    instance_count=1,
    # instance_type="ml.m5.2xlarge",
    instance_type="local",
    # conda에서의 sagemeaker SDK 버전이 낮아 1.0 미지원
    framework_version="0.8",
    py_version="py39",
    base_job_name="autogluon-tabular-train",
    disable_profiler=True,
    debugger_hook_config=False,
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/elnath/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/elnath/Library/Application Support/sagemaker/config.yaml


Upload the data to s3

In [47]:
s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}"
train_input = ag.sagemaker_session.upload_data(
    path=os.path.join("data", "train.csv"), key_prefix=s3_prefix
)
eval_input = ag.sagemaker_session.upload_data(
    path=os.path.join("data", "test.csv"), key_prefix=s3_prefix
)
config_input = ag.sagemaker_session.upload_data(
    path=os.path.join("config", "config-med.yaml"), key_prefix=s3_prefix
)

# Provide inference script so the script repacking is not needed later
# See more here: https://docs.aws.amazon.com/sagemaker/latest/dg/mlopsfaq.html
# Q. Why do I see a repack step in my SageMaker pipeline?
inference_script = ag.sagemaker_session.upload_data(
    path=os.path.join("scripts", "tabular_serve.py"), key_prefix=s3_prefix
)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


### Fit The Model
For local training set `instance_type` to local.

For non-local training the recommended instance type is `ml.m5.2xlarge`.

In [None]:
job_name = utils.unique_name_from_base("test-autogluon-image")
ag.fit(
    {
        "config": config_input,
        "train": train_input,
        "test": eval_input,
        "serving": inference_script,
    },
    job_name=job_name,
)

INFO:sagemaker:Creating training-job with name: test-autogluon-image-1711446762-9abf
INFO:sagemaker.local.image:'Docker Compose' found using Docker CLI.
INFO:sagemaker.local.local_session:Starting training job
INFO:sagemaker.local.image:Using the long-lived AWS credentials found in session
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-28c43:
    command: train
    container_name: u4na7hxy1w-algo-1-28c43
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/autogluon-training:0.8-cpu-py39
    networks:
      sagemaker-local:
        aliases:
        - algo-1-28c43
    stdin_open: true
    tty: true
    volumes:
    - /private/var/folders/yp/836q9b2x1_n1fm3627j48ssc0000gn/T/tmpvwf859g4/algo-1-28c43/input:/opt/ml/input
    - /private/var/folders/yp/836q9b2x1_n1fm3627j48ssc0000gn/T/tmpvwf859g4/algo-1-28c43/output:/opt/ml/ou

 Container u4na7hxy1w-algo-1-28c43  Creating
 Container u4na7hxy1w-algo-1-28c43  Created
Attaching to u4na7hxy1w-algo-1-28c43
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,602 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,604 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,606 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,615 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,618 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:50,628 sagemaker_pytorch_container.training INFO     Invoking user training script.
u4na7hxy1w-algo-1-28c43  | 2024-03-26 09:52:

### Model export

AutoGluon models are portable: everything needed to deploy a trained model is in the tarball created by SageMaker.

The artifact can be used locally, on EC2/ECS/EKS or served via SageMaker Inference.

In [None]:
print(ag.model_data)

In [None]:
!aws s3 cp {ag.model_data} .

In [None]:
!ls -alF model.tar.gz

# Endpoint Deployment

Upload the model we trained earlier

In [None]:
endpoint_name = sagemaker.utils.unique_name_from_base("sagemaker-autogluon-serving-trained-model")

model_data = sagemaker_session.upload_data(
    path=os.path.join(".", "model.tar.gz"), key_prefix=f"{endpoint_name}/models"
)

Deploy remote or local endpoint

In [None]:
instance_type = "ml.m5.2xlarge"
# instance_type = 'local'

In [None]:
model = AutoGluonNonRepackInferenceModel(
    model_data=model_data,
    role=role,
    region=region,
    framework_version="0.8",
    py_version="py39",
    instance_type=instance_type,
    source_dir="scripts",
    entry_point="tabular_serve.py",
)

In [None]:
model.deploy(initial_instance_count=1, serializer=CSVSerializer(), instance_type=instance_type)

In [None]:
print(model.endpoint_name)

In [None]:
predictor = AutoGluonRealtimePredictor(model.endpoint_name)

### Predict on unlabeled test data

Remove target variable (`class`) from the data and get predictions for a sample of 100 rows using the deployed endpoint.

In [21]:
df = pd.read_csv("data/test.csv")
data = df[:100]

In [22]:
preds = predictor.predict(data.drop(columns="class"))
preds

Unnamed: 0,pred,<=50K_proba,>50K_proba
0,<=50K,0.997886,0.002114
1,<=50K,0.819394,0.180606
2,<=50K,0.673882,0.326118
3,>50K,0.011298,0.988702
4,<=50K,0.999636,0.000364
...,...,...,...
95,<=50K,0.999180,0.000820
96,<=50K,0.988200,0.011800
97,<=50K,0.913842,0.086158
98,<=50K,0.655580,0.344420


In [23]:
p = preds[["pred"]]
p = p.join(data["class"]).rename(columns={"class": "actual"})
p.head()

Unnamed: 0,pred,actual
0,<=50K,<=50K
1,<=50K,<=50K
2,<=50K,>50K
3,>50K,>50K
4,<=50K,<=50K


In [24]:
print(f"{(p.pred==p.actual).astype(int).sum()}/{len(p)} are correct")

94/100 are correct


### Cleanup Endpoint

In [25]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: autogluon-inference-2024-03-26-09-10-29-167
INFO:sagemaker:Deleting endpoint with name: autogluon-inference-2024-03-26-09-10-29-167


# Batch Transform

Deploying a trained model to a hosted endpoint has been available in SageMaker since launch and is a great way to provide real-time predictions to a service like a website or mobile app. But, if the goal is to generate predictions from a trained model on a large dataset where minimizing latency isn’t a concern, then the batch transform functionality may be easier, more scalable, and more appropriate.

[Read more about Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).

In [26]:
endpoint_name = sagemaker.utils.unique_name_from_base(
    "sagemaker-autogluon-batch_transform-trained-model"
)

model_data = sagemaker_session.upload_data(
    path=os.path.join(".", "model.tar.gz"), key_prefix=f"{endpoint_name}/models"
)

In [None]:
instance_type = "ml.m5.2xlarge"

In [27]:
model = AutoGluonSagemakerInferenceModel(
    model_data=model_data,
    role=role,
    region=region,
    framework_version="0.8",
    py_version="py39",
    instance_type=instance_type,
    entry_point="tabular_serve-batch.py",
    source_dir="scripts",
    predictor_cls=AutoGluonBatchPredictor,
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/elnath/Library/Application Support/sagemaker/config.yaml


In [28]:
transformer = model.transformer(
    instance_count=1,
    instance_type=instance_type,
    strategy="MultiRecord",
    max_payload=6,
    max_concurrent_transforms=1,
    output_path=output_path,
    accept="application/json",
    assemble_with="Line",
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/elnath/Library/Application Support/sagemaker/config.yaml


INFO:sagemaker:Repacking model artifact (s3://sagemaker-ap-northeast-2-688554574862/sagemaker-autogluon-batch_transform-trained-mod-1711444702-fd3d/models/model.tar.gz), script artifact (scripts), and dependencies ([]) into single tar.gz file located at s3://sagemaker-ap-northeast-2-688554574862/autogluon-inference-2024-03-26-09-19-01-205/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: autogluon-inference-2024-03-26-09-24-57-380


Prepare data for batch transform

In [29]:
pd.read_csv(f"data/test.csv")[:100].to_csv("data/test_no_header.csv", header=False, index=False)

Upload data to sagemaker session

In [30]:
test_input = transformer.sagemaker_session.upload_data(
    path=os.path.join("data", "test_no_header.csv"), key_prefix=s3_prefix
)

In [31]:
transformer.transform(
    test_input,
    input_filter="$[:14]",  # filter-out target variable
    split_type="Line",
    content_type="text/csv",
    output_filter="$['class']",  # keep only prediction class in the output
)

transformer.wait()

INFO:sagemaker:Creating transform job with name: autogluon-inference-2024-03-26-09-24-58-230


...................................[34m['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model'][0m
[34m2024-03-26T09:30:54,106 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...[0m
[34m2024-03-26T09:30:54,227 [INFO ] main org.pytorch.serve.ModelServer - [0m
[34mTorchserve version: 0.7.1[0m
[34mTS Home: /opt/conda/lib/python3.9/site-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mMetrics config path: /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 8[0m
[34mMax heap size: 7924 M[0m
[34mPython executable: /opt/conda/bin/python3.9[0m
[34mConfig file: /etc/sagemaker-ts.properties[0m
[34mInference address: http://0.0.0.

Download batch transform outputs

In [32]:
!aws s3 cp {transformer.output_path[:-1]}/test_no_header.csv.out .

download: s3://sagemaker-ap-northeast-2-688554574862/autogluon_sm/2024-03-26-09-02-49-026/output/test_no_header.csv.out to ./test_no_header.csv.out


In [33]:
p = pd.concat(
    [
        pd.read_json("test_no_header.csv.out", orient="index")
        .sort_index()
        .rename(columns={0: "preds"}),
        pd.read_csv("data/test.csv")[["class"]].iloc[:100].rename(columns={"class": "actual"}),
    ],
    axis=1,
)
p.head()

Unnamed: 0,preds,actual
0,<=50K,<=50K
1,<=50K,<=50K
2,<=50K,>50K
3,>50K,>50K
4,<=50K,<=50K


In [36]:
print(f"{(p.preds==p.actual).astype(int).sum()}/{len(p)} are correct")

94/100 are correct


# Conclusion

In this tutorial we successfully trained an AutoGluon model and explored a few options how to deploy it using SageMaker. Any of the sections of this tutorial (training/endpoint inference/batch inference) can be used independently (i.e. train locally, deploy to SageMaker, or vice versa).

Next steps:
* [Learn more](https://auto.gluon.ai) about AutoGluon, explore [tutorials](https://auto.gluon.ai/stable/tutorials/index.html).
* Explore [SageMaker inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html).