In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

# bucket = sagemaker_session.default_bucket()
# prefix = "sagemaker/DEMO-pytorch-mnist"

role = sagemaker.get_execution_role()

from sagemaker.local import LocalSession

sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

## Data
### Getting the data



### Run training in SageMaker

The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ```ml.c4.xlarge``` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist.py` script above.


In [2]:
from sagemaker.pytorch import PyTorch


estimator = PyTorch(
    entry_point="train.py",
    role=role,
    py_version="py38",
    framework_version="1.11.0",
    source_dir="./resources",
    instance_count=1,
    instance_type="local_gpu",
    hyperparameters={"epochs": 1, "batch_size": 4, "samples_per_epoch": 500},
    session=sagemaker_session
)

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [3]:
estimator.fit({"training": "s3://fastvision.ai/segmented_data/LUNA16_segmented_2mm_test/"})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-08-02-22-11-19-132
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-ntx1p:
    command: train
    container_name: 1ub1bj9v1l-algo-1-ntx1p
    deploy:
      resources:
        reservations:
          devices:
          - capabilities:
            - gpu
    environment:
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.11.0-gpu-py38
    networks:
      sagemaker-local:
        aliases:
        - algo-1-ntx1p
 

Login Succeeded


INFO:sagemaker.local.image:image pulled: 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.11.0-gpu-py38


Container 1ub1bj9v1l-algo-1-ntx1p  Creating
Container 1ub1bj9v1l-algo-1-ntx1p  Created
Attaching to 1ub1bj9v1l-algo-1-ntx1p
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,873 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,894 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,903 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,906 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,909 sagemaker_pytorch_container.training INFO     Invoking user training script.
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-02 18:16:27,950 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
1ub1bj9v1l-algo-1-ntx1p  | 2023-08-0

INFO:root:creating /tmp/tmpsce_mp_k/artifacts/output/data
INFO:root:copying /tmp/tmpsce_mp_k/algo-1-ntx1p/output/failure -> /tmp/tmpsce_mp_k/artifacts/output


1ub1bj9v1l-algo-1-ntx1p exited with code 1
Aborting on container exit...
Container 1ub1bj9v1l-algo-1-ntx1p  Stopping
Container 1ub1bj9v1l-algo-1-ntx1p  Stopped
time="2023-08-02T22:16:50Z" level=error msg=1


RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpsce_mp_k/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

## Host
### Create endpoint
After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.


In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

### Evaluate

You can use the test images to evalute the endpoint. The accuracy of the model depends on how many it is trained. 

In [None]:
# response = predictor.predict(data))
# print("Raw prediction result:")
# print(response)
# print()

# labeled_predictions = list(zip(range(10), response[0]))
# print("Labeled predictions: ")
# print(labeled_predictions)
# print()

# labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
# print("Most likely answer: {}".format(labeled_predictions[0]))

### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it

In [None]:
sagemaker_session.delete_endpoint(endpoint_name=predictor.endpoint_name)