In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

# bucket = sagemaker_session.default_bucket()
# prefix = "sagemaker/DEMO-pytorch-mnist"

role = sagemaker.get_execution_role()

## Data
### Getting the data



### Run training in SageMaker

The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ```ml.c4.xlarge``` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist.py` script above.


In [2]:
from sagemaker.pytorch import PyTorch


estimator = PyTorch(
    entry_point="train.py",
    role=role,
    py_version="py38",
    framework_version="1.11.0",
    source_dir="./resources",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    hyperparameters={"epochs": 3, "samples_per_epoch": 500},
)

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [4]:
estimator.fit({"training": "s3://fastvision.ai/segmented_data/LUNA16_segmented_2mm_test/"})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker:Creating training-job with name: pytorch-training-2023-07-29-20-06-14-515


2023-07-29 20:06:14 Starting - Starting the training job...
2023-07-29 20:06:30 Starting - Preparing the instances for training......
2023-07-29 20:07:29 Downloading - Downloading input data...
2023-07-29 20:07:49 Training - Downloading the training image...........................
2023-07-29 20:12:46 Uploading - Uploading generated training model[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-07-29 16:12:31,015 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-07-29 16:12:31,036 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-07-29 16:12:31,047 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-07-29 16:12:31,050 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-07-29 16:12:31,299 sagemaker-trai

UnexpectedStatusException: Error for Training job pytorch-training-2023-07-29-20-06-14-515: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "NameError: name 'samples_per_epoch' is not defined"
Command "/opt/conda/bin/python3.8 train.py --epochs 3 --samples-per-epoch 500", exit code: 1

## Host
### Create endpoint
After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.


In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

### Evaluate

You can use the test images to evalute the endpoint. The accuracy of the model depends on how many it is trained. 

In [None]:
# response = predictor.predict(data))
# print("Raw prediction result:")
# print(response)
# print()

# labeled_predictions = list(zip(range(10), response[0]))
# print("Labeled predictions: ")
# print(labeled_predictions)
# print()

# labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
# print("Most likely answer: {}".format(labeled_predictions[0]))

### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it

In [None]:
sagemaker_session.delete_endpoint(endpoint_name=predictor.endpoint_name)