# Session 1 - Getting Started

This session aims to build a simple [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) with one step to preprocess the dataset.


In [9]:
%load_ext autoreload
%autoreload 2

In [10]:
# Let's make sure we are running the latest version of the SakeMaker's SDK. 
# Restart the notebook after you upgrade the library.

!pip install -q --upgrade pip
!pip install -q --upgrade sagemaker
!pip show sagemaker

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.22.22 requires botocore==1.23.22, but you have botocore 1.29.110 which is incompatible.
awscli 1.22.22 requires s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.6.0 which is incompatible.[0m[31m
[0mName: sagemaker
Version: 2.145.0
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /usr/local/lib/python3.8/site-packages
Requires: attrs, boto3, google-pasta, importlib-metadata, jsonschema, numpy, packaging, pandas, pathos, platformdirs, protobuf, protobuf3-to-dict, PyYAML, schema, smdebug-rulesconfig
Required-by: 


In [11]:
import os
import sagemaker
import numpy as np
import boto3
import json
import pandas as pd
import numpy as np
import urllib.request
import argparse
import tempfile

from pathlib import Path
from sagemaker.inputs import FileSystemInput
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.parameters import ParameterInteger, ParameterString, ParameterFloat
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import CacheConfig
from sagemaker.workflow.pipeline_context import LocalPipelineSession


role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()

[autoreload of sagemaker.utils failed: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/usr/local/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/local/lib/python3.8/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/usr/local/lib/python3.8/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 604, in _exec
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/site-packages/sagemaker/utils.py", line 42, in <module>
    from sagemaker.workflow import is_pipeline_variable, is_pipeline_parameter_string
ImportError: cannot import name 'is_pipeline_var

ImportError: cannot import name 'SessionSettings' from 'sagemaker.session' (/usr/local/lib/python3.8/site-packages/sagemaker/session.py)

## Step 1 - Creating an S3 Bucket

We need to create an S3 bucket where we will upload everything we need during the program.

Make sure you set `BUCKET` to the name of the bucket you want to use.

In [5]:
BUCKET = "mlschool"

!aws s3api create-bucket --bucket $BUCKET

{
    "Location": "/mlschool"
}


## Step 2 - Downloading the Dataset

We can now download the [Penguins dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) and store it in S3.

In [6]:
S3_FILEPATH = f"s3://{BUCKET}/penguins"
DATA_FILEPATH = "penguins/data.csv"

# Download the official Penguins dataset and store it locally.
urllib.request.urlretrieve(
    "https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins_size.csv", 
    DATA_FILEPATH
)

# Upload the dataset to S3. We need to do this to make it available to 
# the preprocessing step.
INPUT_DATA_URI = sagemaker.s3.S3Uploader.upload(
    local_path=DATA_FILEPATH, 
    desired_s3_uri=S3_FILEPATH,
)

print(f"Dataset S3 location: {INPUT_DATA_URI}")

NameError: name 'urllib' is not defined

We can now load and display the dataset.

In [6]:
df = pd.read_csv(DATA_FILEPATH)
df

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


## Step 3 - Preprocessing the Dataset

Let's create a script to do feature engineering on the original dataset. This script should also split the data into train, validation, and a test set.

In [7]:
%%writefile penguins/preprocessor.py

import os
import numpy as np
import pandas as pd
import tempfile

from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler


BASE_DIR = "/opt/ml/processing"
DATA_FILEPATH = Path(BASE_DIR) / "input" / "data.csv"


def save_splits(base_dir, train, validation, test):
    """
    Saves the supplied datasets to disk.
    """
    
    train_path = Path(base_dir) / "train" 
    validation_path = Path(base_dir) / "validation" 
    test_path = Path(base_dir) / "test"
    
    train_path.mkdir(parents=True, exist_ok=True)
    validation_path.mkdir(parents=True, exist_ok=True)
    test_path.mkdir(parents=True, exist_ok=True)
    
    pd.DataFrame(train).to_csv(train_path / "train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(validation_path / "validation.csv", header=False, index=False)
    pd.DataFrame(test).to_csv(test_path / "test.csv", header=False, index=False)


def preprocess(base_dir, data_filepath):
    """
    Preprocesses the supplied raw dataset and splits it into a train, validation,
    and a test set.
    """
    
    df = pd.read_csv(data_filepath)
    
    numerical_columns = [column for column in df.columns if df[column].dtype in ["int64", "float64"]]

    numerical_preprocessor = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler())
    ])

    categorical_preprocessor = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ("numerical", numerical_preprocessor, numerical_columns),
            ("categorical", categorical_preprocessor, ["island"]),
        ]
    )
    

    X = df.drop(["sex"], axis=1)
    columns = list(X.columns)
    
    X = X.to_numpy()
    
    np.random.shuffle(X)
    train, validation, test = np.split(X, [int(.7 * len(X)), int(.85 * len(X))])
    
    X_train = pd.DataFrame(train, columns=columns)
    X_validation = pd.DataFrame(validation, columns=columns)
    X_test = pd.DataFrame(test, columns=columns)
    
    y_train = X_train.species
    y_validation = X_validation.species
    y_test = X_test.species

    X_train.drop(["species"], axis=1, inplace=True)
    X_validation.drop(["species"], axis=1, inplace=True)
    X_test.drop(["species"], axis=1, inplace=True)
    
    X_train = preprocessor.fit_transform(X_train)
    X_validation = preprocessor.transform(X_validation)
    X_test = preprocessor.transform(X_test)

    label_encoder = LabelEncoder()

    y_train = label_encoder.fit_transform(y_train)
    y_validation = label_encoder.transform(y_validation)
    y_test = label_encoder.transform(y_test)
    
    
    train = np.concatenate((X_train, np.expand_dims(y_train, axis=1)), axis=1)
    validation = np.concatenate((X_validation, np.expand_dims(y_validation, axis=1)), axis=1)
    test = np.concatenate((X_test, np.expand_dims(y_test, axis=1)), axis=1)
    
    save_splits(base_dir, train, validation, test)
    
    
if __name__ == "__main__":
    preprocess(BASE_DIR, DATA_FILEPATH)


Overwriting penguins/preprocessor.py


We can now load the script we just created and run it locally to ensure it creates the 3 splits. 

Having a way to run scripts locally is crucial to shorten the development feedback.

In [8]:
from penguins.preprocessor import preprocess

with tempfile.TemporaryDirectory() as directory:
    preprocess(
        base_dir=directory, 
        data_filepath=DATA_FILEPATH
    )
    
    print(f"Splits: {os.listdir(directory)}")
    print(f"Train: {os.listdir(Path(directory) / 'train')}")
    print(f"Validation: {os.listdir(Path(directory) / 'validation')}")
    print(f"Test: {os.listdir(Path(directory) / 'test')}")

Splits: ['train', 'validation', 'test']
Train: ['train.csv']
Validation: ['validation.csv']
Test: ['test.csv']


## Step 4 - Pipeline Configuration

When we create a SageMaker Pipeline we can specify a list of paramaters that we can use throughout the individual pipeline steps. To read more about these parameters, check [Pipeline Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html).

These are the parameters that will use in our pipeline:

* `dataset_location`: This parameter represents the location of the dataset in S3. We will use this parameter during the preprocessing step to access the dataset.
* `preprocessor_destination`: We need to define the location where the preprocessing step will be storing the dataset splits to avoid SageMaker from appending a timestamp to their auto-generated location. If we let SageMaker use a timestamp, we can't cache this step.

In [9]:
dataset_location = ParameterString(
    name="dataset_location",
    default_value=INPUT_DATA_URI,
)

preprocessor_destination = ParameterString(
    name="preprocessor_destination",
    default_value=f'{S3_FILEPATH}/preprocessing',
)

## Step 5 - Caching Pipeline Steps

While you are building your pipeline, you don't want to rerun every step of the process unless you expect a different result. Instead, you can instruct SageMaker to reuse the result of a previous successful run of a pipeline step.

You can accomplish this by caching your steps. You can find more information about this topic in [Caching Pipeline Steps](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html).

Getting caching to work is tricky, and you will find SageMaker missing the cache frequently. Whenever that happens, you need to dig and figure out how to adjust the step configuration to prevent SageMaker from autogenerating data that prevents a cache hit. For example, to cache the preprocessing step we need to define the destination of the processing job to prevent SageMaker from using an autogenerated timestamp.

In [10]:
# We'll use this cache configuration to cache individual steps for 
# a maximum of 5 days.
cache_config = CacheConfig(
    enable_caching=True, 
    expire_after="5d"
)

## Step 6 - Setting up a Processing Step

The first step we need in our pipeline is a Processing Step to run the preprocessing script. Check the [Processing Step documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) for more information. To run our script, we need access to Scikit-Learn, so we can use the [SKLearnProcessor]((https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) processor that comes out-of-the-box with the SageMaker's Python SDK.

The input of this step will be the dataset location, and the output will be the location of the three sets.

In [11]:
sklearn_processor = SKLearnProcessor(
    base_job_name="penguins-preprocessing",
    framework_version="0.23-1",
    instance_type="ml.t3.medium",
    instance_count=1,
    role=role,
)

preprocess_step = ProcessingStep(
    name="penguins-preprocessing-step",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=dataset_location, destination="/opt/ml/processing/input"),  
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train", destination=preprocessor_destination),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation", destination=preprocessor_destination),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test", destination=preprocessor_destination),
    ],
    code="penguins/preprocessor.py",
    cache_config=cache_config
)

## Step 7 - Defining and Running the Pipeline

We can now define and run the SageMaker Pipeline. Check [Pipeline Structure and Execution](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-pipeline.html) for more information about how to define a pipeline.


In [12]:

pipeline = Pipeline(
    name="session1-penguins-pipeline",
    parameters=[
        dataset_location, 
        preprocessor_destination,
    ],
    steps=[
        preprocess_step, 
    ],
    sagemaker_session=local_pipeline_session
)

In [13]:
pipeline.upsert(role_arn=role)
execution = pipeline.start()

Starting execution for pipeline session1-penguins-pipeline. Execution ID is 9ab71355-64cf-44b1-a6ef-cd646f9bdfed
Starting pipeline step: 'penguins-preprocessing-step'
Pipeline step 'penguins-preprocessing-step' FAILED. Failure message is: ImportError: 'docker-compose' is not installed. Local Mode features will not work without docker-compose. For more information on how to install 'docker-compose', please, see https://docs.docker.com/compose/install/
Pipeline execution 9ab71355-64cf-44b1-a6ef-cd646f9bdfed FAILED because step 'penguins-preprocessing-step' failed.


## Step 8 - Cleaning up

Before you finish, don't forget to clean up after you.

In [64]:
pipeline.delete()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session1-penguins-pipeline',
 'ResponseMetadata': {'RequestId': '7956b94e-aae7-499b-990c-8183b542f84b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7956b94e-aae7-499b-990c-8183b542f84b',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '94',
   'date': 'Mon, 10 Apr 2023 17:18:36 GMT'},
  'RetryAttempts': 0}}

# Assignments

1. Set up an Amazon SageMaker domain using the Standard Setup. Make sure you set the network configuration to VPC Only. Create a new execution role and ensure it has access to the S3 bucket you’ll use during this class. You can also specify “Any S3 bucket” if you want this role to access every S3 bucket in your AWS account.

2. Create a GitHub repository and clone it from inside SageMaker Studio. We’ll use this repository to store the code used during this program.

3. Configure your SageMaker Studio session to store your name and email address and cache your credentials. You can use the following commands from a Terminal window:

```bash
$ git config --global user.name "John Doe"
$ git config --global user.email johndoe@example.com
$ git config --global credential.helper store
```

4. Prepare the MNIST dataset. Throughout the course, you will create a SageMaker pipeline to work with the MNIST dataset. MNIST is popular and relatively small, so it's easy to find pre-packaged versions of it. We aren't going to use those. Instead, we will simulate a practical scenario where the data is stored in the filesystem. To accomplish this, you will load a pre-packaged version of MNIST, save it to disk, and upload it to an S3 bucket. Complete the section "Prepare the MNIST Dataset" below.

5. Setup a SageMaker Pipeline with a preprocessing step where you split the MNIST dataset into a train and a test set.

## Prepare the MNIST Dataset

These are the steps you need to follow to prepare the data:

1. Create the S3 bucket to upload the dataset.
2. Load the MNIST dataset from the Keras built-in collection of small datasets, convert it into images, and save them to the disk.
3. Upload the dataset to the S3 bucket.


### Create the dataset

We want to load the Keras' built-in version of MNIST and save it to disk so we can later upload it to S3.

There are 70,000 images. Running this cell will take some time, so this is the perfect moment to walk around and grab some coffee. Fortunately, we only need to do this once.

In [None]:
import numpy as np
import tensorflow as tf

from PIL import Image
from pathlib import Path
from tensorflow.keras.datasets import mnist


def save(dataset, split, images, labels):
    """
    This function saves the handwritten digits to disk as PNG files.
    
    Every image will be saved inside a folder corresponding to 
    its label. For example, a digit from the train set representing 
    the number 3 will be saved inside the folder `~/train/3`.
    """
    
    for index, (image, label) in enumerate(zip(images, labels)):
        im = Image.fromarray(image)

        path = dataset / split / str(label)
        path.mkdir(parents=True, exist_ok=True)
        
        im.save(path / f"{index}.png")
        

# We will save the dataset in the home directory, inside a folder
# named `dataset`.
dataset = Path.home() / "dataset" 

# We want to make sure we don't generate the images if the dataset
# already exists.
if not dataset.exists():
    # Load the MNIST dataset using the Keras library. This returns the
    # dataset in numpy arrays.
    (X_train, y_train), (X_test, y_test) = mnist.load_data()

    # Use the function we created to save the data to disk.
    save(dataset, split="train", images=X_train, labels=y_train)
    save(dataset, split="test", images=X_test, labels=y_test)

### Upload the data to S3

Now that we exported the MNIST dataset to the filesystem, we need to upload them to an S3 bucket. The easiest way to do this is to use the AWS CLI.

This command will also take a while to finish.

In [None]:
!aws s3 cp $dataset s3://$BUCKET/dataset --recursive

# Additional Notes

1. Amazon SageMaker is free to try. Your free tier starts from the first month when you create your first SageMaker resource and lasts 2 months. Check out the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for more information.

2. We’ll be working extensively with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) and [SageMaker’s Python SDK](https://sagemaker.readthedocs.io/en/stable/).

3. This notebook uses a Scikit-Learn Pipeline to transform the dataset. You should always orchestrate your transformations using pipelines. Check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more details.

4. The preprocessing script uses `np.split()` to split the dataset into 3 different splits. It's a neat way of getting the three splits with a single instruction.

5. Keras offers a [list of built-in vectorized datasets](https://www.notion.so/Bnomial-RESTful-API-4ecf85043b484ec994d7f70c56abfe27) in NumPy format. You can load any of these datasets with a single line of code, making them convenient.

6. Converting a NumPy array into an image you can save and visualize is a useful trick to know. Check the `Image.fromarray()` function from the `PIL` library.

7. The [command line interface](https://docs.aws.amazon.com/cli/latest/index.html) is a simple way to interact with the AWS services. You can combine Python code with bash commands in the same notebook cell, which makes notebooks a very flexible tool.

8. Check Python’s `pathlib` module. Since Python 3.4, this module offers a clean way to interact with the filesystem.

9. This notebook uses the `%%writefile` cell magic. There's a whole list of [line and cell magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html) you can start using in your code.