# Setup

Let's ensure we are running the latest version of the SakeMaker SDK. **Restart the Kernel** after you run the following cell.

In [84]:
%%bash 

DOMAIN_ID="d-givocgtibv1g"
USER_PROFILE="default-1682182522641"


LCC_CONTENT=`openssl base64 -A -in packages.sh`

aws sagemaker delete-studio-lifecycle-config \
    --studio-lifecycle-config-name packages

response=$(aws sagemaker create-studio-lifecycle-config \
    --studio-lifecycle-config-name packages \
    --studio-lifecycle-config-content $LCC_CONTENT \
    --studio-lifecycle-config-app-type KernelGateway) 

arn=$(echo "${response}" | python3 -c "import sys, json; print(json.load(sys.stdin)['StudioLifecycleConfigArn'])")
echo "${arn}"

aws sagemaker update-user-profile --domain-id $DOMAIN_ID \
    --user-profile-name $USER_PROFILE \
    --user-settings '{
        "KernelGatewayAppSettings": {
            "LifecycleConfigArns": ["'$arn'"]
        }
    }'

arn:aws:sagemaker:us-east-1:325223348818:studio-lifecycle-config/packages
{
    "UserProfileArn": "arn:aws:sagemaker:us-east-1:325223348818:user-profile/d-givocgtibv1g/default-1682182522641"
}


In [7]:
import os
import pandas as pd
import sagemaker
import sys
import tempfile
import urllib.request

from pathlib import Path

CODE_FOLDER = "code"
Path(CODE_FOLDER).mkdir(parents=True, exist_ok=True)

sys.path.append(f"./{CODE_FOLDER}")

In [8]:
%%writefile {CODE_FOLDER}/setup.py

import boto3
import sagemaker

from pathlib import Path

BUCKET = "mlschool"
CODE_FOLDER = "code"

S3_FILEPATH = f"s3://{BUCKET}/penguins"
LOCAL_FILEPATH = Path().resolve() / "data.csv"
INPUT_DATA_URI = f"{S3_FILEPATH}/data.csv"

Overwriting code/setup.py


## Step 1 - Initial Setup

Let's start by preparing the S3 bucket where we will organize every resource we are going to use during the program. Make sure you set `BUCKET` to the bucket name you want to use. This name has to be unique. The [command line interface](https://docs.aws.amazon.com/cli/latest/index.html) is a simple way to interact with the AWS services. You can combine Python code with bash commands in the same notebook cell, which makes notebooks a very flexible tool.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region
```

The `LocationConstraint` argument should specify the region where you want to create the bucket.

After we have a bucket, we can download the [Penguins dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) and store it in a folder inside the bucket. Our SageMaker Pipeline will use this dataset.

In [9]:
from setup import *

In [10]:
!aws s3api create-bucket --bucket $BUCKET

{
    "Location": "/mlschool"
}


Download the official Penguins dataset and store it locally.

In [11]:
urllib.request.urlretrieve(
    "https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins_size.csv", 
    LOCAL_FILEPATH
)

(PosixPath('/root/ml.school/penguins/data.csv'),
 <http.client.HTTPMessage at 0x7f3f2852b1f0>)

Upload the dataset to S3. We need to do this to make it available to the preprocessing step.

In [12]:
sagemaker.s3.S3Uploader.upload(
    local_path=str(LOCAL_FILEPATH), 
    desired_s3_uri=S3_FILEPATH,
)

's3://mlschool/penguins/data.csv'

We can now load and display the dataset.

In [13]:
df = pd.read_csv(LOCAL_FILEPATH)
df

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


## Step 2 - Setting up Permissions

To run this notebook you need to update the Execution Policy assigned to SageMaker's Execution Role with the following permissions:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IAM0",
            "Effect": "Allow",
            "Action": [
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": [
                        "autoscaling.amazonaws.com",
                        "ec2scheduled.amazonaws.com",
                        "elasticloadbalancing.amazonaws.com",
                        "spot.amazonaws.com",
                        "spotfleet.amazonaws.com",
                        "transitgateway.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "IAM1",
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:PassRole",
                "iam:AttachRolePolicy"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Lambda",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunctionUrl",
                "lambda:InvokeFunction",
                "lambda:UpdateFunctionCode",
                "lambda:InvokeAsync"
            ],
            "Resource": "*"
        },
        {
            "Sid": "SageMaker",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateDomain",
                "sagemaker:UpdateUserProfile"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CloudWatch",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:GetMetricData",
                "cloudwatch:DescribeAlarmsForMetric",
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:CreateLogGroup",
                "logs:DescribeLogStreams"
            ],
            "Resource": "*"
        },
        {
            "Sid": "ECR",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3",
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}
```

Let's start by defining a few variables we'll use throughout this notebook:

* `sagemaker_client`: We'll use a [boto3 SageMaker Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) instance to access SageMaker.
* `iam_client`: We'll use a [boto3 IAM Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html) instance to access IAM.
* `role`: This is the execution role attached to this notebook. We can use this role with any of the SageMaker services that need it to ensure they run with the appropriate permissions.
* `region`: The current region attached to our session. 
* `sagemaker_session`: The current SageMaker session.

In [14]:
%%writefile -a {CODE_FOLDER}/setup.py

sagemaker_client = boto3.client("sagemaker")
iam_client = boto3.client("iam")
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()

Appending to code/setup.py
