# Setting up SageMaker Studio

Use this notebook to setup SageMaker Studio. You only need to go through the code here once.

This notebook is part of the [Machine Learning School](https://www.ml.school) program.

In [50]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

CODE_FOLDER = Path("code")
CODE_FOLDER.mkdir(parents=True, exist_ok=True)

sys.path.append(f"./{CODE_FOLDER}")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Update the following constants with your SageMaker Domain Id and User Profile. You can find them in your Amazon SageMaker dashboard under "Domains".

In [25]:
DOMAIN_ID = "d-givocgtibv1g"
USER_PROFILE = "default-1682182522641"

## Step 1 - Lifecycle Configuration

You can customize SageMaker Studio using Lifecycle configurations. These are shell scripts that will be triggered by lifecycle events, such as starting a new Studio notebook.

The following script upgrades the packages on a SageMaker Studio Kernel Application.

In [23]:
%%writefile packages.sh

#!/bin/bash
# This script upgrades the packages on a SageMaker 
# Studio Kernel Application.

set -eux

pip install -q --upgrade pip
pip install -q --upgrade awscli boto3
pip install -q --upgrade scikit-learn==0.23.2
pip install -q --upgrade PyYAML==6.0
pip install -q --upgrade sagemaker

Overwriting packages.sh


We can now create a new lifecycle configuration that we can later select as the start-up script for our kernel.

In [None]:
%%bash -s "$DOMAIN_ID" "$USER_PROFILE"

DOMAIN_ID=$(echo "$1")
USER_PROFILE=$(echo "$2")

LCC_CONTENT=`openssl base64 -A -in packages.sh`

aws sagemaker delete-studio-lifecycle-config \
    --studio-lifecycle-config-name packages

response=$(aws sagemaker create-studio-lifecycle-config \
    --studio-lifecycle-config-name packages \
    --studio-lifecycle-config-content $LCC_CONTENT \
    --studio-lifecycle-config-app-type KernelGateway) 

arn=$(echo "${response}" | python3 -c "import sys, json; print(json.load(sys.stdin)['StudioLifecycleConfigArn'])")
echo "${arn}"

aws sagemaker update-user-profile --domain-id $DOMAIN_ID \
    --user-profile-name $USER_PROFILE \
    --user-settings '{
        "KernelGatewayAppSettings": {
            "LifecycleConfigArns": ["'$arn'"]
        }
    }'

## Step 2 - Permissions 

To run the notebooks we use during the program, you need to update the Execution Policy assigned to SageMaker's Execution Role and add the appropriate permissions.

The following cell displays the name of Execution Role you are currently using.

In [28]:
import sagemaker

role = sagemaker.get_execution_role()
role

'arn:aws:iam::325223348818:role/service-role/AmazonSageMaker-ExecutionRole-20230312T160501'

Open the Amazon IAM service, find the role and edit the custom Execution Policy assigned to it. You can edit the permissions of the Execution Policy and use the following definition instead:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IAM0",
            "Effect": "Allow",
            "Action": [
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": [
                        "autoscaling.amazonaws.com",
                        "ec2scheduled.amazonaws.com",
                        "elasticloadbalancing.amazonaws.com",
                        "spot.amazonaws.com",
                        "spotfleet.amazonaws.com",
                        "transitgateway.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "IAM1",
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:PassRole",
                "iam:AttachRolePolicy"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Lambda",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunctionUrl",
                "lambda:InvokeFunction",
                "lambda:UpdateFunctionCode",
                "lambda:InvokeAsync"
            ],
            "Resource": "*"
        },
        {
            "Sid": "SageMaker",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateDomain",
                "sagemaker:UpdateUserProfile"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CloudWatch",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:GetMetricData",
                "cloudwatch:DescribeAlarmsForMetric",
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:CreateLogGroup",
                "logs:DescribeLogStreams"
            ],
            "Resource": "*"
        },
        {
            "Sid": "ECR",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3",
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}
```

## Step 3 - Constants

There are a few constants and variables we'll use throughout every notebook. To prevent code duplication, we'll define them in a file that we can later reuse.

* `BUCKET`: This is the name of the S3 bucket where we will organize every resource we are going to use during the program. This name has to be unique. 
* `S3_LOCATION`: This is the location in S3 where we'll save every file related to the Penguins project.
* `DATA_FILEPATH`: The local path where we'll download the Penguins dataset.
* `sagemaker_client`: We'll use a [boto3 SageMaker Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) instance to access SageMaker.
* `iam_client`: We'll use a [boto3 IAM Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html) instance to access IAM.
* `role`: This is the execution role attached to this notebook. We can use this role with any of the SageMaker services that need it to ensure they run with the appropriate permissions.
* `region`: The current region attached to our session. 
* `sagemaker_session`: The current SageMaker session.

In [51]:
%%writefile {CODE_FOLDER}/constants.py

import boto3
import sagemaker
from pathlib import Path


BUCKET = "mlschool"
S3_LOCATION = f"s3://{BUCKET}/penguins"
DATA_FILEPATH = Path().resolve() / "data.csv"


sagemaker_client = boto3.client("sagemaker")
iam_client = boto3.client("iam")
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()

Overwriting code/constants.py


## Step 3 - S3 Bucket

Let's now create the S3 bucket where we will organize every resource we are going to use during the program.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint="eu-west-1"
```

The `LocationConstraint` argument should specify the region where you want to create the bucket. The example above creates the bucket in the `eu-west-1` region.

In [47]:
from constants import BUCKET

!aws s3api create-bucket --bucket $BUCKET

{
    "Location": "/mlschool"
}


After we have a bucket, we can download the [Penguins dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) and store it in the bucket.

In [48]:
import urllib.request
import pandas as pd

from constants import S3_LOCATION, DATA_FILEPATH
from pathlib import Path
from sagemaker.s3 import S3Uploader


urllib.request.urlretrieve(
    "https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins_size.csv", 
    DATA_FILEPATH
)

S3Uploader.upload(local_path=str(DATA_FILEPATH), desired_s3_uri=S3_LOCATION)

's3://mlschool/penguins/data.csv'

We can now load and display the dataset.

In [49]:
pd.read_csv(DATA_FILEPATH)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE
