# Sagemaker MLOps: Job major classifier project

This notebook creates programatically and MLOps project for classifiyng job titles into one of the job majors stated in the website O*NET.

This notebook should be run in the Sagemaker environment Data Science 3.0

***

## Setup

Install latest version of Sagemaker python sdk

In [2]:
# Uncomment if you have any compatibility issues and would like to use the specific version of the sagemaker library
# %pip install sagemaker==2.132.0
%pip install --upgrade pip sagemaker

Collecting pip
  Using cached pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Collecting sagemaker
  Downloading sagemaker-2.208.0-py3-none-any.whl.metadata (13 kB)
Collecting urllib3<3.0.0,>=1.26.8 (from sagemaker)
  Using cached urllib3-2.0.7-py3-none-any.whl.metadata (6.6 kB)
Using cached pip-24.0-py3-none-any.whl (2.1 MB)
Downloading sagemaker-2.208.0-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hUsing cached urllib3-2.0.7-py3-none-any.whl (124 kB)
Installing collected packages: urllib3, pip, sagemaker
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.1.0
    Uninstalling urllib3-2.1.0:
      Successfully uninstalled urllib3-2.1.0
  Attempting uninstall: pip
    Found existing installation: pip 23.3.1
    Uninstalling pip-23.3.1:
      Successfully uninstalled pip-23.3.1
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.19

In [3]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Import packages

In [1]:
import time
import os
import json
import boto3
import numpy as np  
import pandas as pd 
import sagemaker
from time import gmtime, strftime, sleep

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


'2.208.0'

### Set constants

In [2]:
#Variables for handling Sagemaker sdk
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "job-major-clf"  
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

# Pipeline objects
project = "test-job-major-clf"
pipeline_name = f"{project}-pipeline"
pipeline_model_name = f"{project}-model"
model_package_group_name = f"{project}-model-group"
endpoint_config_name = f"{project}-endpoint-config"
endpoint_name = f"{project}-endpoint"

#Instance types and counts
process_instance_type = "ml.c5.xlarge"
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

#S3 urls for data
train_s3_url = f"s3://{bucket_name}/{bucket_prefix}/train"
validation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/validation"
test_s3_url = f"s3://{bucket_name}/{bucket_prefix}/test"
baseline_s3_url = f"s3://{bucket_name}/{bucket_prefix}/baseline"

evaluation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/evaluation"
prediction_baseline_s3_url = f"s3://{bucket_name}/{bucket_prefix}/prediction_baseline"

output_s3_url = f"s3://{bucket_name}/{bucket_prefix}/output"

In [3]:
print(sm_role)
print(f"Train S3 url: {train_s3_url}")
print(f"Validation S3 url: {validation_s3_url}")
print(f"Test S3 url: {test_s3_url}")
print(f"Data baseline S3 url: {baseline_s3_url}")
print(f"Evaluation metrics S3 url: {evaluation_s3_url}")
print(f"Model prediction baseline S3 url: {prediction_baseline_s3_url}")

arn:aws:iam::181460750629:role/sm-studio-execution-role
Train S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/train
Validation S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/validation
Test S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/test
Data baseline S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/baseline
Evaluation metrics S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/evaluation
Model prediction baseline S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/prediction_baseline


Get domain id

In [4]:
NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"
domain_id = None

if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE, "rb") as f:
        domain_id = json.loads(f.read()).get('DomainId')
        print(f"SageMaker domain id: {domain_id}")

SageMaker domain id: d-eezrmoob5mvx


***

## Data

Talk about data source

In [5]:
# Download datasets
!wget -P data/ https://www.onetcenter.org/dl_files/database/db_28_1_excel/Occupation%20Data.xlsx
!wget -P data/ https://www.onetcenter.org/dl_files/database/db_28_1_excel/Alternate%20Titles.xlsx

--2024-02-16 23:14:11--  https://www.onetcenter.org/dl_files/database/db_28_1_excel/Occupation%20Data.xlsx
Resolving www.onetcenter.org (www.onetcenter.org)... 152.46.6.153, 2610:28:2100:1::10
Connecting to www.onetcenter.org (www.onetcenter.org)|152.46.6.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102700 (100K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘data/Occupation Data.xlsx’


2024-02-16 23:14:11 (11.1 MB/s) - ‘data/Occupation Data.xlsx’ saved [102700/102700]

--2024-02-16 23:14:11--  https://www.onetcenter.org/dl_files/database/db_28_1_excel/Alternate%20Titles.xlsx
Resolving www.onetcenter.org (www.onetcenter.org)... 152.46.6.153, 2610:28:2100:1::10
Connecting to www.onetcenter.org (www.onetcenter.org)|152.46.6.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1228946 (1.2M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘data/Alternate Titles.xlsx’


In [6]:
try:
    input_s3_urls
except NameError:      
    # If input_s3_url is not defined, upload the dataset to S3 and store the path
    input_s3_urls = [
        sagemaker.Session().upload_data(
            path="data/Alternate Titles.xlsx",
            bucket=bucket_name,
            key_prefix=f"{bucket_prefix}/input"
        ),
        sagemaker.Session().upload_data(
            path="data/Occupation Data.xlsx",
            bucket=bucket_name,
            key_prefix=f"{bucket_prefix}/input"
        ),
    ]
        
        
        
    print(f"Uploaded datasets to {input_s3_urls}")

Uploaded datasets to ['s3://sagemaker-us-east-1-181460750629/job-major-clf/input/Alternate Titles.xlsx', 's3://sagemaker-us-east-1-181460750629/job-major-clf/input/Occupation Data.xlsx']


***

## Create project programatically

In [7]:
sm = boto3.client("sagemaker")
sc = boto3.client("servicecatalog")

sc_provider_name = "Amazon SageMaker"
sc_product_name = "MLOps template for model building and training"

In [8]:
p_ids = [p['ProductId'] for p in sc.search_products(
    Filters={
        'FullTextSearch': [sc_product_name]
    },
)['ProductViewSummaries'] if p["Name"]==sc_product_name]

In [9]:
# If you get any exception from this code, go to the Option 2 and create a project in Studio UX
if not len(p_ids):
    raise Exception("No Amazon SageMaker ML Ops products found!")
elif len(p_ids) > 1:
    raise Exception("Too many matching Amazon SageMaker ML Ops products found!")
else:
    product_id = p_ids[0]
    print(f"ML Ops product id: {product_id}")

ML Ops product id: prod-53ibyqbj2cgmo


In [10]:
provisioning_artifact_id = sorted(
    [i for i in sc.list_provisioning_artifacts(
        ProductId=product_id
    )['ProvisioningArtifactDetails'] if i['Guidance']=='DEFAULT'],
    key=lambda d: d['Name'], reverse=True)[0]['Id']

In [11]:
project_name = f"test-job-major-clf"
project_parameters = []

In [12]:
# create SageMaker project
r = sm.create_project(
    ProjectName=project_name,
    ProjectDescription="Model build project",
    ServiceCatalogProvisioningDetails={
        'ProductId': product_id,
        'ProvisioningArtifactId': provisioning_artifact_id,
    },
)

print(r)
project_id = r["ProjectId"]

{'ProjectArn': 'arn:aws:sagemaker:us-east-1:181460750629:project/test-job-major-clf', 'ProjectId': 'p-zufhkugozwvk', 'ResponseMetadata': {'RequestId': '81bfb147-c418-4f58-a4b7-fbb960ce9924', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '81bfb147-c418-4f58-a4b7-fbb960ce9924', 'content-type': 'application/x-amz-json-1.1', 'content-length': '113', 'date': 'Fri, 16 Feb 2024 23:17:12 GMT'}, 'RetryAttempts': 0}}


In [13]:
while sm.describe_project(ProjectName=project_name)['ProjectStatus'] != 'CreateCompleted':
    print("Waiting for project creation completion")
    sleep(10)
    
print(f"MLOps project {project_name} creation completed")

Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
MLOps project test-job-major-clf creation completed


Clone the project default code to the Studio file system:
1. Choose **Home** in the Studio sidebar
2. Select **Deployments** and then select **Projects**
3. Click on the name of the project you created to open the project details tab
4. In the project tab, choose **Repositories**
5. In the **Local path** column for the repository choose **clone repo....**
6. In the dialog box that appears choose **Clone Repository**

The end

In [14]:
input_s3_urls

['s3://sagemaker-us-east-1-181460750629/job-major-clf/input/Alternate Titles.xlsx',
 's3://sagemaker-us-east-1-181460750629/job-major-clf/input/Occupation Data.xlsx']

In [None]:
- |
        run-pipeline --module-name pipelines.jobmajor.pipeline \
          --role-arn $SAGEMAKER_PIPELINE_ROLE_ARN \
          --tags "[{\"Key\":\"sagemaker:project-name\", \"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \
          --kwargs "{\"default_bucket\":\"${ARTIFACT_BUCKET}\",\"input_data_url\":\"s3://sagemaker-us-east-1-181460750629/job-major-clf/input\"}"