# Sagemaker MLOps: Job major classifier project

This notebook creates programatically and MLOps project for classifiyng job titles into one of the job majors stated in the website O*NET.

This notebook should be run in the Sagemaker environment Data Science 3.0

***

## Setup

Install latest version of Sagemaker python sdk

In [3]:
# Uncomment if you have any compatibility issues and would like to use the specific version of the sagemaker library
# %pip install sagemaker==2.132.0
%pip install --upgrade pip sagemaker

[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Import packages

In [1]:
import time
import os
import json
import boto3
import numpy as np  
import pandas as pd 
import sagemaker
from time import gmtime, strftime, sleep

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


'2.214.0'

### Set constants

In [2]:
#Variables for handling Sagemaker sdk
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "job-major-clf"  
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

# Pipeline objects
project = "test-job-major-clf"
pipeline_name = f"{project}-pipeline"
pipeline_model_name = f"{project}-model"
model_package_group_name = f"{project}-model-group"
endpoint_config_name = f"{project}-endpoint-config"
endpoint_name = f"{project}-endpoint"

#Instance types and counts
process_instance_type = "ml.c5.xlarge"
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

#S3 urls for data
train_s3_url = f"s3://{bucket_name}/{bucket_prefix}/train"
validation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/validation"
test_s3_url = f"s3://{bucket_name}/{bucket_prefix}/test"
baseline_s3_url = f"s3://{bucket_name}/{bucket_prefix}/baseline"

evaluation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/evaluation"
prediction_baseline_s3_url = f"s3://{bucket_name}/{bucket_prefix}/prediction_baseline"

output_s3_url = f"s3://{bucket_name}/{bucket_prefix}/output"

In [3]:
print(sm_role)
print(f"Train S3 url: {train_s3_url}")
print(f"Validation S3 url: {validation_s3_url}")
print(f"Test S3 url: {test_s3_url}")
print(f"Data baseline S3 url: {baseline_s3_url}")
print(f"Evaluation metrics S3 url: {evaluation_s3_url}")
print(f"Model prediction baseline S3 url: {prediction_baseline_s3_url}")

arn:aws:iam::181460750629:role/sm-studio-execution-role
Train S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/train
Validation S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/validation
Test S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/test
Data baseline S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/baseline
Evaluation metrics S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/evaluation
Model prediction baseline S3 url: s3://sagemaker-us-east-1-181460750629/job-major-clf/prediction_baseline


Get domain id

In [4]:
NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"
domain_id = None

if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE, "rb") as f:
        domain_id = json.loads(f.read()).get('DomainId')
        print(f"SageMaker domain id: {domain_id}")

SageMaker domain id: d-eezrmoob5mvx


***

## Data

Talk about data source

In [5]:
# Download datasets
!wget -P data/ https://www.onetcenter.org/dl_files/database/db_28_1_excel/Occupation%20Data.xlsx
!wget -P data/ https://www.onetcenter.org/dl_files/database/db_28_1_excel/Alternate%20Titles.xlsx

--2024-03-23 22:13:33--  https://www.onetcenter.org/dl_files/database/db_28_1_excel/Occupation%20Data.xlsx
Resolving www.onetcenter.org (www.onetcenter.org)... 152.46.6.153, 2610:28:2100:1::10
Connecting to www.onetcenter.org (www.onetcenter.org)|152.46.6.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102700 (100K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘data/Occupation Data.xlsx.1’


2024-03-23 22:13:33 (5.59 MB/s) - ‘data/Occupation Data.xlsx.1’ saved [102700/102700]

--2024-03-23 22:13:33--  https://www.onetcenter.org/dl_files/database/db_28_1_excel/Alternate%20Titles.xlsx
Resolving www.onetcenter.org (www.onetcenter.org)... 152.46.6.153, 2610:28:2100:1::10
Connecting to www.onetcenter.org (www.onetcenter.org)|152.46.6.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1228946 (1.2M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘data/Alternate Titles.xl

In [6]:
df_alt = pd.read_excel('data/Occupation Data.xlsx')
df_occ = pd.read_excel('data/Alternate Titles.xlsx')

In [7]:
df_alt.head()

Unnamed: 0,O*NET-SOC Code,Title,Description
0,11-1011.00,Chief Executives,Determine and formulate policies and provide o...
1,11-1011.03,Chief Sustainability Officers,"Communicate and coordinate with management, sh..."
2,11-1021.00,General and Operations Managers,"Plan, direct, or coordinate the operations of ..."
3,11-1031.00,Legislators,"Develop, introduce, or enact laws and statutes..."
4,11-2011.00,Advertising and Promotions Managers,"Plan, direct, or coordinate advertising polici..."


In [8]:
df_occ.head()

Unnamed: 0,O*NET-SOC Code,Title,Alternate Title,Short Title,Source(s)
0,11-1011.00,Chief Executives,Aeronautics Commission Director,,8
1,11-1011.00,Chief Executives,Agency Owner,,10
2,11-1011.00,Chief Executives,Agricultural Services Director,,8
3,11-1011.00,Chief Executives,Arts and Humanities Council Director,,8
4,11-1011.00,Chief Executives,Bank President,,9


In [9]:
try:
    input_s3_urls
except NameError:      
    # If input_s3_url is not defined, upload the dataset to S3 and store the path
    input_s3_urls = [
        sagemaker.Session().upload_data(
            path="data/Alternate Titles.xlsx",
            bucket=bucket_name,
            key_prefix=f"{bucket_prefix}/input"
        ),
        sagemaker.Session().upload_data(
            path="data/Occupation Data.xlsx",
            bucket=bucket_name,
            key_prefix=f"{bucket_prefix}/input"
        ),
    ]
        
        
        
    print(f"Uploaded datasets to {input_s3_urls}")

Uploaded datasets to ['s3://sagemaker-us-east-1-181460750629/job-major-clf/input/Alternate Titles.xlsx', 's3://sagemaker-us-east-1-181460750629/job-major-clf/input/Occupation Data.xlsx']


***

## Create project programatically

In [10]:
sm = boto3.client("sagemaker")
sc = boto3.client("servicecatalog")

sc_provider_name = "Amazon SageMaker"
sc_product_name = "MLOps template for model building and training"

In [11]:
p_ids = [p['ProductId'] for p in sc.search_products(
    Filters={
        'FullTextSearch': [sc_product_name]
    },
)['ProductViewSummaries'] if p["Name"]==sc_product_name]

In [12]:
# If you get any exception from this code, go to the Option 2 and create a project in Studio UX
if not len(p_ids):
    raise Exception("No Amazon SageMaker ML Ops products found!")
elif len(p_ids) > 1:
    raise Exception("Too many matching Amazon SageMaker ML Ops products found!")
else:
    product_id = p_ids[0]
    print(f"ML Ops product id: {product_id}")

ML Ops product id: prod-53ibyqbj2cgmo


In [13]:
provisioning_artifact_id = sorted(
    [i for i in sc.list_provisioning_artifacts(
        ProductId=product_id
    )['ProvisioningArtifactDetails'] if i['Guidance']=='DEFAULT'],
    key=lambda d: d['Name'], reverse=True)[0]['Id']

In [18]:
project_name = f"test-job-major-clf-3"
project_parameters = []

In [19]:
# create SageMaker project
r = sm.create_project(
    ProjectName=project_name,
    ProjectDescription="Model build project",
    ServiceCatalogProvisioningDetails={
        'ProductId': product_id,
        'ProvisioningArtifactId': provisioning_artifact_id,
    },
)

print(r)
project_id = r["ProjectId"]

{'ProjectArn': 'arn:aws:sagemaker:us-east-1:181460750629:project/test-job-major-clf-3', 'ProjectId': 'p-ka9k3eqqic4g', 'ResponseMetadata': {'RequestId': '9b76fcf1-d25d-4b10-9698-fc03f500f0e7', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '9b76fcf1-d25d-4b10-9698-fc03f500f0e7', 'content-type': 'application/x-amz-json-1.1', 'content-length': '115', 'date': 'Sat, 23 Mar 2024 22:14:27 GMT'}, 'RetryAttempts': 0}}


In [20]:
while sm.describe_project(ProjectName=project_name)['ProjectStatus'] != 'CreateCompleted':
    print("Waiting for project creation completion")
    sleep(10)
    
print(f"MLOps project {project_name} creation completed")

Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
Waiting for project creation completion
MLOps project test-job-major-clf-3 creation completed


Clone the project default code to the Studio file system:
1. Choose **Home** in the Studio sidebar
2. Select **Deployments** and then select **Projects**
3. Click on the name of the project you created to open the project details tab
4. In the project tab, choose **Repositories**
5. In the **Local path** column for the repository choose **clone repo....**
6. In the dialog box that appears choose **Clone Repository**

The end

In [21]:
input_s3_urls

['s3://sagemaker-us-east-1-181460750629/job-major-clf/input/Alternate Titles.xlsx',
 's3://sagemaker-us-east-1-181460750629/job-major-clf/input/Occupation Data.xlsx']

## Replace the default pipeline with our own classifier

In [2]:
# Fill the variables below with the paths in the studio file system
# of this tutorial repository and the recently create MLOps repository

tutorial_dir = 'job-major-clf'
project_dir = 'test-job-major-clf-3-p-ka9k3eqqic4g/sagemaker-test-job-major-clf-3-p-ka9k3eqqic4g-modelbuild'

In [3]:
%cp ~/{tutorial_dir}/codebuild-buildspec.yml ~/{project_dir}//codebuild-buildspec.yml
%cp -R ~/{tutorial_dir}/jobmajorclf ~/{project_dir}/pipelines/jobmajorclf

In [4]:
%%writefile ~/{project_dir}/setup.py

import os
import setuptools


about = {}
here = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(here, "pipelines", "__version__.py")) as f:
    exec(f.read(), about)


with open("README.md", "r") as f:
    readme = f.read()


required_packages = ["sagemaker"]
extras = {
    "test": [
        "black",
        "coverage",
        "flake8",
        "mock",
        "pydocstyle",
        "pytest",
        "pytest-cov",
        "sagemaker",
        "tox",
    ]
}
setuptools.setup(
    name=about["__title__"],
    description=about["__description__"],
    version=about["__version__"],
    author=about["__author__"],
    author_email=["__author_email__"],
    long_description=readme,
    long_description_content_type="text/markdown",
    url=about["__url__"],
    license=about["__license__"],
    packages=setuptools.find_packages(),
    include_package_data=True,
    python_requires=">=3.6",
    install_requires=required_packages,
    extras_require=extras,
    entry_points={
        "console_scripts": [
            "get-pipeline-definition=pipelines.get_pipeline_definition:main",
            "run-pipeline=pipelines.run_pipeline:main",
        ]
    },
    classifiers=[
        "Development Status :: 3 - Alpha",
        "Intended Audience :: Developers",
        "Natural Language :: English",
        "Programming Language :: Python",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.6",
        "Programming Language :: Python :: 3.7",
        "Programming Language :: Python :: 3.8",
    ],
)

Overwriting /root/test-job-major-clf-3-p-ka9k3eqqic4g/sagemaker-test-job-major-clf-3-p-ka9k3eqqic4g-modelbuild/setup.py


Now it's time to run our pipeline. The default CI/CD pipeline implemented will run everytime a change is introduced in our repository. We can, for example commit all the latest changes and push to the remote repository. 

For this, in Sagemaker Studio open a new Terminal (File/New/Terminal). Inside the terminal, change your working directory to that of the MLOps project (make sure to put your own default project directory here):

```bash
cd test-job-major-clf-3-p-ka9k3eqqic4g/sagemaker-test-job-major-clf-3-p-ka9k3eqqic4g-modelbuild

```
Finally, add all changes to a new commit and push to the remote CodeCommit repository:

```bash
git add -A; git commit -m 'Ready to Run'; git push
```

To follow the execution of the pipeline, you can Open the Deployments/Projects window (search for it in the Home Panel on the left side of Sagemaker studio) and open the job-major-clf project. All the information and resources related to this project can be inspected here. In particular the running pipeline diagram will be found in the Pipeline tab. Here, select the pipeline name and open the last running execution. After completed execution, the following diagram will show each of the steps completed:

![alt text](Diagram.png "Title")

A successful completion of all the steps finishes with the trained model being stored in the Model Registry (Home/Models/Model registry in Sagemaker Studio), from where it can be compared against other model instances and further deployed into a serving solution like Sagemaker Endpoints. 