## Sagemaker Tutorial Series

### Tutorial - 1 Mobile Price Classification using SKLearn Custom Script in Sagemaker

Data Source - https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification?resource=download

### Let's divide the workload
1. Initialize Boto3 SDK and create S3 bucket. 
2. Upload data in Sagemaker Local Storage. 
3. Data Exploration and Understanding.
4. Split the data into Train/Test CSV File. 
5. Upload data into the S3 Bucket.
6. Create Training Script
7. Train script in-side Sagemaker container. 
8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 
9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [16]:
import sklearn # Check Sklearn version
sklearn.__version__

'1.5.2'

In [17]:
import xgboost
xgboost.__version__

'2.1.3'

In [3]:
!python --version

Python 3.11.11


In [6]:
!pip install xgboost==2.1.3

Collecting xgboost==2.1.3
  Using cached xgboost-2.1.3-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting nvidia-nccl-cu12 (from xgboost==2.1.3)
  Using cached nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
Using cached xgboost-2.1.3-py3-none-manylinux_2_28_x86_64.whl (153.9 MB)
Using cached nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
Installing collected packages: nvidia-nccl-cu12, xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 2.1.4
    Uninstalling xgboost-2.1.4:
      Successfully uninstalled xgboost-2.1.4
Successfully installed nvidia-nccl-cu12-2.26.2 xgboost-2.1.3


In [3]:
!pip install -U sagemaker

Collecting sagemaker
  Using cached sagemaker-2.242.0-py3-none-any.whl.metadata (16 kB)
Collecting boto3<2.0,>=1.35.75 (from sagemaker)
  Downloading boto3-1.37.20-py3-none-any.whl.metadata (6.7 kB)
Collecting sagemaker-core<2.0.0,>=1.0.17 (from sagemaker)
  Downloading sagemaker_core-1.0.26-py3-none-any.whl.metadata (4.9 kB)
Collecting botocore<1.38.0,>=1.37.20 (from boto3<2.0,>=1.35.75->sagemaker)
  Downloading botocore-1.37.20-py3-none-any.whl.metadata (5.7 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3<2.0,>=1.35.75->sagemaker)
  Using cached s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Collecting mock<5.0,>4.0 (from sagemaker-core<2.0.0,>=1.0.17->sagemaker)
  Using cached mock-4.0.3-py3-none-any.whl.metadata (2.8 kB)
Using cached sagemaker-2.242.0-py3-none-any.whl (1.6 MB)
Downloading boto3-1.37.20-py3-none-any.whl (139 kB)
Downloading sagemaker_core-1.0.26-py3-none-any.whl (407 kB)
Downloading botocore-1.37.20-py3-none-any.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━

In [7]:
!pip install scikit-learn==1.5.2

Collecting scikit-learn==1.5.2
  Downloading scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m124.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.6.0
    Uninstalling scikit-learn-1.6.0:
      Successfully uninstalled scikit-learn-1.6.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.1.1 requires nvidia-ml-py3==7.352.0, which is not installed.
autogluon-core 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.5.2 which is incompatible.
autogluon-core 1.1.1 requires scipy<1

## 1. Initialize Boto3 SDK and create S3 bucket. 

In [12]:
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import datetime
import time
import tarfile
import boto3
import pandas as pd

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = 'mainbucketrockhight5461' # Mention the created S3 bucket name here
print("Using bucket " + bucket)
# hi
print(f"sagemaker version: {sagemaker.__version__}")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Using bucket mainbucketrockhight5461
sagemaker version: 2.242.0


In [2]:
import pickle

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

In [3]:
print(type(model))
print(model)

<class 'sklearn.pipeline.Pipeline'>
Pipeline(steps=[('processing',
                 <transformers.RawDataProcessor object at 0x7f744b50bf90>),
                ('slice_columns',
                 <transformers.DataSlicer object at 0x7f744b55f350>),
                ('null_filling',
                 <transformers.NullFillTransformer object at 0x7f744b3d60d0>),
                ('model',
                 FitModel(folds=5,
                          hyper_parameters={'colsample_bytree': [0.6, 0.8],
                                            'gamma': [2], 'max_depth': [3],
                                            'min_child_weight': [3],
                                            'random_state': [1005],
                                            'subsample': [0.6, 0.8]}))])


In [4]:
print(model.get_params())

{'memory': None, 'steps': [('processing', <transformers.RawDataProcessor object at 0x7f744b50bf90>), ('slice_columns', <transformers.DataSlicer object at 0x7f744b55f350>), ('null_filling', <transformers.NullFillTransformer object at 0x7f744b3d60d0>), ('model', FitModel(folds=5,
         hyper_parameters={'colsample_bytree': [0.6, 0.8], 'gamma': [2],
                           'max_depth': [3], 'min_child_weight': [3],
                           'random_state': [1005], 'subsample': [0.6, 0.8]}))], 'verbose': False, 'processing': <transformers.RawDataProcessor object at 0x7f744b50bf90>, 'slice_columns': <transformers.DataSlicer object at 0x7f744b55f350>, 'null_filling': <transformers.NullFillTransformer object at 0x7f744b3d60d0>, 'model': FitModel(folds=5,
         hyper_parameters={'colsample_bytree': [0.6, 0.8], 'gamma': [2],
                           'max_depth': [3], 'min_child_weight': [3],
                           'random_state': [1005], 'subsample': [0.6, 0.8]}), 'model__folds'

In [5]:
import pandas as pd
import pickle

# Load the model
with open('model.pkl', 'rb') as f:
    pipeline = pickle.load(f)

# Assuming 'decline_v2a_debit' is one of the required features
input_data = pd.DataFrame({
    'timestamp': ['2023-05-01'],
    'in_data': ['{"yams_score":0.7,"north_star_metric":"5.5"}'],
    'decline_v2a_debit': [0.5],
    'days_since_sms_otp_success': [20],
    'days_since_receiver_first_seen': [100],
    'days_since_device_first_seen': [20],
    'dda_age_in_days': [100]# Add this and any other missing features
    # ... add all other required features ...
})

# Make a prediction
prediction = pipeline.predict(input_data)

In [6]:
print(prediction)

{'uncalibrated': array([[0.14639568, 0.8536043 ]], dtype=float32), 'calibrated': array([[0.60676062, 0.39323938]])}


## 5. make predict.py script

In [7]:
%%writefile predict.py

import pickle
import os
import pandas as pd
from io import StringIO

def model_fn(model_dir):
    """Load the trained model.pkl"""
    with open(os.path.join(model_dir, "model.pkl"), 'rb') as f:
        return pickle.load(f)

def input_fn(request_body, request_content_type):
    """Parse CSV input into DataFrame with correct columns"""
    if request_content_type == "text/csv":
        # Hardcoded columns to match your model's requirements
        columns = [
            'timestamp',
            'in_data',
            'decline_v2a_debit',
            'days_since_sms_otp_success',
            'days_since_receiver_first_seen',
            'days_since_device_first_seen',
            'dda_age_in_days'
        ]
        df = pd.read_csv(StringIO(request_body.strip()), header=None)
        df.columns = columns  # Assign correct column names
        return df
    else:
        raise ValueError(f"Unsupported content type: {request_content_type}")

def predict_fn(input_data, model):
    """Run prediction on preprocessed DataFrame"""
    return model.predict(input_data)

Overwriting predict.py


## we test the model locally

In [8]:
import pickle
import os
import pandas as pd
from io import StringIO

def model_fn(model_dir):
    """Load the trained model.pkl"""
    with open(os.path.join(model_dir, "model.pkl"), 'rb') as f:
        return pickle.load(f)

def input_fn(request_body, request_content_type):
    """Parse CSV input into DataFrame with correct columns"""
    if request_content_type == "text/csv":
        # Hardcoded columns to match your model's requirements
        columns = [
            'timestamp',
            'in_data',
            'decline_v2a_debit',
            'days_since_sms_otp_success',
            'days_since_receiver_first_seen',
            'days_since_device_first_seen',
            'dda_age_in_days'
        ]
        df = pd.read_csv(StringIO(request_body.strip()), header=None)
        df.columns = columns  # Assign correct column names
        return df
    else:
        raise ValueError(f"Unsupported content type: {request_content_type}")

def predict_fn(input_data, model):
    """Run prediction on preprocessed DataFrame"""
    return model.predict(input_data)

import pandas as pd
from predict import model_fn, predict_fn, input_fn
import os

# 1. Point this to your actual model.pkl directory
MODEL_DIR = "."  # <-- CHANGE THIS

# 2. Test with a DataFrame (matches your working example)
def test_dataframe_prediction():
    print("=== Testing DataFrame Input ===")
    model = model_fn(MODEL_DIR)
    
    test_df = pd.DataFrame({
        'timestamp': ['2023-05-01'],
        'in_data': ['{"yams_score":0.7,"north_star_metric":"5.5"}'],
        'decline_v2a_debit': [0.5],
        'days_since_sms_otp_success': [20],
        'days_since_receiver_first_seen': [100],
        'days_since_device_first_seen': [20],
        'dda_age_in_days': [100]
    })
    
    pred = predict_fn(test_df, model)
    print(f"Prediction: {pred}")

# 3. Test with raw CSV (simulates API input)
def test_csv_prediction():
    print("\n=== Testing CSV Input ===")
    model = model_fn(MODEL_DIR)
    
    csv_data = """
2023-05-01,"{""yams_score"":0.7,""north_star_metric"":""5.5""}",0.5,20,100,20,100
    """.strip()
    
    df = input_fn(csv_data, "text/csv")
    pred = predict_fn(df, model)
    print(f"Prediction from CSV: {pred}")


test_dataframe_prediction()
test_csv_prediction()

=== Testing DataFrame Input ===
Prediction: {'uncalibrated': array([[0.14639568, 0.8536043 ]], dtype=float32), 'calibrated': array([[0.60676062, 0.39323938]])}

=== Testing CSV Input ===
Prediction from CSV: {'uncalibrated': array([[0.14639568, 0.8536043 ]], dtype=float32), 'calibrated': array([[0.60676062, 0.39323938]])}


## 7. save the model.pkl into model.tar.gz

In [9]:
import tarfile

with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("model.pkl")

### 7.1 we will test the model locally

## 8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 

In [14]:
s3 = boto3.client('s3')

# Upload the tar.gz file to S3
s3.upload_file("model.tar.gz", bucket, "models/model.tar.gz")
model_data = f"s3://{bucket}/models/model.tar.gz"

print(f"model data: {model_data}")

model data: s3://mainbucketrockhight5461/models/model.tar.gz


## 9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [18]:
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime

model_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"framework version: {sklearn.__version__}")
model = SKLearnModel(
    name =  model_name,
    model_data=model_data,
    role=get_execution_role(),
    entry_point="predict.py",
    framework_version="1.2-1",
    dependencies=['requirements.txt'],
    source_dir="tmp"
)

framework version: 1.5.2


In [19]:
endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
)

EndpointName=Custom-sklearn-model-2025-03-26-16-20-55


--------------------------------------------------------*

In [None]:
### the endpoint is failig. Very likely because we are using sklearn 1.5 and the latest supported for this API is 1.2-1. Hence I will stop all efforts, but I will keep this notebook

In [75]:
import pandas as pd
from io import StringIO
from sagemaker.deserializers import NumpyDeserializer
from sagemaker.serializers import CSVSerializer

# Convert testX[features][0:2] to CSV string
test_data = testX[features][0:2].values.tolist()
csv_buffer = StringIO()
pd.DataFrame(test_data).to_csv(csv_buffer, header=False, index=False)
csv_data = csv_buffer.getvalue()

# Set up the predictor with appropriate serializer and deserializer
predictor.serializer = CSVSerializer()
predictor.deserializer = NumpyDeserializer()

# Use predictor.predict with explicit content type
predictor.content_type = "text/csv"  # Set the content type for the request
predictor.accept = "application/x-npy"  # Set the accept type for the response

# Make the prediction
result = predictor.predict(csv_data)
print(result)

[3 0]


## Don't forget to delete the endpoint !

In [76]:
sm_boto3.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'ce296f2c-c619-4b4d-b151-f34d0b0aa710',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ce296f2c-c619-4b4d-b151-f34d0b0aa710',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 25 Mar 2025 15:42:17 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

### Don't forget to Subscribe Machine Learning Hub YouTube Channel. 