## Sagemaker Tutorial Series

### Tutorial - 1 Mobile Price Classification using SKLearn Custom Script in Sagemaker

Data Source - https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification?resource=download

### Let's divide the workload
1. Initialize Boto3 SDK and create S3 bucket. 
2. Upload data in Sagemaker Local Storage. 
3. Data Exploration and Understanding.
4. Split the data into Train/Test CSV File. 
5. Upload data into the S3 Bucket.
6. Create Training Script
7. Train script in-side Sagemaker container. 
8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 
9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [3]:
!pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.15.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.2 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached scipy-1.15.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.3 MB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 scipy-1.15.2 threadpoolctl-3.6.0

[1m[[0m[34;49mnot

In [4]:
import sklearn # Check Sklearn version
sklearn.__version__

'1.6.1'

In [5]:
!python --version

Python 3.13.1


## 1. Initialize Boto3 SDK and create S3 bucket. 

In [6]:
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import datetime
import time
import tarfile
import boto3
import pandas as pd

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = 'mainbucketrockhight5461' # Mention the created S3 bucket name here
print("Using bucket " + bucket)
# hi
print(f"sagemaker version: {sagemaker.__version__}")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/xdg-ubuntu/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/murivirg/.config/sagemaker/config.yaml


Using bucket mainbucketrockhight5461
sagemaker version: 2.243.0


## 3. Data Exploration and Understanding.

In [10]:
df = pd.read_csv("mob_price_classification_train.csv")

In [11]:
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [12]:
df.shape

(2000, 21)

In [13]:
# ['Low_Risk','High_Risk'],[0,1]
df['price_range'].value_counts(normalize=True)

price_range
1    0.25
2    0.25
3    0.25
0    0.25
Name: proportion, dtype: float64

In [14]:
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [15]:
df.shape

(2000, 21)

In [16]:
# Find the Percentage of Values are missing
df.isnull().mean() * 100

battery_power    0.0
blue             0.0
clock_speed      0.0
dual_sim         0.0
fc               0.0
four_g           0.0
int_memory       0.0
m_dep            0.0
mobile_wt        0.0
n_cores          0.0
pc               0.0
px_height        0.0
px_width         0.0
ram              0.0
sc_h             0.0
sc_w             0.0
talk_time        0.0
three_g          0.0
touch_screen     0.0
wifi             0.0
price_range      0.0
dtype: float64

In [17]:
features = list(df.columns)
features

['battery_power',
 'blue',
 'clock_speed',
 'dual_sim',
 'fc',
 'four_g',
 'int_memory',
 'm_dep',
 'mobile_wt',
 'n_cores',
 'pc',
 'px_height',
 'px_width',
 'ram',
 'sc_h',
 'sc_w',
 'talk_time',
 'three_g',
 'touch_screen',
 'wifi',
 'price_range']

In [18]:
label = features.pop(-1)
label

'price_range'

In [19]:
x = df[features]
y = df[label]

In [20]:
x.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0


In [21]:
# {0: 'Low_Risk',1: 'High_Risk'}
y.head()

0    1
1    2
2    2
3    2
4    1
Name: price_range, dtype: int64

In [22]:
x.shape

(2000, 20)

In [23]:
y.value_counts()

price_range
1    500
2    500
3    500
0    500
Name: count, dtype: int64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.15, random_state=0)

In [25]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1700, 20)
(300, 20)
(1700,)
(300,)


## 4. Split the data into Train/Test CSV File. 

In [26]:
trainX = pd.DataFrame(X_train)
trainX[label] = y_train

testX = pd.DataFrame(X_test)
testX[label] = y_test

In [27]:
print(trainX.shape)
print(testX.shape)

(1700, 21)
(300, 21)


In [28]:
trainX.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
1452,1450,0,2.1,0,1,0,31,0.6,114,5,...,1573,1639,794,11,5,9,0,1,1,1
1044,1218,1,2.8,1,3,0,39,0.8,150,7,...,1122,1746,1667,10,0,12,0,0,0,1
1279,1602,0,0.6,0,12,0,58,0.4,170,1,...,1259,1746,3622,17,2,17,0,1,1,3
674,1034,0,2.6,1,2,1,45,0.3,190,3,...,182,1293,969,15,1,7,1,0,0,0
1200,530,0,2.4,0,1,0,32,0.3,88,6,...,48,1012,959,17,7,6,0,1,0,0


In [29]:
trainX.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

In [30]:
testX.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

## 5. Upload data into the S3 Bucket.

In [31]:
trainX.to_csv("train-V-1.csv",index = False)
testX.to_csv("test-V-1.csv", index = False)

In [32]:
# send data to S3. SageMaker will take training data from s3
sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer"
trainpath = sess.upload_data(
    path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

testpath = sess.upload_data(
    path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

In [33]:
testpath

's3://mainbucketrockhight5461/sagemaker/mobile_price_classification/sklearncontainer/test-V-1.csv'

In [34]:
trainpath

's3://mainbucketrockhight5461/sagemaker/mobile_price_classification/sklearncontainer/train-V-1.csv'

## 6. Create Training Script

In [35]:
%%writefile train.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import sklearn
import pickle  # Replace joblib with pickle
import argparse
import os
import pandas as pd

if __name__ == "__main__":
    print("[INFO] Extracting arguments")
    parser = argparse.ArgumentParser()

    # Hyperparameters sent by the client are passed as command-line arguments to the script
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--random_state", type=int, default=0)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="train-V-1.csv")
    parser.add_argument("--test-file", type=str, default="test-V-1.csv")

    args, _ = parser.parse_known_args()
    
    print("SKLearn Version: ", sklearn.__version__)
    # Removed Joblib version print since joblib is no longer used

    print("[INFO] Reading data")
    print()
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    features = list(train_df.columns)
    label = features.pop(-1)
    
    print("Building training and testing datasets")
    print()
    X_train = train_df[features]
    X_test = test_df[features]
    y_train = train_df[label]
    y_test = test_df[label]

    print('Column order: ')
    print(features)
    print()
    
    print("Label column is: ", label)
    print()
    
    print("Data Shape: ")
    print()
    print("---- SHAPE OF TRAINING DATA (85%) ----")
    print(X_train.shape)
    print(y_train.shape)
    print()
    print("---- SHAPE OF TESTING DATA (15%) ----")
    print(X_test.shape)
    print(y_test.shape)
    print()
    
    print("Training RandomForest Model.....")
    print()
    model = RandomForestClassifier(n_estimators=args.n_estimators, random_state=args.random_state, verbose=3, n_jobs=-1)
    model.fit(X_train, y_train)
    print()
    
    # Change file name to model.pkl and use pickle to save
    model_path = os.path.join(args.model_dir, "model.pkl")
    with open(model_path, 'wb') as f:
        pickle.dump(model, f)
    print("Model persisted at " + model_path)
    print()

    y_pred_test = model.predict(X_test)
    test_acc = accuracy_score(y_test, y_pred_test)
    test_rep = classification_report(y_test, y_pred_test)

    print()
    print("---- METRICS RESULTS FOR TESTING DATA ----")
    print()
    print("Total Rows are: ", X_test.shape[0])
    print('[TESTING] Model Accuracy is: ', test_acc)
    print('[TESTING] Testing Report: ')
    print(test_rep)

Writing train.py


In [36]:
%%writefile predict.py

import os
import pandas as pd
from io import StringIO
import json
import pickle
from tornado.httputil import HTTPServerRequest

# Define the directory printing function from the original code
def print_directory_tree(path, prefix="", is_last=True, ignore_venv=True):
    if os.path.isdir(path):
        dir_name = os.path.basename(path)
        if ignore_venv and dir_name in ['venv', 'env']:
            print(f"{prefix}{'└── ' if is_last else '├── '}{dir_name}/ (Python virtual environment, contents not listed)")
            return
        print(f"{prefix}{'└── ' if is_last else '├── '}{dir_name}/")
        new_prefix = prefix + ("    " if is_last else "│   ")
        contents = os.listdir(path)
        for i, item in enumerate(contents):
            is_last_item = i == len(contents) - 1
            item_path = os.path.join(path, item)
            print_directory_tree(item_path, new_prefix, is_last_item, ignore_venv)
    else:
        print(f"{prefix}{'└── ' if is_last else '├── '}{os.path.basename(path)}")

# Define the model class to fit the template
class MyModel:
    def __init__(self):
        print("Contents of /opt/ml:")
        print_directory_tree('/opt/ml')
        current_directory = os.getcwd()
        print("Current working directory:", current_directory)
        print_directory_tree(current_directory)
        script_directory = os.path.dirname(os.path.abspath(__file__))
        print("Script directory:", script_directory)
        print_directory_tree(script_directory)
        # Load the model from /opt/ml/model/model.pkl, consistent with SageMaker-like environments
        with open('/opt/ml/model/model.pkl', 'rb') as f:
            self.model = pickle.load(f)

    def decode(self, request: HTTPServerRequest) -> str:
        # Decode the request body to a string, as in the template
        return request.body.decode("utf-8")

    def encode(self, response: dict) -> bytes:
        # Encode the response dictionary to JSON bytes, as in the template
        return json.dumps(response).encode("utf-8")

    def invoke(self, request: HTTPServerRequest) -> bytes:
        # Print directory structures every time invoke is called
        print("Contents of /opt/ml:")
        print_directory_tree('/opt/ml')
        current_directory = os.getcwd()
        print("Current working directory:", current_directory)
        print_directory_tree(current_directory)
        script_directory = os.path.dirname(os.path.abspath(__file__))
        print("Script directory:", script_directory)
        print_directory_tree(script_directory)

        # Check content type, similar to input_fn in the original code
        if request.headers.get('Content-Type') != 'text/csv':
            return self.encode({"error": "Please use Content-Type = 'text/csv'"})

        # Decode and process the request body
        request_body = self.decode(request).strip()
        try:
            # Parse the CSV data into a DataFrame, as in input_fn
            df = pd.read_csv(StringIO(request_body), header=None)
            # Generate predictions using the model, as in predict_fn
            predictions = self.model.predict(df)
            # Convert predictions to a list for JSON serialization
            predictions_list = predictions.tolist()
            # Return predictions as a JSON response
            return self.encode({"predictions": predictions_list})
        except Exception:
            # Return an error if CSV parsing fails
            return self.encode({"error": "Invalid CSV data"})

# Instantiate the model
my_model = MyModel()

# Define the handler as per the template
async def handler(request: HTTPServerRequest):
    return my_model.invoke(request)

Writing predict.py


In [37]:
! python train.py --n_estimators 100 \
                   --random_state 0 \
                   --model-dir ./ \
                   --train ./ \
                   --test ./ \

[INFO] Extracting arguments
SKLearn Version:  1.6.1
[INFO] Reading data

Building training and testing datasets

Column order: 
['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g', 'touch_screen', 'wifi']

Label column is:  price_range

Data Shape: 

---- SHAPE OF TRAINING DATA (85%) ----
(1700, 20)
(1700,)

---- SHAPE OF TESTING DATA (15%) ----
(300, 20)
(300,)

Training RandomForest Model.....

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
building tree 8 of 100
building tree 2 of 100
building tree 7 of 100
building tree 1 of 100
building tree 5 of 100
building tree 3 of 100
building tree 9 of 100
building tree 4 of 100
building tree 10 of 100
building tree 12 of 100
building tree 11 of 100
building tree 6 of 100
building tree 13 of 100
building tree 15 of 100
building tree 14 of 100
building tree 16 of 100
bui

In [39]:
import pickle
import pandas as pd
import numpy as np


# Optional: Verify model details (if sklearn >= 1.0)
print("Number of features expected:", model.n_features_in_)
try:
    print("Feature names from training:", model.feature_names_in_)
except AttributeError:
    print("Feature names not stored (sklearn < 1.0), ensure order matches training!")

# Notes:
# 1. Run the training script first and check:
#    - "Column order:" for feature names and order
#    - "SHAPE OF TRAINING DATA" for n_features (e.g., (850, 3) means 3 features)
#    - "SKLearn Version:" to confirm if feature_names_in_ is available
# 2. Update feature_names and data_df/data_array to match the training output
# 3. Ensure all values are numerical and match the expected feature count

Number of features expected: 20
Feature names from training: ['battery_power' 'blue' 'clock_speed' 'dual_sim' 'fc' 'four_g'
 'int_memory' 'm_dep' 'mobile_wt' 'n_cores' 'pc' 'px_height' 'px_width'
 'ram' 'sc_h' 'sc_w' 'talk_time' 'three_g' 'touch_screen' 'wifi']


In [40]:
import pickle
import pandas as pd
import numpy as np

# Load the trained model from the pickle file
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# --- Input Format for Inference ---
# From your training output:
# - Number of features expected: 20
# - Feature names from training: 
#   ['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g', 
#    'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height', 
#    'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g', 
#    'touch_screen', 'wifi']
# - Input must be a 2D array-like object (e.g., NumPy array or pandas DataFrame)
# - Shape: (n_samples, 20), where n_samples is the number of predictions
# - Column Types: Numerical (int or float) only, as RandomForestClassifier requires numerical input
# - Column Values: Must be appropriate for each feature (e.g., reasonable ranges)

# Define the exact feature names from training
feature_names = ['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g', 
                 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height', 
                 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g', 
                 'touch_screen', 'wifi']

# Option 1: Using a pandas DataFrame
# Example data - replace values with realistic ones based on your dataset
data_df = pd.DataFrame({
    'battery_power': [1500, 2000],    # Battery power in mAh (e.g., 500-2000)
    'blue': [1, 0],                   # Bluetooth (0 or 1)
    'clock_speed': [2.5, 1.8],        # CPU speed in GHz (e.g., 0.5-3.0)
    'dual_sim': [1, 0],               # Dual SIM support (0 or 1)
    'fc': [5, 2],                     # Front camera MP (e.g., 0-20)
    'four_g': [1, 0],                 # 4G support (0 or 1)
    'int_memory': [32, 64],           # Internal memory in GB (e.g., 2-128)
    'm_dep': [0.5, 0.3],              # Mobile depth in cm (e.g., 0.1-1.0)
    'mobile_wt': [150, 180],          # Weight in grams (e.g., 80-200)
    'n_cores': [4, 8],                # Number of cores (e.g., 1-8)
    'pc': [8, 12],                    # Primary camera MP (e.g., 0-20)
    'px_height': [720, 1080],         # Pixel height (e.g., 320-2000)
    'px_width': [1280, 1920],         # Pixel width (e.g., 480-2500)
    'ram': [2048, 4096],              # RAM in MB (e.g., 512-8000)
    'sc_h': [5, 6],                   # Screen height in inches (e.g., 3-7)
    'sc_w': [3, 4],                   # Screen width in inches (e.g., 2-5)
    'talk_time': [10, 15],            # Talk time in hours (e.g., 2-20)
    'three_g': [1, 1],                # 3G support (0 or 1)
    'touch_screen': [1, 0],           # Touchscreen (0 or 1)
    'wifi': [1, 1]                    # WiFi support (0 or 1)
}, columns=feature_names)             # Ensures exact order

# Option 2: Using a NumPy array (uncomment if preferred)
# data_array = np.array([
#     [1500, 1, 2.5, 1, 5, 1, 32, 0.5, 150, 4, 8, 720, 1280, 2048, 5, 3, 10, 1, 1, 1],
#     [2000, 0, 1.8, 0, 2, 0, 64, 0.3, 180, 8, 12, 1080, 1920, 4096, 6, 4, 15, 1, 0, 1]
# ])

# Make predictions
predictions = model.predict(data_df)  # Using DataFrame
# predictions = model.predict(data_array)  # Using NumPy array (uncomment if using array)

# Print results
print("Predictions:", predictions)

# Verify model details
print("Number of features expected:", model.n_features_in_)
print("Feature names from training:", model.feature_names_in_)

Predictions: [2 3]
Number of features expected: 20
Feature names from training: ['battery_power' 'blue' 'clock_speed' 'dual_sim' 'fc' 'four_g'
 'int_memory' 'm_dep' 'mobile_wt' 'n_cores' 'pc' 'px_height' 'px_width'
 'ram' 'sc_h' 'sc_w' 'talk_time' 'three_g' 'touch_screen' 'wifi']


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.1s finished


In [42]:
# test the predict.py file
import asyncio
import os
import shutil
from predict import handler

# Step 1: Simulate the SageMaker directory structure
# Create /opt/ml/model/ and copy model.pkl there (temporary for testing)
model_source_path = "model.pkl"  # Model file in current directory
model_target_dir = "/opt/ml/model"
model_target_path = os.path.join(model_target_dir, "model.pkl")

# Ensure the directory exists (may require sudo or user permissions)
if not os.path.exists(model_target_dir):
    os.makedirs(model_target_dir, exist_ok=True)
    print(f"Created directory: {model_target_dir}")
else:
    print(f"Directory already exists: {model_target_dir}")

# Copy the model file if it’s not already there
if not os.path.exists(model_target_path):
    shutil.copy(model_source_path, model_target_path)
    print(f"Copied model.pkl to {model_target_path}")
else:
    print(f"Model already exists at {model_target_path}")

# Step 2: Define a mock request class
class MockRequest:
    def __init__(self, headers, body):
        self.headers = headers
        self.body = body

# Step 3: Prepare test data (CSV matching the expected 20 features)
csv_data = (
    "1500,1,2.5,1,5,1,32,0.5,150,4,8,720,1280,2048,5,3,10,1,1,1\n"
    "2000,0,1.8,0,2,0,64,0.3,180,8,12,1080,1920,4096,6,4,15,1,0,1"
)
headers = {"Content-Type": "text/csv"}
body = csv_data.encode("utf-8")
mock_request = MockRequest(headers, body)

# Step 4: Define and run the async test function
async def test_handler():
    response = await handler(mock_request)
    print("Response from handler:")
    print(response.decode("utf-8"))

# Run the async function in the notebook
await test_handler()

# Optional: Clean up (remove the temporary directory after testing)
# Uncomment the following lines if you want to clean up
# shutil.rmtree("/opt/ml")
# print("Cleaned up temporary directory /opt/ml")

Contents of /opt/ml:
└── ml/
    └── model/
        ├── test.txt
        └── model.pkl
Current working directory: /home/murivirg/work/github/sagemaker-tutorials/hosting_using_base_container
└── hosting_using_base_container/
    ├── docker/
    │   ├── Dockerfile
    │   └── .ipynb_checkpoints/
    │       └── Dockerfile-checkpoint
    ├── mob_price_classification_train.csv
    ├── train.py
    ├── predict.py
    ├── deploy_sklearn_from_pickle_Model_class_and_custom_container.ipynb
    ├── train-V-1.csv
    ├── test-V-1.csv
    ├── __pycache__/
    │   └── predict.cpython-313.pyc
    ├── env/ (Python virtual environment, contents not listed)
    ├── model.pkl
    └── .ipynb_checkpoints/
        ├── predict-checkpoint.py
        └── deploy_sklearn_from_pickle_Model_class_and_custom_container-checkpoint.ipynb
Script directory: /home/murivirg/work/github/sagemaker-tutorials/hosting_using_base_container
└── hosting_using_base_container/
    ├── docker/
    │   ├── Dockerfile
    │   └── .ip

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.1s finished


## 7. save the model.pkl into model.tar.gz

In [43]:
import tarfile

with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("model.pkl")

## 8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 

In [44]:
s3 = boto3.client('s3')

# Upload the tar.gz file to S3
s3.upload_file("model.tar.gz", bucket, "models/model.tar.gz")
model_data = f"s3://{bucket}/models/model.tar.gz"

print(f"model data: {model_data}")

model data: s3://mainbucketrockhight5461/models/model.tar.gz


#### 8.1 we will build the docker image as instructed by the inference expert

we added `ENV SAGEMAKER_INFERENCE_CODE = "predict.handler"` to the dockerfile as instructed by the inference expert

on [dockerfile instructions](https://github.com/aws/sagemaker-distribution/pull/536#pullrequestreview-2554576424) we are told to use a base conda environment for building this image. I don't see the environment listed anywhere. Also in the [readme](https://github.com/aws/sagemaker-distribution/tree/main) for the sagemaker distribution, we are told that we don't need to build the iamge ourselves, instead. We can just use the one that is provided for us. So I will attempt that.

so the dockerfile looks like this 
```
FROM public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
ENV SAGEMAKER_INFERENCE_CODE="predict.handler"
```
and the code I used to build it and upload it to the ECR is 
```
#!/bin/bash

# Set variables
AWS_ACCOUNT_ID="794038231401"  # Replace with your AWS account ID
REGION="us-east-1"              # Replace with your region
REPO_NAME="custom_base_model"   # Replace with your repository name
IMAGE_TAG="latest"              # Optional: change to a specific version like "v1"

# Full ECR image URI
ECR_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO_NAME}:${IMAGE_TAG}"

# Log in to ECR
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# Check if the repository exists, create it if it doesn't
if ! aws ecr describe-repositories --repository-names "${REPO_NAME}" --region ${REGION} &>/dev/null; then
    echo "Creating ECR repository: ${REPO_NAME}"
    aws ecr create-repository --repository-name "${REPO_NAME}" --region ${REGION}
else
    echo "ECR repository already exists: ${REPO_NAME}"
fi

# Remove all containers (optional, kept for consistency)
docker rm $(docker ps -aq)

# Clean up previous images created by this script before building
echo "Checking and removing previous images if they exist..."
if docker image inspect ${REPO_NAME}:${IMAGE_TAG} &>/dev/null; then
    echo "Removing ${REPO_NAME}:${IMAGE_TAG}"
    docker rmi -f ${REPO_NAME}:${IMAGE_TAG}
else
    echo "Image ${REPO_NAME}:${IMAGE_TAG} not found, skipping removal"
fi

if docker image inspect ${ECR_URI} &>/dev/null; then
    echo "Removing ${ECR_URI}"
    docker rmi -f ${ECR_URI}
else
    echo "Image ${ECR_URI} not found, skipping removal"
fi

# Build the Docker image
docker build -t ${REPO_NAME} .

# Tag the image for ECR
docker tag ${REPO_NAME}:${IMAGE_TAG} ${ECR_URI}

# Push the image to ECR
docker push ${ECR_URI}

echo "Docker image pushed to ECR: ${ECR_URI}"
```

In [46]:
ecr_image = '794038231401.dkr.ecr.us-east-1.amazonaws.com/custom_base_model:latest'

# Get the SageMaker execution role (assumes this is run in a SageMaker notebook)
role = "arn:aws:iam::794038231401:role/service-role/SageMaker-ExecutionRole-20250103T203496"

SKLearnModel
- aws provided docker file
- we provide entry_point="predict.py" and it works

train 
upload

[container]
->
[Model] ...

## 9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [49]:
!pip freeze >> requirements.txt

In [50]:
from sagemaker.model import Model
from time import gmtime, strftime

model_name = "Custom-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"framework version: {sklearn.__version__}")
model = Model(
    name =  model_name,
    image_uri = ecr_image,
    model_data=model_data,
    role=role,
    entry_point="predict.py",
    dependencies = ['requirements.txt']
)

framework version: 1.6.1


In [51]:
endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
)

EndpointName=Custom-sklearn-model-2025-04-09-17-28-06


--------------------------------------------------*

based on the documentation here (1) I want to do the following:
```
model = Model(
    name =  model_name,
    image_uri = ecr_image,
    model_data=model_data,
    role=role,
    entry_point="predict.py",
    dependencies = ['./env', 'requirements.txt'],
    source_dir = './src'
)
```

but as you can see a basic:

```
model = Model(
    name =  model_name,
    image_uri = ecr_image,
    model_data=model_data,
    role=role,
)
```
doesn't work

Documentation:
(1) https://sagemaker.readthedocs.io/en/stable/api/inference/model.html

In [74]:
testX[features][0:2].values.tolist()

[[1454.0,
  1.0,
  0.5,
  1.0,
  1.0,
  0.0,
  34.0,
  0.7,
  83.0,
  4.0,
  3.0,
  250.0,
  1033.0,
  3419.0,
  7.0,
  5.0,
  5.0,
  1.0,
  1.0,
  0.0],
 [1092.0,
  1.0,
  0.5,
  1.0,
  10.0,
  0.0,
  11.0,
  0.5,
  167.0,
  3.0,
  14.0,
  468.0,
  571.0,
  737.0,
  14.0,
  4.0,
  11.0,
  0.0,
  1.0,
  0.0]]

In [75]:
import pandas as pd
import sagemaker
from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer

# Create predictor object for existing endpoint
# Replace 'my-endpoint' with your actual endpoint name
predictor = Predictor(endpoint_name='my-endpoint', sagemaker_session=sagemaker.Session())
predictor.deserializer = JSONDeserializer()

# Test data
test_data = pd.DataFrame([
    [1454.0, 1.0, 0.5, 1.0, 1.0, 0.0, 34.0, 0.7, 83.0, 4.0, 3.0, 250.0, 1033.0, 3419.0, 7.0, 5.0, 5.0, 1.0, 1.0, 0.0],
    [1092.0, 1.0, 0.5, 1.0, 10.0, 0.0, 11.0, 0.5, 167.0, 3.0, 14.0, 468.0, 571.0, 737.0, 14.0, 4.0, 11.0, 0.0, 1.0, 0.0]
])
csv_data = test_data.to_csv(header=False, index=False)

# Make prediction
try:
    response = predictor.predict(csv_data, initial_args={'ContentType': 'text/csv'})
    if "predictions" in response:
        print("Predictions:", response["predictions"])
    else:
        print("Error from endpoint:", response.get("error", "Unknown error"))
except Exception as e:
    print("Prediction failed:", e)

[3 0]


## Don't forget to delete the endpoint !

In [76]:
sm_boto3.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'ce296f2c-c619-4b4d-b151-f34d0b0aa710',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ce296f2c-c619-4b4d-b151-f34d0b0aa710',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 25 Mar 2025 15:42:17 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

### Don't forget to Subscribe Machine Learning Hub YouTube Channel. 

In [41]:
!pip freeze > requirements.txt

# now we will test it buildin the dockerfile ourselves from the repository