### Mobile Price Classification using SKLearn Custom Script in Sagemaker

In [1]:
import sagemaker
from sklearn.model_selection import train_test_split
import boto3
import pandas as pd

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = 'mobbucketsagemaker' # Mention the created S3 bucket name here
print("Using bucket " + bucket)

Using bucket mobbucketsagemaker


The code seems to set up a SageMaker session and defines some variables for later use. Let's break it down into bullet points:

* The code imports the necessary libraries: sagemaker, train_test_split from sklearn.model_selection, boto3, and pandas.
* It creates a Boto3 client for SageMaker using boto3.client("sagemaker").
* It initiates a SageMaker session using sagemaker.Session().
* The region name of the current session is stored in the region variable.
* It sets the S3 bucket name to 'mobbucketsagemaker' and stores it in the bucket variable.
* Finally, it prints the message "Using bucket " followed by the S3 bucket name to indicate which bucket will be used.
* It's important to note that this code snippet by itself does not perform any specific machine learning tasks or operations; rather, it sets up the necessary environment to work with SageMaker and specifies the S3 bucket to use for storing data and model artifacts. Additional code will be required to perform actual machine learning tasks using SageMaker.

In [2]:
df = pd.read_csv("mob_price_classification_train.csv")

- It reads the data from the CSV file "mob_price_classification_train.csv" and creates a pandas DataFrame called 'df' to store the data.

In [3]:
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


- The df.head() function is used to display the first few rows of the pandas DataFrame 'df'. This function is commonly used to quickly inspect the data and get an overview of its structure. It is especially useful for understanding the column names and the data types present in the DataFrame.

In [4]:
df.shape

(2000, 21)

The df.shape attribute is used to determine the dimensions (number of rows and columns) of the pandas DataFrame 'df'. It returns a tuple with two elements:

* The first element represents the number of rows in the DataFrame.
* The second element represents the number of columns in the DataFrame.

In [5]:
# ['Low_Risk','High_Risk'],[0,1]
df['price_range'].value_counts(normalize=True)

price_range
1    0.25
2    0.25
3    0.25
0    0.25
Name: proportion, dtype: float64

The code you provided is likely used to analyze the distribution of values in the 'price_range' column of the pandas DataFrame 'df'. The 'price_range' column seems to contain categorical data representing different price ranges for mobile devices.

Here's what the code does:

* df['price_range'] refers to the 'price_range' column in the DataFrame 'df'.
* value_counts() is a pandas function that counts the occurrences of each unique value in the specified column. It returns a Series where the index is the unique value and the value is the count of occurrences.
* normalize=True is used as an argument to the value_counts() function to get the relative frequencies (proportions) of each unique value instead of the raw counts.

In [6]:
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

The df.columns attribute is used to get a list of column names present in the pandas DataFrame 'df'. It provides an easy way to access and inspect the names of the columns in the DataFrame.

When you execute df.columns, it will return an array-like object (usually a pandas Index or a list) containing the column names of the DataFrame.

In [7]:
df.shape

(2000, 21)

In [8]:
# Find the Percentage of Values are missing
df.isnull().mean() * 100

battery_power    0.0
blue             0.0
clock_speed      0.0
dual_sim         0.0
fc               0.0
four_g           0.0
int_memory       0.0
m_dep            0.0
mobile_wt        0.0
n_cores          0.0
pc               0.0
px_height        0.0
px_width         0.0
ram              0.0
sc_h             0.0
sc_w             0.0
talk_time        0.0
three_g          0.0
touch_screen     0.0
wifi             0.0
price_range      0.0
dtype: float64

The code provided calculates the percentage of missing values in each column of the pandas DataFrame 'df'. Here's what the code does:

* df.isnull() returns a DataFrame of the same shape as 'df' where each cell contains a boolean value indicating whether the corresponding cell in 'df' is null (True) or not null (False).
* .mean() is then used to calculate the mean (average) of each column. Since the boolean values are treated as 1 for True and 0 for False, taking the mean effectively calculates the proportion of True (missing) values in each column.
* Finally, * 100 is used to convert the proportions to percentages, giving the percentage of missing values in each column.

In [9]:
features = list(df.columns)
features

['battery_power',
 'blue',
 'clock_speed',
 'dual_sim',
 'fc',
 'four_g',
 'int_memory',
 'm_dep',
 'mobile_wt',
 'n_cores',
 'pc',
 'px_height',
 'px_width',
 'ram',
 'sc_h',
 'sc_w',
 'talk_time',
 'three_g',
 'touch_screen',
 'wifi',
 'price_range']

The code creates a list named features containing the column names of the pandas DataFrame 'df'. Here's what the code does:

* df.columns returns an array-like object (usually a pandas Index or a list) containing the column names of the DataFrame 'df'.
* list(df.columns) converts the array-like object to a regular Python list.

In [10]:
label = features.pop(-1)
label

'price_range'

The code pops the last element from the features list and assigns it to the variable label. The pop() method in Python removes and returns the element at the specified index (in this case, -1 refers to the last element in the list).

Here's what the code does:

* features.pop(-1) removes the last element from the features list and returns its value.
* The value returned by features.pop(-1) is assigned to the variable label.

In [11]:
x = df[features]
y = df[label]

In code, the DataFrame 'df' is being split into two separate datasets: 'x' and 'y'. This is commonly done when preparing data for machine learning tasks, where 'x' typically represents the features or input data, and 'y' represents the target or output labels that we want the model to learn and predict.

Here's what the code does:

* df[features]: This line creates a new DataFrame 'x' by selecting only the columns specified in the features list. 'x' will contain the data corresponding to the columns listed in 'features'. It represents the input features for the machine learning model.
* df[label]: This line creates a new Series 'y' by selecting the column specified by the variable label. 'y' will contain the data corresponding to the column represented by label. It represents the target variable or the output labels for the machine learning model.

In [12]:
x.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0


The x.head() function is used to quickly inspect the first few rows of the DataFrame 'x' and get an overview of its structure and data. The actual output will vary based on the data present in your 'x' DataFrame.

In [13]:
# {0: 'Low_Risk',1: 'High_Risk'}
y.head()

0    1
1    2
2    2
3    2
4    1
Name: price_range, dtype: int64

The y.head() function displays the first few rows of the 'y' Series, which by default is the first 5 rows. Each row represents the target label for the corresponding row in the 'x' DataFrame.

In [14]:
x.shape

(2000, 20)

In [15]:
y.value_counts()

price_range
1    500
2    500
3    500
0    500
Name: count, dtype: int64

The y.value_counts() function is used to count the occurrences of each unique value in the Series 'y'. In this case, 'y' represents the target variable or output labels, and we want to understand the distribution of different label values in the Series.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.15, random_state=0)

The code provided performs a train-test split on the input features ('x') and the target variable ('y') using the train_test_split function from scikit-learn. This is a common technique used in machine learning to divide the data into two sets: one for training the model and another for testing its performance.

Here's what the code does:

* train_test_split(x, y, test_size=0.15, random_state=0): This function takes the input features 'x' and the target variable 'y' as inputs and splits them into four sets: 'X_train', 'X_test', 'y_train', and 'y_test'.
* test_size=0.15: This parameter specifies the proportion of the data that should be allocated to the test set. In this case, 15% of the data will be used for testing, and the remaining 85% will be used for training.
* random_state=0: This parameter is used to set the random seed for reproducibility. By setting it to a specific value (e.g., 0), the train-test split will always produce the same result when executed multiple times.

After executing the code, you will have the following sets:

* 'X_train': The training set of input features.
* 'X_test': The test set of input features.
* 'y_train': The training set of target variable (labels).
* 'y_test': The test set of target variable (labels).

These sets can be used to train a machine learning model on the training data and evaluate its performance on the test data.

In [17]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1700, 20)

(300, 20)

(1700,)

(300,)


These shapes indicate the number of samples (rows) and features (columns) in the input feature DataFrames and the number of samples in the target label Series.

In [18]:
trainX = pd.DataFrame(X_train)
trainX[label] = y_train

testX = pd.DataFrame(X_test)
testX[label] = y_test

In the code, new DataFrames named 'trainX' and 'testX' are created to combine the training and test sets of input features 'X_train' and 'X_test', respectively, with their corresponding target labels 'y_train' and 'y_test'.

Here's what the code does:

- trainX = pd.DataFrame(X_train): This line creates a new DataFrame 'trainX' from the training set of input features 'X_train'. The columns in 'trainX' will be the same as in 'X_train'.
- trainX[label] = y_train: This line adds a new column to 'trainX' with the name specified by the variable 'label', and populates it with the values from the training set of target labels 'y_train'. This step effectively adds the target variable as a new column in the training set.
- testX = pd.DataFrame(X_test): This line creates a new DataFrame 'testX' from the test set of input features 'X_test'. The columns in 'testX' will be the same as in 'X_test'.
- testX[label] = y_test: This line adds a new column to 'testX' with the name specified by the variable 'label', and populates it with the values from the test set of target labels 'y_test'. This step effectively adds the target variable as a new column in the test set.

The resulting 'trainX' and 'testX' DataFrames will have the target variable as one of their columns, allowing you to use them for training and testing machine learning models that require the input features and their corresponding target labels in a single DataFrame.

In [19]:
print(trainX.shape)
print(testX.shape)

(1700, 21)

(300, 21)


In [20]:
trainX.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
1452,1450,0,2.1,0,1,0,31,0.6,114,5,...,1573,1639,794,11,5,9,0,1,1,1
1044,1218,1,2.8,1,3,0,39,0.8,150,7,...,1122,1746,1667,10,0,12,0,0,0,1
1279,1602,0,0.6,0,12,0,58,0.4,170,1,...,1259,1746,3622,17,2,17,0,1,1,3
674,1034,0,2.6,1,2,1,45,0.3,190,3,...,182,1293,969,15,1,7,1,0,0,0
1200,530,0,2.4,0,1,0,32,0.3,88,6,...,48,1012,959,17,7,6,0,1,0,0


In [21]:
trainX.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

The code snippet seems to be written in Python and is likely related to data analysis using pandas. The code is checking for null values in a pandas DataFrame called trainX and then calculating the sum of null values for each column.

Here's a breakdown of what the code does:

- trainX: It is assumed to be a pandas DataFrame containing the training data.
- isnull(): This is a pandas DataFrame method that checks for missing or null values in the DataFrame and returns a DataFrame of the same shape with boolean values, where True indicates a missing value and False indicates a non-null value.
- sum(): After calling isnull(), the sum() method is used on the resulting DataFrame to calculate the sum of True values (i.e., the number of missing values) for each column

In [22]:
testX.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

In [23]:
trainX.to_csv("train-V-1.csv",index = False)
testX.to_csv("test-V-1.csv", index = False)

In the code, the DataFrames 'trainX' and 'testX' are saved as CSV files named "train-V-1.csv" and "test-V-1.csv", respectively. The to_csv() method is used to export the DataFrames to CSV format.

Here's what the code does:

- trainX.to_csv("train-V-1.csv", index=False): This line saves the DataFrame 'trainX' to a CSV file named "train-V-1.csv" in the current working directory. The index=False parameter ensures that the DataFrame's index is not included in the CSV file.
- testX.to_csv("test-V-1.csv", index=False): This line saves the DataFrame 'testX' to a CSV file named "test-V-1.csv" in the current working directory. The index=False parameter ensures that the DataFrame's index is not included in the CSV file.

After executing this code, you will find two CSV files named "train-V-1.csv" and "test-V-1.csv" in the same location where your Python script is located (assuming no specific path is given). These files will contain the data from 'trainX' and 'testX', respectively, in CSV format, and can be used for further analysis, model training, or sharing the data with others

In [24]:
bucket

'mobbucketsagemaker'

In the provided code, the variable bucket is assigned a value of "mobbucketsagemaker." The value "mobbucketsagemaker" is likely used to represent the name of an Amazon S3 bucket.

Amazon S3 (Simple Storage Service) is a cloud-based storage service provided by Amazon Web Services (AWS). Buckets are containers for storing objects (files) in S3. Each object in S3 is identified by a unique key, which is the combination of the object's name and the bucket name.

Using S3 buckets, you can store various types of data, such as files, images, videos, or any other type of data you need to store securely in the cloud. S3 buckets are widely used for various purposes, including data storage for applications, data backups, data sharing, and serving static content for websites.

The code snippet you provided doesn't perform any specific actions with the 'bucket' variable. It is merely setting the name of the S3 bucket in the variable, which may be used in later parts of the code for interacting with the specified bucket on AWS.

In [25]:
# send data to S3. SageMaker will take training data from s3
sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer"
trainpath = sess.upload_data(
    path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

testpath = sess.upload_data(
    path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)
print(trainpath)
print(testpath)

s3://mobbucketsagemaker/sagemaker/mobile_price_classification/sklearncontainer/train-V-1.csv

s3://mobbucketsagemaker/sagemaker/mobile_price_classification/sklearncontainer/test-V-1.csv


In the provided code, the SageMaker sess.upload_data() function is used to upload the data from local files "train-V-1.csv" and "test-V-1.csv" to the specified Amazon S3 bucket. This is a common process when preparing data for SageMaker training jobs, where the data needs to be available in an S3 bucket for the training to take place.

Here's what the code does:

1. sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer": This line sets the prefix for the S3 key (object name) under which the data will be stored. It's a way of organizing the data within the specified S3 bucket.
2. trainpath = sess.upload_data(path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix): This line uploads the local file "train-V-1.csv" to the specified S3 bucket. The sess.upload_data() function takes the path of the local file to upload, the target S3 bucket specified by the variable bucket, and the key prefix to organize the data within the bucket. The function returns the S3 path where the file was uploaded.
3. testpath = sess.upload_data(path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix): This line uploads the local file "test-V-1.csv" to the same S3 bucket using the same key prefix. The sess.upload_data() function is used again to upload the file, and it returns the S3 path.
4. print(trainpath): This line prints the S3 path where the "train-V-1.csv" file was uploaded.
5. print(testpath): This line prints the S3 path where the "test-V-1.csv" file was uploaded.

The output of the print(trainpath) and print(testpath) statements will be the URLs of the uploaded files in the S3 bucket. These URLs will be used later to reference the data when setting up and running the SageMaker training job using the specified data.

In [26]:
%%writefile script.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, auc
import sklearn
import joblib
import boto3
import pathlib
from io import StringIO 
import argparse
import joblib
import os
import numpy as np
import pandas as pd
    
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf
    
if __name__ == "__main__":

    print("[INFO] Extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--random_state", type=int, default=0)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="train-V-1.csv")
    parser.add_argument("--test-file", type=str, default="test-V-1.csv")

    args, _ = parser.parse_known_args()
    
    print("SKLearn Version: ", sklearn.__version__)
    print("Joblib Version: ", joblib.__version__)

    print("[INFO] Reading data")
    print()
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    features = list(train_df.columns)
    label = features.pop(-1)
    
    print("Building training and testing datasets")
    print()
    X_train = train_df[features]
    X_test = test_df[features]
    y_train = train_df[label]
    y_test = test_df[label]

    print('Column order: ')
    print(features)
    print()
    
    print("Label column is: ",label)
    print()
    
    print("Data Shape: ")
    print()
    print("---- SHAPE OF TRAINING DATA (85%) ----")
    print(X_train.shape)
    print(y_train.shape)
    print()
    print("---- SHAPE OF TESTING DATA (15%) ----")
    print(X_test.shape)
    print(y_test.shape)
    print()
    
  
    print("Training RandomForest Model.....")
    print()
    model =  RandomForestClassifier(n_estimators=args.n_estimators, random_state=args.random_state, verbose = 3,n_jobs=-1)
    model.fit(X_train, y_train)
    print()
    

    model_path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model,model_path)
    print("Model persisted at " + model_path)
    print()

    
    y_pred_test = model.predict(X_test)
    test_acc = accuracy_score(y_test,y_pred_test)
    test_rep = classification_report(y_test,y_pred_test)

    print()
    print("---- METRICS RESULTS FOR TESTING DATA ----")
    print()
    print("Total Rows are: ", X_test.shape[0])
    print('[TESTING] Model Accuracy is: ', test_acc)
    print('[TESTING] Testing Report: ')
    print(test_rep)

Writing script.py


The provided Python script appears to be an implementation of a machine learning model using Scikit-learn's RandomForestClassifier. It performs training and testing of the model and saves the trained model using joblib. The script also prints various metrics results for the testing data. The script is designed to be run as part of an Amazon SageMaker training job.

Here's an overview of the key sections of the script:

1. model_fn: A function that loads the trained model from the specified model directory. This function will be used by SageMaker to load the model during deployment.
2. Main Script:

- The script starts by extracting the hyperparameters and data directories from the command-line arguments passed by SageMaker.
- It reads the training and testing data from the respective CSV files specified by the arguments.
- It builds the training and testing datasets by separating the input features and target labels.
- The RandomForestClassifier model is trained on the training dataset (X_train and y_train).
- The trained model is saved using joblib, and the file is stored in the specified model directory.
- The model's performance is evaluated on the testing dataset (X_test and y_test) using accuracy and classification report.
- The final model is persisted to the model directory, and metrics results are printed.

This script follows the structure required for running a Scikit-learn model as a SageMaker training job. The input data is passed through the SM_CHANNEL_TRAIN and SM_CHANNEL_TEST environment variables, and the trained model is saved in the SM_MODEL_DIR directory. The script also extracts hyperparameters using command-line arguments, allowing easy configuration of the model during SageMaker training.

In [27]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="script.py",
    role="arn:aws:iam::566373416292:role/service-role/AmazonSageMaker-ExecutionRole-20230120T164209",
    instance_count=1,
    instance_type="ml.m5.large",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="RF-custom-sklearn",
    hyperparameters={
        "n_estimators": 100,
        "random_state": 0,
    },
    use_spot_instances = True,
    max_wait = 7200,
    max_run = 3600
)

The provided code sets up an Amazon SageMaker SKLearn Estimator for training a custom Scikit-learn model on AWS SageMaker. The estimator defines the configuration required for the training job, including the script to be executed, the instance type for training, hyperparameters, and other settings.

Here's a breakdown of the key settings and options:

- entry_point: The name of the script file that contains the custom Scikit-learn model implementation. In this case, it is "script.py."
- role: The Amazon Resource Name (ARN) of the IAM role that SageMaker will assume to perform tasks on your behalf. This role should have the necessary permissions for reading data from S3 and writing model artifacts to S3.
- instance_count: The number of instances to use for training. In this case, a single instance (1) will be used.
- instance_type: The type of EC2 instance to use for training. Here, it is set to "ml.m5.large."
- framework_version: The version of Scikit-learn to be used during training. The provided version is "0.23-1."
- base_job_name: A unique name that will be used as the base name for the training job.
- hyperparameters: A dictionary of hyperparameters to be passed to the custom Scikit-learn script. In this case, it sets the "n_estimators" hyperparameter to 100 and the "random_state" hyperparameter to 0.
- use_spot_instances: This option enables the use of Amazon EC2 Spot Instances for training. Spot Instances can significantly reduce training costs.
- max_wait: The maximum time, in seconds, that the training job is allowed to run. If the training job exceeds this time, it will be terminated.
- max_run: The maximum time, in seconds, that a training job is allowed to run on a spot instance.

Once the sklearn_estimator is defined, you can use it to launch the SageMaker training job by calling the fit() method on it, passing the S3 paths of the training and testing datasets as arguments.

In [28]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)
# sklearn_estimator.fit({"train": datapath}, wait=True)

Using provided s3_resource


INFO:sagemaker:Creating training-job with name: RF-custom-sklearn-2023-06-23-17-51-03-767


2023-06-23 17:51:11 Starting - Starting the training job...

2023-06-23 17:51:26 Starting - Preparing the instances for training......

2023-06-23 17:52:25 Downloading - Downloading input data...

2023-06-23 17:53:31 Training - Training image download completed. Training in progress..2023-06-23 17:53:34,866 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training

2023-06-23 17:53:34,870 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)

2023-06-23 17:53:34,920 sagemaker_sklearn_container.training INFO     Invoking user training script.

2023-06-23 17:53:35,093 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)

2023-06-23 17:53:35,107 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)

2023-06-23 17:53:35,120 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)

2023-06-23 17:53:35,129 sagemaker-training-toolkit INFO     Invoking u

The code provided launches a SageMaker training job using the previously defined sklearn_estimator with an asynchronous call. The fit() method is used to start the training job, and the training data is passed to the estimator in the form of a dictionary containing S3 paths to the training and testing datasets.

Here's what the code does:

1. sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True): This line starts the SageMaker training job using the sklearn_estimator. The fit() method takes the input data as a dictionary where the keys are strings that represent the names of the input channels specified in the custom Scikit-learn script, and the values are the S3 paths to the corresponding datasets.
- "train": trainpath: The S3 path to the training dataset previously uploaded to S3 is provided as the value for the "train" key.
- "test": testpath: The S3 path to the testing dataset previously uploaded to S3 is provided as the value for the "test" key.
- wait=True: The wait parameter is set to True, which means that the fit() method will wait for the training job to complete before proceeding to the next line of code. This makes the call synchronous, and the script will continue only after the training job is finished.

2.  sklearn_estimator.fit({"train": datapath}, wait=True): This is a commented line of code that seems to be a duplicate or an example of how the fit() method can be called with only the training dataset. It is not executed since it is commented out (denoted by the '#' symbol).

By calling sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True), the training job will be launched using the provided data, and the script will wait for the job to finish before proceeding further. Once the training job is complete, the trained model will be saved in the specified S3 location as defined by the sklearn_estimator configuration.

In [29]:
sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

print("Model artifact persisted at " + artifact)



2023-06-23 17:54:51 Starting - Preparing the instances for training

2023-06-23 17:54:51 Downloading - Downloading input data

2023-06-23 17:54:51 Training - Training image download completed. Training in progress.

2023-06-23 17:54:51 Uploading - Uploading generated training model

2023-06-23 17:54:51 Completed - Training job completed

Model artifact persisted at s3://sagemaker-us-east-1-566373416292/RF-custom-sklearn-2023-06-23-17-51-03-767/output/model.tar.gz


In the provided code, after launching the SageMaker training job with the sklearn_estimator.fit() method, the script waits for the training job to complete using the wait() method of the latest training job created by sklearn_estimator.

Here's what the code does:

1. sklearn_estimator.latest_training_job.wait(logs="None"): This line waits for the training job represented by sklearn_estimator.latest_training_job to complete. The wait() method is called with logs="None" to suppress the display of logs while waiting for the job to finish. This makes the script synchronous, and it will wait until the training job completes before proceeding.
2. artifact = sm_boto3.describe_training_job(TrainingJobName=sklearn_estimator.latest_training_job.name)["ModelArtifacts"]["S3ModelArtifacts"]: After the training job is completed, the code uses the SageMaker Boto3 client (sm_boto3) to describe the training job and retrieve the S3 path to the trained model artifacts. The describe_training_job method is called with the TrainingJobName set to sklearn_estimator.latest_training_job.name, which retrieves details of the latest training job. The ModelArtifacts field contains information about the model artifacts, and the S3ModelArtifacts field contains the S3 path to the trained model.
3. print("Model artifact persisted at " + artifact): The S3 path to the trained model is printed, indicating where the model artifacts are stored after the training job is completed.

By using wait() and describe_training_job, the script ensures that it waits for the training job to finish before proceeding to retrieve and print the S3 path to the trained model. This way, you can be sure that the trained model is ready and accessible before further processing or deploying it.

In [30]:
artifact

's3://sagemaker-us-east-1-566373416292/RF-custom-sklearn-2023-06-23-17-51-03-767/output/model.tar.gz'

The variable artifact contains the S3 path to the trained model artifacts that were generated during the SageMaker training job. These artifacts include the model itself and any associated files created during the training process.

The specific content of the artifact variable will be a string representing the S3 path.

In [31]:
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime

model_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
    name =  model_name,
    model_data=artifact,
    role="arn:aws:iam::566373416292:role/service-role/AmazonSageMaker-ExecutionRole-20230120T164209",
    entry_point="script.py",
    framework_version=FRAMEWORK_VERSION,
)

In the code, a SageMaker SKLearnModel is created based on the trained model artifacts from the previous training job. The SKLearnModel serves as a SageMaker endpoint for deploying the trained Scikit-learn model.

Here's what the code does:

1. model_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime()): This line generates a unique name for the SageMaker model. The name is a combination of the base name "Custom-sklearn-model-" and a timestamp in the format "YYYY-MM-DD-HH-MM-SS". This ensures that each model created will have a distinct name.
2. model = SKLearnModel(...): This line creates the SKLearnModel instance, representing the SageMaker model to be deployed. The instance is initialized with the following parameters:
- name: The name of the SageMaker model. It is set to the value generated in the previous step (model_name).
- model_data: The S3 path to the trained model artifacts. This is the value of the artifact variable, which points to the location of the serialized model in S3.
- role: The Amazon Resource Name (ARN) of the IAM role that SageMaker will assume to perform tasks on your behalf during model deployment.
- entry_point: The name of the script file that contains the custom Scikit-learn model implementation. In this case, it is "script.py".
- framework_version: The version of Scikit-learn to be used during model deployment. The provided version is "0.23-1".

Once the model instance is created, you can deploy it as a SageMaker endpoint to make predictions on new data.

In [33]:
model_name

'Custom-sklearn-model-2023-06-23-18-01-16'

The variable model_name contains the name of the SageMaker model that will be created for deploying the trained Scikit-learn model. 

In [34]:
##Endpoints deployment
endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
)

EndpointName=Custom-sklearn-model-2023-06-23-18-02-06


INFO:sagemaker:Creating model with name: Custom-sklearn-model-2023-06-23-18-01-16

INFO:sagemaker:Creating endpoint-config with name Custom-sklearn-model-2023-06-23-18-02-06

INFO:sagemaker:Creating endpoint with name Custom-sklearn-model-2023-06-23-18-02-06


The code provided deploys the trained Scikit-learn model as a SageMaker endpoint using the model.deploy() method. This allows you to create a live endpoint to make predictions on new data.

Here's what the code does:

- endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime()): This line generates a unique name for the SageMaker endpoint. It follows the same format as in the previous example, where endpoint_name is a combination of the base name "Custom-sklearn-model-" and a timestamp in the format "YYYY-MM-DD-HH-MM-SS". This ensures that each endpoint created will have a distinct name.
- print("EndpointName={}".format(endpoint_name)): This line prints the name of the endpoint that will be created. The printed value will be in the format of "EndpointName=Custom-sklearn-model-YYYY-MM-DD-HH-MM-SS".
- predictor = model.deploy(..., endpoint_name=endpoint_name): This line deploys the SageMaker endpoint using the trained Scikit-learn model. The deploy() method takes the following parameters:
- initial_instance_count: The number of instances to be launched for the endpoint. In this case, a single instance (initial_instance_count=1) will be launched.
- instance_type: The type of EC2 instance to be used for the endpoint. In this example, "ml.m4.xlarge" instances will be used.
- endpoint_name: The name for the endpoint. It is set to the value generated in the first step (endpoint_name). The endpoint will be accessible using this name.

After executing this code, the Scikit-learn model will be deployed as a SageMaker endpoint with the specified instance configuration, and you can use the predictor object to make real-time predictions by sending new data to the endpoint. For example:

In [None]:
endpoint_name

The variable endpoint_name contains the name of the SageMaker endpoint that was created for deploying the trained Scikit-learn model. This endpoint serves as an API endpoint that you can use to make real-time predictions on new data.

In [None]:
testX[features][0:2].values.tolist()

The code provided extracts a subset of data from the 'testX' DataFrame, specifically the first two rows, and converts it into a list. The selected subset includes only the columns specified in the 'features' list.

Here's what the code does:

- testX[features]: This part of the code selects the columns specified in the 'features' list from the 'testX' DataFrame. It effectively filters the 'testX' DataFrame to retain only the specified columns.
- [0:2]: This slice notation selects the first two rows of the filtered DataFrame, i.e., the first and second row.
- values.tolist(): The 'values' attribute of the DataFrame is used to get the values of the selected subset as a NumPy array. The 'tolist()' method then converts this NumPy array into a Python list.

The resulting output will be a Python list containing the selected data from the 'testX' DataFrame, limited to the first two rows and only the columns specified in the 'features' list.

In [None]:
print(predictor.predict(testX[features][0:2].values.tolist()))

The provided code uses the SageMaker predictor object to make predictions on the subset of data extracted from the 'testX' DataFrame, containing the first two rows and only the columns specified in the 'features' list.

Here's what the code does:

- testX[features][0:2].values.tolist(): This part of the code selects the subset of data from the 'testX' DataFrame, containing the first two rows and only the columns specified in the 'features' list. The data is extracted as a list of lists, where each sublist represents a row of data, and each element in the sublist represents the value of a specific feature for that row. The resulting format is similar to the one shown in the previous example.
- predictor.predict(...): The predict() method of the predictor object is called with the selected subset of data as input. This method sends the data to the deployed SageMaker endpoint and receives the model's predictions in return.
- print(...): The predictions obtained from the predict() method are printed to the console.

In [None]:
sm_boto3.delete_endpoint(EndpointName=endpoint_name)

The code provided deletes the SageMaker endpoint with the name specified in the endpoint_name variable. It uses the delete_endpoint function from the sm_boto3 SageMaker client to initiate the deletion of the endpoint.

Here's what the code does:

sm_boto3.delete_endpoint(EndpointName=endpoint_name): This line calls the delete_endpoint function of the SageMaker client (sm_boto3) and passes the name of the endpoint to be deleted as the EndpointName parameter. The EndpointName should match the name of the endpoint that was created earlier using the same name.
After executing this code, the specified SageMaker endpoint will be deleted, and the associated compute resources used for hosting the endpoint will be terminated. It's a good practice to delete unused endpoints to avoid unnecessary charges and resource consumption.

Keep in mind that once the endpoint is deleted, you won't be able to use it for making predictions until you redeploy the model and create a new endpoint. If you plan to use the model again, you'll need to re-deploy it using the previous steps, starting from creating an SKLearnModel and deploying it as an endpoint.