### **What is Sagemaker?**

Amazon SageMaker enables you to quickly build, train, and deploy machine learning
(ML) models at scale, without managing any infrastructure. It helps you focus on the ML
problem at hand and deploy high-quality models by removing the heavy lifting typically
involved in each step of the ML process. This book is a comprehensive guide for data
scientists and ML developers who want to learn the ins and outs of Amazon SageMaker.


### **Why Should You Use It?**

The complexity of the machine learning project in any enterprise increases with the expansion of scale. This is because machine learning projects comprise of three key stages - build, train and deploy - each of which can continuously loop back into each other as the project progresses. And as the amount of data being dealt with increases, so does the complexity. And if you are planning to build a ML model that truly works, your training data sets will tend to be on the larger side

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
train_dir = 'survey_lung_cancer.csv'

In [3]:
df = pd.read_csv(train_dir)

In [4]:
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            

In [6]:
df["GENDER"].unique()

array(['M', 'F'], dtype=object)

In [7]:
df["LUNG_CANCER"].unique()

array(['YES', 'NO'], dtype=object)

# Label Encoding

In [8]:
from sklearn.preprocessing import LabelEncoder

In [9]:
encoder = LabelEncoder()
df['GENDER'] = encoder.fit_transform(df['GENDER'])
df['LUNG_CANCER'] = encoder.fit_transform(df['LUNG_CANCER'])

# Splitting Trian and Test Data

In [10]:
data = df.sample(frac=1, random_state=42)
data

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
288,0,61,1,2,2,2,1,1,2,2,1,2,1,2,1,1
9,1,53,2,2,2,2,2,1,2,1,2,1,1,2,2,1
57,1,73,1,1,1,1,2,1,2,1,2,2,2,2,2,1
60,1,70,1,2,1,2,2,2,2,2,2,2,1,2,2,1
25,1,65,1,2,2,1,1,2,1,2,2,2,2,2,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,1,65,2,2,2,2,2,1,1,1,1,1,1,1,1,1
71,0,66,2,2,2,2,1,2,1,2,1,2,2,2,1,1
106,0,61,2,2,2,2,2,2,1,1,1,1,2,2,1,1
270,0,70,2,1,1,1,1,2,1,1,1,1,2,1,1,0


In [11]:
from sklearn.model_selection import train_test_split

In [12]:
train_data, val_data = train_test_split(data, test_size=0.1)

In [13]:
print(train_data.shape)
print(val_data.shape)

(278, 16)
(31, 16)


## Data Save in csv file

In [14]:
train_data.to_csv('sagemaker_training_dataset.csv', index=False)
val_data.to_csv('sagemaker_validation_dataset.csv', index=False)

# Store Data in s3

In [16]:
import sagemaker

print(sagemaker.__version__)

sess = sagemaker.Session()
bucket = sess.default_bucket()

prefix = 'fiverr2'
training_data_path = sess.upload_data(path='sagemaker_training_dataset.csv', key_prefix=prefix + '/input/training')
validation_data_path = sess.upload_data(path='sagemaker_validation_dataset.csv', key_prefix=prefix + '/input/validation')
output   = 's3://{}/{}/output/'.format(bucket,prefix)
print(training_data_path)
print(validation_data_path)
print(output)

2.107.0
s3://sagemaker-us-east-1-122514903081/fiverr2/input/training/sagemaker_training_dataset.csv
s3://sagemaker-us-east-1-122514903081/fiverr2/input/validation/sagemaker_validation_dataset.csv
s3://sagemaker-us-east-1-122514903081/fiverr2/output/


# Training job

In [19]:
from sagemaker.xgboost import XGBoost

role = sagemaker.get_execution_role()
#role = 'arn:aws:iam::0123456789012:role/Sagemaker-fullaccess'
hyperparameters = {
    "max_depth": 4,
    "eta": 0.2,
    "gamma": 4,
    "min_child_weight": 6,
    "subsample": 0.7,
    "objective": "binary:logistic",
    "num_round": 100,
    "verbosity": 2,
    "n_estimators":500
}

xgb_estimator = XGBoost(entry_point='train.py', 
                          role=role,
                          instance_count=1, 
                          instance_type='ml.m5.xlarge',
                          framework_version='1.2-2',
                          py_version='py3',
                          script_mode=True,
                          output_path=output,
                          hyperparameters = hyperparameters)

In [None]:
xgb_estimator.fit({'training':training_data_path, 'validation':validation_data_path})

2022-09-09 15:27:41 Starting - Starting the training job...
2022-09-09 15:28:05 Starting - Preparing the instances for trainingProfilerReport-1662737261: InProgress
......
2022-09-09 15:29:05 Downloading - Downloading input data...
2022-09-09 15:29:41 Training - Downloading the training image......
2022-09-09 15:30:31 Training - Training image download completed. Training in progress.[34m[2022-09-09 15:30:35.328 ip-10-0-153-19.ec2.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2022-09-09:15:30:35:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2022-09-09:15:30:35:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-09-09:15:30:35:INFO] Invoking user training script.[0m
[34m[2022-09-09:15:30:35:INFO] Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2022-09-09:15:30:35:INFO] Generating setup.cfg[0m
[34m[2022-09-09:15:30:35:INFO] Generating MANIFEST.in[0m
[34m[2022-09-09:15:30:35:INF

# Endpoint

In [29]:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'xgb-demo2-'+timestamp
print(endpoint_name)

xgb-demo2-07-14-41-42


In [30]:
xgb_predictor = xgb_estimator.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1, 
                        instance_type='ml.t2.medium')

-----------!

# Prepare Test Data

In [33]:
# Load some samples, drop labels, and one-hot encode
payload = val_data[:10].drop(['LUNG_CANCER'], axis=1)
payload = payload.to_csv(header=False,index=False).rstrip('\n')
print(payload)

1,62,2,1,2,1,1,2,1,2,2,2,2,1,2
1,56,2,1,1,1,1,2,2,2,2,2,2,1,2
1,46,1,2,2,1,1,1,1,1,1,1,1,2,2
1,56,2,2,2,2,1,2,2,1,2,2,2,1,2
0,72,1,2,2,2,2,2,1,1,1,1,1,1,1
1,58,2,2,2,2,2,1,1,1,2,1,1,2,2
0,65,1,2,2,2,2,1,2,2,2,2,2,2,1
1,68,2,1,2,1,1,2,1,1,1,1,1,1,1
1,72,2,2,2,2,2,2,1,2,2,2,2,2,2
1,51,1,2,1,1,2,2,2,2,2,2,2,1,2


# prediction

In [34]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()
xgb_predictor.deserializer = sagemaker.deserializers.CSVDeserializer()

response = xgb_predictor.predict(payload)

print(response)

[['0.93329525'], ['0.9758178'], ['0.6607869'], ['0.98851424'], ['0.8759119'], ['0.9522261'], ['0.98198503'], ['0.42776287'], ['0.9959466'], ['0.99193305']]


# Delete Endpoint.

In [35]:
xgb_predictor.delete_endpoint()