# AWS SageMaker Linear-Learner Algorithm for Breast Cancer Prediction

In [None]:
$AWS SageMaker is a fully managed AWS service to build, train and deploy Machine Learning(ML) models. 
$It has three services namely Notebook Instances, Training Instances and Endpoint Instances.
$The Notebook Instances can host Jupiter notebooks to perform exploratory data analysis by accessing 
raw data from Amazon s3 buckets and stores the training data again in s3. 
$The Training Instances train the ML models on the training data by choosing any of the instances based on the data and model 
requirements. The trained models are in turn send to s3 for future use. 
$The trained models are deployed to the endpoints by Endpoint Instances for making predictions on the unseen data.

AWS SageMaker has many inbuilt ML algorithms and Linear-Learner is one amongst them. 
It is a supervised learning algorithm to solve either Regression or Classification problems. 
I will be using Linear-Learner algorithm for classifying whether a patient has breast cancer or not.

The idea is here is to explore and understand the capabilities and functionality of AWS SageMaker for data exploratory , 
training and deployment processes.

# Data sources:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)

In [None]:
import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role
from sagemaker import LinearLearner
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [7]:
import pandas as pd
import boto3
from sagemaker import get_execution_role

# Get the SageMaker execution role
role = get_execution_role()

# Specify the S3 bucket and file path
bucket = 'sagemaker-17'
data_key = 'cancer-dir/data.csv'
data_location = f's3://{bucket}/{data_key}'

# Read the CSV file from S3 into a DataFrame
df = pd.read_csv(data_location)

# Display the first 5 rows of the DataFrame
df.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
#Shape of the dataframe df
print(df.shape)

(569, 32)


In [9]:
#Columns/Features present in df
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [12]:
# Convert 'diagnosis' to integers (1 for 'M' and 0 for 'B')
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

# Set 'id' column as the index
df = df.set_index('id')

# Display the first 5 rows of the DataFrame
df.head(5)


Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [13]:
#Concise Summary of df
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                 

In [14]:
# Number of Benign and Malignant tumor patients
benign, malignant = df['diagnosis'].value_counts()
print("Number of Benign patients:", benign)
print("Number of Malignant patients:", malignant)


Number of Benign patients: 357
Number of Malignant patients: 212


# Splitting data into Train and Test sets:

The process data is now split into train and test sets. For this project, 80% of the data is allocated for training and 20% for testing the model.

In [15]:
# Splitting the dataset into train and test sets
cancer_data = df.values
num_train = int(cancer_data.shape[0] * 0.80)  # 80% of the data should be training
X_train = cancer_data[:num_train, 1:32]  # Feature Vector for Training
y_train = cancer_data[:num_train, 0]  # Label or Target Vector for Training
X_test = cancer_data[num_train:, 1:32]  # Feature Vector for Testing
y_test = cancer_data[num_train:, 0]  # Label or Target Vector for Testing

print('Length of Train set:', len(X_train))
print('Length of Test set:', len(y_test))
print('First training sample:\n', X_train[0])
print('First Label/Target sample:', y_train[0])


Length of Train set: 455
Length of Test set: 114
First training sample:
 [1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
First Label/Target sample: 1.0


# Create an output path for the trained models in s3:

This step would be creating an output path in s3 for trained models to reside.

In [17]:
s3_prefix = 'cancer-detection'
output_path = 's3://{}/{}'.format(bucket, s3_prefix)

# Creating an Object of LinearLearner:

Next important step would be to create an Object with LinearLeaner class to train the the model with training data. 
A high-level estimator class Linear Learner is used to initiate training job and inference endpoint. 
One advantage of using this over Python SDK’s generic Estimator class is that we need not specify the location of 
algorithm container we wish to use for training the model.

In [18]:
import sagemaker

session = sagemaker.Session()
linear = sagemaker.LinearLearner(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',  # Use regular single quotes ('') here
    predictor_type='binary_classifier',
    output_path=output_path,
    sagemaker_session=session,
    epochs=20
)


train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


# Wrap data into RecordSet objects:

This step is all about converting data which is in numpy array format to record set format which is done by using 
record_set() function

In [19]:
# Convert numpy arrays into RecordSet
training_recordset = linear.record_set(train=X_train.astype('float32'), labels=y_train.astype('float32'))


# Fit the model to the training data:

This step does the fitting of model with the trained data form the previous step.

In [20]:
linear.fit(training_recordset)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: linear-learner-2023-09-01-08-36-45-587


2023-09-01 08:36:45 Starting - Starting the training job...
2023-09-01 08:37:10 Starting - Preparing the instances for training.........
2023-09-01 08:38:42 Downloading - Downloading input data...
2023-09-01 08:39:12 Training - Downloading the training image.........
2023-09-01 08:40:23 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[09/01/2023 08:40:39 INFO 140046298789696] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'o

# Model Deployment to the endpoint:

Now that we have a trained model, we want to make predictions and evaluate model performance on our test set. For that we’ll need to deploy a model hosting endpoint to accept inference requests using the estimator API. This is done by the following instructions

In [21]:
linear_predictor = linear.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium'  # Use regular single quotes ('') here
)


INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: linear-learner-2023-09-01-08-51-27-653
INFO:sagemaker:Creating endpoint-config with name linear-learner-2023-09-01-08-51-27-653
INFO:sagemaker:Creating endpoint with name linear-learner-2023-09-01-08-51-27-653


------------------!

# Testing with one sample test data:

Next after model has been deployed to an endpoint, the users can test the trained model with the data that has previously 
not seen by the training process. Here, we will be testing with one single sample test data.


In [22]:
sample = X_test.astype('float32')
print(linear_predictor.predict(sample[0]))


[label {
  key: "score"
  value {
    float32_tensor {
      values: 0.616036654
    }
  }
}
label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 1
    }
  }
}
]


# Classification Metrics:

Now after model has been tested with one single sample test data we must evaluate the model performance with complete test data. 
Here we write a function to evaluate the metrics which takes test features, deployment object,test labels, batch size as inputs. 
The output of the function is the classification metrics such as precision, recall and accuracy. 
The entire test data is divided into batches..

In [23]:
#Function to evaluate the deployed model with the entire test set
import numpy as np
import pandas as pd

def evaluate(predictor, test_features, test_labels, test_batch_size=10, verbose=True):
    input_batches = [predictor.predict(batch) for batch in np.array_split(test_features.astype('float32'), test_batch_size)]
    
    predicted_labels = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) for batch in input_batches])

    true_pos = np.logical_and(test_labels, predicted_labels).sum()
    false_pos = np.logical_and(1 - test_labels, predicted_labels).sum()
    true_neg = np.logical_and(1 - test_labels, 1 - predicted_labels).sum()
    false_neg = np.logical_and(test_labels, 1 - predicted_labels).sum()

    recall = true_pos / (true_pos + false_neg)
    precision = true_pos / (true_pos + false_pos)
    accuracy = (true_pos + true_neg) / (true_pos + false_pos + true_neg + false_neg)

    if verbose:
        print(pd.crosstab(test_labels, predicted_labels, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()

    return {
        'tp': true_pos,
        'tn': true_neg,
        'fp': false_pos,
        'fn': false_neg,
        'precision': precision,
        'recall': recall,
        'accuracy': accuracy
    }


# Evaluating the trained model with test data:
This step is about evaluating the model by calling the evaluate() function on X_test and y_test data.


In [24]:
#Model evaluation with test data
evaluate(linear_predictor, X_test.astype(float), y_test.astype(float))


prediction (col)  0.0  1.0
actual (row)              
0.0                86    2
1.0                 1   25
Recall:     0.962
Precision:  0.926
Accuracy:   0.974



{'tp': 25,
 'tn': 86,
 'fp': 2,
 'fn': 1,
 'precision': 0.9259259259259259,
 'recall': 0.9615384615384616,
 'accuracy': 0.9736842105263158}

In [None]:
#Delete the endpoint
session.delete_endpoint(linear_predictor.endpoint)

# Conclusion:
    
The model is built and trained with AWS Linear-Learner classification algorithm for predicting the breast cancer. 
The accuracy of the classifier is 97.36% with a precision and recall of 92.59% and 96.15% respectively. 
It is observed that accuracy is good with Linear-Learner, which uses a linear model to map input to the outputs. 
The accuracy of predictions can be further improved by employing complex nonlinear AWS SageMaker algorithms.