## Boston House pricing Binary Classification problem

Case: Return YES if the new house is predicted to be worth more than $22000. No if not.  

1. Load dataset onto notebook instance from S3
2. Clean, transform and Prepare the dataset
3. Create and train linear learner model
4. Deploy the model into SageMaker hosting

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import io
import sagemaker.amazon.common as smac

import boto3
from sagemaker import get_execution_role
import sagemaker

import matplotlib.pyplot as plt
import seaborn as sns

### Step1: Load the data from S3

In [None]:
role = get_execution_role()
bucket = 'boston-house-bucket'
sub_folder = 'boston-house-data'
data_key = 'boston_housing_raw.csv'
data_location = 's3://{}/{}/{}'.format(bucket, sub_folder, data_key)

df = pd.read_csv(data_location, low_memory = False)
df.head()

### Step2: Clean, Transform and Prepare the dataset


See [Variable description](http://lib.stat.cmu.edu/datasets/boston)

1. Convert CHAS, RAD varibales into categorical and one-hot encode them
2. MinMaxScale the data so that all the points will be in 0 to 1 range
3. Find the scaled value for $22000

In [None]:
#check if there are any missing values
df.isnull().values.any()

In [None]:
#drop unrequired columns
df.drop(columns = ['Unnamed: 0'], inplace = True )


In [None]:
#convert CHAS, RAD attributes to categorical
df['CHAS'] = df['CHAS'].astype('category')
df['RAD'] = df['RAD'].astype('category')

#one-hot encode CHAS, RAD attributes
df = pd.get_dummies(df, columns=['CHAS', 'RAD'])

df.head()

In [None]:
df.shape

In [None]:
#scale the data to evenly distribute between 0 and 1
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
data_scaled = scaler.fit_transform(df)


In [None]:
#minmaxscaler converts dataframe to ndarray, convert it back to data frame
df_scaled = pd.DataFrame(data = data_scaled, columns = list(df) )
df_scaled.head()

In [None]:
df_scaled['MEDV'].head()

In [None]:
df_scaled.corr()

In [None]:
df_scaled['MEDV'].describe()

In [None]:
df['MEDV'].describe()

In [None]:
#this calculation gives scaled down value for any single number(i) if its in MEDV range.
x = df['MEDV']
i = 22

if i in range(len(x)):
    i_scl = ([(i - min(x)) / (max(x) - min(x))]) 
    print("Scaled value of i:", i_scl)
else:
    print('Value not in range')

### Step 3: Create and Train Linear Learner  model

1. Randomize data
2. Split data into train, validate and test sets
3. Classify label MEDV data points to 1(yes) if above USD20000(scaled value 0.377), 0(no) if below.
4. Convert data sets into recordIO format and upload into S3

#### Training Job
1. Import the Amazon SageMaker Python SDK and get the linear-learner container
2. Create training job name(must be unique for every run) and output location
3. Set up required parameters for linear learner algorithm. See [details](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model.html).
4. Set up hyperparameters. Use SageMaker hyperparameter tuning jobs for optimized values.
5. Pass the training, validation channels for input
6. To start model training, call the estimator's fit method. This method calls the CreateTrainingJob API call

In [None]:
#randomize data and split data into train, validation and test sets
np.random.seed(0)

rand_split = np.random.rand(len(df_scaled))

train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
test_list = rand_split <= 0.9

#datasets for training, validating and testing
data_train = df_scaled[train_list]
data_val = df_scaled[val_list]
data_test = df_scaled[test_list]

#convert data sets into numpy.ndarray. X is features and Y is labels

train_X = data_train.drop(columns = 'MEDV').to_numpy() 
train_Y = ((data_train['MEDV'] > 0.377777)+0).to_numpy() #values above 0.37 will return as 1, and below will be as 0.

val_X = data_val.drop(columns = 'MEDV').to_numpy()
val_Y = ((data_val['MEDV'] > 0.377777)+0).to_numpy()

test_X = data_test.drop(columns = 'MEDV').to_numpy()
test_Y = ((data_val['MEDV'] > 0.377777)+0).to_numpy()      

In [None]:
#Create recordIO protobuf type float32 for training data
train_file = 'boston_housing_train_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'),
                                train_Y.astype('float32'))
f.seek(0)

#Upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object('linearlearner_train/{}'.format(train_file)).upload_fileobj(f)

#location of the training data in S3
train_channel = 's3://{}/linearlearner_train/{}'.format(bucket,train_file)

In [None]:
#create recordIO protobuf type32 for validation data
validation_file = 'boston_housing_validation_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype('float32'),
                                val_Y.astype('float32'))
f.seek(0)

#upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object('linearlearner_validation/{}'.format(validation_file,)).upload_fileobj(f)

#location of the validation data in S3
validation_channel = 's3://{}/linearlearner_validation/{}'.format(bucket,validation_file)

In [None]:
# Import the Amazon SageMaker Python SDK and get the linear-learner container.

import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'linear-learner',"1")

In [None]:
#create a training job name
job_name = 'bh-linear-learner-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('job name{}'.format(job_name))

#output path of the model artifacts
output_location = 's3://{}/linearlearner-output'.format(bucket)


In [None]:
print('The feature_dim hyperparameter needs to be set to {}.'.format(data_train.shape[1]-1)) 

In [None]:
#session objest manages interactions with necassary AWS services
sess = sagemaker.Session()

#set up linear algorithm from ECR
linear = sagemaker.estimator.Estimator(container,
                                      role,
                                      train_instance_count =1,
                                      train_instance_type = 'ml.c4.xlarge',
                                      output_path=output_location,
                                      sagemaker_session=sess,
                                      input_mode='Pipe')

#set up hyperparameters.
linear.set_hyperparameters(feature_dim = 22,
                          predictor_type = 'binary_classifier',
                          l1 = 0.0034313572059783636,
                          learning_rate = 0.022529489694206588,
                          mini_batch_size = 1,
                          use_bias = 'true',
                          wd = 0.08134206001008425)

#launch training job. This method calls the CreateTrainingJob API call
data_channels = {
    'train': train_channel,
    'validation': validation_channel
}
linear.fit(data_channels, job_name=job_name)

In [None]:
print('location of the model:{}/{}/model.tar.gz'.format(output_location, job_name))

### Step4: Deploy the model into SageMaker hosting

call .deploy method to deploy the model into SageMaker Hosting

In [None]:
binaryclass_predictor = linear.deploy(initial_instance_count =1, instance_type = 'ml.m4.xlarge')

After deploying the model <br>
1. Set up the confusion matrix
2. Run the batch predictions on test data
3. Run confusion matrix
4. Print the evaluation metrics

In [None]:

from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None, 
                          cmap=None):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
            plt.cm.Greens
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#         print("Normalized confusion matrix")
#     else:
#         print('Confusion matrix, without normalization')

#     print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='Actual',
           xlabel='Predicted')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)

In [None]:
from sagemaker.predictor import json_deserializer, csv_serializer

binaryclass_predictor.content_type = 'text/csv'
binaryclass_predictor.serializer = csv_serializer
binaryclass_predictor.deserializer = json_deserializer

predictions = []
results = binaryclass_predictor.predict(test_X)
predictions += [r['predicted_label'] for r in results['predictions']]
predictions = np.array(predictions)

In [None]:
%matplotlib inline
sns.set_context("paper", font_scale=1.4)

y_test = (data_test['MEDV']> 0.377777)+0
y_pred = predictions

class_names = np.array(['YES', 'NO'])

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names,
                      title='Confusion matrix',
                      cmap=plt.cm.Blues)
plt.grid(False)
plt.show()

In [None]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

y_test = (data_test['MEDV']> 0.377777)+0
y_pred = predictions
scores = precision_recall_fscore_support(y_test, y_pred, average='macro',labels=np.unique(y_pred))
acc = accuracy_score(y_test, y_pred)
print('Accuracy:{}'.format(acc))
print('Precision:{}'.format(scores[0]))
print('Recall :{}'.format(scores[1]))
print('F1 score:{}'.format(scores[2]))