# Predicting Red Winequality

In this project, I'll look at a winequality dataset, and build a binary classification model that can that can identify
which physiochemical properties make a wine to be classified as either a ‘good wine’ or ‘bad
wine’.It is based on the study of P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling
wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.


### Labeled Data

The red wine quality dataset (P. Cortez et al. 2009) was downloaded from Kaggle. This has
11 features (11 physiochemical properties) and labels (quality rankings) of different red
wines from Portugal. The labels are from 1 (very bad) to 10 (very excellent). There are 1599
samples. According to repository page on Kaggle, the classes are ordered and not balanced
(e.g. there are much more normal wines than excellent or poor ones).
The features are:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

In this notebook, I'd like to train a model based on these features so that we can predict the quality of a red wine in the future.

### Binary Classification

Since we have true labels to aim for, we'll take a **supervised learning** approach and train a binary classifier to sort data into one of our two transaction classes: fraudulent or valid.  We'll train a model on training data and see how well it generalizes on some test data.

The notebook will be broken down into a few steps:
* Loading and exploring the data
* Splitting the data into train/test sets
* Defining and training a LinearLearner, binary classifier
* Making improvements on the model
* Evaluating and comparing model test performance

### Making Improvements

A lot of this notebook will focus on making improvements. Specifically, I'll address techniques for:

1. **Tuning a model's hyperparameters** and aiming for a specific metric, such as high recall or precision.
2. **Managing class imbalance**, since we have in this case, much more normal wines than
excellent or poor ones.
In the later part of this notebook, I will try to compare the performance of LinearLearner and a sklearn SVM model.

---

First, I will import the usual resources.

In [1]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

import boto3
import sagemaker
from sagemaker import get_execution_role

%matplotlib inline

I'm storing my **SageMaker variables** in the next cell:
* sagemaker_session: The SageMaker session we'll use for training models.
* bucket: The name of the default S3 bucket that we'll use for data storage.
* role: The IAM role that defines our data and model permissions.

In [2]:
# sagemaker session, role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# S3 bucket name
bucket = sagemaker_session.default_bucket()


## Loading and Exploring the Data

Next, I am loading the red winequality data, `red_winequality.csv`.

As in previous notebooks, it's important to look at the distribution of data since this will inform how we develop a fraud detection model. We'll want to know: How many data points we have to work with, the number and type of features, and finally, the distribution of data over the classes (valid or fraudulent).

In [3]:
# only have to run once
#!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c534768_creditcardfraud/creditcardfraud.zip
#!unzip creditcardfraud

In [4]:
# read in the csv file
local_data = 'red_winequality.csv'

# print out some data
redwinequality_df = pd.read_csv(local_data)
print('Data shape (rows, cols): ', redwinequality_df.shape)
print()
redwinequality_df.head()

Data shape (rows, cols):  (1599, 12)



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
#My goal is to create a binary classifier and so I'll need 
#a binary class label that indicates whether a wine quality is good (1) or not (0).
# The function below reads in a csv file and return a transformed dataframe
def new_dataframe(local_data = 'red_winequality.csv'):
    '''Reads in a csv file which is assumed to have a `quality` columns.
       This function converts `quality` column values to two numerical values.
       The values from 7 to 10 as 1 (for good wine) and those from 0 to 6 as 0 (for bad wine).
       Source texts have a special label, -1.
       :param local_data: The red_winequality.csv file
       :return: A dataframe with new quality values'''
    
    # your code here
    redwinequality_df = pd.read_csv(local_data)
    redwinequality_df["quality"] = redwinequality_df["quality"].map({0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:1, 8:1, 9:1, 10:1})
    new_dataframe = pd.DataFrame(redwinequality_df)
    return new_dataframe
    

In [6]:
# informal testing, print out the results of a called function
# create new `transformed_df`
transformed_df = new_dataframe(local_data = 'red_winequality.csv')

# check work
# check that all categories of plagiarism have a class label = 1
transformed_df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,0
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,0
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,1
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,1
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,0


### Calculate the percentage of good red wine in the data

To take a look at the distribution of this winequality data over the quality types, good and not good, I will create below a function `goodwine_percentage`. Count up the number of data points in each quality type and calculate the *percentage* of the data points that are good.

In [7]:
# Calculate the fraction of data points that are good wines
def goodwine_percentage(transformed_df):
    '''Calculate the fraction of all data points that have a 'quality' label of 1; good.
       :param redwinequality_df: Dataframe of all transaction data points; has a column 'quality'
       :return: A fractional percentage of good wine data points/all points
    '''
    # counts for all classes
    counts = transformed_df['quality'].value_counts()
    
    # get fraudulent and valid cnts
    goodwine_cnts = counts[1]
    badwine_cnts = counts[0]
    
    # calculate percentage of fraudulent data
    goodwine_percentage = goodwine_cnts/(goodwine_cnts+badwine_cnts)
    
    return goodwine_percentage


Testing out my code by calling my function and printing the result.

In [8]:
# call the function to calculate the fraud percentage
goodwine_percentage = goodwine_percentage(transformed_df)

print('Good redwine percentage = ', goodwine_percentage)
print('Total # of good wine pts: ', goodwine_percentage*transformed_df.shape[0])
print('Out of (total) pts: ', transformed_df.shape[0])


Good redwine percentage =  0.1357098186366479
Total # of good wine pts:  217.0
Out of (total) pts:  1599


### Split into train/test datasets

In this project, I'll want to evaluate the performance of a red wine classifier; training it on some training data and testing it on *test data* that it did not see during the training process. So, I'll need to split the data into separate training and test sets.

Complete the `train_test_split` function, below. This function should:
* Shuffle the transaction data, randomly
* Split it into two sets according to the parameter `train_frac`
* Get train/test features and labels
* Return the tuples: (train_features, train_labels), (test_features, test_labels)

In [9]:
# split into train/test
def train_test_split(transformed_df, train_frac= 0.7, seed=1):
    '''Shuffle the data and randomly split into train and test sets;
       separate the class labels (the column in transformed_df) from the features.
       :param df: Dataframe of all red wine quality
       :param train_frac: The decimal fraction of data that should be training data
       :param seed: Random seed for shuffling and reproducibility, default = 1
       :return: Two tuples (in order): (train_features, train_labels), (test_features, test_labels)
       '''
    
    # convert the df into a matrix for ease of splitting
    df_matrix = transformed_df.to_numpy()
    
    # shuffle the data
    np.random.seed(seed)
    np.random.shuffle(df_matrix)
    
    # split the data
    train_size = int(df_matrix.shape[0] * train_frac)
    # features are all but last column
    train_features  = df_matrix[:train_size, :-1]
    # class labels *are* last column
    train_labels = df_matrix[:train_size, -1]
    # test data
    test_features = df_matrix[train_size:, :-1]
    test_labels = df_matrix[train_size:, -1]
    
    return (train_features, train_labels), (test_features, test_labels)


### Test Cell

In the cells below, I'm creating the train/test data and checking to see that result makes sense. The tests below test that the above function splits the data into the expected number of points and that the labels are indeed, class labels (0, 1).

In [10]:
# get train/test data
(train_features, train_labels), (test_features, test_labels) = train_test_split(transformed_df, train_frac=0.7)


In [11]:
# manual test

# for a split of 0.7:0.3 there should be ~2.33x as many training as test pts
print('Training data pts: ', len(train_features))
print('Test data pts: ', len(test_features))
print()

# take a look at first item and see that it aligns with first row of data
print('First item: \n', train_features[0])
print('Label: ', train_labels[0])
print()

# test split
assert len(train_features) <= 2.333*len(test_features), \
        'Unexpected number of train/test points for a train_frac=0.7'
# test labels
assert np.all(train_labels)== 0 or np.all(train_labels)== 1, \
        'Train labels should be 0s or 1s.'
assert np.all(test_labels)== 0 or np.all(test_labels)== 1, \
        'Test labels should be 0s or 1s.'
print('Tests passed!')

Training data pts:  1119
Test data pts:  480

First item: 
 [ 8.8     0.41    0.64    2.2     0.093   9.     42.      0.9986  3.54
  0.66   10.5   ]
Label:  0.0

Tests passed!


---
# Modeling

Now that my training data is uploaded, it's time to define and train a model!

In this project, I'll first train the SageMaler, built-in algorithm, LinearLearner to separate the training data into the two red winequality types: good or bad. I will later compare it to a custom Sklearn model.

### Creating a LinearLearner Estimator

I'll first instantiate a LinearLearner estimator. I'll start with a simple model, utilizing default values where applicable. Later, I'll try to tune some specific hyperparameters and analyse their impact.

In [12]:
# import LinearLearner
from sagemaker import LinearLearner

# specify an output path
prefix = 'redwine'
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate LinearLearner
linear = LinearLearner(role=role,
                       train_instance_count=1, 
                       train_instance_type='ml.c4.xlarge',
                       predictor_type='binary_classifier',
                       output_path=output_path,
                       sagemaker_session=sagemaker_session,
                       epochs=15)


### Converting data into a RecordSet format

Next, the data is prepared for the LinearLearner model by converting the train features and labels into numpy array's of float values. To perform this task I'll use the record_set function of Sagemaker LinearLearner. This function will format the data as a RecordSet and prepare it for training!

In [13]:
# convert features/labels to numpy
train_x_np = train_features.astype('float32')
train_y_np = train_labels.astype('float32')

# create RecordSet
formatted_train_data = linear.record_set(train_x_np, labels=train_y_np)

### Training the Estimator

After instantiating the estimator, it will be trained with a call to `.fit()`, passing in the formatted training data.

In [14]:
%%time 
# train the estimator on formatted training data
linear.fit(formatted_train_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-26 09:03:16 Starting - Starting the training job...
2020-06-26 09:03:18 Starting - Launching requested ML instances......
2020-06-26 09:04:20 Starting - Preparing the instances for training......
2020-06-26 09:05:18 Downloading - Downloading input data...
2020-06-26 09:06:11 Training - Training image download completed. Training in progress.
2020-06-26 09:06:11 Uploading - Uploading generated training model[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/26/2020 09:06:09 INFO 140713795012416] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'inf


2020-06-26 09:06:18 Completed - Training job completed
Training seconds: 60
Billable seconds: 60
CPU times: user 544 ms, sys: 36.6 ms, total: 580 ms
Wall time: 3min 41s


### Deploying the trained model

The trained model will be deployed to create a predictor. I'll use this to make predictions on the test data and evaluate the model.

In [15]:
%%time 
# deploy and create a predictor
linear_predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!CPU times: user 269 ms, sys: 18.2 ms, total: 287 ms
Wall time: 7min 32s


---
# Evaluating the Model

Now that the model is deployed, it is time to see how it performs when applied to the test data.

According to the deployed [predictor documentation](https://sagemaker.readthedocs.io/en/stable/linear_learner.html#sagemaker.LinearLearnerPredictor), this predictor expects an `ndarray` of input features and returns a list of Records.
> "The prediction is stored in the "predicted_label" key of the `Record.label` field."

Below the model is tested on just one test point, to see the resulting list.

In [16]:
# test one prediction
test_x_np = test_features.astype('float32')
result = linear_predictor.predict(test_x_np[0])

print(result)

[label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 0.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.014677328988909721
    }
  }
}
]


### Helper function for evaluation


The provided function below, takes in a deployed predictor, some test features and labels, and returns a dictionary of metrics; calculating false negatives and positives as well as recall, precision, and accuracy.

In [17]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 100)]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}


### Test Results

The cell below runs the `evaluate` function. 

The code assumes that you have a defined `predictor` and `test_features` and `test_labels` from previously-run cells.

In [18]:
print('Metrics for simple, LinearLearner.\n')

# get metrics for linear predictor
metrics = evaluate(linear_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True) # verbose means we'll print out the metrics


Metrics for simple, LinearLearner.

prediction (col)  0.0  1.0
actual (row)              
0.0               403   12
1.0                43   22

Recall:     0.338
Precision:  0.647
Accuracy:   0.885



This simple model gets a high accuracy of 88.5% ! But it still misclassifies 43 of good redwines as bad, and 12 bad redwines as good redwines. This results in much lower values for recall and precision scores.

Next, the endpoint will be deleted and I'll look into ways to improve this model.

## Deleting the Endpoint

I'll use the function below to delete prediction endpoint.

In [19]:
# Deletes a precictor.endpoint
def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))

In [20]:
# delete the predictor endpoint 
delete_endpoint(linear_predictor)

Deleted linear-learner-2020-06-26-09-03-16-614


---

# Model Improvements

As noted above, the trained simple LinearLearner model got a high accuracy, but still classified good redwines and bad redwines data points incorrectly. Specifically, it misclassifies 43 of good redwines as bad, and 12 bad redwines as good redwines. The recall score is very low compared to the the precision score. Which metric should be improved?

**1. Model optimization**
To answer that question, I consider the four potential users of this model. These are wine merchants, wine reviews magazines, wine producers and wine drinker or end customer. From their perspectives, they would want to have as few bad wines classified as good wines, that is very few false positives. In that respect, it will be useful to optimize a metric that can help decrease the number of these false positives.

In this notebook, we'll look at different cases for tuning a model and make an optimization decision, accordingly.

**2. Imbalanced training data**
* At the beginning of this notebook, it was estimated that only about 14% of the data was labeled as good. It is therefore also important for the model to be able to account for this imbalance in the training set.

So, let's address these issues in order; first, tuning our model and optimizing for a specific metric during training, and second, accounting for class imbalance in the training set. 


## Improvement: Model Tuning

Optimizing according to a specific metric is called **model tuning**, and SageMaker provides a number of ways to automatically tune a model.


### Create a LinearLearner and tune for higher recall 

To aim for a specific metric, LinearLearner offers the hyperparameter `binary_classifier_model_selection_criteria`, which is the model evaluation criteria for the training dataset. A reference to this parameter is in [LinearLearner's documentation](https://sagemaker.readthedocs.io/en/stable/linear_learner.html#sagemaker.LinearLearner). We'll also have to further specify the exact value we want to aim for; read more about the details of the parameters, [here](https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html).

I will assume that performance on a training set will be within about 5% of the performance on a test set. So, for a recall of about 85%, I'll aim for a bit higher, 90%.

---
## Managing Class Imbalance

To account for class imbalance during training of a binary classifier, LinearLearner offers the hyperparameter, `positive_example_weight_mult`, which is the weight assigned to good (1) examples when training a binary classifier. The weight of bad examples (0) is fixed at 1. From the [hyperparameter documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) on `positive_example_weight_mult`, it reads:
> "If you want the algorithm to choose a weight so that errors in classifying negative vs. positive examples have equal impact on training loss, specify `balanced`." 

I will choose 'balanced', since I have no exact idea about the weight to attach to that parameter.

In [21]:
# instantiate a LinearLearner
# tune the model for a higher recall
linear_recall = LinearLearner(role=role,
                              train_instance_count=1, 
                              train_instance_type='ml.c4.xlarge',
                              predictor_type='binary_classifier',
                              output_path=output_path,
                              sagemaker_session=sagemaker_session,
                              epochs=15,
                              binary_classifier_model_selection_criteria='precision_at_target_recall', # Target recall score
                              target_recall=0.9,
                              positive_example_weight_mult='balanced')


### Train the tuned estimator

Fit the new, tuned estimator on the formatted training data.

In [22]:
%%time 
# train the estimator on formatted training data
linear_recall.fit(formatted_train_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-26 09:14:33 Starting - Starting the training job...
2020-06-26 09:14:35 Starting - Launching requested ML instances.........
2020-06-26 09:16:05 Starting - Preparing the instances for training...
2020-06-26 09:16:59 Downloading - Downloading input data......
2020-06-26 09:18:03 Training - Training image download completed. Training in progress.
2020-06-26 09:18:03 Uploading - Uploading generated training model
2020-06-26 09:18:03 Completed - Training job completed
[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/26/2020 09:17:52 INFO 140253805389632] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'tru

[34m#metrics {"Metrics": {"train_binary_classification_weighted_cross_entropy_objective": {"count": 1, "max": 1.199740966796875, "sum": 1.199740966796875, "min": 1.199740966796875}}, "EndTime": 1593163073.31847, "Dimensions": {"model": 31, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 7}, "StartTime": 1593163073.318453}
[0m
[34m[06/26/2020 09:17:53 INFO 140253805389632] #quality_metric: host=algo-1, epoch=7, train binary_classification_weighted_cross_entropy_objective <loss>=1.12743139648[0m
[34m[06/26/2020 09:17:53 INFO 140253805389632] #early_stopping_criteria_metric: host=algo-1, epoch=7, criteria=binary_classification_weighted_cross_entropy_objective, value=0.77024597168[0m
[34m[06/26/2020 09:17:53 INFO 140253805389632] Epoch 7: Loss improved. Updating best model[0m
[34m[06/26/2020 09:17:53 INFO 140253805389632] Saving model for epoch: 7[0m
[34m[06/26/2020 09:17:53 INFO 140253805389632] Saved checkpoint to "/tmp/tmp1uoHGm/mx-mod-0000.

Training seconds: 64
Billable seconds: 64
CPU times: user 578 ms, sys: 39 ms, total: 617 ms
Wall time: 3min 41s


### Deploying and evaluating the tuned estimator

Below the tuned predictor is deployed and evaluated.

In [23]:
%%time 
# deploy and create a predictor
recall_predictor = linear_recall.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!CPU times: user 262 ms, sys: 23.8 ms, total: 286 ms
Wall time: 7min 31s


In [24]:
print('Metrics for tuned (recall), LinearLearner.\n')

# get metrics for tuned predictor
metrics = evaluate(recall_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True)

Metrics for tuned (recall), LinearLearner.

prediction (col)  0.0  1.0
actual (row)              
0.0               272  143
1.0                 6   59

Recall:     0.908
Precision:  0.292
Accuracy:   0.690



## Delete the endpoint 

As always, below, I'm using the `delete_endpoint` helper function defined earlier to delete the endpoint.

In [25]:
# delete the predictor endpoint 
delete_endpoint(recall_predictor)

Deleted linear-learner-2020-06-26-09-14-33-264


---
## A Model Optimizing for F1 to Manage Precision-Recall Trade-off



In [26]:
# instantiate a LinearLearner

# include params for tuning for F1
# *and* account for class imbalance in training data
linear_balanced_f1 = LinearLearner(role=role,
                                train_instance_count=1, 
                                train_instance_type='ml.c4.xlarge',
                                predictor_type='binary_classifier',
                                output_path=output_path,
                                sagemaker_session=sagemaker_session,
                                epochs=15,
                                binary_classifier_model_selection_criteria='f_beta', # target F1
                                
                                positive_example_weight_mult='balanced')


### Training the balanced estimator

Fit the new, balanced estimator on the formatted training data.

In [27]:
%%time 
# train the estimator on formatted training data
linear_balanced_f1.fit(formatted_train_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-26 09:25:49 Starting - Starting the training job...
2020-06-26 09:25:51 Starting - Launching requested ML instances......
2020-06-26 09:26:55 Starting - Preparing the instances for training......
2020-06-26 09:28:13 Downloading - Downloading input data...
2020-06-26 09:28:44 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/26/2020 09:29:01 INFO 140203502479168] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_me


2020-06-26 09:29:11 Uploading - Uploading generated training model
2020-06-26 09:29:11 Completed - Training job completed
Training seconds: 58
Billable seconds: 58
CPU times: user 531 ms, sys: 13 ms, total: 544 ms
Wall time: 3min 41s


### Deploy and evaluate the balanced estimator

Deploy the balanced predictor and evaluate it.

In [28]:
%%time 
# deploy and create a predictor
balanced_f1_predictor = linear_balanced_f1.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-----------------!CPU times: user 318 ms, sys: 7.03 ms, total: 325 ms
Wall time: 8min 32s


In [29]:
print('Metrics for balanced F1, LinearLearner.\n')

# get metrics for balanced predictor
metrics = evaluate(balanced_f1_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True)

Metrics for balanced, LinearLearner.

prediction (col)  0.0  1.0
actual (row)              
0.0               375   40
1.0                28   37

Recall:     0.569
Precision:  0.481
Accuracy:   0.858



## Delete the endpoint 

In [30]:
# delete the predictor endpoint 
delete_endpoint(balanced_predictor)

Deleted linear-learner-2020-06-26-09-25-49-477


## A model targetting high precision

Target a precision of 90% and compare with the previous results.

In [31]:
%%time
# instantiate and train a LinearLearner

# include params for tuning for higher precision
# *and* account for class imbalance in training data
linear_precision = LinearLearner(role=role,
                                train_instance_count=1, 
                                train_instance_type='ml.c4.xlarge',
                                predictor_type='binary_classifier',
                                output_path=output_path,
                                sagemaker_session=sagemaker_session,
                                epochs=15,
                                binary_classifier_model_selection_criteria='recall_at_target_precision',
                                target_precision=0.9,
                                positive_example_weight_mult='balanced')


# train the estimator on formatted training data
linear_precision.fit(formatted_train_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-26 09:38:20 Starting - Starting the training job...
2020-06-26 09:38:22 Starting - Launching requested ML instances......
2020-06-26 09:39:25 Starting - Preparing the instances for training.........
2020-06-26 09:41:08 Downloading - Downloading input data
2020-06-26 09:41:08 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/26/2020 09:41:25 INFO 139704805508928] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_me


2020-06-26 09:41:34 Uploading - Uploading generated training model
2020-06-26 09:41:34 Completed - Training job completed
Training seconds: 38
Billable seconds: 38
CPU times: user 546 ms, sys: 28.4 ms, total: 574 ms
Wall time: 3min 41s


This model trains for a fixed precision of 90%, and, under that constraint, tries to get as high a recall as possible.

In [32]:
%%time 
# deploy and evaluate a predictor
precision_predictor = linear_precision.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!CPU times: user 283 ms, sys: 5.26 ms, total: 289 ms
Wall time: 7min 31s


In [33]:
print('Metrics for tuned (precision), LinearLearner.\n')

# get metrics for precision predictor
metrics = evaluate(precision_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True)

Metrics for tuned (precision), LinearLearner.

prediction (col)  0.0  1.0
actual (row)              
0.0                33  382
1.0                 0   65

Recall:     1.000
Precision:  0.145
Accuracy:   0.204



In [34]:
## IMPORTANT
# delete the predictor endpoint 
delete_endpoint(precision_predictor)

Deleted linear-learner-2020-06-26-09-38-20-120


### Creating a custom Scikit-learn classifier
I'll now define a custom Scikit-learn classifier. I will later compare its performance metrics to those of the built-in LinearLearner used above.

In [78]:
data_dir = 'Winequality_prediction'
pd.concat([pd.DataFrame(train_labels), pd.DataFrame(train_features)], axis=1).to_csv(os.path.join('train.csv'), header=False, index=False)
train_location = sagemaker_session.upload_data(os.path.join('train.csv'), key_prefix=prefix)

In [79]:
# displaying my Scikit-learn training script code, the train.py file.
!pygmentize train2.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mlinear_model[39;49;00m [34mimport[39;49;00m LogisticRegression
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m joblib

[37m## TODO: Import any additional libraries you need to define a model[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m [34mimport[39;49;00m svm

[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39

## Defining a Scikit-learn estimator

In [80]:
# your import and estimator code, here
from sagemaker.sklearn.estimator import SKLearn
sklearn = SKLearn(
                  output_path='s3://{}/{}/output'.format(sagemaker_session.default_bucket(), prefix),
                  entry_point='train2.py',
                  
                  train_instance_type="ml.m4.xlarge",
                  train_instance_count=1,
                  sagemaker_session=sagemaker_session,
                  role=role,
                  )




This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


## Training the estimator

The estimator will now be trained on the training data stored in S3.

In [81]:
%%time
# Training my estimator on S3 training data
s3_input_train = sagemaker.s3_input(s3_data = train_location, content_type = 'csv')
sklearn.fit({'train': s3_input_train})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-06-26 13:25:36 Starting - Starting the training job...
2020-06-26 13:25:38 Starting - Launching requested ML instances.........
2020-06-26 13:27:09 Starting - Preparing the instances for training...
2020-06-26 13:27:55 Downloading - Downloading input data...
2020-06-26 13:28:33 Training - Training image download completed. Training in progress.
2020-06-26 13:28:33 Uploading - Uploading generated training model[34m2020-06-26 13:28:29,094 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-06-26 13:28:29,097 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-26 13:28:29,111 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-06-26 13:28:29,360 sagemaker-containers INFO     Module train2 does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-06-26 13:28:29,361 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-06-26 13:28:2

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-06-26-13-25-36-669: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m train2"

## Deploying the trained model
After training, the model will be deployed to create a predictor.

In [None]:
# deploying my model to create a predictor
svm_predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")


In [None]:
import os
#storing the test data locally in a test.csv file
data_dir = 'Winequality_prediction'
pd.concat([pd.DataFrame(test_labels), pd.DataFrame(test_features)], axis=1).to_csv(os.path.join('test.csv'), header=False, index=False)
test_location = sagemaker_session.upload_data(os.path.join('test.csv'), key_prefix=prefix)
# read in test data
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_labels
test_x = test_features

print('Metrics for sklearn, SVM model.\n')

# get metrics for balanced predictor
metrics = evaluate(svm_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True)

In [None]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

In [None]:
# Second: calculate the test accuracy
from sklearn import metrics
from sklearn.metrics import accuracy_score
precision = metrics.precision_score(test_y, test_y_preds)
recall = metrics.recall_score(test_y, test_y_preds)
f1_score = metrics.f1_score(test_y, test_y_preds)
accuracy = accuracy_score(test_y, test_y_preds)*100

print(accuracy, precision, recall, f1_score)


In [None]:
## IMPORTANT
# delete the predictor endpoint 
delete_endpoint(svm_predictor)

## Final Cleanup!

* Double check that all endpoints are deleted.
* Manually delete S3 bucket, models, and endpoint configurations directly from AWS console.