# Problem:  Title 

### Introduction

*Why* are you creating this notebook?
*What* will you teach? 
*Why* should the reader care about this?

### Background

What is the background info that the reader might be interested in?  

### Contact
Created by Aaron Sengstacken - https://github.com/sengstacken

### References
* Ref 1
* Ref 2


In [None]:
# Import Libraries
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from datetime import datetime
import io

import boto3
import sagemaker
from sagemaker import KMeans
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac
from sagemaker.amazon.amazon_estimator import get_image_uri

Example Notebook for a cloud guru training.  Multiclass Classification Problem

# Data
*What* data exists and *Where* will you get it?

### Load Data

In [31]:
data = ''

In [32]:
role = get_execution_role()
bucket = ''

In [33]:
df = pd.read_csv(data,low_memory=False)

In [34]:
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1977-04-04T04:02:23.340Z,1977-03-31,23:46,circle,4,1,rain,Ila,Bashirian,47.329444,-122.578889,Y,N,N,explained
1,1982-11-22T02:06:32.019Z,1982-11-15,22:04,disk,4,1,partly cloudy,Eriberto,Runolfsson,52.664913,-1.034894,Y,Y,N,explained
2,1992-12-07T19:06:52.482Z,1992-12-07,19:01,circle,49,1,clear,Miller,Watsica,38.951667,-92.333889,Y,N,N,explained
3,2011-02-24T21:06:34.898Z,2011-02-21,20:56,disk,13,1,partly cloudy,Clifton,Bechtelar,41.496944,-71.367778,Y,N,N,explained
4,1991-03-09T16:18:45.501Z,1991-03-09,11:42,circle,17,1,mostly cloudy,Jayda,Ebert,47.606389,-122.330833,Y,N,N,explained


In [35]:
df.shape

(18000, 15)

### Data Exploration and Visualization

### Data Preparation 

#### Scaling and Normalization

#### Missing Values

In [37]:
df.isnull().values.any()

True

In [41]:
display(df[df.isnull().any(axis=1)])

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome


In [42]:
df['shape'].value_counts()

circle      6049
disk        5920
light       1699
square      1662
triangle    1062
sphere      1020
box          200
oval         199
pyramid      189
Name: shape, dtype: int64

In [40]:
# replace the missing values with the most common shape
df['shape'] = df['shape'].fillna(df['shape'].value_counts().index[0])

#### Transform DataTypes

In [44]:
# dates
df['reportedTimestamp'] = pd.to_datetime(df['reportedTimestamp'])
df['eventDate'] = pd.to_datetime(df['eventDate'])

In [45]:
# catagorical
df['shape'] = df['shape'].astype('category')
df['weather'] = df['weather'].astype('category')
df['researchOutcome'] = df['researchOutcome'].astype('category')

In [43]:
# binary mapping
df['physicalEvidence'] = df['physicalEvidence'].replace({'Y':1,'N':0})
df['contact'] = df['contact'].replace({'Y':1,'N':0})

In [46]:
df.dtypes

reportedTimestamp    datetime64[ns, UTC]
eventDate                 datetime64[ns]
eventTime                         object
shape                           category
duration                           int64
witnesses                          int64
weather                         category
firstName                         object
lastName                          object
latitude                         float64
longitude                        float64
sighting                          object
physicalEvidence                   int64
contact                            int64
researchOutcome                 category
dtype: object

#### Drop Unneeded Columns

In [56]:
df.drop(columns=['firstName','lastName','sighting','reportedTimestamp','eventDate','eventTime'],inplace=True)

In [57]:
df.head()

Unnamed: 0,shape,duration,witnesses,weather,latitude,longitude,physicalEvidence,contact,researchOutcome
0,circle,4,1,rain,47.329444,-122.578889,0,0,explained
1,disk,4,1,partly cloudy,52.664913,-1.034894,1,0,explained
2,circle,49,1,clear,38.951667,-92.333889,0,0,explained
3,disk,13,1,partly cloudy,41.496944,-71.367778,0,0,explained
4,circle,17,1,mostly cloudy,47.606389,-122.330833,0,0,explained


In [58]:
# one hot encoding
df = pd.get_dummies(df,columns=['weather','shape'])

In [61]:
df.head()

Unnamed: 0,duration,witnesses,latitude,longitude,physicalEvidence,contact,researchOutcome,weather_clear,weather_fog,weather_mostly cloudy,...,weather_stormy,shape_box,shape_circle,shape_disk,shape_light,shape_oval,shape_pyramid,shape_sphere,shape_square,shape_triangle
0,4,1,47.329444,-122.578889,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,4,1,52.664913,-1.034894,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,49,1,38.951667,-92.333889,0,0,1,1,0,0,...,0,0,1,0,0,0,0,0,0,0
3,13,1,41.496944,-71.367778,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,17,1,47.606389,-122.330833,0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0


In [60]:
df['researchOutcome'] = df['researchOutcome'].replace({'unexplained':0,'explained':1,'probable':2})

In [62]:
df.dtypes

duration                   int64
witnesses                  int64
latitude                 float64
longitude                float64
physicalEvidence           int64
contact                    int64
researchOutcome            int64
weather_clear              uint8
weather_fog                uint8
weather_mostly cloudy      uint8
weather_partly cloudy      uint8
weather_rain               uint8
weather_snow               uint8
weather_stormy             uint8
shape_box                  uint8
shape_circle               uint8
shape_disk                 uint8
shape_light                uint8
shape_oval                 uint8
shape_pyramid              uint8
shape_sphere               uint8
shape_square               uint8
shape_triangle             uint8
dtype: object

In [65]:
df.shape

(18000, 23)

#### Shuffle / Randomize

In [None]:
# randomize - optional
df = df.sample(frac=1).reset_index(drop=True)

#### Split Dataset

In [67]:
# split
rand_split = np.random.rand(len(df))
train_bool = rand_split < 0.8
val_bool = (rand_split >= 0.8) & (rand_split < 0.9)
test_bool = rand_split >= 0.9

train = df[train_bool]
val = df[val_bool]
test = df[test_bool]

In [70]:
# rearrange the columsn so that the target is the first column
train = pd.concat([train['researchOutcome'],train.drop(['researchOutcome'],axis=1)],axis=1)
val = pd.concat([val['researchOutcome'],val.drop(['researchOutcome'],axis=1)],axis=1)

#### Save and Upload Data

In [73]:
# save to CSV
train.to_csv('train.csv',index=False, header=False)
val.to_csv('val.csv',index=False, header=False)

In [74]:
!ls

acloud_kmeans.ipynb		__MACOSX		       project.avi
acloud_xgboost.ipynb		Model Monitor		       README.md
Baile_01_DeepSort.avi		MOT17-06-raw.avi	       SiamMask
data				MOT20-02-raw.avi	       test2.avi
export				NYC_Walkers_01_DeepSort.avi    test.avi
eye_detection.avi		ObjectDetection		       train.csv
eyevideoframes			ObjectTracking_SiamMask.ipynb  ufo_fullset.csv
Facial Rec			Plane Detection		       val.csv
HouseholdPowerPrediction-LSTM	PortoBello_Road_DeepSort.avi   _yolo.jpg
London_Runners_01_DeepSort.avi	ProcessVideo.ipynb


In [76]:
# upload to s3
boto3.Session().resource('s3').Bucket(bucket).Object('train.csv').upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object('val.csv').upload_file('val.csv')

# Modeling

### Model #1 - XGBoost

#### Model Selection

In [80]:
container = get_image_uri(boto3.Session().region_name, 'xgboost','0.90-1')

In [82]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/train.csv'.format(bucket), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/val.csv'.format(bucket), content_type='csv')

#### Model Training Config

In [85]:
# Create a training job name
job_name = 'ufo-xgboost-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Job Name: {}'.format(job_name))

# Here is where the model artifact will be stored
output_location = 's3://{}/'.format(bucket)

Job Name ufo-xgboost-job-20200610163340


In [86]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=output_location,
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='multi:softmax',
                        num_class=3,
                        num_round=100)

data_channels = {
    'train': s3_input_train,
    'validation': s3_input_validation
}

#### Model Training

In [87]:
xgb.fit(data_channels, job_name=job_name)

2020-06-10 16:34:12 Starting - Starting the training job...
2020-06-10 16:34:14 Starting - Launching requested ML instances......
2020-06-10 16:35:40 Starting - Preparing the instances for training......
2020-06-10 16:36:41 Downloading - Downloading input data...
2020-06-10 16:37:13 Training - Downloading the training image...........[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[16:38:51] 14402x22 matrix with 316844 entries loaded from /opt/ml/inp

In [88]:
print('Here is the location of the trained model: {}/{}/model.tar.gz'.format(output_location, job_name))

Here is the location of the trained XGBoost model: s3://sengstacken-acloud-kinesis//ufo-xgboost-job-20200610163340/model.tar.gz


#### Model Evaluation

In [None]:
# plot validation and training progress
client = boto3.client('logs')
BASE_LOG_NAME = '/aws/sagemaker/TrainingJobs'

def plot_object_detection_log(model, title):
    logs = client.describe_log_streams(logGroupName=BASE_LOG_NAME, logStreamNamePrefix=model._current_job_name)
    cw_log = client.get_log_events(logGroupName=BASE_LOG_NAME, logStreamName=logs['logStreams'][0]['logStreamName'])

    mAP_accs=[]
    for e in cw_log['events']:
        msg = e['message']
        if 'validation mAP <score>=' in msg:
            num_start = msg.find('(')
            num_end = msg.find(')')
            mAP = msg[num_start+1:num_end]
            mAP_accs.append(float(mAP))

    print(title)
    print('Maximum mAP: %f ' % max(mAP_accs))

    fig, ax = plt.subplots()
    plt.xlabel('Epochs')
    plt.ylabel('Mean Avg Precision (mAP)')
    val_plot,   = ax.plot(range(len(mAP_accs)),   mAP_accs,   label='mAP')
    plt.legend(handles=[val_plot])
    ax.yaxis.set_ticks(np.arange(0.0, 1.05, 0.1))
    ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%0.2f'))
    plt.show()

#### Linear Learner

In [89]:
# This rearranges the columns
cols = list(train)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
train = train[cols]

cols = list(val)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
val = val[cols]

cols = list(test)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
test = test[cols]

# Breaks the datasets into attribute numpy.ndarray and the same for target attribute.  
train_X = train.drop(columns='researchOutcome').values
train_y = train['researchOutcome'].values

val_X = val.drop(columns='researchOutcome').values
val_y = val['researchOutcome'].values

test_X = test.drop(columns='researchOutcome').values
test_y = test['researchOutcome'].values

In [90]:
train_file = 'ufo_sightings_train_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object('{}'.format(train_file)).upload_fileobj(f)
training_recordIO_protobuf_location = 's3://{}/{}'.format(bucket, train_file)
print('The Pipe mode recordIO protobuf training data: {}'.format(training_recordIO_protobuf_location))

The Pipe mode recordIO protobuf training data: s3://sengstacken-acloud-kinesis/ufo_sightings_train_recordIO_protobuf.data


In [91]:
validation_file = 'ufo_sightings_validatioin_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype('float32'), val_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object('{}'.format(validation_file)).upload_fileobj(f)
validate_recordIO_protobuf_location = 's3://{}/{}'.format(bucket, validation_file)
print('The Pipe mode recordIO protobuf validation data: {}'.format(validate_recordIO_protobuf_location))

The Pipe mode recordIO protobuf validation data: s3://sengstacken-acloud-kinesis/ufo_sightings_validatioin_recordIO_protobuf.data


In [109]:
from sagemaker.amazon.amazon_estimator import get_image_uri
import sagemaker

container = get_image_uri(boto3.Session().region_name, 'linear-learner', "1")

In [110]:
# Create a training job name
job_name = 'ufo-linear-learner-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Job Name: {}'.format(job_name))

# Here is where the model-artifact will be stored
output_location = 's3://{}'.format(bucket)


Job Name: ufo-linear-learner-job-20200610192427


In [111]:
print('The feature_dim hyperparameter needs to be set to {}.'.format(train.shape[1] - 1))

The feature_dim hyperparameter needs to be set to 22.


In [112]:

sess = sagemaker.Session()

# Setup the LinearLeaner algorithm from the ECR container
linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       input_mode='Pipe')
# Setup the hyperparameters
linear.set_hyperparameters(feature_dim=22, # number of attributes (minus the researchOutcome attribute)
                           predictor_type='multiclass_classifier', # type of classification problem
                           num_classes=3)  # number of classes in out researchOutcome (explained, unexplained, probable)


# Launch a training job. This method calls the CreateTrainingJob API call
data_channels = {
    'train': training_recordIO_protobuf_location,
    'validation': validate_recordIO_protobuf_location
}



In [113]:
linear.fit(data_channels, job_name=job_name)


2020-06-10 19:24:30 Starting - Starting the training job...
2020-06-10 19:24:33 Starting - Launching requested ML instances......
2020-06-10 19:25:53 Starting - Preparing the instances for training.........
2020-06-10 19:27:20 Downloading - Downloading input data
2020-06-10 19:27:20 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/10/2020 19:27:42 INFO 140365195962176] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_me

In [98]:
print('Here is the location of the trained Linear Learner model: {}/{}/model.tar.gz'.format(output_location, job_name))

Here is the location of the trained Linear Learner model: s3://sengstacken-acloud-kinesis/ufo-linear-learner-job-20200610164901/model.tar.gz


####  AutoML

In [2]:
!pip install autogluon

Collecting autogluon
  Using cached https://files.pythonhosted.org/packages/0f/e3/5b9f02d217567b1831fb7c7bcd45410a94c6e759fa18ec41b40f725647aa/autogluon-0.0.10-py3-none-any.whl
Collecting ConfigSpace<=0.4.10 (from autogluon)
  Using cached https://files.pythonhosted.org/packages/42/de/4e8e4f26332fc65404f52baa112defbf822b6738b60bfa6b2993f5c60933/ConfigSpace-0.4.10.tar.gz
Collecting lightgbm<3.0,>=2.3.0 (from autogluon)
  Using cached https://files.pythonhosted.org/packages/0b/9d/ddcb2f43aca194987f1a99e27edf41cf9bc39ea750c3371c2a62698c509a/lightgbm-2.3.1-py2.py3-none-manylinux1_x86_64.whl
Collecting gluoncv<1.0,>=0.5.0 (from autogluon)
  Using cached https://files.pythonhosted.org/packages/fa/81/37a00609cb53da3671adb106b9bc03fb1c029ad5a8db4bc668283e65703d/gluoncv-0.7.0-py2.py3-none-any.whl
Collecting catboost<0.24 (from autogluon)
  Using cached https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_

In [4]:
from autogluon import TabularPrediction as task
train_path = 'ufo_fullset.csv'
train_data = task.Dataset(file_path=train_path)
predictor = task.fit(train_data, label='researchOutcome')

Loaded data from: ufo_fullset.csv | Columns = 15 / 15 | Rows = 18000 -> 18000


TypeError: _call() takes 0 positional arguments but 1 was given

In [6]:
predictor = task.fit(train_data=task.Dataset(file_path='ufo_fullset.csv'), label='researchOutcome')

Loaded data from: ufo_fullset.csv | Columns = 15 / 15 | Rows = 18000 -> 18000
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200610_194005/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200610_194005/
Train Data Rows:    18000
Train Data Columns: 15
Preprocessing data ...
Here are the 3 unique label values in your data:  ['explained', 'probable', 'unexplained']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Feature Generator processed 18000 data points with 13 features
Original Features:
	datetime features: 3
	object features: 6
	int features: 2
	float features: 2
Generated Features:
	int features: 0
All Features:
	datetime features: 3
	object features: 6
	int features: 2
	float features: 2
	Data preprocessing and

In [7]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                         model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer
0           CatboostClassifier   0.946667       0.017714   1.576203                0.017714           1.576203            0       True
1      weighted_ensemble_k0_l1   0.946667       0.018908   2.233299                0.001194           0.657095            1       True
2          NeuralNetClassifier   0.946667       0.221566  39.732700                0.221566          39.732700            0       True
3           LightGBMClassifier   0.946111       0.021708   2.153854                0.021708           2.153854            0       True
4     LightGBMClassifierCustom   0.943889       0.046833   6.622899                0.046833           6.622899            0       True
5     ExtraTreesClassifierEntr   0.942778       0.216649   1.139321                0.216649           1.139321            0     

In [None]:
task.fit()

# Model Tuning - Optional

# Model Deployment

#### Configure

#### Deploy

#### Monitor