<h2>Kaggle Bike Sharing Demand Dataset</h2>

Modified 'count' as log1p(count) for training - Log converts a big number to a smaller number.  Once prediction is run, need to reverse this to find actual count by using expm1 function

Inspiration: https://www.kaggle.com/apapiu/predicting-bike-sharing-with-xgboost by Alexandru Papiu

To download dataset, sign-in and download from this link: https://www.kaggle.com/c/bike-sharing-demand/data

Objective: <quote>You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period (Ref: Kaggle.com)</quote>

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# assumes that the train and test dataset has been uploaded to the SagMaker Jupyter Notebook
df = pd.read_csv('train.csv', parse_dates=['datetime'])
df_test = pd.read_csv('test.csv', parse_dates=['datetime'])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# We need to convert datetime to numeric for training.
# Let's extract key features into separate numeric columns
def add_features(df):
    df['year'] = df['datetime'].dt.year
    df['month'] = df['datetime'].dt.month
    df['day'] = df['datetime'].dt.day
    df['dayofweek'] = df['datetime'].dt.dayofweek
    df['hour'] = df['datetime'].dt.hour

In [None]:
add_features(df)
add_features(df_test)

In [None]:
df["count"] = df["count"].map(np.log1p)

In [None]:
df.head()

In [None]:
df_test.head()

In [None]:
df.dtypes

## Training and Validation Set

* Target Variable as first column followed by input features
* raining, Validation files do not have a column header

In [None]:
# Training = 70% of the data
# Validation = 30% of the data
# Randomize the datset
np.random.seed(5)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

In [None]:
rows = df.shape[0]
train = int(.7 * rows)
test = int(.3 * rows)

In [None]:
rows, train, test

In [None]:
columns = ['count', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'dayofweek','hour']

In [None]:
# Write Training Set
df[:train].to_csv('bike_train.csv'
                          ,index=False,header=False
                          ,columns=columns)

In [None]:
# Write Validation Set
df[train:].to_csv('bike_validation.csv'
                          ,index=False,header=False
                          ,columns=columns)

In [None]:
# Test Data has only input features
test_cols =  ['datetime'] +columns[1:]
df_test.to_csv('bike_test.csv'
               ,index=False,header=False
              ,columns=test_cols)

In [None]:
test_cols

In [None]:
df_test.head()

In [None]:
','.join(columns)

In [None]:
# Write Column List
with open('bike_train_column_list.txt','w') as f:
    f.write(','.join(columns))

### Import AWS libraries

In [None]:
# Define IAM role
import boto3
import re
import sagemaker
from sagemaker import get_execution_role

# SageMaker SDK Documentation: http://sagemaker.readthedocs.io/en/latest/estimators.html

### Upload Data to S3

In [None]:
bucket_name = 'uwo-bkt-xj'
training_file_key = 'biketrain/bike_train.csv'
validation_file_key = 'biketrain/bike_validation.csv'
test_file_key = 'biketrain/bike_test.csv'

s3_model_output_location = r's3://{0}/biketrain/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_file_key)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_file_key)
s3_test_file_location = r's3://{0}/{1}'.format(bucket_name,test_file_key)

In [None]:
print(s3_model_output_location)
print(s3_training_file_location)
print(s3_validation_file_location)
print(s3_test_file_location)

In [None]:
# Write and Reading from S3 is just as easy
# files are referred as objects in S3.  
# file name is referred as key name in S3
# Files stored in S3 are automatically replicated across 3 different availability zones 
# in the region where the bucket was created.

# http://boto3.readthedocs.io/en/latest/guide/s3.html
def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

In [None]:
write_to_s3('bike_train.csv',bucket_name,training_file_key)
write_to_s3('bike_validation.csv',bucket_name,validation_file_key)
write_to_s3('bike_test.csv',bucket_name,test_file_key)

### Training Algorithm Docker Image

* AWS Maintains a separate image for every region and algorithm

In [None]:
role = get_execution_role()

In [None]:
# This role contains the permissions needed to train, deploy models
# SageMaker Service is trusted to assume this role
print(role)

In [None]:
# find you region
boto3.Session().region_name

### Build Model

In [None]:
sess = sagemaker.Session()

In [None]:
# Access appropriate algorithm container image
#  Specify how many instances to use for distributed training and what type of machine to use
#  Finally, specify where the trained model artifacts needs to be stored
#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html
#    Optionally, give a name to the training job using base_job_name

import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
image_uri = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
print(image_uri)
estimator = sagemaker.estimator.Estimator(image_uri,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m4.xlarge',
                                       output_path=s3_model_output_location,
                                       sagemaker_session=sess,
                                       base_job_name ='xgboost-biketrain-v1')

In [None]:

# Specify hyper parameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference: 
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

# max_depth=5,eta=0.1,subsample=0.7,num_round=150
estimator.set_hyperparameters(max_depth=6,objective="reg:linear",
                              eta=0.12,subsample=0.73,num_round=200)


In [None]:
estimator.hyperparameters()

### Specify Training Data Location and Optionally, Validation Data Location

In [None]:
# content type can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.s3_input(s3_data=s3_training_file_location,content_type="csv")
validation_input_config = sagemaker.session.s3_input(s3_data=s3_validation_file_location,content_type="csv")

In [None]:
print(training_input_config.config)
print(validation_input_config.config)

### Train the model

In [None]:
# XGBoost supports "train", "validation" channels
# Reference: Supported channels by algorithm
#   https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
estimator.fit({'train':training_input_config, 'validation':validation_input_config})

### Deploy Model

In [None]:
# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'xgboost-biketrain-v1')

### Run Predictions

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

In [None]:
predictor.predict([[3,0,1,2,28.7,33.335,79,12.998,2011,7,7,3]])

### Summary

* Ensure Training, Test and Validation data are in S3 Bucket
* Select Algorithm Container Registry Path - Path varies by region
* Configure Estimator for training - Specify Algorithm container, instance count, instance type, model output location
* Specify algorithm specific hyper parameters
* Train model
* Deploy model - Specify instance count, instance type and endpoint name
* Run Predictions