# CODING TASK #1: UNDERSTAND THE PROBLEM STATEMENT/BUSINESS CASE [REVIEW]

- In this project, we will assume that we own an ice cream business that is highly dependant on the outside air temperature. 
- We will apply simple linear regression to predict the daily revenue in dollars based on outside air temperature. 
- Dataset:
    - Input (X): Outside Air Temperature
    - Output (Y): Overall daily revenue generated in dollars 
- In simple linear regression, we predict the value of one variable Y based on another variable X.
- X is called the independent variable and Y is called the dependant variable.
- Why simple? Because it examines relationship between two variables only.
- Why linear? when the independent variable increases (or decreases), the dependent variable increases (or decreases) in a linear fashion.


**PRACTICE OPPORTUNITY #1 [OPTIONAL]:**
- **What do you expect the relationship between outside air temperature and ice cream sales to look like?**
- **What do you expect the relationship between outside air temperature and bike sharing rental usage to look like?**
- **What do you expect the relationship between outside air temperature and ski rental usage to look like?**

# CODING TASK #2: IMPORT KEY LIBRARIES/DATASETS AND PREPARE THE DATA FOR TRAINING

In [None]:
# Note that we are using AWS SageMaker 2.72.1
# We will be using the new SageMaker 2.x SDK 
!pip list

In [None]:
# install seaborn library
!pip install --upgrade Seaborn

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# read the data using Pandas 
icecream_sales_df = pd.read_csv('IceCreamData.csv')

In [None]:
# View the DataFrame
icecream_sales_df

In [None]:
icecream_sales_df.head()

In [None]:
icecream_sales_df.tail()

In [None]:
# Separate the data into input X and Output y
X = icecream_sales_df[['Temperature']]
y = icecream_sales_df[['Revenue']]

In [None]:
X

In [None]:
y

In [None]:
# Check out the shape of the input
X.shape

In [None]:
# Check out the shape of the output
y.shape

In [None]:
# Convert the datatype to float32
X = np.array(X).astype('float32')
y = np.array(y).astype('float32')

In [None]:
# Only take the numerical variables and scale them
X 

In [None]:
y

In [None]:
# split the data into training and testing using SkLearn Library
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

**PRACTICE OPPORTUNITY #2 [OPTIONAL]:**
 - **Split the data into 75% for training and the rest for testing**
 - **Verify that the split was successful**
 - **Did you notice any change in the order of the data? why?**
 - **Add an attribute to disable data shuffling [external research is required]**

# CODING TASK #3: TRAIN A LINEAR LEARNER MODEL USING AWS SAGEMAKER (SDK 2.0)

In [None]:
# Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python
# Boto3 allows Python developer to write software that makes use of services like Amazon S3 and Amazon EC2

import sagemaker
import boto3

# Let's create a Sagemaker session
sagemaker_session = sagemaker.Session()

# Let's define the S3 bucket and prefix that we want to use in this session
bucket = 'aws.ml.engineer' # bucket need to be created beforehand
prefix = 'linear_learner' # prefix is the subfolder within the bucket.

# Let's get the execution role for the notebook instance. 
# This is the IAM role that you created when you created your notebook instance. You pass the role to the training job.
# Note that AWS Identity and Access Management (IAM) role that Amazon SageMaker can assume to perform tasks on your behalf (for example, reading training results, called model artifacts, from the S3 bucket and writing training results to Amazon S3). 
role = sagemaker.get_execution_role()
print(role)

In [None]:
X_train.shape

In [None]:
y_train = y_train[:,0]

In [None]:
y_train.shape

In [None]:
import io # The io module allows for dealing with various types of I/O (text I/O, binary I/O and raw I/O). 
import numpy as np
import sagemaker.amazon.common as smac # sagemaker common libary

# Code below converts the data in numpy array format to RecordIO format
# This is the format required by Sagemaker Linear Learner (one of many available options!)

buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, X_train, y_train)
buf.seek(0) 
# When you write to in-memory byte arrays, it increments 1 every time you write to it
# Let's reset that back to zero 


In [None]:
import os

# Code to upload RecordIO data to S3
 
# Key refers to the name of the file    
key = 'linear-train-data'

# The following code uploads the data in record-io format to S3 bucket to be accessed later for training
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

# Let's print out the training data location in s3
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

In [None]:
X_test.shape

In [None]:
y_test.shape

In [None]:
# Make sure that the target label is a vector
y_test = y_test[:,0]


In [None]:
# Code to upload RecordIO data to S3

buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, X_test, y_test)
buf.seek(0) 
# When you write to in-memory byte arrays, it increments 1 every time you write to it
# Let's reset that back to zero 


In [None]:
# Key refers to the name of the file    
key = 'linear-test-data'

# The following code uploads the data in record-io format to S3 bucket to be accessed later for training
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)

# Let's print out the testing data location in s3
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_test_data))

In [None]:
# create an output placeholder in S3 bucket to store the linear learner output

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

In [None]:
# Note that this code leverages the new SageMaker SDK 2.0
# Check this out for the list of changes from AWS SageMaker SDK 1.0 to 2.0: https://sagemaker.readthedocs.io/en/stable/v2.html

# This code is used to get the training container of sagemaker built-in algorithms
# all we have to do is to specify the name of the algorithm that we want to use

# Let's obtain a reference to the linearLearner container image
# You don't have to specify (hardcode) the region, get_image_uri will get the current region name using boto3.Session
container = sagemaker.image_uris.retrieve("linear-learner", boto3.Session().region_name)


# If you are using an old version of AWS SageMAker SDK 1.0, you need to use get_image_uri
# from sagemaker.amazon.amazon_estimator import get_image_uri
# container = get_image_uri(boto3.Session().region_name, 'linear-learner')

In [None]:
# Note that this code leverages the new SageMaker SDK 2.0
# Check this for the list of changes from AWS SageMaker SDK 1.0 to 2.0: https://sagemaker.readthedocs.io/en/stable/v2.html


# We have pass in the container, the type of instance that we would like to use for training 
# output path and sagemaker session into the Estimator. 
# We can also specify how many instances we would like to use for training

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session)


# We can tune parameters like the number of features that we are passing in, type of predictor like 'regressor' or 'classifier', mini batch size, epochs
# Train 32 different versions of the model and will get the best out of them (built-in parameters optimization!)

linear.set_hyperparameters(feature_dim = 1,
                           predictor_type = 'regressor',
                           mini_batch_size = 5,
                           epochs = 5,
                           num_models = 32,
                           loss = 'absolute_loss')

# Now we are ready to pass in the training data from S3 to train the linear learner model

linear.fit({'train': s3_train_data})

# Let's see the progress using cloudwatch logs

**PRACTICE OPPORTUNITY #3 [OPTIONAL]:**
- **Try to train the model with more epochs and additional number of models**
- **Can you try to reduce the cost of the billable seconds?**

# CODING TASK #4: DEPLOY AND TEST TRAINED LINEAR LEARNER MODEL 

In [None]:
# Deploying the model to perform inference 
# serializer: A serializer object is used to encode data for an inference endpoint.
# deserializer: A deserializer object is used to decode data from an inference endpoint.

from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer


linear_regressor = linear.deploy(initial_instance_count = 1,
                                 instance_type = 'ml.m4.xlarge',
                                 serializer = CSVSerializer(),
                                 deserializer = JSONDeserializer())

In [None]:
# Use code lines below if you're using AWS SDK 1.0
# from sagemaker.predictor import csv_serializer, json_deserializer
# linear_regressor.content_type = 'text/csv' # This will need to be enabled for AWS SageMaker SDK 1.0
# linear_regressor.serializer = csv_serializer
# linear_regressor.deserializer = json_deserializer

In [None]:
# making prediction on the test data

result = linear_regressor.predict(X_test)

In [None]:
result # results are in Json format

In [None]:
# Since the result is in json format, we access the scores by iterating through the scores in the predictions

predictions = np.array([r['score'] for r in result['predictions']])

In [None]:
predictions

In [None]:
predictions.shape

In [None]:
# VISUALIZE TEST SET RESULTS
plt.figure(figsize = (10, 6))
plt.scatter(X_test, y_test, color = 'blue')
plt.plot(X_test, predictions, color = 'red')
plt.xlabel('Temperature [DegC]')
plt.ylabel('Revenue [$]')
plt.title('Temperature vs. Revenue (Testing Dataset)')
plt.grid()

In [None]:
# Delete the end-point
linear_regressor.delete_endpoint()

**PRACTICE OPPORTUNITY #4 [OPTIONAL]:**
- **Use the trained AWS SageMaker Linear Learner model, obtain the revenue when the outside air temperature is 35 degC and 10 degC?**
- **Compare the results to the ones optained using SkLearn!**

# GREAT JOB! 

# PRACTICE OPPORTUNITY SOLUTIONS

**PRACTICE OPPORTUNITY #1 SOLUTION:**
- **What do you expect the relationship between outside air temperature and ice cream sales to look like?**
- **What do you expect the relationship between outside air temperature and bike sharing rental usage to look like?**
- **What do you expect the relationship between outside air temperature and ski rental usage to look like?**

- A positive correlation is expected for case 1 & 2 since as temperature increases, we expect ice cream sales and bike sharing rental usage to increase as well. 
- A positive correlation implies a positive relationship between X and Y: as X increases, Y increases.
- A Negative correlation expected for case 3 (ski rental usage) since as temperature decreases, ski rental usage tend to increase (to a point when it's too cold and demand should stabilize or even drop).

**PRACTICE OPPORTUNITY #2 SOLUTION:**
 - **Split the data into 75% for training and the rest for testing**
 - **Verify that the split was successful**
 - **Did you notice any change in the order of the data? why?**
 - **Add an attribute to disable data shuffling [external research is required]**

In [None]:
# split the data into training and testing using SkLearn Library
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = False)


In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
X_train

**PRACTICE OPPORTUNITY #3 SOLUTION:**
- **Try to train the model with more epochs and additional number of models**
- **Can you try to reduce the cost of the billable seconds?**

In [None]:
# More epochs and additional number of models
linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session)

# We can tune parameters like the number of features that we are passing in, type of predictor like 'regressor' or 'classifier', mini batch size, epochs
# Train 32 different versions of the model and will get the best out of them (built-in parameters optimization!)

linear.set_hyperparameters(feature_dim = 1,
                           predictor_type = 'regressor',
                           mini_batch_size = 5,
                           epochs = 10,
                           num_models = 64,
                           loss = 'absolute_loss')

# Now we are ready to pass in the training data from S3 to train the linear learner model

linear.fit({'train': s3_train_data})

# Let's see the progress using cloudwatch logs

In [None]:
# A Spot offers a lower price compared to an on-Demand instance.
# Amazon EC2 Spot Instances offer spare compute capacity available in the AWS Cloud at ~90% discounts compared to On-Demand prices. 

# train_use_spot_instances (bool): Specifies whether to use SageMaker Managed Spot instances for training.
# max_run (int): Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.
# max_wait (int): Timeout in seconds waiting for spot training instances (default: None). After this amount of time Amazon SageMaker will stop waiting for Spot instances to become available (default:None).


linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session,
                                       use_spot_instances = True,
                                       max_run = 300,
                                       max_wait = 600)

# We can tune parameters like the number of features that we are passing in, type of predictor like 'regressor' or 'classifier', mini batch size, epochs
# Train 32 different versions of the model and will get the best out of them (built-in parameters optimization!)

linear.set_hyperparameters(feature_dim = 1,
                           predictor_type = 'regressor',
                           mini_batch_size = 5,
                           epochs = 5,
                           num_models = 32,
                           loss = 'absolute_loss')

# Now we are ready to pass in the training data from S3 to train the linear learner model

linear.fit({'train': s3_train_data})

# Let's see the progress using cloudwatch logs

**PRACTICE OPPORTUNITY #4 SOLUTION:**
- **Use the trained AWS SageMaker Linear Learner model, obtain the revenue when the outside air temperature is 35 degC and 10 degC?**
- **Compare the results to the ones optained using SkLearn!**

In [None]:
temperature = [[10]]
revenue = linear_regressor.predict(temperature)
print(revenue)

temperature = [[35]] 
revenue = linear_regressor.predict(temperature)
print(revenue)


In [None]:
# Delete the end-point
linear_regressor.delete_endpoint()