# Welcome to the AWS Worshop

**Brief description about the dataset**

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

We will explore this dataset and find out factors that contribute the most for diabetes causation. We will also build Machine Learning Models that can help to predict whether a person is diabetic or not and try to improve the model by performing Cross Validation and hyperparameter tuning.

### Steps to be followed:

* Importing necessary libraries
* Creating s3 bucket
* Importing and exporting the data from git repository and s3 bucket.
* Data preprosessing
* Exploratory data analysis
* Building and deploying the model
* Prediction
* Deleting the endpoints and s3 bucket.

### Importing all necessary libraries

In [3]:
# Basic analysis library

import sys
import numpy as np
import pandas as pd

In [4]:
#sagemaker library

import sagemaker                                             #Build in algorithms that are present in sagemaker                                             
import boto3                                                 #Allows to create, update and delete aws resources from s3          
from sagemaker.amazon.amazon_estimator import get_image_uri  #Downloading image container of the models 
from sagemaker.session import s3_input, Session              #Provides convenient methods for manipulating entities and resouces that amazon sagemaker uses, such as training jobs, endpoints and input datasets in s3.
from sagemaker import get_execution_role                     #IAM role created for the instance

In [5]:
# Visualization libraries

from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
from IPython.display import display
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
%pip install --upgrade boto3   # To avoid getting errors while import or exporting data in s3

### Creating  S3 bucket

The s3 bucket can also be created manually by going to the s3 management console and clicking on create bucket.

In [None]:
bucket_name = 'awsworkshop301' # <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

In [None]:
s3 = boto3.resource('s3')  #To get the access of s3 bucket
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

**AWS Identity and Access Management(IAM)** roles are entities you create and assign specific permissions to that allow trusted identities such as workforce identities and applications to perform actions in AWS. When your trusted identities assume IAM roles, they are granted only the permissions scoped by those IAM roles.

### Data Collection

We are using the diabetes dataset and it is divided into two sections. The first part of the dataset will be imported from git repository and the second part will be imported from the s3 bucket.

#### Importing first part of the data from git repository

In [None]:
data1=pd.read_csv('diabetes_first_data.csv')
data1.head()

#### Importing second part of the data from S3 bucket

In [None]:
#uploading the second data into s3

s3=boto3.resource('s3')
s3.meta.client.upload_file('diabetes_second_data.csv',bucket_name,'diabetes_second_data.csv')

In [None]:
# Loading dataset from s3

role=get_execution_role()
data_key='diabetes_second_data.csv'
data_location = 's3://{}/{}'.format(bucket_name, data_key)

data2=pd.read_csv(data_location)
data2.head()

#### Merging the datasets to get a complete data

In [None]:
merge_data=pd.merge(data1,data2, on='Test ID')
merge_data.head()

### Data Preprocessing

In [None]:
#Removing the test id as it is not necessary for exploratory data analysis.

df=merge_data.drop('Test ID', axis='columns')
df.sample(10)

**'Pregnancies'** is the number of pregnancies to date.

**'Glucose'** is the plasma glucose concentration over 2 hours in an oral glucose tolerance test.

**'BloodPressure'** is the diastolic blood pressure, measured in millimeters of mercury (mm Hg).

**'SkinThickness'** is the triceps skin fold thickness, measured in millimeters (mm).

**'Insulin'** is the 2-hour serum insulin, measured in micrometre units per millilitre (mu U/ml).

**'BMI'** is the body mass index (BMI) for weight in kg and height in m (kg/m^2).

**'DiabetesPedigreeFunction'** is a function that scores likelihood of diabetes based on family history, with a realistic range of 0.08 to 2.42.

**'Age'** of a person in years.

**'Outcome'** is the target class label, where 0 represents absence and 1 represents presence of diabetes.

In [None]:
# Displaying the number of entries, the names of the column attributes, the data type and the memory space used

df.info()

The dataset contains 768 rows of records and 9 columns of attributes. The data types of the attributes consist of 6 quantitative discrete numerical integers and 2 quantitative continuous numerical float values.

In [None]:
# Summary statistics of the attributes, including measures of central tendency and measures of dispersion

ab=df.describe() 
ab

In [None]:
#Coverting the above table into a dataframe and uploading it into the s3 bucket.

export_dataframe=pd.DataFrame(ab)
describe_key='describe.xlsx'
describe_location='s3://{}/{}'.format(bucket_name, describe_key)
export_dataframe.to_excel(describe_location)

#We can download this file from s3 bucket directly in our local machine.

**Q) For the Iris dataset in sklearn find the summary statistics and export it into the s3 bucket created manually in the form of an excel file.** 

In [1]:
from sklearn import datasets

iris_data=pd.DataFrame(datasets.load_iris().data)
iris_data.columns=datasets.load_iris().feature_names
iris_data.head()

In [None]:
# Checking for null values 

df.isnull().sum().any()

In [None]:
# Checking for duplicate rows

duplicated_rows = df[df.duplicated()]
duplicated_rows.shape

There are no duplications in the dataset.

Duplicated rows or records will not be dropped from the dataset in this case. There is no certain redundancy which causes inaccurate results and outcomes, since the dataset has no unique identfier that denotes separate entities. Despite this, the dataset will still be checked for duplicated rows.

### Exploratory Data Analysis

EDA aims to perform initial investigations on data before formal modeling and graphical representations and visualisations, in order to discover patterns, look over assumptions, and test hypothesis. The summarised information on main characteristics and hidden trends in data can help the doctor to identify concern areas of problems and the resolution of these can boost their accuracy in diagnosing diabetes.

In [None]:
# Checking the outcome labels

df['Outcome'].value_counts()

In [None]:
# Plotting the outcome col. histogram

plt.figure(figsize=(7, 5))
sns.countplot(data=df, x='Outcome',palette="autumn",facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("dark", 3))
plt.savefig("countplot.jpg")  #saving the image of this plot in the sagemaker console

In [None]:
#Exporting the image of this plot directly into the s3 bucket

s3=boto3.resource('s3')
s3.meta.client.upload_file('countplot.jpg',bucket_name,'countplot.jpg')

In [None]:
#Creating a piechart to get the percentage of diabetic and non-diabetic population

fig, ax = plt.subplots()

labels = ['Diabetic', 
         'Non-Diabetic']
percentages = [34.89, 65.10]
explode=(0.1,0)
ax.pie(percentages, explode=explode, labels=labels, autopct='%1.0f%%', 
       shadow=False, startangle=0,   
       pctdistance=1.2,labeldistance=1.4)
ax.legend(frameon=False, bbox_to_anchor=(1.5,0.8))

**Q) Save the image of the pie chart and export it to the s3 bucket created manually.**

In [None]:
# Checking distribution of all features

df.hist(figsize=(12,10),grid=False)
sns.set_style('white')
plt.savefig("freqdist.jpg")    #saving the image of this plot in the sagemaker console

The Histograms provide us a more or less clear picture that the Attributes are positively skewed.

Furthermore, the histogram density plots and their respective highest point in the curves show the patterns that diabetes patients generally have higher numbers of Pregnancies, higher Glucose and BMI readings, and older in Age.

In [None]:
#Exporting the image of these plots directly into the s3 bucket

s3=boto3.resource('s3')
s3.meta.client.upload_file('freqdist.jpg',bucket_name,'freqdist.jpg')

In [None]:
#First, we would know what is the effect of Age on the Outcome because we have heard that as the age increases, the chances of diabetes also commonly increases.

sns.boxplot(x = 'Outcome', y = 'Age', data = df)
plt.title('Age vs Outcome')
plt.show()

Yes, we were right, the median of the age of diabetic people is greater than that of non-diabetic people.

In [None]:
#Let's also check the effect of Blood Pressure on the Outcome.

sns.boxplot(x = 'Outcome', y = 'BloodPressure', data = df, palette = 'Blues')
plt.title('BP vs Outcome')
plt.show()

The median of the BloodPressure of diabetic people lies close to the 75th Percentile of non-diabetic people.

In [None]:
#One would also want to know the chances of getting diabetes, if it is common in the family. We can check that with the Diabetes Pedigree Functio

my_pal = {0: "lightgreen", 1: "lightblue"}
sns.boxplot(x = 'Outcome', y = 'DiabetesPedigreeFunction', data = df, palette = my_pal)
plt.title('DPF vs Outcome')
plt.show()

Quite a proportion of people having high DPF does not end up having Diabetes. But usually the diabetic people have DPF value close to 0.5 (50th Percentile)

#### Gluscose Level

In [None]:
my_pal = {0: "lightgrey", 1: "lightyellow"}
sns.boxplot(x = 'Outcome', y = 'Glucose', data = df, palette = my_pal)
plt.title('Glucose vs Outcome')
plt.show()

Wow! the median of the Glucose level of Diabetic People is greater than the 75th Percentile of the glucose level of non-diabetic people. Therefore having a high glucose level does increase the chances of having diabetes.

#### Body Mass Index

Body mass index (BMI) is a measure of body fat based on height and weight that applies to adult men and women. Does having a higher BMI leads to more chances of being diabetic? Let's check that out!

In [None]:
my_pal = {0: "lightyellow", 1: "lightpink"}
sns.boxplot(x = 'Outcome', y = 'BMI', data = df, palette = my_pal)
plt.title('BMI vs Outcome')
plt.show()

Indeed, the Median BMI of the Diabetic People is greater than the Median BMI of the Non-Diabetic people.

In [None]:
# Correlation matrix of the data

figure = plt.figure(figsize = (10, 10))
corr_matrix = df.corr().round(2)
sns.heatmap(data = corr_matrix, annot = True)
plt.savefig("corrheatmap.jpg")     #saving the image of this plot in the sagemaker console

# The less correlation, the better. More correlation means presence of duplication of features

Almost all predictors have weak linear correlations, which is indicative that most of them are more likely to have non-linear relationships.

However it is found that, the correlation between Pregnancies & Age is 54%, the correlation between SkinThickness & BMI is 39%, and the correlation between Insulin & SkinThickness is 44%.

So, the population is advised to be concerned about the above issues in order to minimise the chances of diabetes.

Further the analysis is mostly focused on the relationship between various diabetes features and the target feature which is diabetes outcome. This is because the classification purpose will be mostly interested in these types of correlation and their strengths in order for accurate predictions.

In [None]:
#Exporting the image of this plot directly into the s3 bucket

s3=boto3.resource('s3')
s3.meta.client.upload_file('corrheatmap.jpg',bucket_name,'corrheatmap.jpg')

### Building and Deploying Model

In [None]:
# set an output path where the trained model will be saved
prefix = 'xgboost-as-a-built-in-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

In [None]:
### Train Test split

train_data, test_data = np.split(df.sample(frac=1, random_state=1729), [int(0.7 * len(df))])
print(train_data.shape, test_data.shape)      #Not done as x_train y_train as we do in jupyter notebook

* There are 537 rows and 9 columns in the train data.
* There are 231 rows and 9 columns in the test data.

While working in sagemaker the dependent feature that is 'Outcome' in this case should be the first column of the dataset.so, concatenating the train and test data in such a way that first column represents the dependent feature.

In [None]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
pd.concat([train_data['Outcome'], train_data.drop(['Outcome'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')  #creating the path for the training data

In [None]:
# Test Data Into Buckets
pd.concat([test_data['Outcome'], test_data.drop(['Outcome'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')   #creating the path for the test data

#### Building Models Xgboot- Inbuilt Algorithm

In [None]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.

container = get_image_uri(boto3.Session().region_name,    
                          'xgboost', 
                          repo_version='1.0-1')     #Pulling the inbuilt xgboost container or image from sagemaker with recent repo_version

In [None]:
# initialize hyperparameters
# The main purpose is to reduce the cost of model building

hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":50
        }

In [None]:
# construct a SageMaker estimator that calls the xgboost-container
# shift +tab to undestand the estimator

estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', #GPU version for speed
                                          train_volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          train_use_spot_instances=True,
                                          train_max_run=300,
                                          train_max_wait=600)

In [None]:
# Training the model

estimator.fit({'train': s3_input_train,'validation': s3_input_test})

The model has been created in the s3 bucket in the particular folder.

#### Deploy Machine Learning Model As Endpoints

In [None]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')   #endpoints will be created

### Prediction on the Train Data

In [None]:
from sagemaker.predictor import csv_serializer    #data is a csv file
train_data_array = train_data.drop(['Outcome'], axis=1).values     #load the data into an array
xgb_predictor.serializer = csv_serializer     # set the serializer type
predictions_train = xgb_predictor.predict(train_data_array).decode('utf-8')   # predict!
predictions_array_train = np.fromstring(predictions[1:], sep=',')   # and turn the prediction into an array
print(predictions_array_train.shape)

In [None]:
#Creating the confusion matrix on train data

cm_train = pd.crosstab(index=train_data['Outcome'], columns=np.round(predictions_array_train), rownames=['Observed'], colnames=['Predicted'])
tn_train = cm_train.iloc[0,0]; fn_train = cm_train.iloc[1,0]; tp_train = cm_train.iloc[1,1]; fp_train = cm_train.iloc[0,1]; p_train = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p_train))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "Negative", "Positive"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("Negative", tn_train/(tn_train+fn_train)*100,tn_train, fp_train/(tp_train+fp_train)*100, fp_train))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Positive", fn_train/(tn_train+fn_train)*100,fn_train, tp_train/(tp_train+fp_train)*100, tp_train))

### Prediction on the Test Data

In [None]:
from sagemaker.predictor import csv_serializer    #data is a csv file
test_data_array = test_data.drop(['Outcome'], axis=1).values     #load the data into an array
xgb_predictor.serializer = csv_serializer     # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')   # predict!
predictions_array = np.fromstring(predictions[1:], sep=',')   # and turn the prediction into an array
print(predictions_array.shape)

In [None]:
#Creating the confusion matrix on test data

cm = pd.crosstab(index=test_data['Outcome'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "Negative", "Positive"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("Negative", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Positive", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))

### Deleting the endpoints

Once the prediction from the endpoint is done don't run it continuously because the charges will going on.
once the endpoint address is created it needs to be deleted to avoid any extra charges.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)      #Deleteing the endpoint
bucket_to_delete=boto3.resource('s3').Bucket(bucket_name)        #Deleting the bucket
bucket_to_delete.objects.all().delete()

The process of deleting endpoints and s3 bucket can also be done manually.