Assignment:
 
Using SAGEMAKER STUDIO: 
 
**1) Write a notebook to read all CSV files in an S3 bucket** 

**2) Combine the files to read a single dataframe**

**3) Apply native AWS algorithms from Sagemaker using**

    a) Autopilot (automated ml)
    b) A loop into multiple algorithms and compare them. Use parallelism for data and models.
    
**4) Select the best algorithm**

**5) Register the Model in Sagemaker**

**6) Ensure you are NOT using a single machine but a cluster to leverage parallelism**

### Importing Necessary Libraries

In [2]:
import pandas as pd
import sagemaker
from sagemaker import get_execution_role

import boto3
import sys

import os

from smart_open import smart_open

### Accessing S3 Bucket

In [3]:
s3_client = boto3.client('s3',
                    aws_access_key_id= 'ACCESS__KEY_ID',
                    aws_secret_access_key='SECRET_ACCESS_KEY')
s3_bucket_name = 'BUCKET_NAME'
s3 = boto3.resource('s3',
                    aws_access_key_id= 'ACCESS__KEY_ID',
                    aws_secret_access_key='SECRET_ACCESS_KEY')

### Reading the CSV

In [4]:
my_bucket = s3.Bucket(s3_bucket_name)

# combined_dataframe
data = pd.DataFrame()

print("CSV FILE FOUND")
print('-'*16)
    
idx = 0

for file in my_bucket.objects.filter(Prefix = 'bank-additional/'):
    file_name = file.key
    if file_name.find(".csv") != -1:
        obj = s3_client.get_object(
            Bucket = s3_bucket_name,
            Key = file_name
        )
        
        idx += 1
        print(idx,')',file_name)
        
        this_file_data = pd.read_csv(obj['Body'])
        data = data.append(this_file_data)
print('-'*16)
print('Total Files Found = ', idx)

CSV FILE FOUND
----------------
1 ) bank-additional/bank-additional/bank-additional-full.csv
2 ) bank-additional/bank-additional/bank-additional.csv
----------------
Total Files Found =  2


In [5]:
data[:10]

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
5,45,services,married,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
6,59,admin.,married,professional.course,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
7,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
8,24,technician,single,professional.course,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Uploading the CSV to an S3 bucket

In [7]:
prefix = 'sagemaker/autopilot-assignment/input'
sess = sagemaker.Session()

uri = sess.upload_data(path = './bank-additional/bank-additional-full.csv', key_prefix = prefix)
print(uri)

s3://sagemaker-us-east-1-313830654669/sagemaker/autopilot-assignment/input/bank-additional-full.csv


### SageMaker Autopilot: Predicting using the Best Model

In [8]:
ep_name = 'autopilot-assignment'
sm_rt = boto3.Session().client('runtime.sagemaker')

tn = tp = fn = fp = count = 0

print('Numer of Samples predicted: ')

with open('bank-additional/bank-additional-full.csv') as f:
    lines = f.readlines()
    for l in lines[1:2000]:   # Skip header
        l = l.split(',')      # Split CSV line into features
        label = l[-1]         # Store 'yes'/'no' label
        l = l[:-1]            # Remove label
        l = ','.join(l)       # Rebuild CSV line without label
                
        response = sm_rt.invoke_endpoint(EndpointName=ep_name, 
                                         ContentType='text/csv',       
                                         Accept='text/csv', Body=l)

        response = response['Body'].read().decode("utf-8")
#         print ("label %s response %s" %(label,response))

        if 'yes' in label:
            # Sample is positive
            if 'yes' in response:
                # True positive
                tp=tp+1
            else:
                # False negative
                fn=fn+1
        else:
            # Sample is negative
            if 'no' in response:
                # True negative
                tn=tn+1
            else:
                # False positive
                fp=fp+1
        count = count+1
        if (count % 100 == 0):   
            sys.stdout.write(str(count)+' --> ')
          
print("Done")
           

Numer of Samples predicted: 
100 --> 200 --> 300 --> 400 --> 500 --> 600 --> 700 --> 800 --> 900 --> 1000 --> 1100 --> 1200 --> 1300 --> 1400 --> 1500 --> 1600 --> 1700 --> 1800 --> 1900 --> Done


In [10]:
accuracy  = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall    = tp/(tp+fn)
f1        = (2*precision*recall)/(precision+recall)

print('-'*30)
print('ACCURACY: ', accuracy)
print('-'*30)
print('PRECISION: ', precision)
print('-'*30)
print('RECALL: ', recall)
print('-'*30)
print('F1: ', f1)
print('-'*30)

------------------------------
ACCURACY:  0.9714857428714357
------------------------------
PRECISION:  0.4157303370786517
------------------------------
RECALL:  0.8809523809523809
------------------------------
F1:  0.5648854961832062
------------------------------
