# Setup

Start by specifiying: 
- SageMaker role arn used to give learning and hosting access to your data. The snippet below will use the same role used by your SageMakernotebook instance, if you're using other. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
- The S3 bucket that you want to use for training and storing model objects

In [None]:
!pip install sagemaker

In [None]:
import os
import boto3
import re
import sagemaker 

role = sagemaker.get_execute_role()
region = boto3.Session().region_name

#S3 bucket is used for storing code and model 
bucket = sagemaker.Session().default_bucket()

prefex = (
    "sagemaker/DEMO-breast-cancer-prediction"  #training files uploaded into aws s3 bucket 
)

# Now We import python libraries and dependancies 

In [None]:
import pandas as pd
import numpy as np
import io
import time
import json
import sagemaker.amazon.common as smac

# Data sources

**Breast Cancer Wisconcin dataset:**
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [None]:
s3 = boto3.client("s3")

filename = "data.csv"
s3.download_file("sagemaker-sample-files", "/kaggle/input/breast-cancer-wisconsin-data/data.csv")
data = pd.read_csv(filename, header=None)

#specify columns extracted 
data.columns = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    
    
    
    
    
]

#save data
data.to_csv("data.csv", sep",", index=False)

# shape of data file
print(data.shape)

# top few rows
display(data.head())

# describe data objects
display(data.describe())

# summarize categorical field diagnosis
display(data.diagnosis.value_count())

# Key observation:
- 569 observations
- 32 columns
- First field is 'id'
- Second field, "diagnosis", ('M' = Maglignant; 'B' = Benign)
- 30 other numeric features avialible 

# Create features and Labels 
Split the data into 80%  training 10% and 10% testing. 

In [None]:
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_test = (rand_split >= 0.8) & (rand_split < 0.9)
test_list = rand_split >= 0.9

data_train = data[train_list]
data_val = data[val_test]
data_test = data[test_list]

train_y = ((data_train.iloc[:, 1] =="M") + 0).to_numpy()
train_X = data_train.iloc[:, 2:].to_numpy()

val_y = ((data_val.iloc[:, 1] =="M") + 0).to_numpy()
val_X =  data_val.iloc[:, 2:].to_numpy()

test_y = ((data_test.iloc[:, 1] =="M") + 0).to_numpy()
test_X = data_test.iloc[:, 2:].to_numpy()

We then convert the datasets to  to the recordIO-wrapped protobuf format used by AWS SageMaker algorithm and then uploaded data to AWS S3 bucket. First train the data.

In [None]:
train_file = "linear_train.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype("float32"), train_y.astype("float32"))
f.seek(0)

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train", train_file)
).upload_fileobj(f)

# Next convert and upload the validation dataset

In [None]:
validation_file = "linear_validation.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype("float32"), vsl_y.astype("float32"))
f.seek(0)

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation", validation_file)
).upload_fileobj(f)


# Training the model
- Specify linear model
- Amazon SageMaker has a Leinear Learner which can fit multiple models in parallel (each contructed with different hyperparameters)so it returns the model which performs the best 
- The process of retuning the best model occurs automatically 

- Parameters that we can control include:

1. loss - controls how model is penalized for mistakes in estimates, absolute loss is used (less sensitive to outliers)
2. num_models - controls the number of models run in parallel. Algorithm chooses models with nerby parameter values in order to find a optimal solution Max = 32 was used.
3. wd or l1 - controls regulazation. regulation prevets overfitting. WHich was left to the defualt "auto" value. 

# Specify container images used by SageMaker's linear learner for hosting and training

In [None]:
from sagemaker import image_uris
container = image_uris.retrieve(region=boto3.Session()region_name, framework="Linear-learner")

In [None]:
linear_job = "DEMO-linear" + time.strtime("%Y-%m-%d-%H-%M-%S", timegmtime())

print("Job is called:", linear_job)

linear_training_params = {
    "RoleArn": role,
    "TrainingJobName": linear_job,
    "AlgorithmSpecification": {"TrainingImage": container, "TrainingInputMode": "File"},
    "ResourceConfig": {"InstanceCount" 1, "InstanceType": "ml.c4.2xlarge", "VolumeSizeInGB": 10},
    "inputDataCOnfig": [
        {
            "ChannelName": 'train,
            "DataSource": {
                "S3DataSource": {
                    "S3DataType"
                    "S3Uri":"s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "ShardedByS3Key",
                }
            },
            "CompressionType": "None",
            "RecordWrapperType" "None",
        },
        {
            