# Lab 01 Setup

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.

---

---

## Contents

1. [Prerequisites](#Prerequisites)
1. [Read CSV dataset](#Read-CSV-dataset)
1. [Prepare and upload training data to Amazon S3](#Prepare-and-upload-training-data-to-Amazon-S3)

### Prerequisites

> Please execute [Setup and Data Preparation](../0.setup/setup_and_data_prep.ipynb) notebook before running AutoPilot experiment jobs.

In this notebook, we are going to split the downloaded dataset and upload them to S3 so that we are ready to kick off Autopilot experiment.

### Read CSV dataset


In [None]:
import pandas as pd

import sagemaker

session = sagemaker.Session()
role = sagemaker.get_execution_role()

#### Restore the shared variables

In [None]:
%store -r bucket
%store -r prefix
%store -r data_folder
%store -r data_file_path

lab_ap_prefix = f"{prefix}/autopilot"
lab_ap_prefix

#### Read the dataset into a Pandas data frame 

In [None]:
data = pd.read_csv(data_file_path)
with pd.option_context("display.max_columns", 500):
    # Make sure we can see all of the columns
    display(data)

Note that there are 20 features to help predict the target column 'y'.

Amazon SageMaker Autopilot takes care of preprocessing your data for you. You do not need to perform conventional data preprocssing techniques such as handling missing values, converting categorical features to numeric features, scaling data, and handling more complicated data types.




### Prepare and upload training data to Amazon S3
Moreover, splitting the dataset into training and validation splits is not necessary. Autopilot takes care of this for you. You may, however, want to split out a test set. That's next, although you use it for batch inference at the end instead of testing the model.


#### Reserve some data for calling batch inference on the model 

Divide the data into training and testing splits. The training split is used by SageMaker Autopilot. The testing split is reserved to perform inference using the suggested model.


In [None]:
train_data = data.sample(frac=0.8, random_state=200)

test_data = data.drop(train_data.index)

test_data_no_target = test_data.drop(columns=["y"])

<b> Upload the prepared training dataset to Amazon S3 for the Autopilot experiment </b>

Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use.

**Please note down the S3 object URI of variable `train_data_s3_path`, which will be used in doing AutoPilot job in SageMaker Studio UI**

In [None]:
train_file = f"{data_folder}/train_data.csv"
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=lab_ap_prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)

test_file = f"{data_folder}/test_data_without_label.csv"
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_without_label_s3_path = session.upload_data(path=test_file, key_prefix=lab_ap_prefix + "/test")
print("Test data without label uploaded to: " + test_data_without_label_s3_path)

test_file_with_label = f"{data_folder}/test_data_with_label.csv"
test_data.to_csv(test_file_with_label, index=False, header=True)
# test_data_with_label_s3_path = session.upload_data(path=test_file_with_label, key_prefix=lab_ap_prefix + "/test")
# print("Test data with label uploaded to: " + test_data_with_label_s3_path)

#### Store the shared variables

In [None]:
%store lab_ap_prefix
%store train_data_s3_path
%store test_data_without_label_s3_path
%store test_file_with_label

### Next

Since training data is now stored on S3 bucket, we are ready to kick off an Autopilot experiment.
