# Pipeline of Digits

This is a starting notebook for solving the "Pipeline of Digits" assignment.


This notebook was created by [Santiago L. Valdarrama](https://twitter.com/svpino) as part of the [Machine Learning School](https://www.ml.school) program.

Let's make sure we are running the latest version of the SakeMaker's SDK. **Restart the notebook** after you upgrade the library.

In [None]:
!pip install -q --upgrade pip
!pip install -q --upgrade awscli boto3
!pip install -q --upgrade PyYAML==6.0
!pip install -q --upgrade sagemaker==2.165.0
!pip install -q --upgrade scikit-learn==1.3.0
!pip show sagemaker

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import sys
import boto3
import sagemaker
import pandas as pd

from pathlib import Path
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.steps import ProcessingStep, ProcessingInput, ProcessingOutput
from sagemaker.workflow.parameters import ParameterInteger, ParameterString


role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()

CODE_FOLDER = Path('code')
os.makedirs(CODE_FOLDER, exist_ok=True)
sys.path.append('code')


## Creating the S3 Bucket

Let's create an S3 bucket where you will upload all the information generated by the pipeline. Make sure you set `BUCKET` to the name of the bucket you want to use. This name has to be unique.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region
```

The `LocationConstraint` argument should specify the region where you want to create the bucket.

In [None]:
BUCKET = 'vmate-mnist'

!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region

## Loading the dataset

We have two CSV files containing the MNIST dataset. These files come from the [MNIST in CSV](https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) Kaggle dataset.

The `mnist_train.csv` file contains 60,000 training examples and labels. The `mnist_test.csv` contains 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

Let's extract the `dataset.tar.gz` file.

In [None]:
MNIST_FOLDER = 'mnist'
DATASET_FOLDER = Path('dataset')

!tar -xvzf dataset.tar.gz --no-same-owner

Let's load the first 10 rows of the test set.

In [None]:
df = pd.read_csv(DATASET_FOLDER / 'mnist_train.csv', nrows=10)
df

## Uploading dataset to S3

In [None]:
S3_FILEPATH = f"s3://{BUCKET}/{MNIST_FOLDER}"


TRAIN_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / "mnist_train.csv"), 
    desired_s3_uri=S3_FILEPATH,
)

TEST_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / "mnist_test.csv"), 
    desired_s3_uri=S3_FILEPATH,
)

print(f"Train set S3 location: {TRAIN_SET_S3_URI}")
print(f"Test set S3 location: {TEST_SET_S3_URI}")

In [None]:
%%writefile {CODE_FOLDER}/preprocessor.py

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split

DEFAULT_BASE_DIR = Path('/opt')/'ml'/'processing'

def preprocess(base_dir = None, data_filepath =  DEFAULT_BASE_DIR):
    
    if base_dir is None:
        base_dir = DEFAULT_BASE_DIR
    base_dir = Path(base_dir)
    type(base_dir)
    (base_dir / 'train').mkdir(parents=True, exist_ok=True)
    (base_dir / 'validation').mkdir(parents=True, exist_ok=True)
    (base_dir / 'test').mkdir(parents=True, exist_ok=True)

    df: pd = pd.read_csv(Path(data_filepath) / 'mnist_train.csv')
    df_test: pd = pd.read_csv(Path(data_filepath) / 'mnist_test.csv')
    
    y: pd.Series = df.pop('label')
    X: pd = df
    
    
    X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=7)

    
    pd.DataFrame(pd.concat([X_train, y_train], axis=1).to_csv(base_dir/'train'/'mnist_train.csv', header=True, index=False))
    pd.DataFrame(pd.concat([X_validation, y_validation], axis=1).to_csv(base_dir/'validation'/'mnist_validation.csv', header=True, index=False))

    df_test.to_csv(base_dir / 'test' / 'mnist_test.csv', header=True, index=False)
    

if __name__ == "__main__":
    preprocess(
        base_dir=DEFAULT_BASE_DIR,
    )

In [None]:
%%writefile f'{CODE_FOLDER}/train.py'

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

model = Sequential([
    Dense(784, input_shape=784, activation="relu"),
    Dense(784, input_shape=128, activation="relu"),
    Dense(784, input_shape=10, activation="relu"),
])

In [38]:
import tempfile  
from preprocessor import preprocess


with tempfile.TemporaryDirectory() as directory:
    preprocess(
        base_dir=directory,
        data_filepath=Path(DATASET_FOLDER),
    )
    print(f'Folders: {os.listdir(directory)}')
    print(pd.read_csv(Path(directory) / 'train' / 'mnist_train.csv').head())
    print(pd.read_csv(Path(directory) / 'validation' / 'mnist_validation.csv').head())
    print(pd.read_csv(Path(directory) / 'test' / 'mnist_test.csv').head())

Folders: ['train', 'validation', 'test']
   1x1  1x2  1x3  1x4  1x5  1x6  1x7  1x8  1x9  1x10  ...  28x20  28x21  \
0    0    0    0    0    0    0    0    0    0     0  ...      0      0   
1    0    0    0    0    0    0    0    0    0     0  ...      0      0   
2    0    0    0    0    0    0    0    0    0     0  ...      0      0   
3    0    0    0    0    0    0    0    0    0     0  ...      0      0   
4    0    0    0    0    0    0    0    0    0     0  ...      0      0   

   28x22  28x23  28x24  28x25  28x26  28x27  28x28  label  
0      0      0      0      0      0      0      0      2  
1      0      0      0      0      0      0      0      8  
2      0      0      0      0      0      0      0      0  
3      0      0      0      0      0      0      0      8  
4      0      0      0      0      0      0      0      3  

[5 rows x 785 columns]
   1x1  1x2  1x3  1x4  1x5  1x6  1x7  1x8  1x9  1x10  ...  28x20  28x21  \
0    0    0    0    0    0    0    0    0    0   

In [None]:

sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type="ml.m3.medium",
    instance_count=1,
)

train_data = ParameterString(
    name="TrainData",
    default_value=TRAIN_SET_S3_URI,
)    

test_data = ParameterString(
    name="TestData",
    default_value=TEST_SET_S3_URI,
)    


preprocessing_step = ProcessingStep(
    name='mnist_preprocessing',
    processor = sklearn_processor,
    inputs=[
        ProcessingInput(source=train_data, destination='/opt/ml/processing/train'),
        ProcessingInput(source=test_data, destination='/opt/ml/processing/test'),
    ],
    outputs = [
        ProcessingOutput(output_name='train', source='/opt/ml/processing/train'),
        ProcessingOutput(output_name='validation', source='/opt/ml/processing/validation'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/test'),
    ],
    code='code/preprocessor.py',
)
