# Pipeline of Digits

This is a starting notebook for solving the "Pipeline of Digits" assignment.


This notebook was created by [Santiago L. Valdarrama](https://twitter.com/svpino) as part of the [Machine Learning School](https://www.ml.school) program.

Let's make sure we are running the latest version of the SakeMaker's SDK. **Restart the notebook** after you upgrade the library.

In [None]:

# !pip install -q --upgrade awscli boto3
# !pip install -q --upgrade PyYAML==6.0
# !pip install -q --upgrade sagemaker==2.165.0
%pip install pip
%pip install scikit-learn==1.3.0
%pip install tensorflow==2.5
%pip install pandas==1.3.3
%pip install joblib
%pip install matplotlib==3.6.0


In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import os
import sys
import pandas as pd

from pathlib import Path

## Creating the S3 Bucket

Let's create an S3 bucket where you will upload all the information generated by the pipeline. Make sure you set `BUCKET` to the name of the bucket you want to use. This name has to be unique.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region
```

The `LocationConstraint` argument should specify the region where you want to create the bucket.

In [None]:
BUCKET = 'vmate-mnist'

!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region

## Loading the dataset

We have two CSV files containing the MNIST dataset. These files come from the [MNIST in CSV](https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) Kaggle dataset.

The `mnist_train.csv` file contains 60,000 training examples and labels. The `mnist_test.csv` contains 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

Let's extract the `dataset.tar.gz` file.

In [None]:
MNIST_FOLDER = Path('mnist')
DATASET_FOLDER = Path('dataset')
CODE_FOLDER = Path('code')
CODE_FOLDER.mkdir(parents=True, exist_ok=True)
sys.path.append('./code')

!tar -xvzf dataset.tar.gz --no-same-owner

Let's load the first 10 rows of the test set.

In [None]:
df = pd.read_csv(DATASET_FOLDER / 'mnist_train.csv', nrows=10)
df

## S3 upload / locations

In [None]:
S3_LOCATION = f's3://{BUCKET}/{MNIST_FOLDER}'

TRAIN_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / 'mnist_train.csv'), 
    desired_s3_uri=S3_LOCATION,
)

TEST_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / 'mnist_test.csv'), 
    desired_s3_uri=S3_LOCATION,
)

PROCESSED_SET_S3_BASE_URI = f'{S3_LOCATION}/preprocessed_data',


print(f'Train set S3 location: {TRAIN_SET_S3_URI}')
print(f'Test set S3 location: {TEST_SET_S3_URI}')
print(f'Processed set S3 location: {PROCESSED_SET_S3_BASE_URI}')


In [None]:
%%writefile {CODE_FOLDER}/preprocessor.py

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from pickle import dump
from typing import Tuple


DEFAULT_BASE_DIR = Path('/opt')/'ml'/'processing'

def _preprocess_pipeline(df_data: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:    
    num_classes = 10

    categorical_transformer = Pipeline(
        steps=[
            ('encoder', OneHotEncoder())            
        ]
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ('labels', categorical_transformer, ['label'])
        ],
        remainder='passthrough'
    )

    pipeline = Pipeline(
        steps=[
            ("preprocess", preprocessor)
        ]
    )
    data: np.ndarray = pipeline.fit_transform(df_data)
    # OneHotEncoded
    y_data: np.ndarray = data[:, :num_classes]
    # Drop OneHotEncoded target variable
    data = np.delete(data,np.arange(num_classes), axis=1)

    X_train, X_test_validation, y_data, y_test_validation = train_test_split(data, y_data, test_size=0.2, random_state=7)
    X_test, X_validation, y_test, y_validation = train_test_split(X_test_validation, y_test_validation, test_size=0.5, random_state=7)

    return X_train, X_test, X_validation, y_data, y_test, y_validation


def _save_pipeline(base_dir: str, pipeline: Pipeline):  
    
    pipeline_path = Path(base_dir)
    pipeline_path.mkdir(parents=True, exist_ok=True)
    dump(pipeline, open(pipeline_path / 'pipeline.pkl', 'wb'))

def preprocess(base_dir = None, data_filepath =  DEFAULT_BASE_DIR):
    
    if base_dir is None:
        base_dir = DEFAULT_BASE_DIR
        
    base_dir = Path(base_dir)
    (base_dir / 'train').mkdir(parents=True, exist_ok=True)
    (base_dir / 'validation').mkdir(parents=True, exist_ok=True)
    (base_dir / 'test').mkdir(parents=True, exist_ok=True)
    (base_dir / 'labels').mkdir(parents=True, exist_ok=True)
        
    df_data: pd.DataFrame =  pd.read_csv(Path(data_filepath) / 'mnist_train.csv')

    df_test: pd.DataFrame =  pd.read_csv(Path(data_filepath) / 'mnist_test.csv')

    df_data: np.ndarray = pd.concat([df_data, df_test], axis=0)

    label_encoder = LabelEncoder()
    labels = label_encoder.fit_transform(df_data['label'])

    X_train, X_test, X_validation, y_train, y_test, y_validation  = _preprocess_pipeline(df_data)    

    np.savetxt(base_dir / 'train' / 'mnist_train.csv', np.concatenate([X_train, y_train], axis=1), delimiter=',')
    np.savetxt(base_dir / 'test' / 'mnist_test.csv', np.concatenate([X_test, y_test], axis=1), delimiter=',')
    np.savetxt(base_dir / 'validation' / 'mnist_validation.csv', np.concatenate([X_validation, y_validation], axis=1), delimiter=',')
    np.savetxt(base_dir / 'labels' / 'labels.csv', labels, delimiter=',')

if __name__ == "__main__":
    preprocess(
        base_dir=DEFAULT_BASE_DIR,
    )

In [58]:
%%writefile {CODE_FOLDER}/train.py

from pathlib import Path
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from matplotlib import pyplot as plt 

model_path = Path('model') / '001'
image_size = 28 * 28
def create(no_features):  
    model = Sequential([
        Dense(32, activation='sigmoid'),
        Dense(10, activation='softmax'),
    ])
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    model.build(input_shape=(None,no_features)) # Comeback to this, find out what 'None' is.  
    return model

def train(X_train, y_train, X_test, y_test):
    model = create(X_train.shape[1])
    model.summary()
    
    history = model.fit(X_train, y_train, batch_size=18, epochs=5, validation_split=.1, verbose=True)
    model.save(model_path)

    # loss, accuracy = model.evaluate(X_test, y_test, verbose=True)

    # plt.plot(history.history['accuracy'])
    # plt.plot(history.history['val_accuracy'])
    # plt.plot(accuracy)

    # plt.title('model accuracy')
    # plt.ylabel('accuracy')
    # plt.xlabel('epoch')
    # plt.legend(['training', 'validation'], loc='best')
    # plt.show()

    # print(f'Loss: {loss:.3}, Accuracy: {accuracy:.3}')


Overwriting code/train.py


In [59]:
import numpy as np
from train import train
from joblib import Memory
from typing import Tuple

memory = Memory(location='./cache', verbose=0)

@memory.cache
def load_csv() -> Tuple[np.ndarray,np.ndarray,np.ndarray,np.ndarray]:
    num_classes = 10
    directory = Path('./preprocessed_dataset/')

    X_train = np.genfromtxt(directory / 'train' / 'mnist_train.csv', delimiter=',')
    y_train = X_train[:, -num_classes:]

    X_test = np.genfromtxt(directory / 'test' / 'mnist_test.csv', delimiter=',')
    y_test = X_test[:, -num_classes:]

    return X_train, y_train, X_test, y_test


X_train, y_train, X_test, y_test = load_csv()

train(X_train, y_train, X_test, y_test)

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_43 (Dense)             (None, 32)                25440     
_________________________________________________________________
dense_44 (Dense)             (None, 10)                330       
Total params: 25,770
Trainable params: 25,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
INFO:tensorflow:Assets written to: model/001/assets


In [None]:
import tempfile  
from preprocessor import preprocess


directory = Path('./preprocessed_dataset')
directory.mkdir(parents=True, exist_ok=True)

preprocess(
    base_dir=directory,
    data_filepath=Path(DATASET_FOLDER),
)
print(f'Folders: {os.listdir(directory)}')

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import CacheConfig
from sagemaker.tuner import HyperParameter 

train_input= ParameterString(
    name='train_data_location',
    default_value=TRAIN_SET_S3_URI
)

test_input= ParameterString(
    name='test_data_location',
    default_value=TEST_SET_S3_URI
)

processed_data_output_location= ParameterString(
    name='processed_data_output_location',
    default_value=PROCESSED_SET_S3_URI
)

In [None]:

cache_config = CacheConfig(
    enable_caching=True,
    expire_after='15d',
)
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type="ml.m3.medium",
    instance_count=1,
)

preprocessing_step = ProcessingStep(
    name='mnist_preprocessing',
    processor = sklearn_processor,
    inputs=[
        ProcessingInput(source=train_input, destination='/opt/ml/processing/train'),
        ProcessingInput(source=test_input, destination='/opt/ml/processing/test'),
    ],
    outputs = [
        ProcessingOutput(output_name='train', source='/opt/ml/processing/train', destination=processed_data_output_location),
        ProcessingOutput(output_name='validation', source='/opt/ml/processing/validation', destination=processed_data_output_location),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/test', destination=processed_data_output_location),
    ],
    code='code/preprocessor.py',
    cache_config=cache_config,
)

In [None]:
tensorflow_processor = TensorFlow(
    entry_point=f'{CODE_FOLDER}/train.py'
    framework_version='2.6',
    role=role,
    instance_type='m1.m3.medium',
    instance_count=1,
    base_job_name='mnist_train'
    py_version='py38',
)

step_args = tensorflow.fit(
    estimator=estimator,
    inputs={
        'train': TrainingInput(
            s3_data=preprocessing_step.properties.ProcessingOutputConfig.Outputs[
                'train'
            ].S3Output.S3Uri,
            content_type='text/csv'
        ),
        'validation': TrainingInput(
            s3_data=preprocessing_step.properties.ProcessingOutputConfig.Outputs[
                'validation'
            ].S3Output.S3Uri,
            content_type='text/csv'
        ),
        'test': TrainingInput(
            s3_data=preprocessing_step.properties.ProcessingOutputConfig.Outputs[
                'test'
            ].S3Output.S3Uri,
            content_type='text/csv'
        ),
    }
)

training_step = TrainingStep(
    name='mnist_train',
    step_args=step_args,
    cache_config=cache_config,
)

In [None]:

pipeline = Pipeline(
    name="mnist_pipeline",
    parameters=[
        train_input_location,
        test_input_location,
        processed_data_output_location,
    ],
    steps=[preprocessing_step, training_step],
)