## Sagemaker Processing
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb

Goals: 
1. understand and run Sagemaker processing with default SKLearn containers
2. set-up a container to run processing and evaluation with tidymodels



SKLearn estimator reutrns an SKLearnModel to deploy.

In [1]:
import boto3
import sagemaker

In [2]:
region = boto3.session.Session().region_name
role = !aws configure get role_arn --profile sagemaker
role = role[0]

In [3]:
import pandas as pd

input_data = 's3://sagemaker-sample-data-{}/processing/census/census-income.csv'.format(region)
df = pd.read_csv(input_data, nrows=1000)

In [4]:
df.head()

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,income
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.


### Feature Engineering


In [5]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.compose import make_column_transformer

In [6]:
preprocess = make_column_transformer(
    (KBinsDiscretizer(encode='onehot-dense', n_bins=10), ['age', 'num persons worked for employer'],),
    (StandardScaler(), ['capital gains', 'capital losses', 'dividends from stocks']),
    (OneHotEncoder(sparse=False, handle_unknown='ignore'), ['education', 'major industry code', 'class of worker'])
)

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('income', axis=1), 
    df['income'], 
    test_size=0.3, 
    random_state=0
)

In [8]:
train_transformed = preprocess.fit_transform(X_train)

  'decreasing the number of bins.' % jj)


#### Transformations


In [9]:
from feat import feat

In [10]:
feat(preprocess, df.columns)

Unnamed: 0,name,feature
0,age,age-7
1,age,age-13
2,age,age-22
3,age,age-27
4,age,age-33
...,...,...
59,class of worker,x2_ Not in universe
60,class of worker,x2_ Private
61,class of worker,x2_ Self-employed-incorporated
62,class of worker,x2_ Self-employed-not incorporated


#### Running the job

Then we could save the files to s3 for training. Remember that a sagemaker estimate always takes a train/test split.

In [23]:
%%writefile preprocessing.py

import pandas as pd
from os.path import join

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    input_path = join("/opt/ml/processing/input/census-income.csv")
    
    df = pd.read_csv(input_path)
    
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop('income', axis=1), 
        df['income'], 
        test_size=0.3, 
        random_state=0
    )
    
    kbin = ['age', 'num persons worked for employer']
    ss = ['capital gains', 'capital losses', 'dividends from stocks']
    ohe = ['education', 'major industry code', 'class of worker']
    
    
    preprocess = make_column_transformer(
        (KBinsDiscretizer(encode='onehot-dense', n_bins=10), kbin),
        (StandardScaler(), ss),
        (OneHotEncoder(sparse=False, handle_unknown='ignore'), ohe)
    )
    
    X_train_transformed = preprocess.fit_transform(X_train)
    
    
    output_path = join("/opt/ml/processing/train", "census-train-transform.csv")
    pd.DataFrame(X_train_transformed).to_csv(output_path, header=False, index=False)
    
    

Overwriting preprocessing.py


In [32]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(
    framework_version='0.20.0',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1
)

In [25]:
sklearn_processor.run(
    code='preprocessing.py',
    inputs=[
        ProcessingInput(
            source=input_data,
            destination='/opt/ml/processing/input'
    )],
    outputs=[
        ProcessingOutput(
            output_name='census-train-transform.csv',
            source='/opt/ml/processing/train'
    )]
)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  sagemaker-scikit-learn-2020-07-11-20-30-40-252
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-us-east-2/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-2-495577990003/sagemaker-scikit-learn-2020-07-11-20-30-40-252/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'census-train-transform.csv', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-2-495577990003/sagemaker-scikit-learn-2020-07-11-20-30-40-252/output/census-train-transform.csv', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}]
  import imp[0m
[34mTraceback (most rece

UnexpectedStatusException: Error for Processing job sagemaker-scikit-learn-2020-07-11-20-30-40-252: Failed. Reason: AlgorithmError: See job logs for more information

### Training
We can use the standard set of sagemaker estimators, but sagemaker now also comes with an additional estimator. 

This estimator is a container with `sklearn`, that accepts a script as an input: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-estimator


This is a part of sagemaker frameworks: https://sagemaker.readthedocs.io/en/stable/frameworks/index.html

Should be possible to use train and test set with the estimator: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#create-an-estimator

In [None]:
%%writefile preprocessing.py

import pandas as pd
from os.path import join

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    input_path = join("/opt/ml/processing/input/census-income.csv")
    
    df = pd.read_csv(input_path)
    
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop('income', axis=1), 
        df['income'], 
        test_size=0.3, 
        random_state=0
    )
    
    kbin = ['age', 'num persons worked for employer']
    ss = ['capital gains', 'capital losses', 'dividends from stocks']
    ohe = ['education', 'major industry code', 'class of worker']
    
    
    preprocess = make_column_transformer(
        (KBinsDiscretizer(encode='onehot-dense', n_bins=10), kbin),
        (StandardScaler(), ss),
        (OneHotEncoder(sparse=False, handle_unknown='ignore'), ohe)
    )
    
    X_train_transformed = preprocess.fit_transform(X_train)
    
    
    output_path = join("/opt/ml/processing/train", "census-train-transform.csv")
    pd.DataFrame(X_train_transformed).to_csv(output_path, header=False, index=False)
    
    

In [29]:
from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(
    entry_point='train.py',
    train_instance_type="ml.m5.xlarge",
    role=role,
    framework_version='0.23-1'
)

Note that unlike other estimators in sagemaker, this model is not fit on a test test. There will be a separate evaluation job.

In [26]:
# sklearn.fit({'train': preprocessed_training_data})

### Evaluation

In [None]:
from sagemaker.s3 import S3Downloader

sklearn_processor.run(code='evaluation.py',
                      inputs=[ProcessingInput(
                                  source=model_data_s3_uri,
                                  destination='/opt/ml/processing/model'),
                              ProcessingInput(
                                  source=preprocessed_test_data,
                                  destination='/opt/ml/processing/test')],
                      outputs=[ProcessingOutput(output_name='evaluation',
                                  source='/opt/ml/processing/evaluation')]
                     )                    
evaluation_job_description = sklearn_processor.jobs[-1].describe()

## Bring your own container for processing

In [28]:
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

NameError: name 'processing_repository_uri' is not defined

But what we really want is a custom estimator like the sklearn one.
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/r_bring_your_own/r_bring_your_own.ipynb

Ideas:

* tidymodels estimator that accepts a script like sklearn
* make-your-own template for any R model