## Sagemaker Processing
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb

Goals: 
1. understand and run Sagemaker processing with default SKLearn containers
2. set-up a container to run processing and evaluation with tidymodels



SKLearn estimator reutrns an SKLearnModel to deploy.

In [145]:
import boto3
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor

In [2]:
region = boto3.session.Session().region_name
role = !aws configure get role_arn --profile sagemaker

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m4.xlarge',
                                     instance_count=1)

In [3]:
import pandas as pd

input_data = 's3://sagemaker-sample-data-{}/processing/census/census-income.csv'.format(region)
df = pd.read_csv(input_data, nrows=1000)

In [4]:
model_columns = [
  'age', 'education', 'major industry code', 'class of worker', 
  'num persons worked for employer',
  'capital gains', 'capital losses', 'dividends from stocks', 'income'
]

In [5]:
df = df[model_columns]

In [6]:
df.groupby('income').quantile([0, 0.25, 0.5, 0.75, 1])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,capital gains,capital losses,dividends from stocks,num persons worked for employer
income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
- 50000.,0.0,0.0,0.0,0.0,0.0,0.0
- 50000.,0.25,15.0,0.0,0.0,0.0,0.0
- 50000.,0.5,31.0,0.0,0.0,0.0,1.0
- 50000.,0.75,47.0,0.0,0.0,0.0,4.0
- 50000.,1.0,90.0,20051.0,2415.0,11744.0,6.0
50000+.,0.0,28.0,0.0,0.0,0.0,0.0
50000+.,0.25,35.0,0.0,0.0,0.0,2.0
50000+.,0.5,44.0,0.0,0.0,0.0,4.0
50000+.,0.75,55.0,0.0,0.0,600.0,6.0
50000+.,1.0,86.0,99999.0,2444.0,99999.0,6.0


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

In [8]:
df.isnull().sum()

age                                0
education                          0
major industry code                0
class of worker                    0
num persons worked for employer    0
capital gains                      0
capital losses                     0
dividends from stocks              0
income                             0
dtype: int64

In [76]:
preprocess = make_column_transformer(
    (KBinsDiscretizer(encode='onehot-dense', n_bins=10), ['age', 'num persons worked for employer'],),
    (StandardScaler(), ['capital gains', 'capital losses', 'dividends from stocks']),
    (OneHotEncoder(sparse=False, handle_unknown='ignore'), ['education', 'major industry code', 'class of worker'])
)

In [77]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('income', axis=1), 
    df['income'], 
    test_size=0.3, 
    random_state=0
)

In [78]:
train_features = preprocess.fit_transform(X_train)

  'decreasing the number of bins.' % jj)


In [79]:
test_features = preprocess.transform(X_test)

### Feature Engineering

From here, I could save the features to s3 to be used in a feature machine learning algorithm

#### Transformations

I can pull the features out of the transformers (although it's not very nice).

I'd like to be able to see which set of columns were transformed by what. This could be a future package/blog-post

In [144]:
kbin = preprocess.named_transformers_['kbinsdiscretizer']

In [132]:
kbin.bin_edges_

array([array([ 0. ,  6.9, 13. , 22. , 27. , 33. , 38. , 44. , 52. , 64. , 89. ]),
       array([0., 1., 2., 3., 6.])], dtype=object)

In [142]:
kbin.n_bins_

array([10,  4])

In [104]:
ohe = preprocess.transformers_[2][1]

In [107]:
ohe.get_feature_names()

array(['x0_ 10th grade', 'x0_ 11th grade', 'x0_ 12th grade no diploma',
       'x0_ 1st 2nd 3rd or 4th grade', 'x0_ 5th or 6th grade',
       'x0_ 7th and 8th grade', 'x0_ 9th grade',
       'x0_ Associates degree-academic program',
       'x0_ Associates degree-occup /vocational',
       'x0_ Bachelors degree(BA AB BS)', 'x0_ Children',
       'x0_ Doctorate degree(PhD EdD)', 'x0_ High school graduate',
       'x0_ Less than 1st grade',
       'x0_ Masters degree(MA MS MEng MEd MSW MBA)',
       'x0_ Prof school degree (MD DDS DVM LLB JD)',
       'x0_ Some college but no degree', 'x1_ Agriculture',
       'x1_ Business and repair services', 'x1_ Communications',
       'x1_ Construction', 'x1_ Education', 'x1_ Entertainment',
       'x1_ Finance insurance and real estate',
       'x1_ Forestry and fisheries', 'x1_ Hospital services',
       'x1_ Manufacturing-durable goods',
       'x1_ Manufacturing-nondurable goods',
       'x1_ Medical except hospital', 'x1_ Not in universe or chi

### Training
We can use the standard set of sagemaker estimators, but sagemaker now also comes with an additional estimator. 

This estimator is a container with `sklearn`, that accepts a script as an input: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-estimator


This is a part of sagemaker frameworks: https://sagemaker.readthedocs.io/en/stable/frameworks/index.html