# Run SageMaker Processing Job

This code shows how to run code written in notebook as SageMaker Processing Job by using sagemill <br>
In this example, we use `sagemaker.sklearn.processing.SKLearnProcessor` as Processor

#### assumptions
- this notebook is from "conda_python3" in SageMaker notebook instance

In [None]:
!conda install -y scikit-learn==0.20.0 pandas

## Parameters
The cell with "parameters" tag is used as arguments of python script of Processing job. <br>
So these can be overwritten by `arguments` in `processor.run`

#### How to add tags
1. Click 'View' tab on the notebook
2. Click 'Cell Toolbar'
3. Click 'Tags'
4. Input tag name to the cell
5. Click "Add tag"

In [None]:
base_path = '.'

In [None]:
import os
import pandas as pd

input_path = lambda b: os.path.join(b, 'dataset/dummmy.csv')
output_path = lambda b: os.path.join(b, 'output/out.csv')

## Create dataset to be used by the process
The cell with "sagemaker" tag is ignored when generating python script from the notebook by `Converter.generate_pyfile`

In [None]:
os.makedirs(os.path.dirname(input_path(base_path)), exist_ok=True)
with open(input_path(base_path), 'w') as f:
    f.write('col1,col2,col3\n1,2,3\n4,5,6')

## code to run in the job
You can write anything here

In [None]:
df = pd.read_csv(input_path(base_path))

# should be heavy process here
df['new_col'] = 1
os.makedirs(os.path.dirname(output_path(base_path)), exist_ok=True)
df.to_csv(output_path(base_path))
print('Finished running processing job')

## Run SageMaker Processing Job

- `Converter.generate_pyfile` generates `entrypoint.py` from this notebook.
- Set `entrypoint.py` as `code` of `processor.run`
- `params` corresponds with the cell tagged "parameters" <br>
  In the job, the parameters are overwritten to the values of `params`
- **save this notebook file before runnning the following cell**

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker import Session
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemill import Converter

params = {'base_path': '/opt/ml/processing'}
role = sagemaker.get_execution_role()
bucket_name = Session().default_bucket()
dest_path = f's3://{bucket_name}/process/output'
script_name = 'entrypoint.py'
processor = SKLearnProcessor(framework_version='0.20.0',
                             role=role,
                             base_job_name='process-job-test',
                             instance_count=1,
                             instance_type='ml.c5.xlarge')

Converter.generate_pyfile('process_sklearn.ipynb', script_name)

processor.run(
    code=script_name,
    arguments=Converter.process_args(params),
    inputs=[
        ProcessingInput(source=input_path(base_path), destination=os.path.dirname(input_path(params['base_path'])))
    ],
    outputs=[
        ProcessingOutput(source=os.path.dirname(output_path(params['base_path'])), destination=dest_path)
    ]
)