## Introduction to SageMaker Jobs
In this notebook, we will introduce you to the concept of SageMaker Jobs. Jobs allow us to scale and productionalize our ML and data processing workflows. We will cover the following topics in this notebook:
1. Going from a notebook to a job
2. Bringing your own libraries and code
3. Using the SageMaker SDK to create jobs
4. Using `@remote` decorators to convert functions to jobs

<div style="border: 1px solid black; padding: 10px; background-color: #ffffcc; color: black;">
<strong>Note:</strong> Make sure to fully run the first notebook to ingest the data into Athena before running this notebook.
</div>

For the exercise here, we'll assume that you want to process data using SQL but not necessarily ingest into Athena as we did in the previous notebook. We'll use [DuckDB](https://duckdb.org/) to process the data, which you can think of as a local SQL engine that can be used to analyze and wrangle large amounts of data.

After experimenting with DuckDB in the notebook, we'll convert the code to a job and run it on SageMaker.

In [None]:
# start by installing duckdb
%pip install -Uqq duckdb
%pip install -Uqq duckdb-engine
%pip install -Uqq sagemaker

DuckDB uses uses files to store data, so we create a new database by creating a `.duckdb` file.

In [2]:
import duckdb

# connect to an existing database, or create one if it doesn't exist
conn = duckdb.connect("loan_data.duckdb")

In [None]:
# we can query data directly from a csv file without loading it into a database
sample_df = conn.execute("SELECT * FROM 'data/ln_large.csv' LIMIT 5").df()
sample_df.head()

In [None]:
# we can also validate how well DuckDB inferred the  data types from the CSV file
sample_df.dtypes

In [None]:
# for better performance, we can ingest the CSV file into a table within the database
conn.execute("create table if not exists loan_data as select * from 'data/ln_large.csv'")

In [None]:
# validate that the table was created
# .df() returns a pandas DataFrame
conn.execute("show tables").df()

In [None]:
# we can now query the data from the table
conn.execute("select count(*) from loan_data").df()

In [8]:
# let's try a more complex query to profile the numeric columns
profile_numeric_sql = """
WITH percentiles AS (
    SELECT
        'ti_ln_remaining_term' AS column_name,
        MIN(ti_ln_remaining_term) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_remaining_term) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_remaining_term) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_remaining_term) AS p75,
        MAX(ti_ln_remaining_term) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_balance' AS column_name,
        MIN(ti_ln_balance) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_balance) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_balance) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_balance) AS p75,
        MAX(ti_ln_balance) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_installment_due' AS column_name,
        MIN(ti_ln_installment_due) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_installment_due) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_installment_due) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_installment_due) AS p75,
        MAX(ti_ln_installment_due) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_val_payments' AS column_name,
        MIN(ti_ln_val_payments) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_val_payments) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_val_payments) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_val_payments) AS p75,
        MAX(ti_ln_val_payments) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_val_interest' AS column_name,
        MIN(ti_ln_val_interest) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_val_interest) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_val_interest) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_val_interest) AS p75,
        MAX(ti_ln_val_interest) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_val_total_fees' AS column_name,
        MIN(ti_ln_val_total_fees) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_val_total_fees) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_val_total_fees) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_val_total_fees) AS p75,
        MAX(ti_ln_val_total_fees) AS max_value
    FROM loan_data
    UNION ALL
    SELECT
        'ti_ln_final_charge_cycle' AS column_name,
        MIN(ti_ln_final_charge_cycle) AS min_value,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ti_ln_final_charge_cycle) AS p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ti_ln_final_charge_cycle) AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ti_ln_final_charge_cycle) AS p75,
        MAX(ti_ln_final_charge_cycle) AS max_value
    FROM loan_data
)
SELECT * FROM percentiles;
"""

In [None]:
conn.execute(profile_numeric_sql).df()

In [None]:
# we can also use duckdb to convert the data to parquet format for better performance and interoperability
conn.execute(
    """copy (select *, 
    year(TI_LN_DATE_OPEN) as TI_LN_DATE_OPEN_YEAR, 
    month(ti_ln_date_open) as TI_LN_DATE_OPEN_MONTH 
    from loan_data) 
    to 'parquet_output' 
    (FORMAT PARQUET, PARTITION_BY (TI_LN_DATE_OPEN_YEAR, TI_LN_DATE_OPEN_MONTH), OVERWRITE_OR_IGNORE true)"""
)

### Configuring a SageMaker Processing Job

Now let's convert the code to a SageMaker Processing Job. We'll use the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to create a processing job. The SDK provides a high-level interface for SageMaker Processing Jobs, which allows you to easily create, configure, and run processing jobs.

SageMaker includes 3 types of jobs:
- [Training Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html): Get's training data from S3, trains a model, and saves the model back to S3.
- [Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html): Runs a processing script on input data from S3 and saves the output to S3.
- [Batch Transform Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html): Runs a model on input data from S3 and saves the predictions to S3.

We will work with processing jobs and training jobs in this notebook.

SagMaker Jobs are built around containers and scripts. Users can bring their own containers or leverage the SageMaker provided containers. The SageMaker Python SDK provides a high-level interface for SageMaker Jobs, which allows you to easily create, configure, and run jobs.

We will use a `PyTorch` container for this example. Even though we are not using PyTorch, the container behind it is frequently updated and maintained by AWS given the popularity of the framework. 

In [None]:
import boto3                                                            # AWS SDK for Python                                                
import json
import sagemaker                                                        # SageMaker Python SDK                    
from pathlib import Path
from sagemaker.pytorch.processing import PyTorchProcessor               # Processor for processing data using the PyTorch farmework container
from sagemaker.processing import ProcessingInput, ProcessingOutput      # ProcessingInput and ProcessingOutput objects for specifying location of input and output data

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
bucket = sess.default_bucket()  # default bucket name
account_id = sess.account_id() 

In [12]:
# load values from the first notebook

if not Path("lab_values.json").exists():
    raise FileNotFoundError("Please run the first notebook first.")
else:
    lab_values = json.loads(Path("lab_values.json").read_text())
    s3_csv_data = lab_values["s3_csv_folder"]

In [13]:
# configure the PyTorch processor
processor = PyTorchProcessor(
    framework_version='2.2',          # PyTorch version
    py_version='py310',               # Python version
    role=role,                        # permissions the processing job will assume
    instance_type='ml.m5.xlarge',     # instance type for the processing job (see here for available instances https://aws.amazon.com/sagemaker/pricing/)
    instance_count=1,                 # number of instances for the processing job
    base_job_name='processing-job'    # name of the processing job
)

Next we configure the processing inputs and outputs. The `ProcessingInput` and `ProcessingOutput` provide the source and destination of the input and output datasets. The data will be copied into the instance folder or S3 location specified in the `ProcessingInput` and `ProcessingOutput` objects.

In [14]:

s3_output_location = f"s3://{bucket}/ml_workshop/data/processing_output"

job_inputs = [
    ProcessingInput(
        input_name="data",
        source=s3_csv_data,                     # the S3 location from where the data will be read and copied to the processing instance
        destination="/opt/ml/processing/input", # the folder inside the processing instance where the data will be copied to
    )
]

job_outputs = [
    ProcessingOutput(
        output_name="data_structured",
        source="/opt/ml/processing/output",   # the folder inside the processing instance where script the output will be written to
        destination=s3_output_location,       # the S3 location where the output will be stored
    ),
]

Finally it's time to run the job. We provide a custom script [convert_to_parquet.py](./processing_script/convert_to_parquet.py). Click the link and take a look at the script. It takes command line arguments for the input and output directories so it knows where to read the data from and where to write the output. Additionally, `processing_script` source directory contains the [requirements.txt](./processing_script/requirements.txt) file which specifies the dependencies for the script, in this case duckdb. If a requirements file is provided in the source directory, SageMaker will install the dependencies before running the script which makes it really easy to bring your own code and libraries.

In [None]:
job = processor.run(
    code="convert_to_parquet.py",          # the script to be run
    source_dir="processing_script",        # the folder containing the script
    inputs=job_inputs,
    outputs=job_outputs,
    arguments=[                            # arguments to be passed to the script
        "--input_dir",
        "/opt/ml/processing/input",
        "--output_dir",
        "/opt/ml/processing/output",
    ],
)

In [None]:
# confirm that the output was written to the specified S3 location
!aws s3 ls $s3_output_location/ --recursive

### Creating jobs using a @remote decorator
An alternative and somewhat simpler approach to creating a job is using the [@remote decorator](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html). The `@remote` decorator allows you to convert a function to a job. The decorator takes care of packaging the function and dependencies, uploading the code to S3, and running the job. This is a great way to quickly convert a function to a job without having to write a separate script. 

In [None]:
from sagemaker.remote_function import remote
from sagemaker import image_uris


# we will use the PyTorch framework container for the processing job
# The remote decorator will actually try to reproduce the environment in which the function was defined so providing the image_uri and dependencies is optional
image_uri = image_uri = image_uris.retrieve(
    framework="pytorch",
    image_scope="training",
    region=region,
    version="2.2",
    py_version="py310",
    instance_type="ml.m5.xlarge",
)


@remote(
    instance_type="ml.m5.xlarge",
    dependencies="processing_script/requirements.txt",      # try removing the image uri and dependencies to see if the function still works!
    image_uri=image_uri,
)
def convert_to_parquet(input_s3_path: str, output_s3_path: str):

    """Takes in s3 path to a CSV file and converts it to parquet format and outputs it to another S3 location"""
    
    conn = duckdb.connect("temp_data.duckdb")

    # configure S3 access
    conn.execute(
        """CREATE SECRET s3_access (
           TYPE S3,
           PROVIDER CREDENTIAL_CHAIN
        );"""
    )

    # create a temporary table from data in S3
    conn.execute(f"CREATE TABLE temp_table AS SELECT * FROM '{input_s3_path}/*.csv'")

    # convert the data to parquet format
    conn.execute(
        f"""copy (select *, 
    year(TI_LN_DATE_OPEN) as TI_LN_DATE_OPEN_YEAR, 
    month(ti_ln_date_open) as TI_LN_DATE_OPEN_MONTH 
    from temp_table) 
    to '{output_s3_path}' 
    (FORMAT PARQUET, PARTITION_BY (TI_LN_DATE_OPEN_YEAR, TI_LN_DATE_OPEN_MONTH), OVERWRITE_OR_IGNORE true)"""
    )

    return output_s3_path

In [None]:
func_s3_output = f"s3://{bucket}/ml_workshop/data/processing_output_func"
convert_to_parquet(s3_csv_data, func_s3_output)

In [None]:
# confirm that the output was written to the specified S3 location
!aws s3 ls $func_s3_output/ --recursive

### Self-paced exercises

Using any of the approaches above, create a SageMaker job that does the following:
   - a. Check that the loan open date (TI_LN_DATE_OPEN) is before the first installment date (TI_LN_DATE_FIRST_INSTALLMENT) and before the closing date (TI_LN_DATE_CLOSED)
   - b. Check if customers have as many accounts as the field  TI_CU_NUM_LOAN_ACCT states
   - c. Check that all accounts with a close reason (TI_LN_REASON_CLOSED) have a valid close date (TI_LN_DATE_CLOSED)


### Conclusion
In this notebook, we learned how to convert a notebook to a job and run it on SageMaker. We also learned how to use the `@remote` decorator to convert a function to a job.

**There's more**

SageMaker also offers a [@step](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-step-decorator.html) decorator that allows you to combine multiple functions into a pipeline. This is a great way to create complex workflows that involve multiple steps.