# Finetune and deploy a custom ML model using Nach01 Algorithm from AWS Marketplace 


This algorithm enables the training of predictive models for chemistry-related applications using **Nach01**, a **foundational chemistry language model** trained on **100+ chemistry-specific tasks**.

## Key Features:
1. **Advanced Chemical Modeling** – Utilizes **Nach01**, a specialized language model designed to capture complex chemical patterns and relationships.
1. **State-of-the-Art Performance** – Fine-tuning on proprietary datasets provides best-in-class results for predictive chemistry tasks.
1. **Optimized for AWS** – Ensures **scalability and efficiency**, allowing seamless integration with AWS infrastructure.

By leveraging **Nach01**, you can achieve **high accuracy** in chemical property prediction, molecular design, and other advanced chemistry applications. 

See the research papers at [arxiv.org](https://arxiv.org/abs/2410.09240) and [pubs.rsc.org](https://pubs.rsc.org/en/content/articlelanding/2024/sc/d4sc00966e) for more details.

This sample notebook shows you how to train a custom ML model using [**Nach01**](https://aws.amazon.com/marketplace/pp/prodview-aq32kfu5ifjgw).

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [**Nach01**](https://aws.amazon.com/marketplace/pp/prodview-aq32kfu5ifjgw). 

## Contents
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Install client library](#2.-Install-client-library)
1. [Finetune Nach01 model](#3.-Finetune-Nach01-model)
1. [Create the inference endpoint](#4.-Create-inference-endpoint)
1. [Cleanup resources](#5.-Cleanup-resources)

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing [page](https://aws.amazon.com/marketplace/pp/prodview-aq32kfu5ifjgw).
1. On the AWS Marketplace listing, click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [None]:
product_arn = ''

## 2. Install client library
Install the client library from pypi
> **Note**: Once installed make sure you restarted the jupyter kernel to apply the changes.

In [None]:
%%bash
pip install --quiet --upgrade insilico-aws

You can run the following command to create example data files and notebooks in the current workspace, or create a new one from scratch (if so, change the `examples/` data paths in the cells below).

In [None]:
%%bash
python -c "import insilico_aws; insilico_aws.load_examples(overwrite=False)"

## 3. Finetune Nach01 model

### 3.1 Create client

In [None]:
from insilico_aws.client import AlgorithmClient

client = AlgorithmClient(
    algorithm='nach01',
    arn=product_arn,
    # when running outside Sagemaker Studio, you might have to specify `region_name`
)

It's important to have all required permissions assigned to the current execution role, follow [this](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) guide to set it up.

In [None]:
import sagemaker
print(sagemaker.session.get_execution_role(client.sagemaker_session))

### 3.2 Prepare train data

Make sure you have a writable Amazon S3 bucket to upload the files. Specify it in the cell below. Check how to add s3 pemissions [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In [None]:
s3_input_uri = 's3://<bucket>/<prefix>'

Train data is a single folder with the train.csv and test.csv files inside. Each of these CSV files must have the following columns:

1. `molecule` - to store the molecule
2. `input_format` - "smiles" by default
3. `input_description` - "small molecule" by default
4. `task_description` - to store the name of the predicting property or an extended description
5. `target` - to store the target value of the property for a given molecule
6. `task_type` - classification or regression
> **Note**: If your train files names differ from `train.csv` and `test.csv` please overwrite the default names by the training parameters: `TRAIN_FILENAME`, `TEST_FILENAME` (see below).


In [None]:
client.upload_train_data(
    train_data='bbbp/train.csv',
    test_data='bbbp/test.csv',
    s3_uri=s3_input_uri
)

### 3.2 Run training job

Make sure you have a writable Amazon S3 bucket to save the job outputs. Specify it in the cell below.

In [None]:
s3_output_uri = 's3://<bucket>/<prefix>'

**Nach01** supports optional training parameters:
1. `TRAIN_BATCH_SIZE` - Batch size for training (default 1)
2. `MAX_STEPS` - Maximum number of training steps (default 1000)
3. `GRAD_ACCUMULATION_STEPS` - Number of gradient accumulation steps (default 4)
4. `LR` - Learning rate (default 0.00001)
5. `WEIGHT_DECAY` - Weight decay for optimize (default 0.01)
6. `WARMUP_STEPS` - Number of warmup steps (default 50)
7. `MOL_AUGMENTATION` - Enables or disables molecular augmentation (default "true")
8. `REGRESSION_HEAD` -  Determines whether the model should predict numerical values or categories. Set to True if your task is regression, where the goal is to predict a continuous number (e.g., toxicity level, molecular property). Set to False (default) for classification, where the target is a category.
9. `TRAIN_FILENAME` - Tells the model where to find train data (default train.csv)
10. `TEST_FILENAME` - Tells the model where to find test data (default train.csv)

Some training job parameters affect GPU memory usage. For `ml.g6e.2xlarge` we recommend setting `TRAIN_BATCH_SIZE=4`, `GRAD_ACCUMULATION_STEPS=16`.

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

The Amazon SageMaker library has telemetry enabled by default, set the `TelemetryOptOut` parameter to `true` to opt out. See [SageMaker SDK docs](https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk) for details.

Once you have the training job finished you can deploy the model for performing real-time inference.

In [None]:
job_name = client.create_training_job(
    input_path=s3_input_uri,
    output_path=s3_output_uri,
    instance_type='ml.g6e.2xlarge',
    training_parameters={'TRAIN_BATCH_SIZE': 4, 'GRAD_ACCUMULATION_STEPS': 16},
    max_run_hours=1,
    wait=True,
    # optionally set which `role` to use to access training data,
    # optionally set the `tags` dict (untagged resources creation might be blocked in your environment),
)

## 4. Create inference endpoint

### A. Start endpoint instance

**Nach01** supports optional inference parameters:
1. `TOP_K` - Specifies the number of augmented SMILES strings used during prediction (default 1). The model generates predictions for each of the `TOP_K` augmentations and combines them into a final prediction using the method defined by `AGGREGATING_STRATEGY`.
1. `AGGREGATING_STRATEGY` - Defines how to combine predictions from multiple augmented SMILES strings (as specified by `TOP_K`) into a single final output. `average` (by default) takes the arithmetic mean of all predictions. `vote` applies majority voting across categorical predictions (e.g., for classification tasks).
1. `TOP_P` - Sampling threshold (default 1.0).
1. `MAX_LENGTH` - Tokens number (default 1024).
1. `EVAL_BATCH_SIZE` - Higher value improves the model performance, lower value reduces GPU memory consumption (default 64).

Once endpoint is created, you can perform real-time inference.

In [None]:
# You can use the current job name if it was created in the same session: `client.job_name`
# Otherwise find one with `client.find_latest_training_job()`
endpoint_name = 'nach01'
client.create_endpoint(
    endpoint_name=endpoint_name,
    instance_type='ml.p3.2xlarge',
    training_job_name=job_name,
    training_job_output_path=s3_output_uri,
    # optionally set which `role` to use to access training results,
    # optionally set the `tags` dict (untagged resources creation might be blocked in your environment),
    # optionally set the `inference_parameters` dict described above,
)

### B. Perform real-time inference

The request data file structure is similar to training:
1. `molecule`
2. `input_format`
3. `input_description`
4. `task_description`

Since the real-time inference has default timeout, we recommend to limit the file size by ~350 samples.

In [None]:
predictions = client.invoke_endpoint(
    request_data='bbbp/request.csv',
    endpoint_name=endpoint_name,  # omit if created in the same session
)
print(predictions)

## 5. Cleanup resources

### A. Delete the endpoint

Delete the endpoint instance if you don't use anymore.

In [None]:
client.delete_endpoint(endpoint_name=endpoint_name, quiet=True)

You can also remove the model description if you don't need to redeploy the same model in the future.

In [None]:
client.delete_model(model_name=endpoint_name, quiet=True)

In [None]:
client.close()

### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

