# Ahead of time partitioning for large models in SageMaker

This notebook demonstrates how to partition your large model's checkpoints ahead of time in SageMaker using Training Job. 

To optimize the deployment of large models that do not fit in a single GPU, the model’s tensor weights are partitioned at runtime and each partition is loaded in individual GPU. But runtime partitioning takes significant amount of time and memory on model loading. So, DJLModel offers an ahead of time partitioning capability for DeepSpeed and FasterTransformer engines, which lets you partition your model weights and save them before deployment. HuggingFace does not support tensor parallelism, so ahead of time partitioning cannot be done for it. In our experiment with GPT-J model, loading this model with partitioned checkpoints increased the model loading time by 40%.

`partition` method invokes an Amazon SageMaker Training job to partition the model and upload those partitioned checkpoints to S3 bucket. You can either provide your desired S3 bucket to upload the partitioned checkpoints or it will be uploaded to the default SageMaker S3 bucket. Please note that this S3 bucket will be remembered for deployment. When you call deploy method after partition, DJLServing downloads the partitioned model checkpoints directly from the uploaded s3 url, if available.

## FasterTransformer AOT partition

LMI FasterTransformer DLC has a customized FT library installed which provides two APIs. `save_checkpoint` and `init_inference` API. 

Example for save_checkpoint

```
import fastertransformer

fastertransformer.save_checkpoint("t5-small", 
                                  tensor_parallel_degree=2,
                                  pipeline_parallel_degree=1,
                                  save_mp_checkpoint_path=/home/ubuntu/gpt-sharded/, 
                                  dtype=fp16)
```

Once the partition is done, FT creates a file with the verify str in the format `<model-name>-<dtype>-fp16`. If the verify str is available in the model directory that is passed to init_inference API, FT will not perform the partitioning, instead directly load the available. For example,

```

fastertransformer.init_inference("/home/ubuntu/gpt-sharded/", 
                                  tensor_parallel_degree=2,
                                  pipeline_parallel_degree=1,
                                  dtype=fp16)
```


## Setup

### Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

In [None]:
%pip install "sagemaker"

### Setup account and role

Then, we import the SageMaker python SDK and instantiate a `sagemaker_session` which we use to determine the current region and execution role.

In [None]:
%%writefile ~/.aws/credentials

[default]
aws_access_key_id =  <insert your key here>
aws_secret_access_key = <insert your key here>
aws_session_token = <insert your key here>

In [None]:
%%writefile ~/.aws/config

[default]
region=us-east-1

In [3]:
import sagemaker
from sagemaker.djl_inference import DJLModel, DeepSpeedModel, FasterTransformerModel
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

## Create the Model

When you use `DJLModel`, the engine is chosen for you based on your model architecture and the optimizations available. For example, if your model is T5, Fastertransformer engine is chosen automatically for you. But you can also choose your own engine using their corresponding classes such as `DeepSpeedModel` and `FasterTransformerModel`.  

If its a T5 model, FasterTransformer engine is automatically chosen when you use `DJLModel`

Note that ahead of time partitioning for FasterTransformer engine could be run on a CPU machine with enough memory to fit your model. But for DeepSpeed, you need bigger GPU instance to run the ahead of time partitioning. 



In [4]:
model_name = "aot-flan-t5-xxl-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
role = "arn:aws:iam::185921645874:role/AmazonSageMaker-ExeuctionRole-IntegrationTests"


ft_model = FasterTransformerModel(
    model_id='s3://djl-llm/flan-t5-xxl/',
    name=model_name,
    role=role,
    sagemaker_session=sagemaker_session,
    dtype='fp16',
    number_of_partitions=4
)

## Partition the model

Next we partition the model by invoking partition() function.

In [None]:
ft_model.partition(instance_type='ml.g5.12xlarge',
                s3_output_uri='s3://djl-llm/flan-t5-xxl-4p/')

## Creating a SageMaker Endpoint

Next we deploy the model by invoking the `deploy()` function. Here we use an `ml.g5.12xlarge` instance which come with 4 NVIDIA A10 GPUs. 

Note: If you call deploy after your partition call, your model will be loading the partitioned checkpoints uploaded in S3 automatically for you. So you need to make sure your endpoint has necessary S3 read permissions. 

In [None]:
predictor = ft_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    endpoint_name=model_name
)

## Running Inference

Once the endpoint is up and running, we can evaluate the model using the `predict()` function.

In [None]:
input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "max_new_tokens": 100,
  }
}

predictor.predict(input_data)

## Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

In this tutorial, we used DJL DeepSpeed LMI container to partition the model ahead of time before deployment on SageMaker Training Job using `ml.g5.24xlarge`. Then we created the SageMaker endpoint by loading the partitioned checkpoints on a compartively cheaper instance using `ml.g5.12xlarge`. So with DJL LMI containers, you can easily partition your larger models like GPTJ, flan-t5-xxl and Pythia-12B using engines like DeepSpeed and FasterTransformer. 