This notebook walks you through deploying a custom model to a sagemaker endpoint. This involves packaging dependencies (inference logic, libraries etc ) with the model weights and using a Pytorch Framework to serve the model

To serve custom model with sagemaker you must create a dir with the following structure:
```
inference_project/
    ├──code/
        ├── inference.py
        ├── requirements.txt
    ├──model weights and tokenizer config
```

In the `requirements.txt` list the packages you want to be installed in the container.
The `inference.py` should have the logic to run inference on the model. It should follow the SageMaker Inference syntax using function names including `model_fn`, `input_fn`, `predict_fn` etc. as described in the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites.html)

---------------

#### Copy the inference script and dependencies to the `lora_model` dir where the finetuned adapters where saved

In [None]:
!mkdir lora_model/code
!cp -r code/ lora_model/code/

#### Compress finetuned model and dependencies to sagemaker acceptable format (tar.gz)

In [None]:
%cd lora_model
!tar zcvf model.tar.gz *

##### upload compressed model to s3

In [2]:
model_uri="s3://BUCKET NAME/unsloth9/" # S3 prefix to upload model to
!aws s3 cp /root/GraphDataBase/lora_model/model.tar.gz {model_uri}

upload: ../GraphDataBase/lora_model/model.tar.gz to s3://fairstone/unsloth9/model.tar.gz


In [None]:
#### use the SageMaker Pytorch Framework to deploy the model to an endpoint

pt_image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.4.0-gpu-py311-cu124-ubuntu22.04-sagemaker"

from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel

role = get_execution_role()

pytorch_model = PyTorchModel(model_data=f'{model_uri}model.tar.gz', 
                             role=role,                        
                             image_uri = pt_image_uri
                            )

predictor = pytorch_model.deploy(instance_type="ml.g5.2xlarge",
                                container_startup_health_check_timeout=300,
                                 initial_instance_count=1,)

#### Make Prediction

In [None]:
import sagemaker
import json
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
predictor.serializer=sagemaker.serializers.JSONSerializer()
predictor.deserializer=sagemaker.deserializers.JSONDeserializer()
data = {
  "instruction": "Hello",
  "input": ""
}

res = predictor.predict(data=data)
print(res)