# Deploy Fine-tuned LLM to SageMaker

This notebook shows how to deploy a LoRA fine-tuned model to AWS SageMaker for inference.

## Overview
1. Merge LoRA adapters with base model
2. Test locally
3. Package and upload to S3
4. Deploy to SageMaker endpoint
5. Run inference


## Setup

Import required libraries and configure paths.


In [27]:
import os
import torch
from peft import PeftModel
from paths import OUTPUTS_DIR
from dotenv import load_dotenv
from utils.config_utils import load_config
from sagemaker.huggingface import HuggingFaceModel, HuggingFacePredictor
from transformers import AutoModelForCausalLM, AutoTokenizer


adapters_dir = os.path.join(OUTPUTS_DIR, "lora_samsum", 'lora_adapters')
merged_model_dir = os.path.join(OUTPUTS_DIR, "lora_samsum", 'merged_model')
cfg = load_config()

## Step 1: Merge LoRA Adapters with Base Model

SageMaker's default HuggingFace inference toolkit doesn't support loading LoRA adapters on the fly. 
We need to merge the adapters into the base model first.


In [28]:
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapters (download from S3 first or use local path)
model = PeftModel.from_pretrained(base_model, adapters_dir)

# Merge adapters into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained(merged_model_dir)
tokenizer = AutoTokenizer.from_pretrained(adapters_dir)
tokenizer.save_pretrained(merged_model_dir)


('/Users/mo/Desktop/ReadyTensor/certifications/llm-eng/repos/rt-llm-eng-cert-week7/data/outputs/lora_samsum/merged_model/tokenizer_config.json',
 '/Users/mo/Desktop/ReadyTensor/certifications/llm-eng/repos/rt-llm-eng-cert-week7/data/outputs/lora_samsum/merged_model/special_tokens_map.json',
 '/Users/mo/Desktop/ReadyTensor/certifications/llm-eng/repos/rt-llm-eng-cert-week7/data/outputs/lora_samsum/merged_model/chat_template.jinja',
 '/Users/mo/Desktop/ReadyTensor/certifications/llm-eng/repos/rt-llm-eng-cert-week7/data/outputs/lora_samsum/merged_model/tokenizer.json')

## Step 2: Test Model Locally

Before deploying to SageMaker, test the merged model works correctly on a sample.


In [29]:
from utils.data_utils import load_and_prepare_dataset, build_messages_for_sample

train, val, test = load_and_prepare_dataset(cfg)

Loading dataset from local cache: /Users/mo/Desktop/ReadyTensor/certifications/llm-eng/repos/rt-llm-eng-cert-week7/data/datasets/knkarthick_samsum
ðŸ“Š Loaded 14731 train / 200 val / 200 test samples (from full cache).


In [30]:
messages = build_messages_for_sample(train[0], cfg["task_instruction"])

text_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

In [31]:
from transformers import pipeline
pipe = pipeline("text-generation", model=merged_model, tokenizer=tokenizer)
pipe(text_prompt, return_full_text=False)

Device set to use mps


[{'generated_text': 'Amanda baked cookies for Jerry tomorrow.'}]

## Step 3: Package and Upload to S3

SageMaker needs the model in a tar.gz file stored in S3.


In [32]:
import tarfile
import boto3

cfg = load_config()
bucket = cfg["bucket"]
s3_key = f"{cfg['output_path']}/merged_model/model.tar.gz"


In [33]:
# Create tar.gz
tar_path = os.path.join(OUTPUTS_DIR, "lora_samsum", "model.tar.gz")
with tarfile.open(tar_path, "w:gz") as tar:
    tar.add(merged_model_dir, arcname=".")

In [34]:
# Upload to S3
s3 = boto3.client("s3")
s3.upload_file(tar_path, bucket, s3_key)

print(f"Uploaded to s3://{bucket}/{s3_key}")

Uploaded to s3://sagemaker-llm-training-bucket/llama-3-2-1b-instruct/merged_model/model.tar.gz


## Step 4: Deploy to SageMaker Endpoint

Create a HuggingFace model and deploy it to a real-time endpoint. This takes ~5-10 minutes.


In [35]:
load_dotenv()

role = os.getenv("SAGEMAKER_EXECUTION_ROLE_ARN")
bucket = cfg["bucket"]

model = HuggingFaceModel(
    model_data=f"s3://{bucket}/{s3_key}",
    role=role,
    transformers_version="4.51",
    pytorch_version="2.6",
    py_version="py312"
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name="llama-endpoint"
)


------------!

## Step 5: Run Inference

Connect to the deployed endpoint and test it with a sample prompt.


In [51]:
predictor = HuggingFacePredictor(
    endpoint_name="llama-endpoint",
)

predictor.predict({"inputs": text_prompt, "parameters": {
    "return_full_text": False, "do_sample": False
}})


[{'generated_text': "Jimmy and Sandy don't want to go to the bar because Trevor is there."}]

In [52]:
from tqdm import tqdm

results = []

for sample in tqdm(val):
    messages = build_messages_for_sample(sample, cfg["task_instruction"])

    text_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    out = predictor.predict({
        "inputs": text_prompt,
        "parameters": {"return_full_text": False, "do_sample": False}
    })
    results.append(out)


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 200/200 [01:11<00:00,  2.78it/s]


In [53]:
import evaluate
rouge = evaluate.load("rouge")

predictions = [result[0]["generated_text"] for result in results]
references = [sample["summary"] for sample in val]

rouge.compute(predictions=predictions, references=references)




{'rouge1': np.float64(0.45472632604924645),
 'rouge2': np.float64(0.22621493392056607),
 'rougeL': np.float64(0.38015757712025816),
 'rougeLsum': np.float64(0.3804240535060933)}