# Lab 2: Fine-tune Llama 3.2 3B with Experiment Tracking

## Overview

In this lab, you'll learn how to implement **model governance at scale** by fine-tuning a foundation model while automatically tracking all experimentation metadata. This will help you maintain auditability, reproducibility, and lineage in their ML workflows.


### Use Case: Text Summarization

You need a model that can **generate concise summaries** of longer text. This is valuable for:
- **Financial Services**: Summarizing earnings reports, regulatory filings
- **Healthcare**: Condensing patient notes, research papers
- **Legal**: Summarizing contracts, case documents
- **Customer Service**: Creating brief summaries of support tickets

### What You'll Build
A specialized **text summarization model** fine-tuned on the Dolly dataset using Amazon SageMaker JumpStart and Llama 3.2 3B model.
As part of your fine-tuning, you will track:
- Training hyperparameters
- Data and model artifacts 
- Complete lineage from base model to fine-tuned version

### Why This Matters for Governance
- **Auditability**: Every training run is logged with timestamps, parameters, and results
- **Reproducibility**: All experiments can be recreated from tracked metadata
- **Lineage**: Clear chain from source data ‚Üí training job ‚Üí model artifacts ‚Üí deployments
- **Compliance**: Meet regulatory requirements for model documentation and traceability

## Step 1: Setup and Install Dependencies

First, you'll install the required libraries and initialize our SageMaker session. This establishes the execution context for our training job.

<div style="padding: 15px; background-color: #fff3cd; border-left: 5px solid #ffc107; color: #856404;">
<strong>‚ö†Ô∏è Important:</strong> The cell below installs libraries and restarts the kernel. After the restart, continue with the next cell.
</div>

In [None]:
!pip install -U sagemaker==2.253.1 datasets==4.4.1 mlflow==3.5.1 fsspec==2023.9.2 --quiet
# restart kernel
import IPython
IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel

In [None]:
import datasets
from packaging import version

datasets_version = datasets.__version__
print(f"datasets version: {datasets_version}")

if version.parse(datasets_version) < version.parse("4.4.1"):
    print("‚ö†Ô∏è Warning: datasets version is below 4.4.1. Please run the previous cell again")
else:
    print("‚úì Version OK")


In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker.jumpstart.model import JumpStartModel

# Initialize SageMaker session
sess = sagemaker.Session()
role = get_execution_role()
region = sess.boto_region_name

# Extract account ID from the role ARN
# Role format: arn:aws:iam::ACCOUNT_ID:role/...
account_id = role.split(':')[4]

# Use pre-configured workshop bucket instead of default bucket
# This avoids VPC endpoint policy restrictions on bucket creation
# You can also find the bucket to use from the CloudFormation output: DataBucketName

bucket = "llm-fine-tuning-data-891377069427-us-east-1"

sm_client = boto3.client('sagemaker', region_name=region)

print(f"Amazon SageMaker role: {role}")
print(f"Account ID: {account_id}")
print(f"Amazon S3 bucket: {bucket}")
print(f"AWS Region: {region}")

## Step 2: Deploy base model

Next, you will deploy the base Llama 3.2 3B model so that you can later compare its performance with the fine-tuned model for your summarization use case

In [None]:
model_id, model_version = "meta-textgeneration-llama-3-2-3b", "1.*"

<div style="background-color: #d4edda; border: 1px solid #c3e6cb; border-radius: 4px; padding: 12px; margin: 10px 0;">
<b>‚úì Llama Model EULA Acceptance</b><br>
To deploy Llama models using SageMaker JumpStart, you must accept Meta's End User License Agreement (EULA). In the notebook, set <code>accept_eula=true</code> in the estimator configuration. By doing so, you acknowledge that you have read and agree to the terms of the EULA, available at https://ai.meta.com/resources/models-and-libraries/llama-downloads/. Deployment will fail if this parameter is not set to true.
</div>


> **‚è±Ô∏è Note:** The deployment job will take approximately 10 minutes to complete.


In [None]:
from sagemaker.jumpstart.model import JumpStartModel

pretrained_model = JumpStartModel(model_id=model_id, model_version=model_version, instance_type="ml.g5.2xlarge")
# Please change the following line to have accept_eula = True
pretrained_predictor = pretrained_model.deploy(accept_eula=True)

In [None]:
from IPython.display import Markdown, display

base_model_endpoint_name = pretrained_predictor.endpoint_name

console_url = f"https://console.aws.amazon.com/sagemaker/home?region={region}#/endpoints/{base_model_endpoint_name}"

display(Markdown(f"You can **[view the Real-time Endpoint in the SageMaker Console]({console_url})**"))


In [None]:
print(base_model_endpoint_name)
%store base_model_endpoint_name

## Step 3: Initialize MLflow for Experiment Tracking

### Understanding SageMaker Managed MLflow

SageMaker Serverless MLflow provides a **fully managed experiment MLflow App** that eliminates the need to set up and maintain your own MLflow infrastructure. Key benefits:

- **Centralized Tracking**: All team members log to the same MLflow App
- **No Infrastructure Management**: AWS handles scaling, backups, and availability
- **Integrated Security**: Uses IAM for authentication and authorization
- **Persistent Storage**: Experiments are stored durably in AWS-managed storage

### What Gets Tracked
When you log experiments to MLflow, you capture:
- **Parameters**: Hyperparameters, model IDs, instance types
- **Metrics**: Training loss, validation accuracy, custom metrics
- **Artifacts**: Model files, training datasets, configuration files
- **Metadata**: Run names, timestamps, tags, notes

This creates a **complete audit trail** for governance and compliance.

If you are running this lab as part of an AWS workshop, an MLFlow App has already been created for you. You can use this App to track your fine-tuning experiments. Let's retrieve the MLFlow App URI, you will use it to track the experiments.

If you are running this notebook in your own environment, you need to have an existing running MLFlow App to be able to complete it successfully. Refer to the [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) for more details.

Now you are ready to set up your experiment

In [None]:
try:
    response = sm_client.list_mlflow_apps(MaxResults=10)
    mlflow_apps = response.get('Summaries', [])
    
    if mlflow_apps:
        active_apps = [s for s in mlflow_apps if s['Status'] == 'Created']
        
        if active_apps:
            mlflow_app_arn = active_apps[0]['Arn']
            mlflow_app_name = active_apps[0]['Name']
            print(f"‚úì Found active MLflow App:")
            print(f"  Name: {mlflow_app_name}")
            print(f"  ARN: {mlflow_app_arn}")
        else:
            print("‚ö† No active MLflow Apps found.")
            mlflow_app_arn = None
    else:
        print("‚ö† No MLflow Apps found in this region.")
        mlflow_app_arn = None
except Exception as e:
    print(f"Error checking for MLflow Apps: {e}")
    mlflow_app_arn = None

In [None]:
import mlflow
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")

# Connect to the managed MLflow App
mlflow.set_tracking_uri(mlflow_app_arn)

# Create or use existing experiment
# Experiments group related runs together (e.g., all summarization model iterations)
experiment_name = f"summarization-experiment-{timestamp}"
mlflow.set_experiment(experiment_name)

print(f"‚úì MLflow tracking URI: {mlflow.get_tracking_uri()}")
print(f"‚úì Experiment: {experiment_name}")

### üìä View Your Experiment in MLflow

**To access the MLflow UI:**

1. In the left sidebar of SageMaker Studio, click the **MLflow** icon
2. Click on your MlFlow App name
3. Using the Menu on the right hand side, open MLFlow. (See screenshot below)
4. Navigate to your experiment (See screenshot below. The exact experiment name will differ depending on the timestamp) 
![MLFlow Experiment](../../images/mlflow-console.png)
![MLFlow Experiment](../../images/mlflow-experiment.png)

Next initiate a Run in your experiment to track the fine-tuning job.

In [None]:
# Start a new MLflow run to track this fine-tuning job
# A "run" represents a single training execution with specific parameters
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")

run_name=f"llama-3.2-fine-tuning-summarization-{timestamp}"

mlflow_run = mlflow.start_run(run_name=run_name)
print(f"‚úì Started MLflow Run ID: {mlflow_run.info.run_id}")
print(f"  This run will track all parameters, metrics, and artifacts from this training job.")

## Step 4: Prepare the Fine-Tuning Dataset

### The Dolly Dataset
In this step, you will prepare the dataset your will use to fine-tune your model.

The Databricks Dolly dataset contains ~15,000 instruction-following examples across multiple categories:
- Summarization
- Question answering
- Information extraction
- Creative writing
- Classification

You'll filter for **summarization examples only** to create a domain-specific model.

### Data Format for Instruction Tuning

The data follows an instruction-tuning format:
```json
{
  "instruction": "Summarize the following text",
  "context": "[Long text to summarize]",
  "response": "[Expected summary]"
}
```

This teaches the model to follow instructions and generate appropriate responses.

In [None]:
from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summarization_dataset = summarization_dataset.remove_columns("category")

# Split dataset: 70% for training the model, 30% held out to evaluate performance on unseen data
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.3)

train_and_test_dataset["train"].to_json("train.jsonl")
train_and_test_dataset["test"].to_json("test.jsonl")


In [None]:
print("Sample training example:")
train_and_test_dataset["train"][0]

## Step 5: Create Prompt Template

### Why Prompt Templates Matter

A **prompt template** defines how we structure inputs to the model. This is critical because:
- The model was pre-trained with specific formatting conventions
- Consistent formatting improves model performance
- The same template must be used for training AND inference

Your template follows the **instruction-input-response** pattern commonly used for instruction-tuned models.

In [None]:
import json

template = {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
    "completion": " {response}",
}

with open("template.json", "w") as f:
    json.dump(template, f)

## Step 6: Upload Datasets to S3

When fine-tuning a model using Amazon SageMaker JumpStart, it expects your dataset to be stored in Amazon S3. Below, we'll upload our prepared datasets and template to the Amazon S3 bucket path you have access to.

In [None]:
import json

with open('/opt/ml/metadata/resource-metadata.json', 'r') as f:
    profile_name = json.load(f)['UserProfileName']

profile_name = profile_name[0].upper() + profile_name[1:]


In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

# Use the workshop bucket defined in Step 1
output_bucket = bucket
data_location = f"s3://{output_bucket}/{profile_name}/dolly_dataset"

train_path="train.jsonl"
template_path="template.json"
evaluation_path="test.jsonl"

training_input_path = f'{data_location}/{train_path}'
eval_input_path = f'{data_location}/{evaluation_path}'

S3Uploader.upload(train_path, data_location)
S3Uploader.upload(template_path, data_location)
S3Uploader.upload(evaluation_path, data_location)
print(f"Training data: {data_location}")

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore', message='Failed to determine whether UCVolumeDatasetSource')

df_train = pd.read_json(train_path, orient="records", lines=True)
training_data = mlflow.data.from_pandas(df_train, source=training_input_path)
mlflow.log_input(training_data, context="training")

In [None]:
df_evaluate = pd.read_json(evaluation_path, orient="records", lines=True)
df_evaluate.size
evaluation_data = mlflow.data.from_pandas(df_evaluate, source=eval_input_path)
mlflow.log_input(evaluation_data, context="evaluation")

## Step 7: Configure and Launch Fine-Tuning Job

### SageMaker JumpStart Benefits

JumpStart provides **pre-configured training scripts** for popular foundation models, eliminating the need to write custom training code. Benefits include optimized training configurations, support for distributed training, and integration with other SageMaker features. Before you start your fine-tuning job, you can also modify the hyperparameters for the model training.

## Key Hyperparameters

- **epochs**: Number of complete passes through the training data 
- **learning_rate**: Step size for model updates 
- **instruction_tuned**: Use instruction-following format 
- **per_device_train_batch_size**: Examples processed per GPU 
- **max_input_length**: Maximum tokens in input

All hyperparameters are logged to MLflow for reproducibility.

<div style="background-color: #d4edda; border: 1px solid #c3e6cb; border-radius: 4px; padding: 12px; margin: 10px 0;">
<b>‚úì Llama Model EULA Acceptance</b><br>
To deploy Llama models using SageMaker JumpStart, you must accept Meta's End User License Agreement (EULA). In the notebook, set <code>accept_eula=true</code> in the estimator configuration. By doing so, you acknowledge that you have read and agree to the terms of the EULA, available at https://ai.meta.com/resources/models-and-libraries/llama-downloads/. Deployment will fail if this parameter is not set to true.
</div>


In [None]:
# Define model ID for Llama 3.2 3B 
model_id = "meta-textgeneration-llama-3-2-3b"
model_version = "*"  # Use latest version

# Configure hyperparameters
hyperparameters = {
    "epoch": "2",
    "instruction_tuned": "True",
    "max_input_length": "1024",
}

# Log all hyperparameters to MLflow
mlflow.log_param("base_model_id", model_id)
mlflow.log_param("model_version", model_version)
for key, value in hyperparameters.items():
    mlflow.log_param(key, value)

print("‚úì Hyperparameters configured and logged to MLflow")

In [None]:
# Create JumpStart estimator for fine-tuning
instance_type = "ml.g5.2xlarge"
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
run_name=f"llama-3.2-fine-tuning-summarization-{timestamp}"

estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    role=role,
    instance_type=instance_type,  # GPU instance for training
    instance_count=1,
    hyperparameters=hyperparameters,
    disable_output_compression= False,
    output_path=f"s3://{bucket}/{profile_name}/model-output/",  #Explicitly use workshop bucket for model artifacts
    environment={
        "accept_eula": "true", # CHANGED: Set to true to accept Meta's Llama EULA
        "MLFLOW_TRACKING_URI": mlflow_app_arn,
        "MLFLOW_EXP": experiment_name,
        "MLFLOW_RUN_NAME": run_name
    }
)

# Log training configuration
mlflow.log_param("instance_type", instance_type)
mlflow.log_param("instance_count", 1)
mlflow.log_param("output_path", f"s3://{bucket}/{profile_name}/model-output/")  #Log the output path
mlflow.log_param("image_uri", estimator.image_uri)

print("‚úì Training estimator created")

### Start the training job

In [None]:
from IPython.display import Markdown, display

console_url = f"https://console.aws.amazon.com/sagemaker/home?region={region}#/training"

display(Markdown(f"You are now ready to start the training job and fine-tune the model. You can review the metadata and progress of the training job in the AWS console **[üîó View Training Jobs in SageMaker Console]({console_url})**"))


> **‚è±Ô∏è Note:** The training job will take approximately 15-18 minutes to complete.

In [None]:
# Launch the fine-tuning job
print("üöÄ Starting fine-tuning job...")
print("   This will take approximately 15 minutes.")
print("   You can monitor progress in the SageMaker console.\n")

estimator.fit({"training": training_input_path}, logs=True)

print("\n‚úì Fine-tuning job completed!")

# Log training job details to MLflow
mlflow.log_param("training_job_name", estimator.latest_training_job.name)
mlflow.log_param("model_artifact_s3", estimator.model_data)

###  Track the training progress
While waiting, you can track the training progress above and also review the information you have logged in MLFLow:
1. Navigate to the MLFlow console
2. Find the summarization - experiment you created earlier
3. Click on its name to view the experiment details
4. Locate the Run and click on its name to view its details

![MLFlow Experiment](../../images/run.png)
![MLFlow Experiment](../../images/run_details.png)

In [None]:
mlflow.log_dict(
    {
        "model_artifact": estimator.model_data,
    },
    "model_info.json"
)

In [None]:
from IPython.display import Markdown, display

output_path = estimator.output_path
training_job_name = estimator.latest_training_job.name

s3_url = f"{output_path}{training_job_name}/output/model.tar.gz"
s3_path = s3_url.replace("s3://", "").split("/", 1)
console_url = f"https://s3.console.aws.amazon.com/s3/object/{s3_path[0]}?prefix={s3_path[1]}"

display(Markdown(f"**Training Output:** The fine-tuned model artifacts are saved at **[{s3_url}]({console_url})**"))


## Step 7: Deploy the Fine-Tuned Model

Now we'll deploy the fine-tuned model to a SageMaker endpoint for real-time inference.

> **‚è±Ô∏è Note:** The deployment job will take approximately 10 minutes to complete.

In [None]:
finetuned_predictor = estimator.deploy(instance_type="ml.g5.2xlarge")


In [None]:
fine_tuned_model_endpoint_name = finetuned_predictor.endpoint_name
print(fine_tuned_model_endpoint_name)
%store fine_tuned_model_endpoint_name

In [None]:
# Log deployment details to MLflow
mlflow.log_param("endpoint_name", finetuned_predictor.endpoint_name)
mlflow.log_param("endpoint_instance_type", "ml.g5.2xlarge")

## Step 8: Compare the Based and Fine-Tuned Models


Let's now do some initial testing to compare the outputs of the base and fine-tuned models

In [None]:
def print_response(model_id, payload, response):
    print(f"Model: {model_id}")
    print(f"Prompt: {payload["inputs"]}")
    print(f"Response: {response.get('generated_text')}")
    print("\n==================================\n")

payload = {
    "inputs": """### Instruction: What is Amazon SageMaker in one sentence?### Response:\n""",
    "parameters": {
        "max_new_tokens": 128,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": False,
    },
}
try:
    response = finetuned_predictor.predict(
        payload, custom_attributes="accept_eula=true"  # Please change this to "accept_eula=true"
    )
    print_response(model_id, payload, response)
except Exception as e:
    print(e)

In [None]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = train_and_test_dataset["test"]

(
    inputs,
    ground_truth_responses,
    responses_before_finetuning,
    responses_after_finetuning,
) = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": template["prompt"].format(
            instruction=datapoint["instruction"], context=datapoint["context"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    # Please change the following line to "accept_eula=true"
    pretrained_response = pretrained_predictor.predict(
        payload, custom_attributes="accept_eula=true"
    )
    responses_before_finetuning.append(pretrained_response.get("generated_text"))
    # Fine Tuned Llama 3 models doesn't required to set "accept_eula=true"
    finetuned_response = finetuned_predictor.predict(payload)
    responses_after_finetuning.append(finetuned_response.get("generated_text"))


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

In [None]:
print(experiment_name)

In [None]:
%store experiment_name

## Step 9: Review Governance Artifacts in MLflow

### What We've Tracked

Throughout this lab, we've automatically logged:

1. **Data Lineage**
   - Source dataset location
   - Number of training/test examples
   - Data preprocessing steps

2. **Model Lineage**
   - Base model ID and version
   - All hyperparameters
   - Training job name
   - Model artifact location

3. **Deployment Lineage**
   - Endpoint container image
   - Instance type

### Accessing Your Experiments

You can view all tracked experiments in:
1. **SageMaker Studio**: Navigate to MLflow App
2. **MLflow UI**: Access through the MLflow App URL
3. **Programmatically**: Query using MLflow APIs

Since the training is now complete, let's mark the Run as completed.

In [None]:
# # End the MLflow run
mlflow.end_run()
run_id=mlflow_run.info.run_id

print("‚úì MLflow run completed")
print(f"\nRun Summary:")
print(f"  Experiment: {experiment_name}")
print(f"  Run ID: {mlflow_run.info.run_id}")
print(f"  Run Name: llama-3.2-fine-tuning-summarization")
print(f"\nAll parameters, metrics, and artifacts have been logged for governance and auditability.")

## Key Takeaways

In this lab, you learned how to:

1. ‚úÖ **Fine-tune a foundation model** for a specific use case (text summarization)
2. ‚úÖ **Track all experimentation** using SageMaker Managed MLflow
3. ‚úÖ **Establish complete lineage** from data ‚Üí training ‚Üí deployment
4. ‚úÖ **Create audit trails** for governance and compliance
5. ‚úÖ **Enable reproducibility** by logging all parameters and artifacts

### Governance Benefits Demonstrated

- **Auditability**: Every training run is logged with complete metadata
- **Reproducibility**: Any experiment can be recreated from tracked parameters
- **Lineage**: Clear chain from source data to deployed model
- **Compliance**: Meet regulatory requirements for model documentation
- **Collaboration**: Team members can view and compare all experiments

### Next Steps

- In the next lab, you will evaluate your fine-tuned model, review the metrics, and register it to the SageMaker Model Registry.