# Finetune HF Llama 3.1 70b and Deploy on AWS Bedrock

This notebook has the following steps: 

1. imports and converts [Llama 3.1 70b](https://huggingface.co/meta-llama/Meta-Llama-3-8B) from Hugging Face transformer file format to .nemo file format

    Note: you will need to create a HuggingFace account and request access to the model

2. Supervised Fine Tuning (SFT) using the NeMo framework on the [NVIDIA Daring-Anteater dataset](https://huggingface.co/datasets/nvidia/Daring-Anteater), a comprehensive dataset for instruction tuning

3. Move your finetuned model to AWS S3 for use with AWS Bedrock Custom Model Import

## Convert Hugging Face Model to NeMo

In [None]:
!pip install ipywidgets

In [None]:
import os
import huggingface_hub

# Set your Hugging Face access token
huggingface_hub.login("<HF_TOKEN>")
os.makedirs("/demo-workspace/Meta-Llama-3.1-70B", exist_ok=True)
huggingface_hub.snapshot_download(
    repo_id="meta-llama/Llama-3.1-70B", repo_type="model", local_dir="Meta-Llama-3.1-70B"
)

In [None]:
%%bash
# clear any previous temporary weights dir if any
rm -r model_weights

#converter script from NeMo
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --precision bf16 \
  --input_name_or_path=/demo-workspace/Meta-Llama-3.1-70B \
  --output_path=/demo-workspace/Meta-Llama-3.1-70B.nemo \
  --llama31 True

## Import and Configure Dataset

In [None]:
%%bash

mkdir /demo-workspace/datasets

In [None]:
from datasets import load_dataset
import json

dataset = load_dataset("nvidia/daring-anteater")

for split, shard in dataset.items():
    length = len(shard)
    train_limit = length * 0.85
    with open("/demo-workspace/datasets/daring-anteater-train.jsonl", "w") as train:
        with open("/demo-workspace/datasets/daring-anteater-val.jsonl", "w") as val:
            for count, line in enumerate(shard):
                desired_data = {
                    "system": line["system"],
                    "conversations": line["conversations"],
                    "mask": line["mask"],
                    "type": "TEXT_TO_VALUE",
                }
                if count < train_limit:
                    json.dump(desired_data, train)
                    train.write("\n")
                else:
                    json.dump(desired_data, val)
                    val.write("\n")

## Finetuning

In [None]:
%%bash

chmod +x /demo-workspace/sft-finetune-llama3.1-70b.sh
ls -l /demo-workspace/sft-finetune-llama3.1-70b.sh

In [None]:
import nemo_run as run


def dgxc_executor(nodes: int = 1, devices: int = 1) -> run.DGXCloudExecutor:
    pvcs = [
        {
            "name": "workspace",  # Default name to identify the PVC
            "path": "/demo-workspace",  # Directory where PVC will be mounted in pods
            "existingPvc": True,  # The PVC already exists
            "claimName": "llama-3-1-70b-pvc-project-ax4ia",  # Replace with the name of the PVC to use
        }
    ]

    return run.DGXCloudExecutor(
        base_url="https://tme-aws.nv.run.ai/api/v1",  # Base URL to send API requests
        app_id="aws-app",  # Name of the Application
        app_secret="<APP_SECRET>",  # Application secret token
        project_name="aws-demo-project",  # Name of the project within Run:ai
        nodes=nodes,  # Number of nodes to run on
        gpus_per_node=devices,  # Number of processes per node to use
        container_image="nvcr.io/nvidia/nemo:25.02",  # Which container to deploy
        pvcs=pvcs,  # Attach the PVC(s) to the pod
        launcher="torchrun",  # Use torchrun to launch the processes
        env_vars={
            "PYTHONPATH": "/demo-workspace/nemo-run:$PYTHONPATH",  # Add the NeMo-Run directory to the PYTHONPATH
            "HF_TOKEN": "<HF_TOKEN>",  # Add your Hugging Face API token here
            "FI_EFA_USE_HUGE_PAGE": "0",
            "TORCH_HOME": "/demo-workspace/.cache",
            "NEMORUN_HOME": "/demo-workspace/nemo-run",
            "OMP_NUM_THREADS": "1",
        },
    )

In [None]:
executor = dgxc_executor(nodes=4, devices=8)
run.config.set_nemorun_home("/demo-workspace/nemo-run")

with run.Experiment("sft-finetuning") as exp:
    exp.add(run.Script("/demo-workspace/sft-finetune-llama3.1-70b.sh"), executor=executor)

    # Launch the experiment on the cluster
    exp.run(sequential=True)

## Import Model to AWS S3

To prepare the model for use with BedRock, we must first convert our finetuned model weights back to HF safetensors. The model and the original llama 3.0 tokens will then be sent to your S3 bucket. 

In [None]:
%%bash

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_nemo_to_hf.py \
--input_name_or_path /demo-workspace/llama3.1-70b-daring-anteater-sft/checkpoints/megatron_gpt_sft.nemo \
--output_path /demo-workspace/llama-output-weights.bin \
--hf_input_path /demo-workspace/Meta-Llama-3.1-70B \
--hf_output_path /demo-workspace/sft-llama-3.1-hf

In [None]:
%%bash

export BUCKET_NAME=hf-llama3-1-70b

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
./s5cmd cp /demo-workspace/sft-llama-3.1-hf s3://$BUCKET_NAME

./s5cmd cp /demo-workspace/Meta-Llama-3.1-70B/tokenizer.json s3://$BUCKET_NAME/sft-llama-3.1-hf/
./s5cmd cp /demo-workspace/Meta-Llama-3.1-70B/tokenizer_config.json s3://$BUCKET_NAME/sft-llama-3.1-hf/
./s5cmd cp /demo-workspace/Meta-Llama-3.1-70B/original/tokenizer.model s3://$BUCKET_NAME/sft-llama-3.1-hf/

To run with BedRock, go to the Custom Model import feature and load your model from your S3 bucket. Once the model is ready, it can directly be used for your production inference. 