# Fine-Tuning Llama 3 Using Azure Machine Learning Jobs and Hugging Face Tools 

## Introduction

In this lab we will build upon the knowledge gained in the previous lab, where you fine-tuned Llama 3 inside a notebook.  For that lab, we used an interactive Jupyter notebook with one (possibly small) GPU to perform development.  

In this lab we'll run the same code using an Azure ML _job_, which can be run asynchronously as well as scheduled to run at a specific time.  We will also use an AML _cluster_ which can contain multiple machines with multiple GPUs each.  This is an ideal approach for fine-tuning large LLMs and/or training for many epochs.

## Prerequisites
To run this lab you need to have the following:

* An AML single-machine compute cluster named `mycluster`
* An AML `environment` named `finetune_llama3@latest` and containing the same libraries as the ones in the `requirements.txt` file
* The Medical Dialogs dataset, loaded into AML's Data Repository, accessible through the Assets->Data link on the left-hand side of the screen, with the name MedicalDialogs.
* The code fine named `fine_tune_llama3_doctor.py` which is a slightly modified, single-file version of the code in the previous lab

## Tools Used
The Python tools used in this lab are the following open-source Hugging Face tools:

* [Transformers](https://huggingface.co/docs/transformers/v4.17.0/en/index) - Implementation of a number of deep-learning models using the Transformer architecture
* [PEFT](https://huggingface.co/docs/peft/index) - Implementation of Parameter-Efficient Fine-Tuning, which allows the fine-tuning of pretrained models using only a small subset of their parameters.  We will be using the Quantized LORA (QLORA) algorithm for fine-tuning a model that was quantized to use 4 bits for each weight instead of 16 bits. 
* [TRL](https://huggingface.co/docs/trl/index) - This library contains a number of algorithms that help train Transformer-based language models using Reinforcement Learning.  As our dataset contains medical questions and answers, we will be using the Supervised Fine-Tuning (SFT) algorithm.
* [Accelerate](https://huggingface.co/docs/accelerate/index) - This library makes it easy to run multi-gpu training and is integrated into the other libraries we will use.
* [BitsAndBytes](https://huggingface.co/docs/bitsandbytes/index) - This library provides tools for loading models in a quantized form for PyTorch. 

We will also use the built-in [MLFlow](https://mlflow.org/docs/latest/tracking.html) capabilites in AML to track metrics and outputs from our job.


## Imports & Definitions

In this section we import the required libraries for running the job.  Note that for this notebook we only use the AML SDK and a pre-configured Jupyter kernel that comes with AML.  The actual training code has been moved into `fine_tune_llama_3_doctor.py` with the addition of a `main` function and the ability to accept training parameters.

In [None]:
# AML SDK imports

from azure.ai.ml import MLClient, Input, Output, command
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import JobResourceConfiguration


# Authentication
credential = DefaultAzureCredential()

# AML Details - These are available by clicking the down-arrow on the top right-hand sign of the screen, next to 
# your login initials

SUBSCRIPTION="<Your Subscription ID>"
RESOURCE_GROUP="<AML Studio Resource Group Name>"
WS_NAME="<Workspace Name>"

# Get a handle to the workspace - this handle allows you to send commands to the workspace via code
ml_client = MLClient(
    credential=credential,
    subscription_id=SUBSCRIPTION,
    resource_group_name=RESOURCE_GROUP,
    workspace_name=WS_NAME
)

## Submitting the Job 

In this section we submit the job for processing using the AML SDK.

Some interesting points in this `command` object:
*  We submit the entire directory in the AML notebook to the job
*  We use Hugging Face's `accelerate` tool to automatically use all of the GPU's available on the compute cluster
*  We define and submit a `batch_size` parameter, that will be handled in the `fine_tune_llama_3_doctor.py` file

In [None]:
# Declare a command object with the details of the job
job = command(

    # local path where the code is stored - in this case, we want the entire directory in the AML notebook 
    code=".",  

    # Using HuggingFace tools to request ALL GPUs on the machine
    command="python download_files.py && accelerate launch fine_tune_llama_3_doctor.py " + \
        "--num_epochs=${{inputs.num_epochs}} --batch_size=${{inputs.batch_size}} --num_data_rows=${{inputs.num_data_rows}}",

    # Pass a custom parameter to the training script
    inputs={
        "batch_size": 8,
        "num_epochs": 2,
        "num_data_rows": 10000,
    },

    # Name of environment to use
    environment="finetune_llama3@latest",

    # Display name of this experiment in the Jobs display (MLFlow)
    display_name="Fine-Tune Llama 3 ",

    # Name of compute cluster
    compute="mycluster",

    # Directory (on host) where to save HF models
    resources=JobResourceConfiguration(docker_args="-v=/mnt:/mounts"),
)

In [None]:
# Run and return status for the command object
returned_job = ml_client.create_or_update(job)

## Explanation of the Above Code

In order to run an AML job, we upload a set of files which contain the code we developed in the previous lab, but this time as a standalone file.  We've also added:
* A sepearate Python file for downloading the model and tokenizer files to a fixed location on the GPU machine.  This is to ensure that we only have to download these large files ones.
* A `main` function to the previous code that can optionally receive parameters such as batch size, learning_rate, etc.   

We also require an `environment`, which is just a Docker container with the proper libraries to run the standalone Python file.  This environment can be created in the *Assets -> Environments* tab on the left-hand side of the screen.

When a job is submitted to AML using its SDK AML performs the following steps:
1.  Downloads system containers and the environment container to the cluster machine.  This may take a few minutes when done for the first time on new compute cluster machines.

2.  AML will run the standalone Python file containing the experiment code inside the environment's Docker container, while making sure it has access to an MLFLow server for tracking and all of the files in the directory specified in the command.  

3.  AML will write the standard output and error streams of the run to `std_log.txt` in the working directory which can be viewed in AML under the `Jobs` tabs and clicking on the experiment name.

4.  Anything written to the `output` directory in the working directory (such as the model file or any other additional files) can be retrieved from the same place.

5.  Finally, all MLFlow metrics can be reviewed in the same place as well.
