In [None]:
!pip uninstall autogluon-multimodal autogluon-timeseries -y
!pip install -Uqq torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
!FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install -Uqq --no-cache-dir "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git" 
!pip install -Uqq accelerate bitsandbytes
!pip install -U tensorflow
!pip install -U awscli
!pip uninstall flash_attn -y

## Fine-tune model for better RAG
In this notebook we will fine-tune the `Mistral 7B` model on the previously generated dataset. There are several reasons why we might want to fine-tune a model:
- Our data is domain specific and is distinct from the data the model was trained on.
- The model is not performing well on the task we are interested in.
- We want to better align the models responses with our expectations.

### Approach
We will follow a variation of the approach proposed in the Retrieval Augmented Fine Tuning [RAFT](https://arxiv.org/pdf/2403.10131). The main idea is we will create a fine-tuning dataset that is comprised of the question, relevant and irrelevant contexts, and the answer. We will then fine-tune the model on this dataset. The hope is that the model will learn to better distinguish between relevant and irrelevant contexts when formulating the response. This better simulates the real-world scenario where the retriever will likely return both relevant and irrelevant contexts and we need the model to be able to distinguish between the two.

To speed up training, we will use the [unsloth](https://github.com/unslothai/unsloth) library combined with Hugging Face TRL as described in this [blog post](https://huggingface.co/blog/unsloth-trl). This would allows us to complete more fine-tuning within a limited amount of time than would have been possible in this environment. 

### Why combine RAG with fine-tuning?
You may wonder why we are fine-tuning the model for RAG rather than fine-tuning on the question and answer task directly. One reason is that this approach trains the model to reason over our custom domain. This allows us to adapt the model to a wide variety of tasks besides question answering. For example, we could use the model to summarize specific regulations or extract entities. Another reason is that this approach would allows us to deal with changing documents without having to retrain the model. In our example the Banking Regulations change fairly frequently as we cas see in this [Timeline](https://www.ecfr.gov/recent-changes?search%5Bdate%5D=current&search%5Bhierarchy%5D%5Btitle%5D=12) view. It would be cost prohibitive and impractical to retrain the model every time the regulations change. However by tuning the model to reason over the documents, we can simply ingest the latest documents into the vector store and the model should be able to provide accurate and up to date answers.



In [None]:
import boto3
import sagemaker
import json
from pathlib import Path

sess = sagemaker.Session()
region = sess.boto_region_name

First we will download the pre-trained model from a SageMaker Jumpstart S3 Bucket

In [3]:
local_model_path = "mistral-7b-instruct"
if not Path(local_model_path).exists():
    !aws s3 cp --recursive s3://jumpstart-cache-prod-{region}/huggingface-llm/huggingface-llm-mistral-7b-instruct/artifacts/inference-prepack/v2.0.0/ {local_model_path}

Next we load the model using unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = local_model_path, 
    max_seq_length = max_seq_length,
    load_in_4bit = load_in_4bit,
)

We will make use of parameter efficient fine-tuning (PEFT) which will train a small number of additional parameter to adapt the model for our task. See [here](https://github.com/huggingface/peft) for more details. 

The parameters below such as `r`, `lora_alpha`, and `lora_dropout` can be tuned to improve the performance of the model.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",  
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora = True,  
)

Next we will load our json lines dataset

In [6]:
from datasets import load_dataset
from pathlib import Path
train_dataset_path = Path("data/prepared_data/prepared_data_train.jsonl")
test_dataset_path = Path("data/prepared_data/prepared_data_test.jsonl")
train_dataset = load_dataset("json", data_files = [train_dataset_path.as_posix()])

Next we need to prepare our data as per the RAFT technique. We'll use a variation of the [Mistral prompt](https://www.promptingguide.ai/models/mistral-7b#chat-template-for-mistral-7b-instruct) as our template to combine the question, relevant and irrelevant contexts, and the answer. 

The number of irrelevant contexts or distractor documents as they are referred to in the RAFT paper is determined by `DISTRACT_DOCS` variable. So with `DISTRACT_DOCS=2` we will include 2 irrelevant documents for each question in addition to the correct document.

In [7]:
from random import randint

prompt_template = "[INST] You are a Banking Regulations expert.\nGiven this context\nCONTEXT\n{context}\n Answer this question\nQuestion: {question} [/INST] {response} "

DISTRACT_DOCS = 2

BOS_TOKEN = tokenizer.bos_token
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    contexts = examples["context"]
    questions = examples["question"]
    responses  = examples["answer"]
    texts = []
    for context, question, response in zip(contexts, questions, responses):
        distractor_contexts = [train_dataset["train"][randint(0, train_dataset["train"].num_rows -1)] for _ in range(DISTRACT_DOCS)]
        
        context_docs = [context] + [doc["context"] for doc in distractor_contexts]
        context = "\n".join(context_docs)
        
        text = BOS_TOKEN + prompt_template.format(context=context, question=question, response=response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

train_dataset = train_dataset["train"].map(formatting_prompts_func, batched = True)
train_dataset = train_dataset.shuffle(seed = 3407)

Finally we configure the training parameters and start the training process using the [Hugging Face TRL](https://github.com/huggingface/trl) library.

In [None]:
import mlflow_utils
import mlflow

mlflow_config_path = Path("mlflow_config.json")
if not mlflow_config_path.exists():
    print(
        "No MLFlow configuration found. Please run the first notebook to set up MLFlow."
    )
else:
    mlflow_config = json.loads(mlflow_config_path.read_text())
    server_status = mlflow_utils.check_server_status(
        mlflow_config["tracking_server_name"]
    )
    if server_status["IsActive"] == "Active":
        print(
            f'MLFlow server is available. The current status is: {server_status["TrackingServerStatus"]}'
        )
        mlflow_available = True
        mlflow.set_tracking_uri(mlflow_config["tracking_server_arn"])
        mlflow.set_experiment("fine-tuning-banking-regulations")
    else:
        mlflow_available = False
        print(
            f'MLFlow server is not available. The current status is: {server_status["TrackingServerStatus"]}'
        )


In [None]:
from IPython.display import display, Markdown

if mlflow_available:
    pre_signed_url = mlflow_utils.create_presigned_url(mlflow_config["tracking_server_name"])
    display(Markdown(f"Our experiment results will be logged to MLFlow. You can view them from the [MLFlow UI]({pre_signed_url})") )

In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to= "mlflow" if mlflow_available else None,
        run_name="fine-tuning"
    )

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = training_args
)

In [13]:
if mlflow_available:
    mlflow.log_params(training_args.to_dict())

trainer_stats = trainer.train()

### Local Inference
To do a quick test of the model we can run some local inference to make sure that the model is working as expected. We'll perform further validation as part of our RAG pipeline in the next notebook.

In [None]:
# load test dataset
test_dataset = load_dataset("json", data_files = [test_dataset_path.as_posix()])

In [10]:
# Pick and index from 0 to 399 to test the model
idx = 125
test_question = test_dataset["train"][idx]["question"]
test_context = test_dataset["train"][idx]["context"]
test_answer = test_dataset["train"][idx]["answer"]

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inference_template = "[INST] You are a Banking Regulations expert.\nGiven this context\nCONTEXT\n{context}\n Answer this question\nQuestion: {question} [/INST]"


inputs = tokenizer(
[
    inference_template.format(context=test_context, question=test_question)
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 2000)
answer = tokenizer.batch_decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens = True)

print("Question: ", test_question)
print("\nGenerated Answer: ", test_answer)
print("\nGround Truth Answer: ", test_answer)

We can register the model into the [MLFLow model registry](https://mlflow.org/docs/latest/model-registry.html). A model registry is a critical component of the ML lifecycle that allows you to manage the full lifecycle of a model, from experimentation to deployment. We can use the MLFlow model registry to track the model version, stage the model for deployment, and manage the model lifecycle. MLFlow registry can automatically sync with the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments-model-registration.html) to provide a seamless experience for model deployment and MLOps.

In [None]:
from mlflow.models import infer_signature

# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
    model_input=inference_template.format(context=test_context, question=test_question),
    model_output=test_answer,
)

mlflow.transformers.log_model(
    transformers_model={"model": trainer.model, "tokenizer": tokenizer},
    task="text-generation",
    prompt_template="[INST] {prompt} [/INST]",
    signature=signature,
    artifact_path="banking-regulations-adapter",
)

We would normally use the model registry as part of an MLOps deployment pipeline. However, for the purposes of this workshop we will save the model to a local directory for use deployment in the next notebook. There are several options for saving the model including saving only the fine-tune adapter, merging the adapter with the model and saving the entire model, and saving quantized model or half precision model. Saving just the adapter is the most space efficient option and would enable efficient multi-tenant serving as described in this [blog post](https://aws.amazon.com/blogs/machine-learning/easily-deploy-and-manage-hundreds-of-lora-adapters-with-sagemaker-efficient-multi-adapter-inference/) and would thus be the recommended approach. To simplify the scope of this workshop, we'll save the entire model.

In [None]:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit")

### Conclusion
In this notebook we saw how we can fine-tune a model for RAG using the RAFT technique. We also saw how we can use the unsloth library to speed up the fine-tuning process. In the next notebook we will deploy the model as a REST API endpoint using SageMaker Hosting Services.