<!-- Banner Image -->
<center>
    <img src="https://developer-blogs.nvidia.com/wp-content/uploads/2024/07/rag-representation.jpg" width="75%">
</center>

<!-- Links -->
<center>
  <a href="https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/" style="color: #76B900;">NVIDIA AI Workbench</a> •
  <a href="https://docs.nvidia.com/ai-workbench/" style="color: #76B900;">User Documentation</a> •
  <a href="https://docs.nvidia.com/ai-workbench/user-guide/latest/quickstart/example-projects.html" style="color: #76B900;">Example Projects Catalog</a> •
  <a href="https://forums.developer.nvidia.com/t/support-workbench-example-project-llama-3-finetune/303411" style="color: #76B900;"> Problem? Submit a ticket here! </a>
</center>

# Finetune and deploy an LLM model using SFT and VLLM 

Welcome!

In this notebook, we're going to walk through the flow of using supervised finetuning (SFT) on the Facebook opt-125m model from scratch using the base model and then deploying it using VLLM.  

Note that we will be using the base model in this notebook and not the instruct model. Additionally, we will be running through a full finetune with no quantization. This notebook was originally built for 2 A100-80GB GPUs, but the default hyperparameters have since been adjusted to run on 1x A100-80GB. If you're looking for a lighter Llama3-finetune, checkout out the other Llama3 finetuning notebook which uses Direct Preference Optimization.

#### Help us make this tutorial better! Please provide feedback on the [NVIDIA Developer Forum](https://forums.developer.nvidia.com/c/ai-data-science/nvidia-ai-workbench/671).

A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.

## Table of Contents
1. Install Prereqs
2. Import libraries
3. Download model
4. Fintuning flow
5. Deploy as an OpenAI compatible endpoint

## 1. Install Prerequisites

In [1]:
!pip install ipywidgets trl transformers accelerate datasets bitsandbytes einops huggingface-hub git+https://github.com/huggingface/peft.git

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-jb2ktzpa
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-jb2ktzpa
  Resolved https://github.com/huggingface/peft.git to commit 18f3efe5c0d9ca85dd8c2b5c26525f5550b6655d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

## 2. Imports libraries

In [2]:
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, SFTConfig

## 3. Load in facebook/opt-125m and our dataset

Because we are using the base model, there is not an exact prompt template we have to follow. The dataset we are using follows LLama3's template format so it should be fine for downstream tasks that use the opt-125m chat format. If you're bringing your own data, you can format it however you want as long as you use the same formatting downstream. 

Here's the official [Llama3 chat template](https://huggingface.co/blog/llama3#how-to-prompt-llama-3)

*Note:* We are using a small model and dataset based on the amount of GPU resources we will be using and also to save time running this notebook during the workshop. 

In [3]:
base_model_id = "facebook/opt-125m"
new_model = "output/models/fb-opt-125m-SFT"
dataset_name = "stolzenp/500-imdb-movie-reviews-smt"

In [4]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, split="train", cache_dir="/workspace/dataset")

README.md:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [5]:
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#optional
os.environ["HF_TOKEN"] = "<insert hf key>"

model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                             cache_dir="project/models",
                                             device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    # token=os.environ["HF_KEY"], 
    trust_remote_code=True,
    add_eos_token=True,
    add_bos_token=True, 
)
tokenizer.pad_token = tokenizer.eos_token



## 4. Set our Training Arguments

A lot of tutorials simply paste a list of arguments leaving it up to the reader to figure out what each argument does. Below, annotations have been added to explain what each argument does.

In [6]:
# Output directory where the results and checkpoint are stored
output_dir = "./results"

# Number of training epochs - how many times does the model see the whole dataset
num_train_epochs = 1 #Increase this for a larger finetune

# Enable fp16/bf16 training. This is the type of each weight. Since we are on an A100
# we can set bf16 to true because it can handle that type of computation
bf16 = True

# Batch size is the number of training examples used to train a single forward and backward pass. 
per_device_train_batch_size = 16

# Gradients are accumulated over multiple mini-batches before updating the model weights. 
# This allows for effectively training with a larger batch size on hardware with limited memory
gradient_accumulation_steps = 8

# memory optimization technique that reduces RAM usage during training by intermittently storing 
# intermediate activations instead of retaining them throughout the entire forward pass, trading 
# computational time for lower memory consumption.
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Number of training steps (overrides num_train_epochs)
max_steps = 100

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 100

# Log every X updates steps
logging_steps = 5

In [8]:
sft_config = SFTConfig(
    dataset_text_field="text",
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size, 
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    max_seq_length=1024,
    max_steps=max_steps,
    bf16=bf16,
    # optim=optim,
    logging_steps=logging_steps,
    output_dir=output_dir,
    # gradient_accumulation_steps=1
)

In [9]:
peft_config = LoraConfig(
      lora_alpha=8,
      lora_dropout=0.1,
      r=16,
      bias="none",
      task_type="CAUSAL_LM",
)

In [10]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    # peft_config=peft_config,
    args=sft_config,
)

max_steps is given, it will override any value given in num_train_epochs


In [11]:
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Step,Training Loss
5,1.6251
10,0.9287
15,0.5813
20,0.4117
25,0.2958
30,0.2233
35,0.1806
40,0.1587
45,0.1278
50,0.1193


## 5. Run the model for inference!

To deploy this model for extremely quick inference, we use VLLM and host an OpenAI compatible endpoint. You must **restart the kernel** to flush the GPU memory and then just run the cells below. 

First, uncomment and install the vLLM pip package. You may need to restart the kernel for the package to take effect. Then, run the command to start the API server. Once running, you can open a new tab in Jupyterlab, select Terminal, and run the curl command to send a request to the server. 

**Note:** We install the latest version of vLLM in the following cell. This may upgrade your transformers package version, which can cause issues if you re-run the notebook from the beginning. If you would like to re-run the entire notebook for another finetuning flow, restart the project environment from inside AI Workbench to get a fresh environment to work in.

In [12]:
# Uncomment to install the latest version of vLLM. 
!pip install vllm nvidia-ml-py

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting vllm
  Downloading vllm-0.6.1.post2-cp38-abi3-manylinux1_x86_64.whl.metadata (2.4 kB)
Collecting nvidia-ml-py
  Downloading nvidia_ml_py-12.560.30-py3-none-any.whl.metadata (8.6 kB)
Collecting sentencepiece (from vllm)
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting py-cpuinfo (from vllm)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting openai>=1.40.0 (from vllm)
  Downloading openai-1.45.1-py3-none-any.whl.metadata (22 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting pydantic>=2.9 (from vllm)
  Downloading pydantic-2.9.1-py3-none-any.whl.metadata (146 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.

In [13]:
!pip uninstall pynvml -y

Found existing installation: pynvml 11.4.1
Uninstalling pynvml-11.4.1:
  Successfully uninstalled pynvml-11.4.1
[0m

In [2]:
#run HF model
!python -O -u -m vllm.entrypoints.openai.api_server \
    --host=0.0.0.0 \
    --port=8000 \
    --model=facebook/opt-125m \
    --tensor-parallel-size=1 \
    --gpu-memory-utilization=0.9 \
    --enforce-eager \
    --max-model-len=2048

INFO 09-16 15:57:35 api_server.py:495] vLLM API server version 0.6.1.post2
INFO 09-16 15:57:35 api_server.py:496] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='output/models/fb-opt-350m-SFT', tokenizer='facebook/opt-350m', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', 

In [None]:
#run fine tuned model
!python -O -u -m vllm.entrypoints.openai.api_server \
    --host=0.0.0.0 \
    --port=8000 \
    --model=output/models/fb-opt-125m-SFT \
    --tokenizer=facebook/opt-125m \
    --tensor-parallel-size=1 \
    --gpu-memory-utilization=0.9 \
    --enforce-eager \
    --max-model-len=2048
    # --enable-lora \
    # --lora-modules peft=output/models/fb-opt-125m-SFT \
    # --model=facebook/opt-125m \
    # --model=output/models/fb-opt-125m-SFT \
    #--tokenizer=facebook/opt-125m \

# Open up a terminal and run

*Baseline test:*
curl http://localhost:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "facebook/opt-125m",
       "prompt": "The god father movie is",
       "max_tokens": 30,
       "temperature": 0
   }'

*Post finetuning test:*
curl http://localhost:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "output/models/fb-opt-125m-SFT",
       "prompt": "The god father movie is",
       "max_tokens": 30,
       "temperature": 0
   }'
