# Script-write your soap opera - Fifth Elephant workshop

Welcome to the LLM fine-tuning workshop and thanks for your interest. In this session, we will walk you through the various steps in fine-tuning your own Large Language Model (LLM). This is a continuous area of research and evolution so the information contained here can change frequently. We have also made decisions for this workshop with the aim of simplifying the process and hope that it provides a great starting point from which participants can explore further.

## Let's build an AI script writer whom we can include in the writer's room

The Writer's room is a fabled place where several writers that are working on a show come together with the Showrunner, Executive producer and others to write the script of a particular show. Each writer's room works and operates differently but they all include brainstorming for ideas, identifying characters, specific plot elements and in the end produce a detailed script for an episode for a series/show/play. The idea itself is not new and there has been research in this area even before the recent popularity of LLMs. In the LLM-era there have been two interesting ideas that stand out - [Showrunner Agents](https://fablestudio.github.io/showrunner-agents/) - where researchers built out an entire episode of South Park by combinging various types of LLM-based agents to work together. The other approach is a paper called [Dramatron (from Deepmind)](https://arxiv.org/pdf/2209.14958) which used a series of prompts that were used to create plays that were finally staged as well.

In this workshop, we will take inspiration from the Dramatron paper to create our own scriptwriting assistant by fine-tuning our own LLM.

As described in the slides, here is a simple step-by-step approach to fine-tuning our own LLM.

<a href="https://ibb.co/bQyLGNx"><img src="https://i.ibb.co/mbYtPhm/Screenshot-2024-07-04-at-14-46-19.png" alt="Screenshot-2024-07-04-at-14-46-19" border="0"></a>

## Building our training config

We choose to use the Axolotl package for running our fine-tuning. It is a wrapper on top of various other packages and provides a simple but detailed configuration file where all parameters for a specific fine-tuning job are specified. There are also several example configurations available from which we can start and adapt as required - this provides a great onboarding experience.

Let's take a look at what our configuration file looks like.

In [None]:
#######
### Configuration file for a training job that teaches Mistral 7B v0.1 to memorize a small batch from the SQLQA dataset
#######

###
# Model Configuration: Mistral 7B
###
base_model: mistralai/Mistral-7B-v0.1

# base model weight quantization
load_in_8bit: false
load_in_4bit: false

# attention implementation
flash_attention: true

# finetuned adapter config
adapter: lora
lora_model_dir:
lora_r: 16
lora_alpha: 32
lora_dropout: 0.0 # off because this is a memorization test
lora_target_linear: true
lora_modules_to_save: # required when adding new tokens to LLaMA/Mistral
  - embed_tokens
  - lm_head
# for details, see https://github.com/huggingface/peft/issues/334#issuecomment-1561727994

###
# Dataset Configuration: sqlqa
###

datasets:
  # This will be the path used for the data when it is saved to the Volume in the cloud.
  - path: data.jsonl
    ds_type: json
    type:
      # JSONL file contains question, context, answer fields per line.
      # This gets mapped to instruction, input, output axolotl tags.
      field_instruction: question
      field_input: context
      field_output: answer
      # Format is used by axolotl to generate the prompt.
      format: |-
        [INST] Using the schema context below, generate a SQL query that answers the question.
        {input}
        {instruction} [/INST]

# dataset formatting config
tokens: # add new control tokens from the dataset to the model
  - "[INST]"
  - " [/INST]"
  - "[SQL]"
  - " [/SQL]"

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

train_on_inputs: false

val_set_size: 0.5

# dataset packing config
sequence_len: 4096
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false
group_by_length: false

###
# Training Configuration: AdamW, CosineLR, deepspeed, many epochs
###

# random seed for better reproducibility
seed: 117

# optimizer config
optimizer: adamw_torch
learning_rate: 0.0001
lr_scheduler: cosine
warmup_steps: 10
gradient_accumulation_steps: 1
micro_batch_size: 16
weight_decay: 0.0

# axolotl saving config
dataset_prepared_path: last_run_prepared
output_dir: ./lora-out

# logging and eval config
logging_steps: 10
eval_steps: 10
save_strategy: "no"
num_epochs: 50

# wandb logging config
wandb_project: memorize-sqlqa

# training performance optimization config
bf16: auto
fp16: false
tf32: false
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json
gradient_checkpointing: true

###
# Miscellaneous Configuration
###

# prevents over-writing the config from the CLI
strict: false

# run with debug-level logs
debug:

# "Don't mess with this, it's here for accelerate and torchrun" -- axolotl docs
local_rank:

## Constructing the Fine-tuning Dataset

One of the most important aspects when fine-tuning an LLM is the dataset on which you would like to fine-tune. There are several datasets that are available on the HuggingFace datasets repository but this is also where your unique business advantage or proprietary data comes to play. For instance, let's say that you already run an app in production with multiple users adopting it, you would typically start by making use of GPT-3.5 or GPT-4 as the LLM. Over time, you likley run multiple LLM calls and store the historical information. Not all responses maybe accurate and you will also know what is actually the desired response. You would therefore use this data and annotate or label it manually. This could be a great asset for you to use and build a specific LLM that cannot be easily replaced and is also chepaer.

For this workshop, since we are not dealing with an existing app as such we followed the route of creating a synthetic dataset with the help of OpenAI APIs. Basically, we take an existing example and ask the OpenAI LLM to create multiple other examples that follow the same style but add some diversity to the examples. You can also run this via a batch API that is cheaper - although you might need to wait upto 24 hours for it to process.

We will dive into a different notebook called "Synthetic_Data_Generation" to look at the details on how one can generate a dataset. Feel free to use the techniques and prompts provided for your own use case. However, to speed up the process of the workshop I have also uploaded the dataset that I created to HuggingFace datasets and it is also available in this repository.

## Running the Fine-tuning process

We have our dataset, we have decided the configuration and now it's time to run our fine-tuning process. Till now we were dealing with conceptual topics and now we get to the nitty-gritty of making this whole thing work. This is typically also the point where you will run into the most number of issues. One of the biggest bottlenecks with fine-tuning LLMs is the requirement to have GPUs. The smallest models (with 7 billion parameters) are still quite large and not everyone has a GPU lying around. This is where it makes sense to leverage a cloud provider.
In addition to that, many of the fine-tuning libraries are still work in progress with a lot of rough edges. So it's not uncommon to run into issues like installation dependencies, incorrect deployments when moving to the cloud and much else. 

Keeping the goal of this workshop in mind, we chose to make use of Modal Labs as the GPU provider and make use of a starter script that they have already created to get our fine-tuning job to run. This script takes care of multiple aspects of running a GPU training process like creating the necessarsy docker containers, kicking off a distributed training run, merging the LoRA at the end and finally also spinning up an inference server. We will obviously incur costs in this process but Modal also provides upto 30 USD of free credit with every account each month which should be enough for running many of our fine-tuning processes.  

In [None]:
modal run --detach src.train --config=config/ai_script_writer.yml --data=data/fine_tuning.jsonl

## Performing Inference

Now that we have trained our model, let us now try and see how it works - moment of truth!

For running inference also there are several options. Modal allows you to have a serverless inference instance that can be spun up only for serving a single request and then it's shut down again. This way, you are not incurring GPU costs when there are no requests. Of course, the issue is that you will also have to run into start-up wait times. In the default configuration, we have set a time out of 15 minutes. So if no new requests are served for 15 minutes then the server would be shut down automatically.

In the following section, we will create a simple Gradio app to provide the input and read the output responses from the model.

`modal run -q src.inference --prompt "[INST] Logline: A story about a famed German baker who must decide what to do with his beloved and popular bakery as he thinks about the future
[/INST]`

`modal deploy src.inference`

In [1]:
!pip install gradio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import requests

def http_bot(prompt):
    url = 'https://sidhusmart--example-axolotl-inference-web.modal.run'
    params = {
        'input': prompt
    }
    
    response = requests.get(url, params=params)
    return response.text

In [3]:
import gradio as gr

with gr.Blocks() as demo:
        gr.Markdown("# Fine-tuned LLM Completion\n")
        inputbox = gr.Textbox(label="Input", placeholder="Enter text and press ENTER")
        outputbox = gr.Textbox(label="Output", placeholder="Generated result from the model")
        inputbox.submit(http_bot, [inputbox], [outputbox])

demo.launch(debug=True, share=True)

  from .autonotebook import tqdm as notebook_tqdm


Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://45b2e6a5a5cfe239ab.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
