# End-to-end DSPy Workflows Guide 

- **Setup and Data Gathering**: gather and preprocess a dataset for a finetuning a DSPy pipeline using LLaMa 3 70B
- **Fine-tuning**: tune a smaller (LLama 3 8B) LLM (LoRA / full param) on the pipeline
- **Serving**: serve the pipeline as a production application that can autoscale, etc.
- **Evaluation**: apply batch offline inference with Ray data and VLLM to quickly evaluate the pipeline.

## Set up

Node Set up:

We will be running everything on a head node that uses 8xL4 GPUs. I find that they are usually available and suitable for this usecase. You can also use any more powerful node.

To change to use L4 GPUs, click the "1 active node" in the top right corner, then for workspace node, click the pencil icon and navigate to the L4 tab and select the 8xL4 option. If you do not see L4 in the list of GPUs, they may not be available on your cloud. Choose another kind of GPU (This notebook has been tested on X, and Y as alternatives) (TODO)

In [None]:
# TODO(work): DSPy installation cell
# TODO(decision): are these changes going to be merged into DSPy main

# Either pull from pip or install locally if using a non-public version

# !pip install -e dspy-d
# !pip install -r dspy-d/requirements.txt
# !pip install vllm

In [None]:
import dspy
import dsp
import os

# TODO: include cache in notebook
cache_dir = "/home/ray/default/dspy_cache"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
    
os.environ["DSP_CACHEDIR"] = cache_dir

In [2]:
# Here are the different options for setting the OPENAI_API_KEY.
# (1) If you are running this notebook locally, you can set it in your shell:
# export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

# (2) If you are running this notebook in a Google Colab, you can set it in
#     the secrets panel and import by uncommenting the following lines:
# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# (3) You can set it directly in the notebook by uncommenting the following
#     line with the appropriate key. If you pick this option, make sure to
#     remove your key before sharing your notebook or committing to a
#     repository.
# os.environ["OPENAI_API_KEY"] = <OPENAI_API_KEY>


necessary_env_vars = [
    # "OPENAI_API_KEY",
    # "API_BASE",
    "DSP_CACHEDIR",
    "HF_TOKEN"
]

for var in necessary_env_vars:
    assert os.environ[var], f"{var} is not set"

NameError: name 'os' is not defined

In [None]:
import ray

if not ray.is_initialized():
    ray.init(runtime_env={"env_vars": os.environ, "py_modules": [dspy, dsp]})

Now we need to download the models that we will use for this notebook.

In this case, it is the Instruct version of the Meta Llama 3 70B and 8B models.

We can use the public Anyscale S3 bucket to download the models. We will use the cluster storage as the download location. (TODO: Is cluster storage okay for this?)

In [None]:
!aws s3 cp s3://large-dl-models-mirror/models--meta-llama--Meta-Llama-3-70B-Instruct/main-safetensors/ /mnt/cluster_storage/meta-llama--Meta-Llama-3-70B-Instruct --recursive
!aws s3 cp s3://large-dl-models-mirror/models--meta-llama--Meta-Llama-3-8B-Instruct/main-safetensors/ /mnt/cluster_storage/meta-llama--Meta-Llama-3-8B-Instruct --recursive

We will make use of a random number generator in this notebook. We are creating a Random object here to ensure that our notebook is reproducible.

In [None]:
import random

rng = random.Random()

# Data preparation

In [None]:
from dspy.datasets import HotPotQA
from dspy.evaluate import Evaluate
from dsp.utils.utils import deduplicate


# We are setting the experimental flag to True to make use of the fine-tuning
# features that are still in development.
dspy.settings.configure(experimental=True)

# Define the program
class BasicMH(dspy.Module):
    def __init__(self, passages_per_hop=3, num_hops=2):
        super().__init__()
        self.num_hops = num_hops
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_query = [dspy.ChainOfThought("context, question -> search_query") for _ in range(self.num_hops)]
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = []
        
        for hop in range(self.num_hops):
            search_query = self.generate_query[hop](context=context, question=question).search_query
            passages = self.retrieve(search_query).passages
            context = deduplicate(context + passages)

        answer = self.generate_answer(context=context, question=question).copy(context=context)
        return answer

Lets breakdown what BasicMH does

TODO: add explanation of what each part does

In [None]:
# Prepare the dataset
TRAIN_SIZE = 1000
DEV_SIZE = 500
dataset = HotPotQA(train_seed=1, eval_seed=2023, test_size=0, only_hard_examples=True)
trainset = [x.with_inputs('question') for x in dataset.train][:TRAIN_SIZE]
devset = [x.with_inputs('question') for x in dataset.dev][:DEV_SIZE]

In [None]:
# Prepare the metric and evaluator
NUM_THREADS = 12
metric = dspy.evaluate.answer_exact_match
evaluate = Evaluate(devset=devset, metric=metric, num_threads=NUM_THREADS, display_progress=True)

TODO(optional): Discuss LLM as judge vs exact match

In [3]:
# Prepare the retriever model
# Note that this is not hosted on Anyscale on purpose to represent a real-world scenario where you arent hosting your DB on Anyscale
COLBERT_V2_ENDPOINT = "http://20.102.90.50:2017/wiki17_abstracts"
retriever = dspy.ColBERTv2(url=COLBERT_V2_ENDPOINT)

NameError: name 'dspy' is not defined

## Gathering baseline performance

run evaluate on a base pipeline

In [None]:
MAX_TOKENS = 1024
MODEL_PARAMETERS = {
  "max_tokens": MAX_TOKENS,
  "temperature": 0,
}

vanilla_program = BasicMH()

In [None]:
from math import ceil
from deploy_dspy.async_llm import 
def evaluate_ray(program, devset, model, tokenizer, dspy_context_kwargs, tensor_parallel_size, num_gpus):
    # need to get number of GPUs somehow
    concurrency = num_gpus
    count = devset.count()
    batch_size = ceil(count / concurrency)
    print(devset.map_batches(DSPyActor, 
               batch_size=batch_size,
               num_gpus=1,
               concurrency=concurrency,
               fn_constructor_kwargs={"batch_size": batch_size}
               ).take_all()

In [None]:
# Let's first get a baseline with the 8B model
from deploy_dspy.async_llm import AsyncLLMWrapper
from transformers import AutoTokenizer
# We first need to create an instance of the LLM in order for DSPy to interact with it
# We will discuss it in detail later, but this is an opportunity to take advantage of Anyscale specific integrations with DSPy and use the new VLLMOfflineEngine class. Because we have 8 GPUs and each one can fit a model instance, we can use 8 instances of the 8B model to run inference in parallel.

In [None]:
llama_8b = dspy.VLLMOfflineEngine(model="meta-llama/Meta-Llama-3-8B", async_mode=False, **MODEL_PARAMETERS)

with dspy.context(lm=llama_8b, rm=retriever):
  print("Evaluating the vanilla program on the devset using the model to be trained (llama 8B)...")
  vanilla_8b_base_eval = evaluate(vanilla_program)

In [None]:
# Now we will 

llm = AsyncLLMWrapper("/mnt/cluster_storage/hf/cache/Meta-Llama-3-70B-Instruct",
            max_pending_requests=512,
            tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct"),
            enforce_eager=True,
            engine_use_ray=False,
            worker_use_ray=False,
            enable_prefix_caching=True,
            tensor_parallel_size=8
        )

In [None]:
# Prepare the language model
llama_70b = dspy.VLLMOfflineEngine(model="meta-llama/Meta-Llama-3-70B", async_mode=False, **MODEL_PARAMETERS)

with dspy.context(lm=llama_70b, rm=retriever):
  print("Evaluating the vanilla program on the devset using llama 70B...")
  vanilla_70b_base_eval = evaluate(vanilla_program)

We hope to bring the 8B performance up to at least 70B level

## Optimizing the LLaMa 70B pipeline

run BSFS and MIPROv2 on the pipeline with the playground model, choose whichever one gets the best performance

TODO: add in MIPROv2 ( If desired)

In [None]:
# Optimization hyperparameters
from dspy.teleprompt.random_search import BootstrapFewShotWithRandomSearch

# Define the hyperparameters for prompt optimization
MAX_BOOTSTRAPPED_DEMOS = 3
MAX_LABELED_DEMOS = 3
NUM_CANDIDATE_PROGRAMS = 6
OPTIMIZER_NUM_TRAIN = 100
OPTIMIZER_NUM_VAL = 150

In [None]:
# Prepare the training and validation sets for the optimizer using the original
# trainset. This ensures that our "devset" is left untouched.
shuffled_trainset = [d for d in trainset]
rng.shuffle(shuffled_trainset)
optimizer_trainset = shuffled_trainset[:OPTIMIZER_NUM_TRAIN]
optimizer_valset = shuffled_trainset[OPTIMIZER_NUM_TRAIN:OPTIMIZER_NUM_TRAIN+OPTIMIZER_NUM_VAL]

In [None]:
# Initialize the optimizer
bfrs_optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric,
    max_bootstrapped_demos=MAX_BOOTSTRAPPED_DEMOS,
    max_labeled_demos=MAX_LABELED_DEMOS,
    num_candidate_programs=NUM_CANDIDATE_PROGRAMS,
    num_threads=NUM_THREADS
)

# Compile the optimizer and evaluate
with dspy.context(lm=llama_70b, rm=retriever):
    vanilla_program = BasicMH()
    bfrs_base_program = bfrs_optimizer.compile(vanilla_program, trainset=optimizer_trainset, valset=optimizer_valset)
    bfrs_base_eval = evaluate(bfrs_base_program, devset=devset)

To get a baseline for the finetuned model, we can run BSFS on the vanilla program with the smaller 8B model

In [None]:
with dspy.context(lm=llama_8b, rm=retriever):
    vanilla_program = BasicMH()
    bfrs_8b_program = bfrs_optimizer.compile(vanilla_program, trainset=optimizer_trainset, valset=optimizer_valset)
    bfrs_8b_eval = evaluate(bfrs_8b_program, devset=devset)

In [None]:
raise SystemExit("Stop right there!")

### Bootstrap Data


In this section, we bootstrap data for fine-tuning. In the code block below, we are deciding which program should be used to collect the bootstraps. We are setting this to the prompt optimized program, but one could also set this to the vanilla program, though doing so would lead to lower quality bootstraps.

In [None]:
bootstrap_program = bfrs_base_program

In [None]:
from dspy.teleprompt.finetune_teleprompter import bootstrap_data


# Bootstrap traces using the prompt optimized program
# TODO: Change to using the version with multiple retries
with dspy.context(lm=llama_70b, rm=retriever):
    bootstrapped_data = bootstrap_data(program=bootstrap_program, dataset=trainset, metric=metric, num_threads=NUM_THREADS)

# The bootstrap_data function returns a list of dictionaries, each dictionary
# corresponding to an exmaple in the data, with the the following keys and
# values:
# - The `example` field corresponding to the example itself
# - The `prediction` field corresponding to the prediction made by the program
#   on the example.
# - The `trace` field corresponding to the trace generated by the program on the
#   example.
# - The `score` field corresponding to the metric score of the example, if the
#   metric is provided. Otherwise, it is not included in the data.
print(f"\nTrainset has {len(trainset)} examples.")
print(f"Bootstrapped data has {len(bootstrapped_data)} examples.")
print(f"Bootstrapped data dictionaries have the following keys: {list(bootstrapped_data[0].keys())}")

In [None]:
rng.shuffle(bootstrapped_data)

In [None]:
# Filter the bootstrapped data to only include examples where the metric score
# is 1
filtered_bootstrapped_data = [d for d in bootstrapped_data if d['score']]
print(f"Filtered bootstrapped data has {len(filtered_bootstrapped_data)} examples.")

In [None]:
from dspy.teleprompt.finetune_teleprompter import convert_to_module_level_prompt_completion_data

# To train a model on our program traces, we need a dataset that consists of
# the prompts and the completions for each module in our program. We can use the
# helper method `convert_to_prompt_completion_data` to generate these traces.
# This function takes in a list of dictionaries, each of which must contain the
# "trace" field, which is used for generating the prompts and completions for
# each module in the program. The function returns a list of dictionaries, each
# of which contains the keys "prompt" and "completion", among others. Refer to
# the documentation for more information.
#
# The `exclude_demos` flag controls whether the demonstrations should be
# included in the prompts. We are setting it to True here to have the "vanilla"
# prompts as part of our finetuning data.
finetune_data = convert_to_module_level_prompt_completion_data(filtered_bootstrapped_data, exclude_demos=True)
print(f"For each datapoint, we get one prompt completion pair for each module in our program of {len(vanilla_program.predictors())} modules.")
print(f"As a result, finetune data has {len(filtered_bootstrapped_data)} * {len(bootstrap_program.predictors())} = {len(finetune_data)} examples.")
print(f"The finetune data dictionaries have the following keys: {list(finetune_data[0].keys())}")

In [None]:
# Let's look at an example prompt completion pair!
print("Example prompt:")
print(finetune_data[0]['prompt'])
print("-"*50)
print("Example completion:")
print(finetune_data[0]['completion'])
print("-"*50)

# Fine-tuning

Take the best pipeline from the previous step and run BSFT on it with the LLama 3 8B model thru LLM forge.

NOTE: I think this is the part that needs the most work

TODO: Outline what steps are still needed

- Dev time estimate TBD based on outlining steps

In [None]:
student_llama_8b = dspy.TrainableAnyscaleLM(model="meta-llama/Meta-Llama-3-8B", **MODEL_PARAMETERS)

# Showing that the model is an instance of the TrainableLM class
isinstance(student_llama_8b, dspy.TrainableLM)

In [None]:
import ujson

train_path = "trainset_data.jsonl"

print(f"Writing dataset with length {len(finetune_data)} to {train_path}")
with open(train_path, "w") as f:
    ujson.dump(finetune_data, f)

In [None]:
from dsp.modules.lm import TrainingMethod

USE_CACHED_MODEL = False

if USE_CACHED_MODEL:
  CACHED_FINETUNED_MODEL_ID = ""
  # Support this case
  finetuned_lm = dspy.TrainableAnyscaleLM(model=CACHED_FINETUNED_MODEL_ID, **MODEL_PARAMETERS)
else:
    # Define the hyperparameters for finetuning. To look at the specifics of
    # what kinds of arguments should be passed to `get_finetune`, refer to the
    # documentation for the particular subclass you are using, in this case,
    # `dspy.TrainableAnyscaleLM`.
    hyperparameters = {
        "n_epochs": 3
    }

    # Select the training method. Must be one of the options in the
    # `TrainingMethod` enum.
    method = TrainingMethod.SFT

    # Get a future object for the finetuned language model. This will get populated
    # with the finetuned model once the training is complete.
    # TODO: This is a big point of work
    # Need to grab this from the other workspace
    future_finetuned_lm = student_llama_8b.get_finetune(
        method=method,
        train_path=train_path,
        hyperparameters=hyperparameters
    )

    # Call the result method on the future object to get the finetuned language.
    # This will block the execution until the training is complete.
    finetuned_lm = future_finetuned_lm.result()

# Record the model ID to access the model again in the future. You can directly
# substitute this model ID for the `CACHED_FINETUNED_MODEL_ID` variable above,
# and the next line will load the model directly.
finetuned_lm.kwargs['model']

# Evaluation

Throughout this section, anything using 8B model (or technically 70B too) should use the new evaluate with ray data batch offline(or technically online) inference.

Probably worth testing offline with 8x8 threads vs just 64 threads to see if it makes a meaningful difference.

## Performance comparisons

- 70B
- 70B BSFS
- 8B
- 8B BSFT
- 8B BSFT + BSFS

---
- Note: Allegedly really easy
- Dev time estimate: 1 day

Let's see how well the vanilla program performs with the finetuned model

In [None]:
with dspy.context(lm=finetuned_lm, rm=retriever):
    vanilla_program = BasicMH()
    vanilla_finetuned_eval = evaluate(vanilla_program, devset=devset)

Now let's try optimizing the program with the finetuned model

In [None]:
with dspy.context(lm=finetuned_lm, rm=retriever):
    vanilla_program = BasicMH()
    bfrs_finetuned_program = bfrs_optimizer.compile(vanilla_program, trainset=optimizer_trainset, valset=optimizer_valset)
    bfrs_finetuned_eval = evaluate(bfrs_finetuned_program, devset=devset)

In [4]:
# Now we can compare all iterations of this pipeline
print(f"Results for HotPotQA fine-tuning LLaMa 8Bwith a starting trainset with {len(trainset)}. Out of the {len(trainset)} examples, only {len(filtered_bootstrapped_data)} examples were judged as successful using a prompt optimized with BootstrapFewShotRandomSearch prompt optimized (bfrs) program on the base model. Traces from these examples are used for SFT. The reported results are computed on a held-out devset of {len(devset)} examples.\n")
print(f"    70B model (vanilla program): {vanilla_70b_base_eval}")
print(f"    70B model (bfrs program): {bfrs_base_eval}")
print(f"    8B model (vanilla program): {vanilla_8b_base_eval}")
print(f"    8B model (bfrs program): {bfrs_8b_eval}")
print(f"    8B model (finetuned program): {vanilla_finetuned_eval}")
print(f"    8B model (finetuned bfrs program): {bfrs_finetuned_eval}")

NameError: name 'trainset' is not defined

Let's now use the new offline batch inference to evaluate the finetuned model with optimized program on the entire devset

In [None]:
# TODO: implement once done

# Serving

This is the second biggest unknown
I imagine it to be easy, but crazier things have happened

I need to keep a reference or link to the LLM forge job inside the LM.finetune method

In [None]:
vanilla_program = BasicMH()

with dspy.context(lm=finetuned_lm, rm=retriever):
  print("Evaluating the vanilla program on the devset using the finetuned model...")
  vanilla_finetune_eval = evaluate(vanilla_program, devset=devset)

## Batch offline inference
- Compare running inference using 
    - Ray Data 
    - multithreading on local VLLM thru HTTP
    - Multithreading to Ray Serve instance thru HTTP
- Dev time estimate: 7 days