# Quick Start 2: Compute Multiturn-aware Rewards (when you have a local LLM)

#### To compute multiturn-aware rewards for a model response where we use policy model to simulate conversations, provide:

- **messages**: `List[Dict[str, str]]` — The full conversation history, with the *last* entry being the model response to evaluate.  
- **task_description**: `str` *(optional)* — A brief description of the overall task domain
- **single_turn_prompt**: `str` *(optional)* — The specific prompt being assessed
- **single_turn_completion**: `str` *(optional)* — The ground-truth response for that prompt

Additionally, provide:
- **local_model:** `AutoModelForCausalLM` *(optional)* — The local model that generates responses 
- **local_tokenizer:** `AutoTokenizer`*(optional)* — The local tokenizer that tokenizes the input messages
- **vllm_base_model**: `str` *(optional)* — The base model used for VLLM inference, e.g., `"meta-llama/Meta-Llama-3-8B-Instruct"`. 

Note:
You can provide `vllm_base_model` only, or both `local_model` and `local_tokenizer`, or all three. 
- If you provide `vllm_base_model` only, the code will use VLLM to generate responses. 
- If you provide `local_model` and `local_tokenizer` only, the code will use huggingface generation API (slow).
- If you provide all three, the code will use VLLM to accelerate the generation process with loaded LoRA weights from `local_model`.

In [1]:
# %env OPENAI_API_KEY=
# %env ANTHROPIC_API_KEY=
# Or set these environment variables in your system
from dotenv import load_dotenv
YOUR_DOTENV_PATH = "../.env"
load_dotenv(YOUR_DOTENV_PATH)

%env CUDA_VISIBLE_DEVICES=2
# Disable logging for the collabllm package
# Set to 1 to see the process of the reward computation.
%env ENABLE_COLLABLLM_LOGGING=0

env: CUDA_VISIBLE_DEVICES=2
env: ENABLE_COLLABLLM_LOGGING=0


## Example 1: Movie Recommendation

In [2]:
import torch
from vllm import LLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig
from peft import PeftModel, PeftConfig, get_peft_model


# -------------- Load model and tokenizer --------------
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
# This can be any loralized model trained based on `base_model_name`
local_model_name_or_path = "meta-llama/Llama-3.1-8B-Instruct" 

if local_model_name_or_path == base_model_name: 
    # Directly use vLLM to simulate the conversation
    local_model = None
    local_tokenizer = None
else: 
    # `local_model` provides LoRA weights 
    local_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    local_model = AutoModelForCausalLM.from_pretrained(local_model_name_or_path)

    peft_config = LoraConfig(
        r=32, 
        lora_alpha=16,
        lora_dropout=0.0,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights="gaussian",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"]
    )
    local_model = get_peft_model(local_model, peft_config)
    
vllm_base_model = LLM(
        model=base_model_name,
        dtype="bfloat16",
        quantization="bitsandbytes",
        load_format="bitsandbytes",
        enable_lora=True,
        max_lora_rank=32
    )

collabllm_model_kwargs = {
    "local_model": local_model,
    "local_tokenizer": local_tokenizer,
    "vllm_base_model": vllm_base_model,
}

# ------------- Run collabllm  --------------
import sys
sys.path.append('..')

import logging
logging.getLogger("LiteLLM").setLevel(logging.CRITICAL)

import numpy as np
from examples.metrics.accuracy import AccuracyMetric
from examples.metrics.efficiency import TokenAmountMetric
from examples.metrics.interactivity import InteractivityMetric
from collabllm.reward import multiturn_aware_reward

task_desc = "Recommend a movie."
single_turn_prompt = "Find a film that suitable for a date night. It should deliver an epic romantic drama, ideally in the 20th-century America, and carry the same decades-long, nostalgic storytelling spirit as Forrest Gump."
single_turn_completion = "The Curious Case of Benjamin Button"

passive_response = """The Pursuit of Happyness (2006) - A touching story of determination, courage, and love being more important than ability Best Movies Like Forrest Gump | BestSimilar, starring Will Smith as a struggling father who overcomes tremendous obstacles. Like Forrest Gump, it's an inspiring tale of perseverance against the odds."""

collaborative_response = "What aspects of Forrest Gump do you enjoy? Is it the storytelling, the character development, or the historical context?"

rewards = {}
for idx, response in enumerate([passive_response, collaborative_response]):
    messages = [
        {"role": "user", "content": "Can you recommend me a movie similar to Forrest Gump?"},
        {"role": "assistant", "content": response}
    ]

    reward_info = multiturn_aware_reward(
        chat_history=messages,
        task_desc=task_desc,
        single_turn_prompt=single_turn_prompt + f"(Hint: {single_turn_completion})",
        single_turn_completion=single_turn_completion,
        metric_names=["recommendation->accuracy", 'interactivity', 'token_amount'],
        metric_weights=[1, 1, -0.1],
        user_generation_kwargs={"model": "gpt-4o"},
        assistant_generation_kwargs={"model": local_model_name_or_path},
        reward_generation_kwargs={"model": "claude-3-5-sonnet-latest"},
        num_samples=3,
        max_new_turns=4,
        **collabllm_model_kwargs
    )

    print(f"{'Metric':<{25}} : Values                        Mean")
    for k in sorted(reward_info):
        print(f"{k:<{25}} : {[f'{v:.3f}' for v in reward_info[k]]}  {np.mean(reward_info[k]):6.3f}")

INFO 06-01 14:42:50 [__init__.py:243] Automatically detected platform cuda.
INFO 06-01 14:42:54 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-01 14:42:54 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-01 14:42:54 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-01 14:43:06 [config.py:793] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 06-01 14:43:06 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-01 14:43:08 [core.py:438] Waiting for init message from front-end.
INFO 06-01 14:43:08 [core.py:65] Initializing a V1 LLM engine (v0.9.0) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mo

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 06-01 14:43:16 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-01 14:43:16 [gpu_model_runner.py:1549] Model loading took 5.5228 GiB and 5.393260 seconds
INFO 06-01 14:43:26 [backends.py:459] Using cache directory: /dfs/user/shirwu/.cache/vllm/torch_compile_cache/426fe7453b/rank_0_0 for vLLM's torch.compile
INFO 06-01 14:43:26 [backends.py:469] Dynamo bytecode transform time: 10.22 s
INFO 06-01 14:43:35 [backends.py:132] Directly load the compiled graph(s) for shape None from the cache, took 7.651 s
INFO 06-01 14:43:37 [monitor.py:33] torch.compile takes 10.22 s in total
INFO 06-01 14:43:38 [kv_cache_utils.py:637] GPU KV cache size: 523,744 tokens
INFO 06-01 14:43:38 [kv_cache_utils.py:640] Maximum concurrency for 131,072 tokens per request: 4.00x
INFO 06-01 14:44:23 [gpu_model_runner.py:1933] Graph capturing finished in 45 secs, took 1.47 GiB
INFO 06-01 14:44:23 [core.py:167] init engine (profile, create kv cache, warmup model) took 67.30 seconds
INFO 06-01 14:44:33 [chat_

Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/2 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Metric                    : Values                        Mean
MR                        : ['1.659', '0.622', '0.527']   0.936
interactivity             : ['0.700', '0.700', '0.600']   0.667
recommendation->accuracy  : ['1.000', '0.000', '0.000']   0.333
token_amount              : ['0.411', '0.775', '0.726']   0.637


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Metric                    : Values                        Mean
MR                        : ['1.878', '0.764', '1.965']   1.536
interactivity             : ['0.950', '0.800', '1.000']   0.917
recommendation->accuracy  : ['1.000', '0.000', '1.000']   0.667
token_amount              : ['0.720', '0.364', '0.351']   0.478


## Example 2: Document writing 
You may need to restart kernel to run the following.

In [1]:
# %env OPENAI_API_KEY=
# %env ANTHROPIC_API_KEY=
# Or set these environment variables in your system
from dotenv import load_dotenv
YOUR_DOTENV_PATH = "../.env"
load_dotenv(YOUR_DOTENV_PATH)

%env CUDA_VISIBLE_DEVICES=3
# Set to 1 to see the process of the reward computation (need to restart to take effect).
%env ENABLE_COLLABLLM_LOGGING=1

env: CUDA_VISIBLE_DEVICES=3
env: ENABLE_COLLABLLM_LOGGING=1


In [None]:
import torch
from vllm import LLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig
from peft import PeftModel, PeftConfig, get_peft_model

# -------------- Load model and tokenizer --------------
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
# This can be any loralized model trained based on `base_model_name`
local_model_name_or_path = "meta-llama/Llama-3.1-8B-Instruct" 

if local_model_name_or_path == base_model_name: 
    # Directly use vLLM to simulate the conversation
    local_model = None
    local_tokenizer = None
else: 
    # `local_model` provides LoRA weights 
    local_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    local_model = AutoModelForCausalLM.from_pretrained(local_model_name_or_path)

    peft_config = LoraConfig(
        r=32, 
        lora_alpha=16,
        lora_dropout=0.0,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights="gaussian",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"]
    )
    local_model = get_peft_model(local_model, peft_config)
    
vllm_base_model = LLM(
        model=base_model_name,
        dtype="bfloat16",
        quantization="bitsandbytes",
        load_format="bitsandbytes",
        enable_lora=True,
        max_lora_rank=32
    )
    

collabllm_model_kwargs = {
    "local_model": local_model,
    "local_tokenizer": local_tokenizer,
    "vllm_base_model": vllm_base_model,
}
# ------------- Run collabllm  --------------
import sys
sys.path.append('..')

import numpy as np
from examples.metrics.bleu import BLEUMetric
from examples.metrics.efficiency import TokenAmountMetric
from examples.metrics.interactivity import InteractivityMetric

import logging
logging.getLogger("LiteLLM").setLevel(logging.CRITICAL)

from collabllm.reward import multiturn_aware_reward

task_desc = "Write a short essay."
passive_response = "Here's a piece that might inspire and motivate you to cultivate optimism:\n\n**The Power of Optimism: Unlocking a Brighter You**\n\nHere are a few tips to get you started:\n\n*   **Practice gratitude**: Take time each day to reflect on the things you're thankful for, no matter how small they may seem.\n*   **Focus on the positive**: When faced with a challenge or setback, try to see the opportunity for growth and learning, rather than the obstacle.\n*   **Surround yourself with positivity**: Spend time with people who uplift and inspire you, and avoid those who bring you down.\n*   **Take care of yourself**: Get enough sleep, exercise regularly, and eat a healthy, balanced diet to support your physical and mental well-being.\n\n**Conclusion**\n\nOptimism is a choice, a mindset that allows you to see the world in a more vibrant and hopeful light. When you practice optimism, you'll start to notice a profound impact on your life, from improved mental health and increased motivation to better relationships and greater resilience.\n\nSo, what are you waiting for? Choose to be optimistic today, and start to unlock a brighter, more fulfilling life for yourself. Believe in yourself, believe in your abilities, and know that anything is possible when you have the courage to dream big."

collaborative_response = "To get us started, can you tell me what kind of tone are you aiming for? Do you want it to be more:\n\nA) Uplifting and motivational, focusing on the benefits of optimism?\nB) Inspiring and thought-provoking, exploring the science behind optimism's impact on well-being?\nC) Heartfelt and personal, sharing your own experiences with optimism and its effects on your life?\n\nAlso, are there any specific aspects of optimism you'd like to highlight, such as its role in resilience, relationships, or overall happiness?"

single_turn_prompt = "Write a short essay on the benefits of optimism."

single_turn_completion = "**The Optimism Revolution: Unleashing Your Inner Power**Hey there, friend! Are you ready to join the optimism revolution? It's time to shake off the negativity, doubt, and fear that's been holding you back, and unleash your inner power. Because when you choose to be optimistic, you're not just changing your outlook \u2013 you're changing your life.**The Power of Positive Thinking**As Nelson Mandela once said, 'The greatest glory in living lies not in never falling, but in rising every time we fall.' When you adopt an optimistic mindset, you're not just a survivor \u2013 you're a thriver. You're a force to be reckoned with, and you're unstoppable.So, what's holding you back? Is it fear of failure? Fear of success? Fear of the unknown? Let's face it \u2013 fear is just an illusion. As Winston Churchill said, 'The pessimist sees the difficulty in every opportunity. The optimist sees the opportunity in every difficulty."

rewards = {}
for idx, response in enumerate([passive_response, collaborative_response]):
    messages = [
        {"role": "user", "content": "I need to write about how optimism can improve our well-being"},
        {"role": "assistant", "content": response}
    ]

    reward_info = multiturn_aware_reward(
        chat_history=messages,
        task_desc=task_desc,
        single_turn_prompt=single_turn_prompt  + "\nReference article: " + single_turn_completion,
        single_turn_completion=single_turn_completion,
        metric_names=["document->bleu", 'interactivity', 'token_amount'],
        metric_weights=[1, 1, -0.1],
        assistant_generation_kwargs={
            "model": local_model_name_or_path,
            "temperature": 0.8,
            "max_tokens": 2048
        },
        user_generation_kwargs={
            "model": "gpt-4o-mini",
            "temperature": 1.0,
            "max_tokens": 1024
        },
        reward_generation_kwargs={
            "model": "claude-3-5-sonnet-latest",
            'temperature': 0
        },
        num_samples=3,
        max_new_turns=4,
        **collabllm_model_kwargs
    )

    rewards[idx] = reward_info

INFO 06-02 16:29:59 [__init__.py:243] Automatically detected platform cuda.
INFO 06-02 16:30:05 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-02 16:30:05 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-02 16:30:05 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-02 16:30:21 [config.py:793] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 06-02 16:30:21 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-02 16:30:24 [core.py:438] Waiting for init message from front-end.
INFO 06-02 16:30:24 [core.py:65] Initializing a V1 LLM engine (v0.9.0) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mo

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 06-02 16:30:33 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-02 16:30:33 [gpu_model_runner.py:1549] Model loading took 5.5228 GiB and 6.070258 seconds
INFO 06-02 16:30:50 [backends.py:459] Using cache directory: /dfs/user/shirwu/.cache/vllm/torch_compile_cache/426fe7453b/rank_0_0 for vLLM's torch.compile
INFO 06-02 16:30:50 [backends.py:469] Dynamo bytecode transform time: 16.67 s
INFO 06-02 16:30:59 [backends.py:132] Directly load the compiled graph(s) for shape None from the cache, took 8.318 s
INFO 06-02 16:31:02 [monitor.py:33] torch.compile takes 16.67 s in total
INFO 06-02 16:31:02 [kv_cache_utils.py:637] GPU KV cache size: 523,744 tokens
INFO 06-02 16:31:02 [kv_cache_utils.py:640] Maximum concurrency for 131,072 tokens per request: 4.00x
INFO 06-02 16:31:50 [gpu_model_runner.py:1933] Graph capturing finished in 48 secs, took 1.47 GiB
INFO 06-02 16:31:51 [core.py:167] init engine (profile, create kv cache, warmup model) took 77.42 seconds


2025-06-02 16:31:53,023 [INFO] collabllm: CollabLLM logging enabled.
2025-06-02 16:31:53,421 [INFO] httpx: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
Simulating chat:   0%|          | 0/4 [00:00<?, ?it/s]2025-06-02 16:31:56,596 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-02 16:31:56,607 [INFO] collabllm.modules.user_simulator: [UserSimulator] Full response: {
  "current_answer": "The AI provided a detailed short essay on the benefits of optimism, highlighting its positive effects on mental health, motivation, relationships, and resilience. It also offered tips on how to cultivate optimism.",
  "thought": "The essay seems really comprehensive and addresses my request well. I appreciate the structure and the tips provided. However, I might want to ask for a more concise version or specific examples to reinforce the points made. I'm leaning towards

INFO 06-02 16:31:58 [chat_utils.py:419] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

2025-06-02 16:32:21,775 [INFO] collabllm.simulation: Assistant: Here's a concise version with a focus on how optimism can improve mental health and motivation:

**The Power of Optimism: Boosting Mental Health and Motivation**

Optimism is a powerful tool that can have a profound impact on both our mental health and motivation. When we choose to be optimistic, we open ourselves up to a world of possibilities, where every challenge becomes an opportunity for growth and every setback is a chance to learn and improve.

**Optimism and Mental Health**

Optimism has been shown to have a positive impact on mental health by:

*   **Reducing symptoms of anxiety and depression**: When we're optimistic, we're more likely to focus on the good, to find meaning and purpose in our lives, and to cultivate a sense of hope and resilience.
*   **Improving self-esteem and confidence**: Optimism can help us see ourselves and our abilities in a more positive light, leading to increased self-esteem and confid

Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

2025-06-02 16:32:52,284 [INFO] collabllm.simulation: Assistant: Here's a very concise version with a focus on how optimism can improve motivation:

**Boosting Motivation with Optimism**

Optimism is a powerful tool that can have a profound impact on our motivation. When we're optimistic, we're more likely to take action towards our goals, and are better equipped to overcome obstacles and stay focused.

**The Key Benefits**

When we're optimistic, we're more likely to:

*   **Set and achieve goals**: With a positive outlook, we're more likely to set ambitious goals and take action towards achieving them.
*   **Stay motivated and focused**: By focusing on the possibilities, rather than the challenges, we can stay motivated and driven to succeed.

**Cultivating Optimism**

To start cultivating optimism, try practicing gratitude and focusing on the positive. By incorporating these habits into your daily life, you can start to experience the many benefits optimism has to offer.
2025-06-02 1

Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

2025-06-02 16:33:47,999 [INFO] collabllm.simulation: Assistant: Here's a draft to get you started:

**The Power of Optimism: Unlocking Resilience and Happiness**

In a world filled with challenges and uncertainties, it's easy to get caught up in negativity and pessimism. But what if we told you that there's a simple yet powerful tool that can help you bounce back from adversity, build stronger relationships, and live a happier life? Enter optimism, the state of mind that can transform your well-being and outlook on life.

**The Resilience Advantage**

Optimism is like a superpower that helps you navigate life's ups and downs with ease. When faced with difficult situations, pessimists often feel overwhelmed, helpless, and stuck. In contrast, optimists view challenges as opportunities for growth, learning, and improvement. This mindset allows them to:

* Bounce back from setbacks faster
* Approach problems with creativity and confidence
* Build a support network of friends, family, and c

Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

2025-06-02 16:34:26,585 [INFO] collabllm.simulation: Assistant: Here's an updated draft with some specific examples and studies:

**The Power of Optimism: Unlocking Resilience and Happiness**

In a world filled with challenges and uncertainties, it's easy to get caught up in negativity and pessimism. But what if we told you that there's a simple yet powerful tool that can help you bounce back from adversity, build stronger relationships, and live a happier life? Enter optimism, the state of mind that can transform your well-being and outlook on life.

**The Resilience Advantage**

Optimism is like a superpower that helps you navigate life's ups and downs with ease. When faced with difficult situations, pessimists often feel overwhelmed, helpless, and stuck. In contrast, optimists view challenges as opportunities for growth, learning, and improvement. This mindset allows them to:

* Bounce back from setbacks faster
* Approach problems with creativity and confidence
* Build a support net