A notebook to test a pair of a provider and a model: \
This implementation is based on "3.4 特定のproviderでエラーが出る場合の対応" in `README_t4.md`.

Setup: \
To use this notebook, you need to use `python>=3.10.0` and install as follows
```
pip install "../../lighteval[math,extended_tasks,litellm,vllm]" "transformers>=4.51.0,<4.53.0" "openai>=1.40.0" "datasets<4.0.0" "ipywidgets"
```
This lineup can be changed due to an update or your environment.

In [None]:
%pip install "lighteval[math,extended_tasks,litellm,vllm]" "transformers>=4.51.0,<4.53.0" "openai>=1.40.0" "datasets<4.0.0" "ipywidgets"

## Call vllm in litellm

In [1]:
from dotenv import load_dotenv
assert load_dotenv('../../.env'), "Failed to load .env file"

import litellm
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

In [2]:
# Helper functions
def get_base_url(provider):
    base_url_dict = {
        "openai": "https://api.openai.com/v1",
        "deepinfra": "https://api.deepinfra.com/v1/openai",
        "vllm": "http://localhost:10001/v1",
    }
    return base_url_dict[provider]

def get_api_key(provider):
    api_name_dict = {
        "openai": "OPENAI_API_KEY",
        "deepinfra": "DEEPINFRA_API_KEY",
        "vllm": None,
    }
    return os.getenv(api_name_dict[provider]) if api_name_dict[provider] else ""

In [None]:
# Setup parameters
provider = "vllm"
base_url = get_base_url(provider)

model = "google/gemma-3-4b-it"

optional_params = {
    # You should refer to the official documentation for the parameters: https://docs.litellm.ai/docs/api-reference/litellm.completion.
    "temperature": 0.0,
    "max_tokens": 32768,
}

api_key = get_api_key(provider)
if api_key != "": optional_params["api_key"] = api_key

In [None]:
# Set a test prompt
test_prompt = """こんにちは"""
print(test_prompt)

In [None]:
# Define request payload
request_payload = {
    "model": f"{provider}/{model}",
    "messages": [
        {
            "role": "user",
            "content": test_prompt,
        }
    ],
    "logprobs": None,
    "caching": False,
    "base_url": base_url,
    **optional_params
}

In [None]:
# Check the payload
request_payload

In [None]:
# Get responses (this takes a while)
responses = litellm.completion(**request_payload)

In [None]:
print(responses)

In [None]:
print(responses.choices[0].message.content)

## Use vllm serve

Firstly, serve your model like the following:

```terminal
export CUDA_VISIBLE_DEVICES=0,1
uv run --isolated --project /home/saito-k/github/swallow-evaluation-instruct-private --locked --extra vllm \
    vllm serve "google/gemma-3-4b-it" \
        --port 10001 \
        --hf-token "hf_..." \
        --tensor-parallel-size 2 \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.9 \
        --dtype bfloat16 \
        2>&1 &
```

⚠️: Make sure that
- the `CUDA_VISIBLE_DEVICES` and the port do not conflict with other processes
- the `hf_token` must be filled with your own hf-token

In [None]:
# Get a client
from openai import OpenAI

client = OpenAI(
    api_key="dummy",
    base_url="http://localhost:10001/v1"
)

In [None]:
# Set a test prompt
test_prompt = """こんにちは"""
print(test_prompt)

In [None]:
# Generate a response
response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[
        {"role": "user", "content": test_prompt}
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)