Skip to content

Run unit tests with real LLM calls #8486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TomeHirata
Copy link
Collaborator

@TomeHirata TomeHirata commented Jul 3, 2025

This PR enables running unit tests that require real LLM calls, which have been skipped in CI.
The model for testing is configurable through LLM_MODEL env variable, and we use ollama/llama3.2:3b in the branch build to balance quality and latency. The Ollama model pulling only introduces ~13s latency, so this PR just enables real LLM tests in the branch build instead of nightly tests.

@TomeHirata TomeHirata changed the title use real LLM for unit tests Use real LLM for unit tests Jul 3, 2025
@TomeHirata TomeHirata changed the title Use real LLM for unit tests Run unit tests with real LLM calls Jul 3, 2025
@chenmoneygithub
Copy link
Collaborator

@TomeHirata This is awesome! One thing I want to discuss.

I compared the time cost of running test before and after this PR, seems we are seeing a significant increase from 2m => 4m:

Before:
image

After:
image

4m seems to be all right, but that's the outcome of only converting 6 test cases into using Ollama, which means it could become much slower over time. I am thinking about the following:

  1. Use pytest markers to split tests requiring actual LM into its own group, like @pytest.mark.real_lm_call. These tests will be skipped in normal github tests, and only run in the dedicated action Run Tests with Real LM. We may not need to test across all python versions, because essentially it's an supplementary to the normal test to test against actual LM behavior.
  2. Prefer mocking over real LM testing in unit tests. Only when mocking goes too complex like optimizer, or unreliable over time like streaming, we switch to using the real LM.

Let me know what you think!

@TomeHirata
Copy link
Collaborator Author

Hi, @chenmoneygithub. I agree with both points. I've split LLM call tests into a separate job. I completely agree with #2. We should limit the usage of LLM call in unit tests because of latency and potential flakiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants