# CS 329A: Homework 2
### Code Generation and Verification

In this homework, we will explore the capabilities of LLMs in generating and verifying code. You will implement different test-time verification techniques, from simple zero-shot generation to more advanced methods like multi-candidate sampling and LLM-as-a-judge verification.

Objectives:
- Establish a baseline for an LLM's code generation performance.
- Implement and evaluate the pass@k metric by sampling multiple solutions.
- Build an LLM-as-a-Judge to select the best code candidate.
- Use an LLM to generate unit tests and perform "weak verification."
- Analyze the strengths and weaknesses of each approach.

#### Q1: Zero Shot (20 pts)
- 1a: Zero Shot Accuracy Code (15 pts)
- 1b: Written Analysis (5 pts)

#### Q2: Advanced Verification Techniques (40 pts)
- 2a: Pass@K Code (10 pts)
- 2b: LLM as Judge Code (20 pts)
- 2c: Evaluating Judge Code (10 pts)

#### Q3: Unit Test Generation and Weak Verifiers (50 pts)
- 3a: Unit Test Generator Code (20 pts)
- 3a: Evaluate Ground Truth of LLM Tests Code (10 pts)
- 3a: Written Analysis (10 pts)
- 3b: K-shot Accuracy on LLM tests (10 pts)

In [None]:
from cs329_hw.tasks import HumanEval
from cs329_hw.methods import get_sampler
from cs329_hw.run.sandbox_docker import run_python_in_docker
from cs329_hw.methods.verifiers import HumanEvalVerifier
from cs329_hw.methods.llm_unit_test import LLMUnitTestGeneratorConfig, LLMUnitTestGenerator

from tqdm import tqdm
import random
import matplotlib.pyplot as plt

# debug mode runs your code on a subset of the test set for faster iteration
DEBUG_MODE = True

In [2]:
# Include the following line to reload the modules when you make changes
# (helpful when iterating on code locally)

%load_ext autoreload
%autoreload 2

In [3]:

from dotenv import load_dotenv
# input path to .env file, which should contain TOGETHER_API_KEY
assert load_dotenv(dotenv_path="environment.env") == True

# model path for generation and judging
qwen_path = "together_ai/meta-llama/Llama-3.2-3B-Instruct-Turbo"

### HumanEval Dataset Setup
We will be working with the [HumanEval](https://arxiv.org/abs/2107.03374) benchmark to evaluate model performance on various coding tasks. The dataset contains 164 problems, each with a function signature, a descriptive docstring, and a set of unit tests to verify the correctness of the generated code.

In [4]:
humaneval = HumanEval()
problems = humaneval.get_problems(debug_mode=DEBUG_MODE)
system_prompt = humaneval.get_system_prompt()

#### Dataset exploration

Each problem in the dataset is provided as a dictionary containing the problem description, answer, and a set of unit tests.

In [None]:
example_docstring = problems[1]["problem"]
example_solution = problems[1]["answer"]
example_test_suite = problems[1]["test_suite"]
print("========Docstring========", example_docstring)
print("========Example solution======== \n", example_solution)
print("========Provided tests========\n", example_test_suite)

To help you build a few different kinds of verifiers, we have provided the following tools.

In [None]:
verifier = HumanEvalVerifier(runner=run_python_in_docker, timeout_s=2)

# before we run the code, we must combine the function docstring with the solution
docstring_and_correct_solution = f"{example_docstring}\n{example_solution}"
docstring_and_incorrect_solution = f"{example_docstring}\n    return False"

res = verifier.verify( 
    code=docstring_and_incorrect_solution, # swap this to docstring_and_correct_solution to see what a passing solution output looks like
    function_name=problems[1]["function_name"],
    test_suite=example_test_suite
)
verifier.print_verification_result(res)

### LLM Sampling

To generate code, we'll use a sampler function that can request multiple completions for a single prompt; this is the basis of the techniques that we will implement.

The `get_sampler("sample_multiple", n_samples=k, ...)` method returns a list of lists, where for each prompt that is passed as an input, we return a list of `k` responses.

In [None]:
method = get_sampler(
    "sample_multiple",
    qwen_path,
    temperature=0.7,
    system_prompt=system_prompt,  # IMPORTANT: use the provided system prompt
    n_samples=3
)
prompts = [example_docstring]
responses = method(prompts)
for i, response in enumerate(responses[0]):
    print(f"========Response {i+1}========")
    print(response)

Notice that the LLM's output includes Markdown code fences (e.g., ```python``). We need to remove these before we can execute the code.

In [None]:
import re

def extract_code(text: str) -> str:
    """Removes python code, demarcated by ```python and ```, from LLM output."""
    if not text:
        return ""
    text = re.sub(r"```(?:python)?\s*", "", text)
    text = text.replace("```", "")
    return text.strip()


cleaned_responses = [extract_code(resp) for resp in responses[0]]
for i, response in enumerate(responses[0]):
    print(f"========Cleaned response {i+1}========")
    print(cleaned_responses[i])

Finally, let's verify one of the cleaned responses to see if it's correct.

In [None]:
res = verifier.verify(
    code=cleaned_responses[1], # you can edit the index here to try each of the three responses
    function_name=problems[1]["function_name"],
    test_suite=example_test_suite,
)
verifier.print_verification_result(res)

### Part 1: Zero-shot predictions (20 pts)
- 1a: Zero Shot Accuracy Code (15 pts)
- 1b: Written Analysis (5 pts)

First, we will evaluate the baseline accuracy of the predictions with a single zero-shot sample.

Deliverable:

* Write your code in the section specified by `TODO: YOUR CODE STARTS HERE` and `TODO: YOUR CODE ENDS HERE`.
* Report the accuracy of the predictions below.

Hint: Look at the entries of the verifier.verify() dictionary

In [None]:
def calculate_accuracy(predictions:list[str], problems: list[dict], verifier: HumanEvalVerifier):
    """
    Calculates the zero-shot accuracy of code predictions.

    Args:
        predictions: A list of generated code strings, one for each problem.
        problems: The list of HumanEval problem dictionaries.
        verifier: The HumanEvalVerifier instance.

    Returns:
        A tuple containing:
        - accuracy (float): The fraction of correctly solved problems (i.e. passed all unit tests).
        - response (list[str]): A list of the verifier's `stdout` field, containing the results of each problem.
        - wrong (list[int]): A list of indices for the problems that failed.
    """
    ### TODO: YOUR CODE STARTS HERE
    
    ### TODO: YOUR CODE ENDS HERE
    return accuracy, response, incorrect_indices

In [None]:
method = get_sampler("sample_multiple",qwen_path, temperature=0.7, n_samples=1, system_prompt=system_prompt)
prompts = [entry["problem"] for entry in problems]
predictions_all_probs_zero_shot = method(prompts)
cleaned_predictions_zero_shot = [extract_code(raw_code[0]) for raw_code in predictions_all_probs_zero_shot]
accuracy, response, wrong = calculate_accuracy(cleaned_predictions_zero_shot, problems, verifier)

print(f"Accuracy: {accuracy}")
print(f"Indices of failed problems: {wrong}")
print(f"Unit test results of all problems: {response}")

##### 1b) Experiment with different problems and any failed tests cases. Explain in a few sentences what sorts of patterns do you observe among the failed test cases. What are some possible reasons that the generated code fails these test cases?

<span style="color:red">YOUR ANSWER HERE</span>


### Part 2: Advanced Verification Techniques (40 pts)
- 2a: Pass@K Code (10 pts)
- 2b: LLM as Judge Code (20 pts)
- 2c: Evaluating Judge Code (10 pts)

 In this section, we'll explore more sophisticated techniques to improve our success rate.

#### 2a) Parallel Sampling and pass@K

Instead of generating just one solution, we want to independently generate `k` different solutions and check if any of them are correct. Here, we use `k=3`.

In [None]:
method = get_sampler("sample_multiple", qwen_path, temperature=0.7, n_samples=3, system_prompt=system_prompt)
prompts = [entry["problem"] for entry in problems]
preds_3shot = method(prompts)
cleaned_preds_3shot = [ # list[list[str]] here, where each inner list[str] contains the k code samples for that problem
    [extract_code(code) for code in raw_codes]
    for raw_codes in preds_3shot
]

Fill in the `k_shot_acc` function, which computes pass@k i.e. the proportion of problems for which a correct solution is obtained within `k` attempts.

In [None]:
def k_shot_acc(predictions: list[list[str]], problems: list[dict], verifier):
    """
    Calculates pass@k accuracy. A problem is solved if any of its k candidates pass.

    Args:
        predictions (list[list[str]]): A list where each item is another list of k code strings for a problem.
        problems (list[dict]): The list of HumanEval problem dictionaries.
        verifier (HumanEvalVerifier): The HumanEvalVerifier instance.

    Returns:
        A tuple containing:
        - accuracy (float): The pass@k accuracy.
        - solved_fns (list[str]): A list of function names for problems that were solved.
    """
    num_probs = len(predictions)
    num_corr = 0
    solved_fns = []

    ### TODO: YOUR CODE STARTS HERE

    ### TODO: YOUR CODE ENDS HERE
    accuracy = num_corr / num_probs if num_probs > 0 else 0.0
    return accuracy, solved_fns

accuracy_3shot, _ = k_shot_acc(cleaned_preds_3shot, problems, verifier)
print(f"3-shot accuracy: {accuracy_3shot}")

Observe how pass@k varies as we change the number of generated samples for k = 3, 6, and 9.

In [None]:
import matplotlib.pyplot as plt
from tqdm import tqdm

def sample_and_evaluate(k: int):
    """Samples k code completions per problem and compute accuracy."""
    method = get_sampler(
        "sample_multiple",
        qwen_path,
        temperature=0.7,
        n_samples=k,
        system_prompt=system_prompt
    )
    prompts = [entry["problem"] for entry in problems]
    preds = method(prompts)
    cleaned_preds = [[extract_code(code) for code in raw_codes] for raw_codes in preds]
    acc, _ = k_shot_acc(cleaned_preds, problems, verifier)
    return acc


sample_sizes = [1, 3, 6, 9]
accuracies = [accuracy, accuracy_3shot]  # Start with your existing 3-shot accuracy

# Run for k = 6 and 9
for k in sample_sizes[2:]:
    acc = sample_and_evaluate(k)
    accuracies.append(acc)
    print(f"Accuracy ({k}-shot): {acc:.3f}")

# Plot results
plt.figure(figsize=(6, 4))
plt.plot(sample_sizes, accuracies, marker="o", linewidth=2)
plt.title("Pass@K")
plt.xlabel("K")
plt.ylabel("Accuracy")
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()

#### 2b) LLM-as-a-Judge Verification
For some tasks, executing code can be slow and resource-intensive. An alternative is to use another LLM as a "judge" to review the candidate solutions and select the one it deems most likely to be correct. This leverages the model's understanding of code quality and logic without requiring execution.

Your task is to implement the `judge` and `_build_messages` methods in the `LLMJudge` class.

- `_build_messages`: Construct the prompt that will be sent to the judge LLM. It should include the problem specification and the formatted candidate solutions.

- `judge`: Use the sampler to send the prompt to the judge and parse its JSON response to extract the chosen index and reasoning.

In [None]:
from __future__ import annotations
from dataclasses import dataclass
from typing import Callable, List, Dict, Any, Optional, Tuple
import json
import re
from collections import Counter
from cs329_hw.methods.simple_samplers import SampleMultiple


@dataclass
class JudgeConfig:
    temperature: float = 0.7   # sampling temp you set inside your sampler
    max_choices: int = 10      # safety cap for number of candidate codes

class LLMJudge:
    """
    LLM-as-a-judge that selects the index of the best code snippet from a list of candidate completions.
    If no candidate seems correct, the LLM returns None.
    """

    def __init__(self, sampler: SampleMultiple, cfg: JudgeConfig = JudgeConfig(), model_name: str = qwen_path):
        self.sampler = sampler
        self.cfg = cfg
        self.model_name = model_name

    def judge(
        self,
        problem_prompt: str,
        function_name: str,
        code_snippets: List[str],
    ) -> Dict[str, Any]:
        """
        Orchestrates the judging process by building a prompt, querying the LLM, and parsing the response.

        Your implementation should follow these steps:
        1. Call `self._build_messages()` to construct the detailed prompt for the LLM-based judge, which 
            includes the code snippets to be judged.
        2. Use `self.sampler` to send this prompt to the LLM and get its raw response.
        3. Call `self._parse_json_choice()` to robustly extract the judge's decision from the raw text.
            
        Args:
            problem_prompt (str): The problem specification.
            function_name (str): The name of the target function.
            code_snippets (List[str]): A list of candidate code snippets.

        Returns:
            A dictionary containing:
            - "choice": the int index of the chosen code snippet, or None
            - "reason": the stripped string that describes why the model chose the option
            - "raw_response": the raw response from the LLM for debugging
        """
        assert 1 <= len(code_snippets) <= self.cfg.max_choices

        ### TODO: YOUR CODE STARTS HERE

        ### TODO: YOUR CODE ENDS HERE
        
        return {"choice": final_choice, "reason": reason, "raw_response": raw_response}

    def _build_messages(self, problem_prompt: str, function_name: str, code_snippets: List[str]) -> str:
        """
        Builds the full text prompt for the LLM-as-a-judge model.

        Your prompt should include:
        - A message describing the LLM judge's behavior as a code evaluator
        - The problem statement (`problem_prompt`) and target function name (`function_name`)
        - The list of all candidate code snippets the LMM judge will choose from (`code_snippets`)
        - Instructions about choosing the most correct code
        - Examples of the expected response format.
          
        The LLM judge should be prompted to return a single-line JSON object with this exact schema (no prose before/after):
        {{
        "choice": <integer index or null>,
        "reason": "<one short sentence>"
        }}

        Expected Output:
          The returned string should contain both the system prompt and user instructions,
          ready to be passed into the LLM sampler.
        """

        ### TODO: YOUR CODE STARTS HERE

        ### TODO: YOUR CODE ENDS HERE
        
        return prompt
        


    def _parse_json_choice(self, raw: str) -> Tuple[Optional[int], str]:
        """
        Robustly extracts the judge's choice and reason from the raw LLM text response.

        Args:
            raw (str): The raw text output from the LLM judge.

        Returns:
            A tuple containing:
                - Optional[int]: The chosen index (or None if unparseable, None-chosen, or invalid)
                - str: The LLM's reasoning for the choice
        """
        if not raw or not raw.strip():
            return None, "Empty response"

        first = raw.strip().splitlines()[0].strip()
        obj = None
        try:
            obj = json.loads(first)
        except Exception:
            m = re.search(r"\{.*\}", raw, flags=re.DOTALL)
            if m:
                try:
                    obj = json.loads(m.group(0))
                except Exception:
                    obj = None

        if not isinstance(obj, dict):
            return None, "Unparseable"

        choice = obj.get("choice", None)
        reason = obj.get("reason", "")
        if choice is None:
            return None, reason or "None"

        try:
            idx = int(choice)
            return (idx if idx >= 0 else None), reason
        except Exception:
            return None, reason or "Non-integer index"

To test this implementation, we can apply the LLM-as-a-judge to the 3 responses we sampled for the first HumanEval problem in the LLM Sampling section.

In [None]:
judge_method = get_sampler("sample_multiple", qwen_path, temperature=1, n_samples=1)

judge = LLMJudge(sampler=judge_method, cfg=JudgeConfig(temperature=0.7), model_name=qwen_path)

candidates = cleaned_preds_3shot[0]
decision = judge.judge(problems[0]["problem"], problems[0]["function_name"], candidates)
print("Problem:", problems[0]["problem"])
for i, candidate in enumerate(candidates):
    print(f"Candidate {i}: {candidate}")
    print("---"*50)
print("Judge choice:", decision["choice"])
print("Judge reason:", decision["reason"])


**Evaluating the Judge**

Now, let's write a function to loop through all our problems, use the judge to select the best of the three candidates we generated earlier, and then calculate the final accuracy.

In [None]:
def evaluate_judge(problems: list[dict], code_generations: list[list[str]], judge:LLMJudge) -> list[str]:
    """
    Evaluates a set of generated code solutions using a given LLM judge.

    Args:
        problems (list[dict]): The list of HumanEval problem dictionaries.
        code_generations (list[list[str]]): A list where each element is a list of code samples for the corresponding problem.
        judge (LLMJudge): LLMJudge object

    Returns:
        A list containing the code snippet chosen by the judge for each problem.
        If the judge did not make a choice for a given problem, the corresponding
        element in the list will be `None`.
    """
    ### TODO: YOUR CODE STARTS HERE
    
    ### TODO: YOUR CODE ENDS HERE
    return results

code_generation_method = get_sampler(
    "sample_multiple",
    qwen_path,
    temperature=0.7,
    n_samples=3,
    system_prompt=system_prompt
)
judge_method = get_sampler("sample_multiple", qwen_path, temperature=1, n_samples=1)
judge = LLMJudge(sampler=judge_method, cfg=JudgeConfig(temperature=0.7), model_name=qwen_path)

cleaned_predictions_llm_judge = evaluate_judge(problems, cleaned_preds_3shot, judge)
accuracy, response, wrong = calculate_accuracy(cleaned_predictions_llm_judge, problems, verifier)
print(f"LLM Judge accuracy: {accuracy}")


### Part 3: Unit Test Generation and Weak Verifiers (50 pts)
- 3a: Unit Test Generator Code (20 pts)
- 3a: Evaluate Ground Truth of LLM Tests Code (10 pts)
- 3a: Written Analysis (10 pts)
- 3b: K-shot Accuracy on LLM tests (10 pts)

So far, we've relied on the ground-truth test suite. In many real-world cases, we don't have access to this ground-truth test suite. In this section, we'll use an LLM to generate its own unit tests. This enables a pipeline with "weak verifiers", where we use these synthetic tests to filter and select code solutions.

#### Part 3a: Generating and Validating Unit Tests

First, let's assess how good the LLM is at writing tests. We will prompt it to generate 5 test cases for each problem based only on the docstring. Then, we'll run these tests against the ground-truth solution. The percentage of problems where the ground-truth code passes all synthetic tests gives us a measure of the LLM's ability to generate reliable tests.

**Note**: We use a more powerful model for test generation, as it is a more demanding task.

**Deliverable:** Fill out `LLMUnitTestGenerator._build_prompt()` and `LLMUnitTestGenerator.generate()` in `methods/llm_unit_test.py`

In [None]:
qwen_large = "together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo"

# Here is an example of using the LLM-based unit test generator to create tests for one problem.
testgen_method = get_sampler(
    "sample_multiple",
    qwen_large, 
    temperature=1.0,
    n_samples=1,
    system_prompt="You are a careful Python unit test designer."
)
testgen = LLMUnitTestGenerator(sampler=testgen_method, cfg=LLMUnitTestGeneratorConfig())

doc = problems[0]["problem"]
fn  = problems[0]["function_name"]

cases = testgen.generate(problem_prompt=doc, function_name=fn, n_unit_tests=5)
candidate_code = f"{doc}\n{problems[0]['answer']}"

res = verifier.verify(code=candidate_code, function_name=fn, test_suite=cases)
HumanEvalVerifier.print_verification_result(res)

In [None]:
def evaluate_ground_truth_on_llm_unit_tests(
    problems: list[dict],
    llm_unit_test_generator: LLMUnitTestGenerator,
    verifier: HumanEvalVerifier,
    n_unit_tests: int = 5
):
    """
    Generates unit tests using the LLMUnitTestGenerator and evaluates the ground-truth code against them.

    Returns:
        - accuracy (float): Percentage of problems where ground truth passed all synthetic tests.
        - correct_idxs (list[int]): The indices of the problems that passed the tests.
        - generated_unit_tests (list[list[TestCase]]): A list of test cases for each problem, where the outer list is over all problems.
    """
    num_problems = len(problems)
    num_correct = 0
    correct_idxs = []
    generated_unit_tests = []

    ### TODO: YOUR CODE STARTS HERE
    
    ### TODO: YOUR CODE ENDS HERE

    accuracy = num_correct / num_problems if num_problems > 0 else 0.0
    print(f"Ground truth code passed all LLM-generated tests for {num_correct}/{num_problems} problems.")
    print(f"Accuracy: {accuracy}")
    return accuracy, correct_idxs, generated_unit_tests

accuracy, correct_idxs, generated_unit_tests = evaluate_ground_truth_on_llm_unit_tests(problems, testgen, verifier, n_unit_tests=5)

We see that the LLM-generated unit tests are sometimes unreliable, where the ground-truth code fails to pass the synthetic tests. Let's examine one of these failures to see what went wrong.

In [None]:
import random

incorrect_idxs = set(range(len(problems))) - set(correct_idxs)
incorrect_idx = random.choice(list(incorrect_idxs))
incorrect_problem_data = problems[incorrect_idx]

print(f"Problem: {incorrect_problem_data['problem']}", "-"*100)
doc, ans = incorrect_problem_data['problem'], incorrect_problem_data['answer']
gt_code = f"{doc}\n{ans}"
for i, test in enumerate(generated_unit_tests[incorrect_idx]):
    print(f"LLM-generated unit test {i}: {test}")
print()
res =verifier.verify(code=gt_code, function_name=incorrect_problem_data['function_name'], test_suite=generated_unit_tests[incorrect_idx])
HumanEvalVerifier.print_verification_result(res)
print("-"*100)
print("Ground truth code:")
print(gt_code)

##### Analyze at least one of the mismatches determine which category(s) it falls into. Explain your reasoning with specific details and example from the test cases.
- **Misinterpreting nuanced requirements:** LLM grasps main goal but fails to apply subtle details like secondary conditions or tie-breaking rules.
- **Flawed algorithmic simulation:** Model cannot reliably execute a multi-step algorithm internally. Instead of computing the true result (e.g. a full Collatz sequence), it produces a (wrong) statistically likely output.
- **Overgeneralization from Training Data:** LLM applies solution pattern from similar but distinct example test cases in docstring. The generated test is valid for that other problem, but not for the specific function provided.

<span style="color:red">YOUR ANSWER HERE</span>

### 3b) LLM-generated unit tests as weak verifiers

Now that we've identified a subset of problems for which our LLM generated reliable tests (`passed_problems`), we can use them to select the best code candidate. This simulates a realistic scenario where we don't have a human-written test suite and must rely on our synthetic tests to select the best code candidate.

To do this, we'll conduct a direct comparison on this trusted subset. For each of these problems, we'll use the three candidate solutions earlier in the pass@3 experiment. We'll then evaluate their correctness using two different methods:

1. **Baseline with Oracle Ground-Truth Tests:** First, we'll calculate the pass@k accuracy on this subset using the original, human-written test suites. This gives us the true, "best possible" score for our candidate solutions and serves as our gold standard for comparison.

2. **Evaluation with Synthetic LLM-Generated Tests (Weak Verifier)**: Next, we will perform the same calculation using our trusted, LLM-generated test suites (`passed_testcases`) for verification. The result will tell us how effectively our automated pipeline can identify correct code.

In [None]:
def k_shot_acc_synthetic(
    predictions: list[list[str]],
    problems: list[dict],
    unit_tests: list[list[TestCase]],
    verifier: Verifier
):
    """
    Evaluates model accuracy using LLM-generated (synthetic) test cases.
    A problem is considered solved if any of its candidate solutions pass all synthetic tests.

    Args:
        predictions (list[list[str]]): A list of code predictions for each problem, where each inner list contains k code samples
        problems (list[dict]): A list of problem dicts
        unit_tests (list[list[TestCase]]): A list of LLM-generated test cases for each problem
        verifier (Verifier): Verifier object

    Returns:
    A tuple containing:
        - The pass@k accuracy, i.e. the fraction of problems for which at least one candidate solution passed all the LLM-generated unit tests
        - A list of function names for the problems that were successfully solved.
    """
    num_probs = len(predictions)
    num_corr = 0
    correct_fns = []

    assert len(predictions) == len(problems) == len(unit_tests)
    ### TODO: YOUR CODE STARTS HERE

    ### TODO: YOUR CODE ENDS HERE
    accuracy = num_corr / num_probs if num_probs > 0 else 0.0
    return accuracy, correct_fns

subset_preds_3_shot = [cleaned_preds_3shot[i] for i in correct_idxs]
subset_problems = [problems[i] for i in correct_idxs]
subset_testcases = [generated_unit_tests[i] for i in correct_idxs]
accuracy_true, passed_true = k_shot_acc(subset_preds_3_shot, subset_problems, verifier)
accuracy_synth, passed_synth = k_shot_acc_synthetic(subset_preds_3_shot, subset_problems, subset_testcases, verifier)

print(f"Evaluating pass@3 accuracy on subset of {len(correct_idxs)} problems (where ground truth code passed all LLM-generated tests)")
print(f"pass@3 with ground truth tests: {accuracy_true:.2f}")
print(f"pass@3 with LLM-generated tests: {accuracy_synth:.2f}")

#### Refining Unit Tests

To increase the quality of generated tests, we can provide the LLM with more context (e.g. the ground truth code, a summary of the program's logic, or a description of intended behavior and potential edge cases). This paper provides a good overview of this topic: https://arxiv.org/pdf/2502.01619