## LLMClient

This file multiple different classes for LLM backends, as well as a class capable of assisting in prompt building.

It also contains experimental output evaluating different models against different preset claims.

### Testing TL;DR

1. It's hard to balance good reasoning and quick responses.
2. Most setups cannot even explain prompts such as `ice is below room temperature`, when given evidence
2. I have a prompt that gives immediate responses, but they are often wrong.
3. Prompts the cause the LLM to explain their reasoning (even if I just ask for a sentence) often run out of tokens because the model does not follow the prompt.
4. I do not know which hyperparameters are best for the transformer library.
5. The `llama-cpp` approach performs better than the transformer library with the same models because it for some reason always follows the prompt.
6. The `llama-cpp` approach requires local hardware to run, and still fails basic reasoning tasks occasionally. (llama-cpp is pretty fast on CPUs, still).
7. The other notebook has a lot more testing scenarios.

### TransformersLMClient

`TransformersLMClient` provides a simple set of methods to allow quickly interfacing with the `AutoModelForCausalLM` class from Hugging Face's `transformers` library.

To use contiguous context, the caller must pass a string representing the full dialogue history to the `send_query` method.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
from typing import Dict
import time

class TransformersLMClient():
  """
  The TransformersLMClient utilizes the AutoModelForCausalLM class
  from Hugging Face's `transformers' library.

  The constructor takes the name of the model, which should come from
  this list:
  https://huggingface.co/models?pipeline_tag=text-generation&num_parameters=min:3B,max:6B&sort=likes
  """

  model_name: str
  max_to_generate: int
  top_k: int
  top_p: float
  temperature: float
  repetition_penalty: float

  _tokenizer: AutoTokenizer
  _model: AutoModelForCausalLM
  _config: AutoConfig

  def __init__(self,
               model_name: str,
               max_to_generate: int = 100,
               top_k: int = 50,
               top_p: float = 1.0,
               temperature: float = 0.7,
               repetition_penalty: float = 1.1) -> None:
    self.model_name = model_name
    self.max_to_generate = max_to_generate
    self.top_k = top_k
    self.top_p = top_p
    self.temperature = temperature
    self.repetition_penalty = repetition_penalty
    self._tokenizer = AutoTokenizer.from_pretrained(model_name)
    self._model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
    self._config = AutoConfig.from_pretrained(model_name)

  def send_query(self, context: str) -> str:
    """
      Given CONTEXT, prompts the loaded model and returns the response.
      """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    inputs = self._tokenizer(context, return_tensors="pt").to(device)

    start = time.time()
    with torch.no_grad():
      outputs = self._model.generate(
        **inputs,
        max_new_tokens=self.max_to_generate,
        top_k=self.top_k,
        top_p=self.top_p,
        temperature=self.temperature,
        repetition_penalty=self.repetition_penalty
      )
    print(f'request took {time.time() - start} seconds')

    output_text = self._tokenizer.decode(outputs[0], skip_special_tokens=True)

    # return only the response
    return output_text[len(context):].strip()


### LlamaCppClient

[LlamaCpp](https://github.com/ggml-org/llama.cpp) is a toolkit that enables LLM inference with minimal setup and state-of-the-art performance on a wide range of consumer hardware and in the cloud.

The `LlamaCppClient` has a similar interface to `TransformersLMClient`, but uses `llama-server` as a backend.

It has a simple interface, and most models allow easily turning off and on a designated 'thinking' stage.

In [None]:
import requests
import re

class LlamaCppClient:
    """
    The LlamaCppClient utilizes the llama-cpp LLM-inference toolkit as the backend
    from Hugging Face's `transformers' library.

    The models compatible with llama-cpp are found here:
    https://huggingface.co/models?library=gguf&sort=trending
    """

    api: str
    should_think: bool
    temperature: float
    system_message: str

    def __init__(self,
                 system_message: str = "You are a helpful assistant.",
                 should_think: bool = False,
                 host: str = "127.0.0.1",
                 port: int = 4568,
                 temperature: float = 0.7,):
        self.api = f"http://{host}:{port}/v1/chat/completions"
        self.temperature = temperature
        self.should_think = should_think
        self.system_message = system_message

    def send_query(self, context: str) -> str:
        think = "" if self.should_think else "/no_think"
        messages = [
            {"role": "system", "content": self.system_message + think},
            {"role": "user", "content": context},
        ]

        payload = {
            # this field does nothing
            "model": "local",
            "messages": messages,
            "temperature": self.temperature
        }

        try:
            #start = time.time()
            response = requests.post(self.api, json=payload)
            response.raise_for_status()
            result = response.json()
            #print(f'request took {time.time() - start} seconds')
            content = result["choices"][0]["message"]["content"]
            return re.sub(r'<think>\s*</think>', '', content)
        except requests.RequestException as e:
            raise requests.RequestException(
                f"HTTP request to llama-cpp server failed: {e}"
            )
        except (KeyError, IndexError) as e:
            raise requests.KeyError(
                f"Unexpected response format: {e}"
            )


### PromptAnalyzer

PromptAnalyzer allows quickly building templated, task-specific prompts and a simple interface to interacting with a LM.

The prompts themselves were taken from *RAGAR, Your Falsehood Radar* by Khaliq et al (pg. 14).

In [None]:
# a later version of this class will be exported
# EXPORT-IGNORE-START

In [None]:
class PromptAnalyzer():

  # general prompts in the RAGAR approaches
  # taken from 'RAGAR, Your Falsehood Radar', Khaliq et. all
  _prompts: Dict[str, str] = {
    "initial_question": "You are an expert fact-checker given an unverified claim that needs to be explored.\n\nClaim: {}\nDate (your questions must be framed to be before this date: {}\n\nYou follow these Instructions:\n1: You understand the entire claim.\n\n2. You will make sure that the question is specific and focuses on one aspect of the claim (focus on one topic, should detail where, who, and what) and is very, very short.\n\n3. You should not appeal to video evidence nor ask for calculations or methodology.\n\n4. You are not allowed to use the word \"claim\". Instead, if you want to refer to the claim, you should point out the exact issue in the claim that you are phrasing your question around.\n\n5. You must never ask for calculations or methodology.\n\n6. Create a pointed factcheck question for the claim.\n\nReturn only the question.",
    "follow_up": "You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored. You follow these steps:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nAre you satisfied with the questions asked and do you have enough information to answer the claim?\n\nIf the answer to any of these questions is \"Yes\". then reply with only \"False\" or else answer, \"True\".",
    "secondary_question" : "You are given an unverified statement and question-answer pairs regarding the claim that needs to be explorted. You follow these steps:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nYour task is to ask a followup question regarding the claim specifically based on the question answer pairs.\n\nNever ask for sources or publishing.\n\nThe follow-up question must be descriptive, specific to the claim, and very short, brief, and concise.\n\nThe follow-up question should not appeal to video evidence not ask for calculations or methodology.\n\nThe followup question should not be seeking to answer a previously asked question. It can however attempt to improve that question.\n\nYou are not allowed to use the word \"claim\" or \"statement\". Instead if you want to refer to the claim/statement, you should point out the exact issue in the claim/statement that you are phrasing your question around.\n\nReply only with the followup question and nothing else.",
  }

  @staticmethod
  def build_prompt(prompt_type: str, args: list[str]) -> str:
    """
    Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the respective prompt ready for LM inference.
    """
    # let this throw an error
    prompt = PromptAnalyzer._prompts[prompt_type]
    return f"{prompt.format(*args)}\n\n"

  @staticmethod
  def parse_boolean_answer(response: str) -> bool | None:
    """
    Given RESPONSE, attempts to parse a binary value by searching the string for keywords 'True' or 'False'. Returns None if both or none are found.
    """
    lower = response.lower()
    if 'true' in lower and 'false' not in lower:
      return True
    elif 'false' in lower and 'true' not in lower:
      return False
    elif 'true' in lower and 'false' in lower:
      return None
    return None

In [None]:
# EXPORT-IGNORE-END

# Tests

In [None]:
# EXPORT-IGNORE-START

Important qualities of generation:

1. The amount of text generated must be minimal
   - If it takes too long, then the throughput of our pipe is very slow
2. The reasoning must be complete
   - If the model incorrectly reasons about a problem, then the fact checker is likely to draw the incorrect conclusion.
   
This section tries to find which configuration works best. I have broken up the configuration into 4 categories:

1. Client
   - currently, `llama-cpp` and the `transformers` library.
2. Model
   - note we are not fine-tuning in this notebook
3. Hyperparameters
4. Prompt

## Tests for `follow_up` prompts

We don't want any of the test section exported:

The baseline claim will be a simple logical modus ponens argument:

1. if water is at sea level and above 100 degrees Celcius, it is boiling.
2. the water is at sea level and above 100 degrees Celcius
3. therefore, the water must be boiling

In [None]:
claim = "the water is not boiling"
qe_pairs = ("Q: \"What elevation are we at?\", A: \"We are at sea level.\"\n"
            "Q: \"What temperature is the water?\", A: \"The water is currently over 100 degrees Celcius.\"\n"
            "Q: \"Does water boil when it is over 100 degrees Celcius at sea level?\", A: \"Yes.\"")
qe_pairs_impartial = ("Q: \"What temperature is the water?\", A: \"The water is currently over 100 degrees Celcius.\"\n"
                      "Q: \"Does water boil when it is over 100 degrees Celcius at sea level?\", A: \"Yes.\"")

### Qwen/Qwen2.5-3B

I will start by loading the Qwen/Qwen2.5-3B model. We ideally stick to small and fast models.

In [None]:
TC = TransformersLMClient('Qwen/Qwen2.5-3B')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Note that temperature is not valid on this model.

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs])
ans = TC.send_query(ctxt)
print(ctxt + ans)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored. You follow these steps:

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What elevation are we at?", A: "We are at sea level."
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to any of these questions is "Yes". then reply with only "False" or else answer, "True".

Based on the provided question-answer pairs, here's my evaluation for each step:

1. **Q: "What elevation are we at?"** - Answer: "We are at sea level." (This implies a specific elevation.)
2. **Q: "What temperature is the water?"** - Answer: "The water is currently over 100 degrees Celsius." (This provides a temperature value.)
3. **Q: "Does water boi

Experimentation showed to me that despite the subject matter of the claim, the model would insert more context than asked for (considering we asked for a binary answer).

PromptAnalyzer provides a method to quickly extract a binary answer out of an llm's response:

In [None]:
print(PromptAnalyzer.parse_boolean_answer(ans))

None


But... it of course can't do so if the LLM ran out of tokens before emitting the final resposne. We could increase the number of tokens, but the more it outputs, the longer it takes to generate the answer, and the slower our pipeline.

How about a new prompt, which emphasizes the binary answer?:

#### Binary prompt

In [None]:
PromptAnalyzer._prompts["follow_up"] = "You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nAre you satisfied with the questions asked and do you have enough information to answer the claim?\n\nIf the answer to both of these questions is \"Yes\". then reply with only \"True\" or else answer, \"False\". Do not give further reasoning.\n\nBinary answer: "

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert PromptAnalyzer.parse_boolean_answer(ans), "Expected True!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What elevation are we at?", A: "We are at sea level."
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to both of these questions is "Yes". then reply with only "True" or else answer, "False". Do not give further reasoning.

Binary answer: 

True
True


<span style="color:green;">**Correct**</span> for the 'True' case!

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_boolean_answer(ans), "Expected False!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to both of these questions is "Yes". then reply with only "True" or else answer, "False". Do not give further reasoning.

Binary answer: 

True
True


<span style="color:red;">**Incorrect**</span> for the 'False' case! What about a harder example?

In [None]:
hard_claim = "The titular Lord of the Rings was never bested by a dog."
hard_qe_pairs = ("Q: \"Who is the Lord of the Rings?\", A: \"In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth.\"\n"
                 "Q: \"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", A: \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\"\n"
                 "Q: \"What Race is Sauron?\", A: \"Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body.\"\n"
                 "Q: \"Did Sauron and Huan ever fight?\", A: \"During an attack on the stronghold of Tol-in-Gaurhoth, Huan managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer.\"")
hard_qe_pairs_impartial = ("Q: \"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", A: \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\"\n")

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [hard_claim, hard_qe_pairs])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert PromptAnalyzer.parse_boolean_answer(ans), "Expected True!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
Q: "Who is the Lord of the Rings?", A: "In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth."
Q: "Could any dog, given the right circumstances, defeat Sauron in combat or contest?", A: "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived."
Q: "What Race is Sauron?", A: "Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body."
Q: "Did Sauron and Huan ever fight?", A: "During an attack on the stronghold of Tol-in-Gaurhoth, Haun managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer."

Are you satisfied with the

<span style="color:green;">**Correct**</span> again for the 'True' case!

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [hard_claim, hard_qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_boolean_answer(ans), "Expected False!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
Q: "Could any dog, given the right circumstances, defeat Sauron in combat or contest?", A: "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived."


Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to both of these questions is "Yes". then reply with only "True" or else answer, "False". Do not give further reasoning.

Binary answer: 

True
True


<span style="color:red;">**Incorrect**</span> again for the 'False' case!

It seems the model regularly only outputs true:

In [None]:
easy_claim = "Autumn is the most beautiful season."
easy_qe_pairs_impartial = ("Q: \"What makes autumn the most beautiful season?\", A: \"Some people say autumn is the most beautiful season because they enjoy its colors, weather, and overall atmosphere.\"\n")

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [easy_claim, easy_qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_boolean_answer(ans), "Expected False!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: Autumn is the most beautiful season.
Question-Answer Pairs:
Q: "What makes autumn the most beautiful season for some people?", A: "Some people say that autumn is the most beautiful season because they enjoy its colors, weather, and overall atmosphere."


Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to both of these questions is "Yes". then reply with only "True" or else answer, "False". Do not give further reasoning.

Binary answer: 

True
True


It is clear this size model needs a reasoning step, and thus, a new prompt!

#### Small reasoning prompt

In this version of the prompt, I additionally request the LLM output 'conclusive/inconclusive' rather than 'true/false', after noticing it frequently output true/false depending on the truthiness of the claim itself.

Additionally, I refactored the templates to allow retrieving the system message only.

In [None]:
PromptAnalyzer._prompts["follow_up"] = ("You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored.\n\nIn a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.\n\nThen, conclude with \"Conclusive\" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with \"Inconclusive\". No other output is needed or warranted.", "Claim: {}\nQuestion-Answer Pairs:\n{}")

def get_system_message(prompt_type :str) -> str:
  """
  Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the respective system message.
  """
  return PromptAnalyzer._prompts[prompt_type][0]

def build_template(prompt_type: str, args: list[str]) -> str:
  """
  Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the completed template.
  """
  # let this throw an error
  prompt = PromptAnalyzer._prompts[prompt_type][1]
  return f"{prompt.format(*args)}"

def build_prompt(prompt_type: str, args: list[str]) -> str:
  """
  Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the respective prompt ready for LM inference.
  """
  prompt = PromptAnalyzer.get_system_message(prompt_type)
  template = PromptAnalyzer.build_template(prompt_type, args)
  return f"{prompt}\n\n{template}"

def parse_conclusivity(response: str) -> bool | None:
  """
  Given RESPONSE, attempts to parse a binary value by searching the string for keywords 'Conclusive' or 'Inconclusive'. Returns None if both or none are found.
  """
  lower = response.lower()

  has_conclusive = 'conclusive' in lower and 'inconclusive' not in lower
  has_inconclusive = 'inconclusive' in lower

  if has_conclusive and not has_inconclusive:
      return True
  elif has_inconclusive and not has_conclusive:
      return False
  return None

PromptAnalyzer.get_system_message = staticmethod(get_system_message)
PromptAnalyzer.build_template = staticmethod(build_template)
PromptAnalyzer.build_prompt = staticmethod(build_prompt)
PromptAnalyzer.parse_conclusivity = staticmethod(parse_conclusivity)


In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [easy_claim, easy_qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: Autumn is the most beautiful season.
Question-Answer Pairs:
Q: "What makes autumn the most beautiful season for some people?", A: "Some people say that autumn is the most beautiful season because they enjoy its colors, weather, and overall atmosphere."


In a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "Conclusive" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "Inconclusive". No other output is needed or warranted.

The provided question-answer pair does not fully prove or disprove the claim. While it acknowledges that some individuals find autumn beautiful due to its colors, weather, and atmosphere, it doesn't provide any objective evidence or statistics supporting this claim as universally true for everyone. There

Now it's <span style="color:green;">**Correct**</span> for the trivial 'False' case!

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

In a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "Conclusive" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "Inconclusive". No other output is needed or warranted.

Based on the provided question-answer pairs, we can analyze whether they support or refute the claim that "the water is not boiling." Here's how each pair relates to the claim:

1. **Q:** What temperature is the water?  
   **A:** The water is currently over 100 degrees Celsius.
   - This directly contradicts the claim because if the wa

But it's <span style="color:red;">regularly back to it's old behavior</span> for more complex prompts!

### microsoft/phi-2

The phi-2 is a similarly sized model to Qwen3B.

In [None]:
# garbage collect the old model
import gc
del TC
gc.collect()

7712

In [None]:
TC = TransformersLMClient('microsoft/phi-2')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We are starting with the revised prompt:

In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored.

In a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "Conclusive" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "Inconclusive". No other output is needed or warranted.

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."The first pair of evidence supports the claim. The second pair of evidence contradicts the claim.
So, the conclusion is Inconclusive. 
But wait, the question is about whether the claim is confirmed or disproven by the pairs. If the first pair says the water is over 100, which would mean it's boiling (since boiling point is 100°C), but the

A common trend I notice is overthinking, which leads to a large answer, and a lot of wasted time.

Given I can't get anywhere on this front, I will for now assume a larger model is needed.

### Qwen/Qwen3-4B

Qwen3-4B is a slightly larger and new model from the other tested Qwen model.

In [None]:
# garbage collect the old model
import gc
del TC
gc.collect()

NameError: name 'TC' is not defined

In [None]:
TC = TransformersLMClient('Qwen/Qwen3-4B')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk and cpu.


In [None]:
ctxt = PromptAnalyzer.build_prompt("follow_up", [claim, qe_pairs_impartial])
ans = TC.send_query(ctxt)
print(ctxt + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

In a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "Conclusive" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "Inconclusive". No other output is needed or warranted.

To elaborate on why your conclusion is what it is.
The provided question-answer pairs do not fully confirm or disprove the claim that "the water is not boiling." The first pair suggests the water is over 100°C, which is typically the temperature at which water boils at sea level. The second pair affirms that water boils at this temperat

Same issue as above.

##### llama-cpp

This is the `llama-cpp` client, loaded with the [Qwen3-4B.Q5_K_M.gguf](https://huggingface.co/Qwen/Qwen3-4B-GGUF) model (same as the one we just tested).

In general, it seems to perform a lot better **because it follows the prompt every time**.

However, the reasoning it uses is sometimes flawed.

Here are some examples:

In [None]:
LC = LlamaCppClient(PromptAnalyzer.get_system_message("follow_up"))

In [None]:
ctxt = PromptAnalyzer.build_template("follow_up", [claim, qe_pairs])
ans = LC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert PromptAnalyzer.parse_conclusivity(ans), "Expected Conclusive!"

request took 28.366713762283325 seconds
Claim: the water is not boiling
Question-Answer Pairs:
Q: "What elevation are we at?", A: "We are at sea level."
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

<think>

</think>

The question-answer pairs provide evidence that the water is over 100 degrees Celsius at sea level, where water typically boils, thus directly contradicting the claim that the water is not boiling. Conclusive.


In [None]:
ctxt = PromptAnalyzer.build_template("follow_up", [claim, qe_pairs_impartial])
ans = LC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

request took 41.369388818740845 seconds
Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

<think>

</think>

The question-evidence pairs do not fully confirm or disprove the claim because the answer to the first question indicates the water is over 100 degrees Celsius, which is the temperature at which water boils at sea level, and the second question confirms that water boils at that temperature. However, the claim states the water is not boiling, which contradicts the evidence. Conclusive.


AssertionError: Expected Inconclusive!

In [None]:
ctxt = PromptAnalyzer.build_template("follow_up", [hard_claim, hard_qe_pairs])
ans = LC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert PromptAnalyzer.parse_conclusivity(ans), "Expected Conclusive!"

request took 22.252416610717773 seconds
Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
Q: "Who is the Lord of the Rings?", A: "In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth."
Q: "Could any dog, given the right circumstances, defeat Sauron in combat or contest?", A: "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived."
Q: "What Race is Sauron?", A: "Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body."
Q: "Did Sauron and Huan ever fight?", A: "During an attack on the stronghold of Tol-in-Gaurhoth, Huan managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer."

<think>

</think>

The question-evidence pairs provide information that Sauron was defeated by Huan, a dog, which dir

AssertionError: Expected Conclusive!

In [None]:
ctxt = PromptAnalyzer.build_template("follow_up", [hard_claim, hard_qe_pairs_impartial])
ans = LC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert not PromptAnalyzer.parse_conclusivity(ans), "Expected Inconclusive!"

request took 26.49172830581665 seconds
Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
Q: "Could any dog, given the right circumstances, defeat Sauron in combat or contest?", A: "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived."


<think>

</think>

The question-evidence pairs provide evidence that contradicts the claim, as they mention a dog (Huan) that was prophesied to defeat Sauron, implying that a dog could overcome him. Inconclusive.


This cell must be at the end of the test section:

In [None]:
# EXPORT-IGNORE-END

# Hyperparameters?

I am not sure what hyperparameters to choose for the transformer client.

Currently, I have chosen hyperparameters which match llama-cpp, because I noted that environment performed very well. It is likely they could still be set correctly to make the transformer client viable.

# PromptAnalyzer (exported version)

In [None]:
class PromptAnalyzer():

  # maps a prompt key to a tuple containing the prompt's system message and content template.
  _prompts: Dict[str, str] = {
    "initial_question": ("You are given an unverified claim that needs to be explored.\n\nYou follow these instructions:\n1: You make sure that the question is specific and focuses on one aspect of the claim (who, what, when).\n2: You are not allowed to use the word \"claim\". Instead, point out the exact issue in the claim that you are phrasing your question around.\n3. Your question is concise, and you do not output any other text.", "Claim: {}\nQuestion: "),
    "follow_up": ("You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored.\n\nIn a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.\n\nThen, conclude with \"Conclusive\" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with \"Inconclusive\". No other output is needed or warranted.", "Claim: {}\nQuestion-Answer Pairs:\n{}"),
    "secondary_question" : ("You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored.\n\nYou follow these instructions:\n1: You make sure that the question is specific and focuses on one aspect of the claim (who, what, when).\n2: You are not allowed to use the word \"claim\". Instead, point out the exact issue in the claim that you are phrasing your question around.\n3. Your question must be a NEW question, unrelated to previously asked questions.\n4. Your question is concise, and you do not output any other text.", "Claim: {}\nQuestion-Answer Pairs:\n{}\n\nQuestion: "),
    "final_verdict" : ("You are given a claim and question-answer pairs regarding the claim that needs to be evaluated.\n\nIn a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.\n\nThen, conclude with \"True\" or \"False\" depending on the truthfulness of the claim, otherwise respond with \"Inconclusive\" if the evidence is inconclusive. No other output is needed or warranted.", "Claim: {}\nQuestion-Answer Pairs:\n{}\n\nQuestion: "),
  }

  @staticmethod
  def get_system_message(prompt_type :str) -> str:
    return PromptAnalyzer._prompts[prompt_type][0]

  @staticmethod
  def build_template(prompt_type: str, args: list[str]) -> str:
    """
    Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the completed template.
    """
    # let this throw an error
    prompt = PromptAnalyzer._prompts[prompt_type][1]
    return f"{prompt.format(*args)}"

  @staticmethod
  def build_prompt(prompt_type: str, args: list[str]) -> str:
    """
    Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the respective prompt ready for LM inference.
    """
    prompt = PromptAnalyzer.get_system_message(prompt_type)
    template = PromptAnalyzer.build_template(prompt_type, args)
    return f"{prompt}\n\n{template}"

  @staticmethod
  def parse_conclusivity(response: str) -> bool | None:
    """
    Given RESPONSE, attempts to parse a binary value by searching the string for keywords 'Conclusive' or 'Inconclusive'. Returns None if both or none are found.
    """
    lower = response.lower()

    has_conclusive = 'conclusive' in lower and 'inconclusive' not in lower
    has_inconclusive = 'inconclusive' in lower

    if has_conclusive and not has_inconclusive:
        return True
    elif has_inconclusive and not has_conclusive:
        return False
    return None

  @staticmethod
  def parse_boolean_answer(response: str) -> bool | None:
    """
    Given RESPONSE, attempts to parse a binary value by searching the string for keywords 'True' or 'False'. Returns None if both or none are found.
    """
    lower = response.lower()
    if 'true' in lower and 'false' not in lower:
      return True
    elif 'false' in lower and 'true' not in lower:
      return False
    elif 'true' in lower and 'false' in lower:
      return None
    return None

In [None]:
# EXPORT-IGNORE-START

## Tests for `initial_question` prompts

In [None]:
sm = PromptAnalyzer.get_system_message("initial_question")
ILC = LlamaCppClient(sm)
ctxt = PromptAnalyzer.build_template("initial_question", [claim])
ans = ILC.send_query(ctxt)
print(sm + '\n\n' + ctxt + '\n\n' + ans)

request took 16.30410075187683 seconds
You are given an unverified claim that needs to be explored.

You follow these instructions:
1: You make sure that the question is specific and focuses on one aspect of the claim.
2: You are not allowed to use the word "claim". Instead, point out the exact issue in the claim that you are phrasing your question around.
3. Your question is concise, and you do not output any other text.
Claim: the water is not boiling
Question: 

<think>

</think>

Is the water at a temperature above 100°C?


`Is the water at a temperature above 100°C?` is the perfect question!

In [None]:
ILC = LlamaCppClient(sm)
ctxt = PromptAnalyzer.build_template("initial_question", [hard_claim])
ans = ILC.send_query(ctxt)
print(sm + '\n\n' + ctxt + '\n\n' + ans)

request took 11.631009578704834 seconds
You are given an unverified claim that needs to be explored.

You follow these instructions:
1: You make sure that the question is specific and focuses on one aspect of the claim.
2: You are not allowed to use the word "claim". Instead, point out the exact issue in the claim that you are phrasing your question around.
3. Your question is concise, and you do not output any other text.
Claim: The titular Lord of the Rings was never bested by a dog.
Question: 

<think>

</think>

Was the Lord of the Rings ever defeated by a dog in the literature?


Great!

## Tests for `secondary_question` prompts

In [None]:
sm = PromptAnalyzer.get_system_message("secondary_question")
FLC = LlamaCppClient(sm)
ctxt = PromptAnalyzer.build_template("secondary_question", [claim, qe_pairs_impartial])
ans = FLC.send_query(ctxt)
print(sm + '\n\n' + ctxt + '\n\n' + ans)

request took 29.01747751235962 seconds
You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored.

You follow these instructions:
1: You make sure that the question is specific and focuses on one aspect of the claim.
2: You are not allowed to use the word "claim". Instead, point out the exact issue in the claim that you are phrasing your question around.
3. The followup question should not be seeking to answer a previously asked question. It can however attempt to improve that question.
4. Your question is concise, and you do not output any other text.
Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Question: 

<think>

</think>

What is the issue with the claim that "the water is not boiling" given that the water is over 100 degrees Celsius at sea level?


## Tests for `final_verdict` prompts

In [None]:
sm = PromptAnalyzer.get_system_message("final_verdict")
ctxt = PromptAnalyzer.build_template("final_verdict", [claim, qe_pairs_impartial])
VLC = LlamaCppClient(sm)
ans = VLC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert PromptAnalyzer.parse_boolean_answer(ans) is None, "Expected Inconclusive!"

request took 31.360589027404785 seconds
Claim: the water is not boiling
Question-Answer Pairs:
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Question: 

<think>

</think>

The answer to the first question indicates the water is over 100 degrees Celsius, which, at sea level, is the temperature at which water boils, directly contradicting the claim that the water is not boiling. False.


AssertionError: Expected Inconclusive!

In [None]:
ctxt = PromptAnalyzer.build_template("final_verdict", [claim, qe_pairs])
ans = VLC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
assert not PromptAnalyzer.parse_boolean_answer(ans), "Expected False!"

request took 29.993372917175293 seconds
Claim: the water is not boiling
Question-Answer Pairs:
Q: "What elevation are we at?", A: "We are at sea level."
Q: "What temperature is the water?", A: "The water is currently over 100 degrees Celcius."
Q: "Does water boil when it is over 100 degrees Celcius at sea level?", A: "Yes."

Question: 

<think>

</think>

The evidence shows that the water is over 100 degrees Celsius at sea level, where water boils at 100 degrees Celsius, thus confirming that the water is boiling, which contradicts the claim. False.


In [None]:
ctxt = PromptAnalyzer.build_template("final_verdict", [easy_claim, easy_qe_pairs_impartial])
ans = VLC.send_query(ctxt)
print(ctxt + '\n\n' + ans)
verdict = PromptAnalyzer.parse_boolean_answer(ans)
assert verdict is None or not verdict, "Expected False/Inconclusive!"

request took 23.26823353767395 seconds
Claim: Autumn is the most beautiful season.
Question-Answer Pairs:
Q: "What makes autumn the most beautiful season?", A: "Some people say autumn is the most beautiful season because they enjoy its colors, weather, and overall atmosphere."


Question: 

<think>

</think>

The question-answer pair provides subjective reasoning for why some people consider autumn the most beautiful season, but does not offer definitive evidence to fully confirm or disprove the claim. Inconclusive.


In [None]:
# EXPORT-IGNORE-END