## SessionManager

This file provides a class to interface with a model provided by hugging face.

It also contains experimental output evaluating different models against different preset claims.

This block is required:

In [None]:
# EXPORT-IGNORE-START
try:
    from google.colab import drive

    in_colab = True
except ImportError:
    in_colab = False

if in_colab:
    from google.colab import drive
    drive.mount('/content/drive')
# EXPORT-IGNORE-END

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
from typing import Dict

`SessionManager` provides a set of methods which allows quickly building templated, task-specific prompts and a simple interface to interacting with a LM.

To use contiguous context, the caller must pass a string representing the full dialogue history.

The prompts themselves were taken from *RAGAR, Your Falsehood Radar* by Khaliq et al (pg. 14).

In [None]:
class SessionManager():

  model_name: str
  max_to_generate: int
  top_k: int
  top_p: float
  temperature: float
  repetition_penalty: float

  _tokenizer: AutoTokenizer
  _model: AutoModelForCausalLM
  _config: AutoConfig

  # general prompts in the RAGAR approaches
  # taken from 'RAGAR, Your Falsehood Radar', Khaliq et. all
  _prompts: Dict[str, str] = {
    "initial_question": "You are an expert fact-checker given an unverified claim that needs to be explored.\n\nClaim: {}\nDate (your questions must be framed to be before this date: {}\n\nYou follow these Instructions:\n1: You understand the entire claim.\n\n2. You will make sure that the question is specific and focuses on one aspect of the claim (focus on one topic, should detail where, who, and what) and is very, very short.\n\n3. You should not appeal to video evidence nor ask for calculations or methodology.\n\n4. You are not allowed to use the word \"claim\". Instead, if you want to refer to the claim, you should point out the exact issue in the claim that you are phrasing your question around.\n\n5. You must never ask for calculations or methodology.\n\n6. Create a pointed factcheck question for the claim.\n\nReturn only the question.",
    "follow_up": "You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored. You follow these steps:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nAre you satisfied with the questions asked and do you have enough information to answer the claim?\n\nIf the answer to any of these questions is \"Yes\". then reply with only \"False\" or else answer, \"True\".",
    "secondary_question" : "You are given an unverified statement and question-answer pairs regarding the claim that needs to be explorted. You follow these steps:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nYour task is to ask a followup question regarding the claim specifically based on the question answer pairs.\n\nNever ask for sources or publishing.\n\nThe follow-up question must be descriptive, specific to the claim, and very short, brief, and concise.\n\nThe follow-up question should not appeal to video evidence not ask for calculations or methodology.\n\nThe followup question should not be seeking to answer a previously asked question. It can however attempt to improve that question.\n\nYou are not allowed to use the word \"claim\" or \"statement\". Instead if you want to refer to the claim/statement, you should point out the exact issue in the claim/statement that you are phrasing your question around.\n\nReply only with the followup question and nothing else.",
    # adjusted prompts
    # uses in-context learning rather than detailed instructions, removes references to visual evidence
    "adj_initial_question" : "You are an expert fact-checker given an unverified claim that needs to be explored. Your task is to create a factcheck question for the claim. To do so, you must 1) make sure the question is specific and focuses on one aspect of the claim (detail who, what, where, and when). Instead of using the word \"claim\", point out the exact issue in the claim you are phrasing your question around. Return only the question, and make sure it is very, very short.\n\nExample Claim: PPP on average provided a grant of around $11,000 per employee\nQ: How does the PPP define an \"employee\" for the purposes of calculating grants?\n\nClaim: {}\nQ: ",
    "adj_follow_up": "You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nAre you satisfied with the questions asked and do you have enough information to answer the claim?\n\nIf the answer to both of these questions is \"Yes\". then reply with only \"True\" or else answer, \"False\". Do not give further reasoning.\n\nBinary answer: ",
    # this prompt attempts to ask the LLM for reasoning and then a final binary answer
    "adj+_follow_up": "You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nAre you satisfied with the questions asked and do you have enough information to prove or disprove the claim?\n\nVery concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with \"True\" if you are satisfied with the questions asked, otherwise respond with \"False\"",
    # this prompt tries to force the LLM to reason about if there is enough evidence, rather than output True/False for the claim itself.
    "adj++_follow_up": "You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:\n\nClaim: {}\nQuestion-Answer Pairs:\n{}\n\nIn a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.\n\nThen, conclude with \"Conclusive\" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with \"Inconclusive\". No other output is needed or warranted.",
  }

  def __init__(self,
               model_name: str,
               max_to_generate: int = 100,
               top_k: int = 50,
               top_p: float = 1.0,
               temperature: float = 0.7,
               repetition_penalty: float = 1.1) -> None:
    self.model_name = model_name
    self.max_to_generate = max_to_generate
    self.top_k = top_k
    self.top_p = top_p
    self.temperature = temperature
    self.repetition_penalty = repetition_penalty
    self._tokenizer = AutoTokenizer.from_pretrained(model_name)
    self._model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
    self._config = AutoConfig.from_pretrained(model_name)

  def build_prompt(self, prompt_type: str, args: list[str]) -> str:
    """
    Given PROMPT_TYPE, which must be a key in SELF._PROMPTS, returns the respective prompt ready for LM inference.
    """
    # let this throw an error
    prompt = self._prompts[prompt_type]
    return f"{prompt.format(*args)}\n\n"

  def parse_boolean_answer(self, response: str) -> bool | None:
    """
    Given RESPONSE, attempts to parse a binary value by searching the string for keywords 'True' or 'False'. Returns None if both or none are found.
    """
    lower = response.lower()
    if 'true' in lower and 'false' not in lower:
      return True
    elif 'false' in lower and 'true' not in lower:
      return False
    elif 'true' in lower and 'false' in lower:
      return None
    return None

  def step(self, context: str) -> str:
    """
    Given CONTEXT, prompts the loaded model and returns the response.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    inputs = self._tokenizer(context, return_tensors="pt").to(device)

    with torch.no_grad():
      outputs = self._model.generate(
        **inputs,
        max_new_tokens=self.max_to_generate,
        top_k=self.top_k,
        top_p=self.top_p,
        temperature=self.temperature,
        repetition_penalty=self.repetition_penalty
      )

    output_text = self._tokenizer.decode(outputs[0], skip_special_tokens=True)

    # return only the response
    return output_text[len(context):].strip()


## SessionManager Tests

Below demonstrates how `SessionManager` is used:

We don't want any of the test section exported:

In [None]:
# EXPORT-IGNORE-START

In [None]:
SM = SessionManager('Qwen/Qwen2.5-3B')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [None]:
ctxt1 = SM.build_prompt("initial_question", ["The titular Lord of the Rings was never bested by a dog.", "2000"])
q1 = SM.step(ctxt1)
print(ctxt1 + q1)

You are an expert fact-checker given an unverified claim that needs to be explored.

Claim: The titular Lord of the Rings was never bested by a dog.
Date (your questions must be framed to be before this date: 2000

You follow these Instructions:
1: You understand the entire claim.

2. You will make sure that the question is specific and focuses on one aspect of the claim (focus on one topic, should detail where, who, and what) and is very, very short.

3. You should not appeal to video evidence nor ask for calculations or methodology.

4. You are not allowed to use the word "claim". Instead, if you want to refer to the claim, you should point out the exact issue in the claim that you are phrasing your question around.

5. You must never ask for calculations or methodology.

6. Create a pointed factcheck question for the claim.

Return only the question.

When did the Lord of the Rings character face a dog? To be clear, is it correct that the titular Lord of the Rings character was neve

In [None]:
ctxt2 = SM.build_prompt("follow_up", ["The titular Lord of the Rings was never bested by a dog.", "(\"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored. You follow these steps:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
("Could any dog, given the right circumstances, defeat Sauron in combat or contest?", "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.")

Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to any of these questions is "Yes". then reply with only "False" or else answer, "True".

False
To evaluate whether the claim "The titular Lord of the Rings was never bested by a dog" can be verified based on the provided information, let's analyze the question-answer pair step by step.

1. **Claim Analysis**: The claim asserts that the main character (Lord of the Rings) has never been defeated by a dog. However, this is somewhat

The model does not seem to be very good at following the directions, in general. It is possible further adjustments of hyperparameters or fine tuning could allow it to actually output binary responses or single, non-noisy questions as the prompt asks. However, an issue with finetuning is that we would require a lot of claim/question pairs (for the first task) which we do not use to evaluate the full pipeline.

I am going to try the adjusted prompts, which use a bit of in-context learning to try and demonstrate the expected outputs:

In [None]:
ctxt1 = SM.build_prompt("adj_initial_question", ["The titular Lord of the Rings was never bested by a dog.", "2000"])
q1 = SM.step(ctxt1)
print(ctxt1 + q1)

You are an expert fact-checker given an unverified claim that needs to be explored. Your task is to create a factcheck question for the claim. To do so, you must 1) make sure the question is specific and focuses on one aspect of the claim (detail who, what, where, and when). Instead of using the word "claim", point out the exact issue in the claim you are phrasing your question around. Return only the question, and make sure it is very, very short.

Example Claim: PPP on average provided a grant of around $11,000 per employee
Q: How does the PPP define an "employee" for the purposes of calculating grants?

Claim: The titular Lord of the Rings was never bested by a dog.
Q: 

What type of creature was the Lord of the Rings defeated by according to the claim?


In some generations, the request that the question does not include the word 'claim' is apparently still ignored:

`What type of creature was the Lord of the Rings defeated by according to the claim?`

Not good! Query too big still?

In [None]:
ctxt2 = SM.build_prompt("adj_follow_up", ["The titular Lord of the Rings was never bested by a dog.", "(\"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
("Could any dog, given the right circumstances, defeat Sauron in combat or contest?", "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.")

Are you satisfied with the questions asked and do you have enough information to answer the claim?

If the answer to both of these questions is "Yes". then reply with only "False" or else answer, "True". Do not give further reasoning.

Binary answer: 

True To determine if the claim "The titular Lord of the Rings was never bested by a dog" can be verified based on the provided question-answer pairs, we need to analyze whether the information given supports or refutes the claim.

1. **Claim Analysis**:
   - The claim states that the titular character from the Lord o

### Follow Up

The 'binary answer' still regularly includes extra content. We can likely fix this one by searching the text for 'true'/'false' and parsing it. On failed parses, a reprompt will be sufficient.

Note that the 'justification' it provides is still good reasoning. So another alternative could be to specifically ask for reasoning, and then a final verdict, as is the proper order:

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up", ["The titular Lord of the Rings was never bested by a dog.", "(\"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
("Could any dog, given the right circumstances, defeat Sauron in combat or contest?", "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.")

Are you satisfied with the questions asked and do you have enough information to answer the claim?

Very concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with "True" if you are satisfied with the questions asked, otherwise respond with "False"

False

The question-answer pairs do not directly address the claim that the titular Lord of the Rings was never bested by a dog. Instead, they provide information about a specific character, Huan, who is a wolfhound in the Tolkien universe and is prophesied to only be defeat

False

Does it also output true in cases when it should?

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["The titular Lord of the Rings was never bested by a dog.",
                         "(\"Who is the Lord of the Rings?\", \"In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth.\")\n" +
                         "(\"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\")\n" +
                         "(\"What Race is Sauron?\", \"Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body.\")\n" +
                         "(\"Did Sauron and Huan ever fight?\", \"During an attack on the stronghold of Tol-in-Gaurhoth, Haun managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
("Who is the Lord of the Rings?", "In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth.")
("Could any dog, given the right circumstances, defeat Sauron in combat or contest?", "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.")
("What Race is Sauron?", "Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body.")
("Did Sauron and Huan ever fight?", "During an attack on the stronghold of Tol-in-Gaurhoth, Haun managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer.")

Are you satisfied with the questions asked

That's a bad answer. This prompt might be confusing, so I'll try a simple modus ponens prompt:

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["The water is not boiling.",
                         "(\"What elevation are we at?\", \"We are at sea level.\")\n" +
                         "(\"What temperature is the water?\", \"The water is currently over 100 degrees Celcius.\")\n" +
                         "(\"Does water boil when it is over 100 degrees Celcius at sea level?\", \"Yes.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The water is not boiling.
Question-Answer Pairs:
("What elevation are we at?", "We are at sea level.")
("What temperature is the water?", "The water is currently over 100 degrees Celcius.")
("Does water boil when it is over 100 degrees Celcius at sea level?", "Yes.")

Are you satisfied with the questions asked and do you have enough information to prove or disprove the claim?

Very concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with "True" if you are satisfied with the questions asked, otherwise respond with "False"

False
The question-answer pairs do not provide enough information to prove or disprove the claim. The question "What elevation are we at?" is irrelevant to the claim about the water boiling. The question "What temperature is the water?" is not enough to determine if the water is boilin

False

The explanation to this simple prompt and set of evidence is self-contradictory, and thus it gets the wrong answer again.

In fact, I cannot get it to output the correct answer at all!

In [None]:
# garbage collect the old model
import gc
del SM
gc.collect()

14

I will try a few different models:

#### Qwen 4B

I expect this model to do well, because it works well on my local system, under llama-cpp. I have retroactively adjusted the hyperparameters for this section to match llama-cpp's defaults.

In [None]:
SM = SessionManager('Qwen/Qwen3-4B')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["The water is not boiling.",
                         "(\"What elevation are we at?\", \"We are at sea level.\")\n" +
                         "(\"What temperature is the water?\", \"The water is currently over 100 degrees Celcius.\")\n" +
                         "(\"Does water boil when it is over 100 degrees Celcius at sea level?\", \"Yes.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The water is not boiling.
Question-Answer Pairs:
("What elevation are we at?", "We are at sea level.")
("What temperature is the water?", "The water is currently over 100 degrees Celcius.")
("Does water boil when it is over 100 degrees Celcius at sea level?", "Yes.")

In one sentence, tersely address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "True" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "False"

True/False:

The claim says the water is not boiling. From the evidence, the water is over 100°C at sea level, which is the temperature at which water boils under standard atmospheric pressure. Therefore, the claim is disproven by the evidence provided. True
Okay, let's break this down. The claim is that the water is not boiling. Now, looking at the question

I will try to use a new prompt, to stop the model from focusing on answering 'true/false' for the claim's truthiness itself. This prompt instead asks for a classification of 'conclusive' versus 'inconclusive'. It also never explicitly asks the model to reason if the claim is false or true.

In [None]:
ctxt2 = SM.build_prompt("adj++_follow_up",
                        ["The water is not boiling.",
                         "(\"What elevation are we at?\", \"We are at sea level.\")\n" +
                         "(\"What temperature is the water?\", \"The water is currently over 100 degrees Celcius.\")\n" +
                         "(\"Does water boil when it is over 100 degrees Celcius at sea level?\", \"Yes.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)

You are given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The water is not boiling.
Question-Answer Pairs:
("What elevation are we at?", "We are at sea level.")
("What temperature is the water?", "The water is currently over 100 degrees Celcius.")
("Does water boil when it is over 100 degrees Celcius at sea level?", "Yes.")

In a single, concise, logical sentence, address if the question-evidence pairs fully confirm or disprove the claim.

Then, conclude with "Conclusive" if you are satisfied with the questions asked and have enough information to answer the claim, otherwise respond with "Inconclusive". No other output is needed or warranted.

Conclusive

Because the evidence indicates that the water is over 100 degrees Celsius at sea level, which typically means it should be boiling. However, since the claim states that the water is not boiling, this contradicts the expected behavior of water under these conditions, thus confirm

This is the best output yet, though it would be nice if it did the reasoning first, then the verdict, and if the reasoning was shorter, so generation time took less time and it did not run the risk of overthinking the problem.

#### Qwen 4B Thinking

In [None]:
import gc
del SM
gc.collect()

995

In [None]:
# this model likes to reason, so it needs to generate more tokens
SM = SessionManager('Qwen/Qwen3-4B-Thinking-2507', 1000)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["The water is not boiling.",
                         "(\"What elevation are we at?\", \"We are at sea level.\")\n" +
                         "(\"What temperature is the water?\", \"The water is currently over 100 degrees Celcius.\")\n" +
                         "(\"Does water boil when it is over 100 degrees Celcius at sea level?\", \"Yes.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The water is not boiling.
Question-Answer Pairs:
("What elevation are we at?", "We are at sea level.")
("What temperature is the water?", "The water is currently over 100 degrees Celcius.")
("Does water boil when it is over 100 degrees Celcius at sea level?", "Yes.")

Are you satisfied with the questions asked and do you have enough information to prove or disprove the claim?

Very concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with "True" if you are satisfied with the questions asked, otherwise respond with "False"

We are given the claim: "The water is not boiling."

We have three question-answer pairs:

1. ("What elevation are we at?", "We are at sea level.")
2. ("What temperature is the water?", "The water is currently over 100 degrees Celcius.")
3. ("Does water boil when it is over 100 degrees

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["Ice is not cold.",
                         "(\"What is the temperature of ice?\", \"Ice is at or below 0 degrees Celcius.\")\n" +
                         "(\"Do we perceive things below room temperature as cold??\", \"Yes, temperatures below room temperature feel cold to the touch.\")\n" +
                         "(\"Is 0 degrees Celsius below room temperature\", \"Yes, it's significantly colder than room temperature.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: Ice is not cold.
Question-Answer Pairs:
("What is the temperature of ice?", "Ice is at or below 0 degrees Celcius.")
("Do we perceive things below room temperature as cold??", "Yes, temperatures below room temperature feel cold to the touch.")
("Is 0 degrees Celsius below room temperature", "Yes, it's significantly colder than room temperature.")

Are you satisfied with the questions asked and do you have enough information to prove or disprove the claim?

Very concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with "True" if you are satisfied with the questions asked, otherwise respond with "False"

Okay, the user wants me to act as a fact-checker for the claim "Ice is not cold." They've provided three question-answer pairs that I need to evaluate. 

First, I need to understand what the claim means. T

True

This claim may not be the best; the claim *is* common knowledge, but still subjective.

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["Ice is not colder than room temperature.",
                         "(\"What is the temperature of ice?\", \"Ice is at or below 0 degrees Celcius.\")\n" +
                         "(\"Do we perceive things below room temperature as cold??\", \"Yes, temperatures below room temperature feel cold to the touch.\")\n" +
                         "(\"Is 0 degrees Celsius below room temperature\", \"Yes, it's significantly colder than room temperature.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: Ice is not colder than room temperature.
Question-Answer Pairs:
("What is the temperature of ice?", "Ice is at or below 0 degrees Celcius.")
("Do we perceive things below room temperature as cold??", "Yes, temperatures below room temperature feel cold to the touch.")
("Is 0 degrees Celsius below room temperature", "Yes, it's significantly colder than room temperature.")

Are you satisfied with the questions asked and do you have enough information to prove or disprove the claim?

Very concisely explain how the question answer pairs succeed or fail to explain the claim, then conclude with "True" if you are satisfied with the questions asked, otherwise respond with "False"

We are given the claim: "Ice is not colder than room temperature."

We have three question-answer pairs:

1. ("What is the temperature of ice?", "Ice is at or below 0 degrees Celciu

False

Forgot context, and gave the wrong final answer.

In [None]:
ctxt2 = SM.build_prompt("adj+_follow_up",
                        ["The titular Lord of the Rings was never bested by a dog.",
                         "(\"Who is the Lord of the Rings?\", \"In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth.\")\n" +
                         "(\"Could any dog, given the right circumstances, defeat Sauron in combat or contest?\", \"In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.\")\n" +
                         "(\"What Race is Sauron?\", \"Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body.\")\n" +
                         "(\"Did Sauron and Huan ever fight?\", \"During an attack on the stronghold of Tol-in-Gaurhoth, Haun managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer.\")"])
q2 = SM.step(ctxt2)
print(ctxt2 + q2)
SM.parse_boolean_answer(q2) # ordinarily, if this returns None, we need to reprompt

You are an expert fact-checker given an unverified claim and question-answer pairs regarding the claim that needs to be explored:

Claim: The titular Lord of the Rings was never bested by a dog.
Question-Answer Pairs:
("Who is the Lord of the Rings?", "In the Tolkien universe, Sauron is generally considered the Lord of the Rings, having created the One Ring in his conquest over Middle Earth.")
("Could any dog, given the right circumstances, defeat Sauron in combat or contest?", "In the Tolkien universe, a wolfhound named Huan was prophesied to only be defeated by the greatest wolf that ever lived.")
("What Race is Sauron?", "Sauron is a fallen Maia, a class of immortal beings who can choose to incarnate themselves in a mortal body.")
("Did Sauron and Huan ever fight?", "During an attack on the stronghold of Tol-in-Gaurhoth, Haun managed to defeat Sauron after the latter took the shape of a wolf in attempts to fill the role of Huan's killer.")

Are you satisfied with the questions asked

False

This is an interesting output, because it finds the answer to the claim quickly, but then tries to 'verify the accuracy' of the output.

The thinking model is likely not a good model for this task, because it takes too longer to generate and often overthinks.

### Question Generation

For the other prompts, an even more sophisticated approach might be necessary--I am considering reprompting, asking specifically to make the prompt more precise and remove references to the claim--essentially splitting the original prompt into multiple smaller ones.

This cell must be at the end of the test section:

In [None]:
# EXPORT-IGNORE-END