# LLM-Reasoners Demo

This notebook is accompanied with our tutorial at SIGIR VF:
[[slides](https://www.llm-reasoners.net/2024-02-Reasoners-SIGIR.pdf)]
[[video](https://www.youtube.com/watch?v=d_x2pzEHGQY&pp=ygUJc2hpYm8gaGFv) (starting at 37:20)]

## Setup

The following code assumes you have cloned our library with `git clone https://github.com/maitrix-org/llm-reasoners.git --recursive`

This notebook is tested with several model choices.
- By default, the output cells are the results using [`TheBloke/Llama-2-70B-GPTQ`](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ) with GPT-Q quantization.
- You could also use the [`meta-llama/Llama-3.1-8B`](https://huggingface.co/meta-llama/Llama-3.1-8B) with either SGLang or Huggingface model (the command lines are shown below), but the results will not be the same as output cells.

In [3]:
# This block would load Llama-2-70B-GPTQ with Exllama

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
from reasoners.lm import ExLlamaModel
import torch

# https://huggingface.co/TheBloke/Llama-2-70B-GPTQ
# It may take a few minutes to download the model

model = ExLlamaModel(model_dir='TheBloke/Llama-2-70B-GPTQ',
                     lora_dir=None,
                     device = torch.device("cuda:0"),
                     max_batch_size=1,
                     max_new_tokens=200,
                     mem_map=[16,22], # For 2 * 24GB GPUs. If you have > 40GB you can set it to None
                     max_seq_length=2048)

If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 82117.40it/s]


In [1]:
# Huggingface

import torch
from reasoners.lm import HFModel
model_name = "meta-llama/Llama-3.1-8B"
model = HFModel(model_name, model_name, device=torch.device("cuda:0"), max_batch_size=1, max_new_tokens=512)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00,  3.23s/it]


In [1]:
# SGLang model
# You'll need to first set up a server with the [SGLang](https://github.com/sgl-project/sglang) API.

import os
from reasoners.lm import SGLangModel
os.environ["SGLANG_API_URL"] = "http://127.0.0.1:30001"
os.environ["OPENAI_API_KEY"] = "None"
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = SGLangModel(model_name, is_instruct_model = True)

  from .autonotebook import tqdm as notebook_tqdm


We gather one example from the Blocksworld dataset, and the proper prompt for in-context learning examples.
We will talk more about Evaluators later.

In [16]:
len(evaluator.full_dataset[:1000])

84

In [2]:
from reasoners.benchmark import BWEvaluator
import json

with open('examples/CoT/blocksworld/prompts/pool_prompt_v1.json') as f:
    prompt = json.load(f)
evaluator = BWEvaluator(config_file='examples/CoT/blocksworld/data/bw_config.yaml',
                        domain_file='examples/CoT/blocksworld/data/generated_domain.pddl',
                        data_path='examples/CoT/blocksworld/data/split_v1/split_v1_step_4_data.json',
                        init_prompt=prompt)
prompt = evaluator.sample_prompt(shuffle_prompt=False, num_shot=4)
example = evaluator.full_dataset[:10]
# cot_inputs = (prompt['icl'].replace('<init_state>', example["init"])
#                            .replace('<goals>', example["goal"])
#                            .replace('<action>', ''))

In [4]:
cot_inputs = []
for e in example:
    temp= (prompt['icl'].replace('<init_state>', e["init"])
                        .replace('<goals>', e["goal"])
                        .replace('<action>', ''))
    cot_inputs.append(temp)

NameError: name 'example' is not defined

In [7]:
output = model.generate([cot_inputs],
                        hide_input=True,
                        eos_token_id='\n[').text[0][:-1].strip()

In [4]:
import os
import shutil
from multiprocessing import Pool
import reasoners.benchmark.bw_utils as bw_utils

def clear_directory(directory):
    if os.path.exists(directory):
        shutil.rmtree(directory) 
    os.makedirs(directory) 

def process_input(args):
    index, input_data = args
    
    output = model.generate([input_data],
                             hide_input=True,
                             eos_token_id=['\n[']).text[0][:-1].strip()
    
    output_dir = "CoT_results"
    
    output_file = os.path.join(output_dir, f"process_{index}.txt")
    
    with open(output_file, 'w') as file:
        file.write(output)
    
    return output_file

if __name__ == "__main__":
    num_processes = 10
    output_dir = "CoT_results"

    clear_directory(output_dir)

    inputs_with_indices = list(enumerate(cot_inputs))

    with Pool(num_processes) as pool:
        results = pool.map(process_input, inputs_with_indices)
    
    print("Files saved:", results)


Files saved: ['CoT_results/process_0.txt', 'CoT_results/process_1.txt', 'CoT_results/process_2.txt', 'CoT_results/process_3.txt', 'CoT_results/process_4.txt', 'CoT_results/process_5.txt', 'CoT_results/process_6.txt', 'CoT_results/process_7.txt', 'CoT_results/process_8.txt', 'CoT_results/process_9.txt']


In [9]:
import reasoners.benchmark.bw_utils as bw_utils

bw_utils.text_to_plan_blocksworld(output, example["instance_file"], 'examples/CoT/blocksworld/data/bw_config.yaml', 'examples/CoT/blocksworld/data/generated_domain.pddl', 'tmp_plan2.txt')

[+]: Saving plan in tmp_plan2.txt


('(pick-up b)\n(stack b c)\n(pick-up d)\n(stack d c)\n',
 '(pick-up blue)\n(stack blue orange)\n(pick-up yellow)\n(stack yellow orange)\n')

In [3]:
example

{'init': 'the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table',
 'goal': 'the red block is on top of the blue block',
 'plan': '\nunstack the orange block from on top of the red block\nput down the orange block\npick up the red block\nstack the red block on top of the blue block\n[PLAN END]\n',
 'question': '\n[STATEMENT]\nAs initial conditions I have that, the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table.\nMy goal is to have that the red block is on top of the blue block.\n\nMy plan is as follows:\n\n[PLAN]\n',
 'instance_file': 'LLMs-Planning/llm_planning_analysis/instances/blocksworld/generated_basic_3/instance-52.pddl'}

In [24]:
len(example)

76

In [5]:
cot_inputs = []
for e in example:
    temp= (prompt['icl'].replace('<init_state>', e["init"])
                        .replace('<goals>', e["goal"])
                        .replace('<action>', ''))
    cot_inputs.append(temp)

TypeError: string indices must be integers

In [4]:
len(cot_inputs)

3719

In [4]:
for i in cot_inputs:
    temp = model.generate([i],
                            hide_input=True,
                            eos_token_id=['\n[']).text[0][:-1].strip()

KeyboardInterrupt: 

In [7]:
from multiprocessing import Pool

def process_input(i):
    temp = model.generate([i],
                          hide_input=True,
                          eos_token_id=['\n[']).text[0][:-1].strip()
    return temp


if __name__ == "__main__":
    num_processes = 100

    with Pool(num_processes) as pool:
        results = pool.map(process_input, cot_inputs)
    print(results)


['pick up the blue block\nstack the blue block on top of the orange block', 'pick up the red block\nstack the red block on top of the blue block', 'pick up the blue block\nstack the blue block on top of the orange block\npick up the yellow block\nstack the yellow block on top of the orange block', 'unstack the blue block from on top of the yellow block\nput down the blue block\npick up the red block\nstack the red block on top of the yellow block\npick up the blue block\nstack the blue block on top of the orange block', 'pick up the blue block\nstack the blue block on top of the red block', 'pick up the blue block\nstack the blue block on top of the red block', 'pick up the blue block\nstack the blue block on top of the orange block\npick up the yellow block\nstack the yellow block on top of the red block', 'unstack the orange block from on top of the yellow block\nput down the orange block\npick up the blue block\nstack the blue block on top of the orange block', 'pick up the yellow b

In [13]:
results[83]

'pick up the white block\nstack the white block on top of the red block\npick up the yellow block\nstack the yellow block on top of the blue block\npick up the orange block\nstack the orange block on top of the blue block'

Here is the example.

In [15]:
print(example['init'])

the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table


In [16]:
print(example['goal'])

the red block is on top of the blue block


## Chain-of-Thought
We first experiment with the Chain-of-Thought method.
Since we are having the simplest generation algorithm, we directly ask the model to generate all the steps.
We look at the 4-shot prompt and the generated answer.

In [5]:
print(cot_inputs)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block
Unstack a block from on top of another block
Put down a block
Stack a block on top of another block

I have the following restrictions on my actions:
I can only pick up or unstack one block at a time.
I can only pick up or unstack a block if my hand is empty.
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up.
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block.
I can only unstack a block from on top of another block if the block I am unstacking is clear.
Once I pick up or unstack a block, I am holding the block.
I can only put down a block that I am holding.
I can only stack a block on top of another block if I am holding the block being stacked.
I can only stack a 

In [6]:
output = model.generate([cot_inputs],
                        hide_input=True,
                        eos_token_id='\n[').text[0][:-1].strip()

In [None]:
import reasoners.benchmark.bw_utils as bw_utils

bw_utils.text_to_plan_blocksworld(output, answer["instance_file"], self.config_file, self.domain_file, self.lm_plan_file)

In [7]:
output

'pick up the red block\nstack the red block on top of the blue block'

In [10]:
print(output)

pick up the red block
stack the red block on top of the blue block


Clearly that's not a valid solution :( 
The orange block is on the red block, so we cannot pick up the red block as the first step.

## Tree-of-Thought
Then let's turn to a tree search algorithm, [Tree-of-Thought]((https://arxiv.org/abs/2305.10601)).
We will need to define a simple world model, and a search algorithm, for the Blocksworld task.

In [6]:
from reasoners import WorldModel, LanguageModel, SearchConfig, State, Reasoner
from reasoners.algorithm import BeamSearch, MCTS
import reasoners.benchmark.bw_utils as utils
from typing import NamedTuple
import copy
import numpy as np


# We use NamedTuple for clearer presentation, you may just use normal tuple if you want a quick experiment.
class BWStateToT(NamedTuple):
    step_idx: int
    action_history: list[str]
    end: bool


# We just use the description str as the action, we use a type alias for better presentation.
# You may directly use str of you want a quick experiment.
BWAction = str


class BlocksWorldModelToT(WorldModel):
    def __init__(self,
                 base_model: LanguageModel,
                 prompt: dict,
                 max_steps: int = 4,
                 batch_size: int = 1) -> None:
        super().__init__()
        self.max_steps = max_steps
        self.base_model = base_model
        self.prompt = prompt
        self.batch_size = batch_size

    def init_state(self) -> BWStateToT:
        return BWStateToT(step_idx=0, action_history=[], end=False)
    
    def step(self, state: BWStateToT, action: BWAction) -> tuple[BWStateToT, dict]:
        state = copy.deepcopy(state)
        if action != "[PLAN END]":
            state = BWStateToT(step_idx=state.step_idx + 1, action_history=state.action_history + [action], end=False)
        else:
            state = BWStateToT(step_idx=state.step_idx + 1, action_history=state.action_history, end=True)
        return state, {}  # the dict is auxiliary information for SearchConfig, we don't need it here.
    
    def is_terminal(self, state: State) -> bool:
        return state.end or state.step_idx >= self.max_steps


class BWConfigToT(SearchConfig):
    def __init__(self,
                 base_model: LanguageModel,
                 prompt: dict,
                 temperature: float = 0.8,
                 n_candidate: int = 4) -> None:
        super().__init__()
        self.base_model = base_model
        self.example = None
        self.prompt = prompt
        self.n_candidate = n_candidate
        self.temperature = temperature

    def get_actions(self, state: BWStateToT) -> list[BWAction]:
        prompts = (self.prompt["icl"]
                       .replace("<action>", "\n".join(state.action_history + [""]))
                       .replace("<init_state>", utils.extract_init_state(self.example))
                       .replace("<goals>", utils.extract_goals(self.example, return_raw=True)))
        outputs = self.base_model.generate([prompts],
                                           num_return_sequences=self.n_candidate,
                                           max_length=20,
                                           eos_token_id="\n",
                                           temperature=self.temperature,
                                           do_sample=True,
                                           hide_input=True).text
        outputs = [output.split("\n")[0] for output in outputs]
        outputs = list(dict.fromkeys(outputs))  # deduplicate
        return outputs

    # Some reward functions are fast to calculate.
    # We calculate the reward before executing the action, which can be used to better guide the search.
    def fast_reward(self, state: BWStateToT, action: BWAction) -> tuple[float, dict]:
        # We use two rewards here:
        # 1. Intuition: The loglikelihood of the action given the prompt.
        # 2. Self-eval: Ask the language model whether this step is "Good".
        inputs = self.prompt["icl"].replace("<action>", "\n".join(state.action_history + [""])) \
            .replace("<init_state>", utils.extract_init_state(self.example)) \
            .replace("<goals>", utils.extract_goals(self.example, return_raw=True))[:-1]
        
        intuition = self.base_model.get_loglikelihood(inputs + "\n", [inputs + "\n" + action])[0]

        self_eval_prompt = (self.prompt["self-eval"].replace("<init_state>", utils.extract_init_state(self.example))
                                                    .replace("<goals>", utils.extract_goals(self.example, return_raw=True))
                                                    .replace("<action>", action))
        self_eval = self.base_model.get_loglikelihood(self_eval_prompt, [self_eval_prompt + "good"])[0]

        return intuition + self_eval, {'intuition': intuition, "self_eval": self_eval}
    
    # kwargs is the auxiliary information returned by SearchConfig.fast_reward and WorldModel.step,
    # so that we do not need duplicated calculations.
    # In this case, we just use the fast_reward result as the reward.
    # Generally, if a reward function depends on the new state, or is slow to calculate,
    # we will calculate it here.
    def reward(self, state, action, **kwargs) -> tuple[float, dict]:
        return kwargs['intuition'] + kwargs['self_eval'], kwargs

Note: The following command may take to 2 minutes to run

In [7]:

world_model = BlocksWorldModelToT(base_model=model, prompt=prompt)
config = BWConfigToT(base_model=model, prompt=prompt)
algorithm = BeamSearch(beam_size=4, max_depth=7)
reasoner_tot = Reasoner(world_model=world_model, search_config=config, search_algo=algorithm)
#result_tot = reasoner_tot(example)


# import time

# start_time_cell1 = time.time()

# result_tot = [reasoner_tot(e) for e in example]
# print(result_tot)

# time.sleep(5)  # Simulating a long-running task
# end_time_cell1 = time.time()

# # Save timing info
# with open("ToT_log_SgLang.txt", "a") as log_file:
#     log_file.write(f"Elapsed time: {end_time_cell1 - start_time_cell1} seconds\n")



In [8]:
import os
import shutil
from multiprocessing import Pool

def clear_directory(directory):
    if os.path.exists(directory):
        shutil.rmtree(directory)
    os.makedirs(directory)

def process_example_ToT(args):
    index, example_input = args

    # Reasoning process for each example
    result = reasoner_tot(example_input)

    # Save the result to an output file
    output_dir = "ToT_results"
    output_file = os.path.join(output_dir, f"process_{index}.txt")

    with open(output_file, 'w') as file:
        file.write(str(result))

    return output_file

if __name__ == "__main__":
    num_processes = 10
    output_dir = "ToT_results"

    clear_directory(output_dir)

    inputs_with_indices = list(enumerate(example))

    with Pool(num_processes) as pool:
        results = pool.map(process_example_ToT, inputs_with_indices)

    print("Files saved:", results)




Files saved: ['ToT_results/process_0.txt', 'ToT_results/process_1.txt', 'ToT_results/process_2.txt', 'ToT_results/process_3.txt', 'ToT_results/process_4.txt', 'ToT_results/process_5.txt', 'ToT_results/process_6.txt', 'ToT_results/process_7.txt', 'ToT_results/process_8.txt', 'ToT_results/process_9.txt']


In [7]:
example

{'init': 'the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table',
 'goal': 'the blue block is on top of the red block',
 'plan': '\nunstack the orange block from on top of the red block\nput down the orange block\npick up the blue block\nstack the blue block on top of the red block\n[PLAN END]\n',
 'question': '\n[STATEMENT]\nAs initial conditions I have that, the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table.\nMy goal is to have that the blue block is on top of the red block.\n\nMy plan is as follows:\n\n[PLAN]\n',
 'instance_file': 'LLMs-Planning/llm_planning_analysis/instances/blocksworld/generated_basic_3/instance-71.pddl'}

In [None]:
print('Action, Reward')
for action, _, reward in result_tot.trace:
    print(action, reward)

Action, Reward
None 0.0
unstack the orange block from on top of the red block -0.7903961927319566
put down the orange block -0.7176974050467834
pick up the red block -0.50134990173392
stack the red block on top of the blue block -0.5919118742342107


Still the same error :(

## RAP
With [RAP](https://arxiv.org/abs/2305.14992), we are truly using the latest block configuration as the state, instead of a history of actions.
Thus, we define a new world model to transit between states, which is just a little complex than the previous one.

In [11]:
BWAction = str


class BWStateRAP(NamedTuple):
    step_idx: int
    last_blocks_state: str
    blocks_state: str
    buffered_action: BWAction


class BlocksWorldModelRAP(WorldModel):
    def __init__(self,
                 base_model: LanguageModel,
                 prompt: dict,
                 max_steps: int = 4,
                 batch_size: int = 1) -> None:
        super().__init__()
        self.max_steps = max_steps
        self.base_model = base_model
        self.prompt = prompt
        self.batch_size = batch_size

    def init_state(self) -> BWStateRAP:
        return BWStateRAP(step_idx=0, last_blocks_state="", blocks_state=utils.
                       extract_init_state(self.example), buffered_action="")

    def step(self, state: BWStateRAP, action: BWAction) -> tuple[BWStateRAP, dict]:
        state = copy.deepcopy(state)
        blocks_state = state.blocks_state
        step_idx = state.step_idx
        blocks_state = self.update_blocks(blocks_state, action)
        new_buffered_action = action if state.buffered_action == "" else ""

        state = BWStateRAP(step_idx=step_idx + 1,
                        last_blocks_state=state.blocks_state,
                        blocks_state=blocks_state,
                        buffered_action=new_buffered_action)
        return state, {"goal_reached": utils.goal_check(utils.extract_goals(self.example), blocks_state)}

    def update_blocks(self, block_states: str, action: BWAction) -> str:
        if "pick" in action:
            key = "world_update_pickup"
        elif "unstack" in action:
            key = "world_update_unstack"
        elif "put" in action:
            key = "world_update_putdown"
        elif "stack" in action:
            key = "world_update_stack"
        else:
            raise ValueError("Invalid action")
        world_update_prompt = self.prompt[key].format(block_states, action.capitalize() + ".")
        world_output = self.base_model.generate([world_update_prompt],
                                                eos_token_id=["\n", ".\n", ".\n\n"],
                                                hide_input=True,
                                                temperature=0).text[0].strip()
        new_state = utils.apply_change(world_output, block_states)
        return new_state

    def is_terminal(self, state: BWStateRAP) -> bool:
        if utils.goal_check(utils.extract_goals(self.example), state.blocks_state)[0]:
            return True
        elif state.step_idx == self.max_steps:
            return True
        return False

In [12]:
class BWConfigRAP(SearchConfig):
    def __init__(self,
                 base_model: LanguageModel,
                 prompt: dict,
                 batch_size: int = 1,
                 reward_alpha: float = 0.5,
                 goal_reward_default: float = 0.,
                 goal_reached_reward: float = 100.) -> None:
        super().__init__()
        self.base_model = base_model
        self.example = None
        self.prompt = prompt
        self.batch_size = batch_size
        self.reward_alpha = reward_alpha
        self.goal_reward_default = goal_reward_default
        self.goal_reached_reward = goal_reached_reward

    def get_actions(self, state: BWStateRAP) -> list[BWAction]:
        blocks_state = state.blocks_state
        return utils.generate_all_actions(blocks_state)

    def fast_reward(self, state: BWStateRAP, action: BWAction) -> tuple[float, dict]:
        if state.buffered_action == "":
            current_blocks_state = state.blocks_state
        else:
            current_blocks_state = state.last_blocks_state
        previous_action = state.buffered_action + "\n" if state.buffered_action != "" else ""
        
        # every two steps, we will also reduce the icl examples by 2 steps
        # so that the distribution of step length in examples is more reasonable
        icl_template = self.prompt["icl_list"][state.step_idx // 2]
        
        inputs = (icl_template.replace("<init_state>", current_blocks_state)
                              .replace("<goals>", utils.extract_goals(self.example, return_raw=True))
                              .replace("<action>", previous_action))
        intuition = self.base_model.get_loglikelihood(inputs, [inputs + action])[0]

        self_eval_prompt = (self.prompt["self-eval"]
                                .replace("<init_state>", current_blocks_state)
                                .replace("<goals>", utils.extract_goals(self.example, return_raw=True))
                                .replace("<action>", action))
        self_eval = self.base_model.get_loglikelihood(self_eval_prompt, [self_eval_prompt + "good"])[0]

        return (self.calculate_reward(intuition, self_eval),
                {'intuition': intuition, "self_eval": self_eval})

    def calculate_reward(self, intuition, self_eval, goal_reached=None) -> float:
        # to provide a unified interface for reward and fast_reward
        if goal_reached is None:
            goal_reward = self.goal_reward_default
        elif goal_reached[0]:
            goal_reward = self.goal_reached_reward
        else:
            goal_reward = goal_reached[1]
        return (intuition + self_eval) * self.reward_alpha + goal_reward * (1 - self.reward_alpha)

    def reward(self, state: BWStateRAP, action: BWAction,
               intuition: float = None,
               self_eval: float = None,
               goal_reached: tuple[bool, float] = None) -> tuple[float, dict]:
        return (self.calculate_reward(intuition, self_eval, goal_reached),
                {'intuition': intuition, 'goal_reached': goal_reached})

We just use the MCTS algorithm embedded in Reasoners, and build up the pipeline again.
Note: the following command may take 2 minutes to run

In [13]:
world_model = BlocksWorldModelRAP(base_model=model, prompt=prompt, max_steps=4)
config = BWConfigRAP(base_model=model, prompt=prompt)
algorithm = MCTS(depth_limit=4, disable_tqdm=False, output_trace_in_each_iter=True, n_iters=10)
reasoner_rap = Reasoner(world_model=world_model, search_config=config, search_algo=algorithm)
#result_rap = reasoner_rap(example)

# result_rap = [reasoner_rap(e) for e in example]
# print(result_rap)





In [14]:
import os
import shutil
from multiprocessing import Pool

def clear_directory(directory):
    if os.path.exists(directory):
        shutil.rmtree(directory)
    os.makedirs(directory)

def process_example_RAP(args):
    index, example_input = args

    # Reasoning process for each example
    result = reasoner_rap(example_input)

    # Save the result to an output file
    output_dir = "RAP_results"
    output_file = os.path.join(output_dir, f"process_{index}.txt")

    with open(output_file, 'w') as file:
        file.write(str(result))

    return output_file

if __name__ == "__main__":
    num_processes = 10
    output_dir = "RAP_results"

    clear_directory(output_dir)

    inputs_with_indices = list(enumerate(example))

    with Pool(num_processes) as pool:
        results = pool.map(process_example_RAP, inputs_with_indices)

    print("Files saved:", results)

                                                              

Files saved: ['RAP_results/process_0.txt', 'RAP_results/process_1.txt', 'RAP_results/process_2.txt', 'RAP_results/process_3.txt', 'RAP_results/process_4.txt', 'RAP_results/process_5.txt', 'RAP_results/process_6.txt', 'RAP_results/process_7.txt', 'RAP_results/process_8.txt', 'RAP_results/process_9.txt']


In [14]:
result_rap.trace

([BWStateRAP(step_idx=0, last_blocks_state='', blocks_state='the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table.', buffered_action=''),
  BWStateRAP(step_idx=1, last_blocks_state='the blue block is clear, the orange block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table and the blue block is on the table.', blocks_state='the blue block is clear, the orange block is in the hand, the red block is clear, the hand is holding the orange block, the blue block is on the table, and the red block is on the table.', buffered_action='unstack the orange block from on top of the red block'),
  BWStateRAP(step_idx=2, last_blocks_state='the blue block is clear, the orange block is in the hand, the red block is clear, the hand is holding the orange block, the blue block is on the table, and the red block is on the table

Finally, we get a valid solution!

## Visualization

Visualization is as simple as calling `visualize(log)`

In [15]:
from reasoners.visualization import visualize
from reasoners.visualization.tree_snapshot import NodeData, EdgeData
from reasoners.algorithm.mcts import MCTSNode


# (Optional) You can write node_data_factory and edge_data_factory to show customized information.
def blocksworld_node_data_factory(n: MCTSNode) -> NodeData:
    return NodeData({"block state": n.state.blocks_state if n.state else "Not expanded",
                     "# goals satisfied": n.reward_details["goal_reached"][1] if hasattr(n, "reward_details") else "N/A",
                     "# visited": len(n.cum_rewards)})

def blocksworld_edge_data_factory(n: MCTSNode) -> EdgeData:
    return EdgeData({"Q": n.Q,
                     "intuition": n.fast_reward_details["intuition"],
                     "self_eval": n.fast_reward_details["self_eval"],
                     "action": n.action})

visualize(result_rap,
          node_data_factory=blocksworld_node_data_factory,
          edge_data_factory=blocksworld_edge_data_factory)

Getting log upload link...
Tree log size: 50506 bytes
Tree log compressed size: 1515 bytes
Uploading log...
Visualizer URL: https://main.d1puk3wdon4rk8.amplifyapp.com/visualizer/31b74c5f-37d1-4fb2-ae4b-ca399f837d9a?accessKey=6f66ce74


This evaluator module provides standard APIs and easy implementation of multiple popular reasoning datasets.

In [5]:
# a helper function to extract the action history from the output of the algorithm

def bfs_bw_extractor(algo_output):
    if torch.distributed.is_initialized():
        torch.distributed.barrier()
    # to make sure the plan is saved before evaluation in multi-process setting
    try:
        return "\n".join(algo_output.terminal_node.state.action_history)
    except Exception as e:
        print("Error in output extraction,", e)
        return ""

In [3]:
import os

# Replace this with the actual path to your VAL folder
val_path = "data/irving/VAL/bin"  
os.environ["VAL"] = val_path

In [6]:
import torch
from reasoners.benchmark import BWEvaluator
import json

torch.cuda.set_device(0)

with open('examples/CoT/blocksworld/prompts/pool_prompt_v1.json') as f:
    prompt = json.load(f)

evaluator = BWEvaluator(config_file='examples/CoT/blocksworld/data/bw_config.yaml',
                        domain_file='examples/CoT/blocksworld/data/generated_domain.pddl',
                        data_path='examples/CoT/blocksworld/data/split_v1/split_v1_step_4_data.json',
                        init_prompt=prompt,
                        output_extractor=bfs_bw_extractor)

world_model = BlocksWorldModelToT(base_model=model, prompt=prompt)
config = BWConfigToT(base_model=model, prompt=prompt)
algorithm = BeamSearch(beam_size=4, max_depth=7)
reasoner_tot = Reasoner(world_model=world_model, search_config=config, search_algo=algorithm)

evaluator.evaluate(reasoner_tot, shuffle_prompt=True, num_shot=4, resume=0)

NameError: name 'BlocksWorldModelToT' is not defined

In [2]:
from reasoners.benchmark import prosqa_utils


In [3]:
c

In [4]:
from reasoners.benchmark.prosqa import ProsQAEvaluator, ProsQAReasoner
evaluator = ProsQAEvaluator(output_extractor=prosqa_utils.prosqa_extractor, answer_extractor = lambda x: x["answer"])

reasoner = prosQAReasoner(model)
evaluator.evaluate(reasoner, shuffle_prompt=True, num_shot=4, resume=0)

prosQA:   0%|          | 1/500 [00:05<42:17,  5.09s/it]

To determine if Sally is a hilpus or sterpus, we need to analyze the given statements.

From the statements, we know that Sally is a boompus (Sally is a boompus) and Sally is a gerpus (Sally is a gerpus). 

Since every gerpus is a scrompus (Every gerpus is a scrompus), Sally is a scrompus.

Also, Sally is a scrompus (Sally is a scrompus) and every scrompus is a rempus (Every scrompus is a rempus), Sally is a rempus.

Furthermore, Sally is a rempus (Sally is a rempus) and every rempus is a sterpus (Every rempus is a sterpus), Sally is a sterpus.

Therefore, Sally is a sterpus.

Answer: Sally is a sterpus.
Case #1: correct=True, output='Sally is a sterpus.', answer='Sally is a sterpus.';accuracy=1.000 (1/1)


prosQA:   0%|          | 2/500 [00:11<46:52,  5.65s/it]

To determine if Bob is a shumpus or gerpus, we need to analyze the given information.

1. Bob is a storpus.
2. Every storpus is a impus.
3. Every impus is a yerpus.
4. Every impus is a zhorpus.
5. Every bompus is a kerpus.
6. Every kerpus is a rompus.
7. Every rompus is a gerpus.
8. Bob is a rempus.
9. Every rempus is a storpus.
10. Bob is a yerpus.

From the given information, we can conclude that Bob is both a rempus and a storpus. Since every rempus is a storpus, we can say that Bob is a storpus. 

Now, let's analyze the relationship between storpus and gerpus:
1. Bob is a storpus.
2. Every storpus is a impus.
3. Every impus is a yerpus.
4. Every yerpus is a gerpus.

Since Bob is a storpus, and every storpus is an impus, and every impus is a yerpus, and every yerpus is a gerpus, we can conclude that Bob is a gerpus.

Therefore, the answer is: Answer: Bob is a gerpus.
Case #2: correct=False, output='Bob is a gerpus.', answer='Bob is a shumpus.';accuracy=0.500 (1/2)


prosQA:   1%|          | 3/500 [00:15<41:05,  4.96s/it]

To determine if Max is a boompus or gwompus, we need to analyze the given information.

1. Max is a storpus.
2. Every storpus is a sterpus.
3. Every sterpus is a bompus.
4. Every bompus is a worpus.
5. Every worpus is a boompus.

From the above information, we can conclude that Max is a boompus.

Answer: Max is a boompus.
Case #3: correct=True, output='Max is a boompus.', answer='Max is a boompus.';accuracy=0.667 (2/3)


prosQA:   1%|          | 4/500 [00:21<44:09,  5.34s/it]

To determine if Davis is a worpus or chorpus, we need to analyze the given information.

From the given statements, we know that Davis is a lorpus. 

We also know that every lorpus is a zumpus, every lorpus is a terpus, and every lorpus is a numpus.

Since every zumpus is a chorpus, and Davis is a lorpus which is a zumpus, Davis is a chorpus.

However, we also know that every lorpus is a terpus, and every terpus is a sterpus. But we also know that every yumpus is a sterpus, and every numpus is a yumpus. So, since Davis is a lorpus which is a numpus, Davis is a sterpus.

But we also know that every sterpus is a chorpus. So, Davis is a chorpus.

However, we also know that Davis is a vumpus. And every vumpus is a rempus, every rempus is a hilpus, and every hilpus is a chorpus. So, Davis is a chorpus.

Considering all the information, we can conclude that Davis is a chorpus.

Answer: Davis is a chorpus.
Case #4: correct=False, output='Davis is a chorpus.', answer='Davis is a worpus.';accur

prosQA:   1%|          | 5/500 [00:28<49:11,  5.96s/it]

To determine if Rex is a rompus or lempus, we need to find the relationship between jelpus and rompus or lempus.

From the given information, we know that Rex is a jelpus. 

We also know that every jelpus is a shumpus, every jelpus is a hilpus, every jelpus is a tumpus, every tumpus is a zumpus, every zumpus is a boompus, every boompus is a lorpus, every lorpus is a lempus, and every shumpus is a kerpus.

However, we also know that every zumpus is a rompus, every kerpus is a rompus, and every lorpus is a lempus.

Since every zumpus is a rompus and every lorpus is a lempus, and every zumpus is a lorpus, we can conclude that every zumpus is both a rompus and a lempus.

However, we also know that every zumpus is a boompus, and every boompus is a lorpus. This means that every zumpus is a lorpus, which is a lempus.

Since every zumpus is a rompus and a lempus, and every tumpus is a zumpus, we can conclude that every tumpus is a rompus and a lempus.

Since every jelpus is a tumpus, we can co

prosQA:   1%|          | 6/500 [00:33<48:01,  5.83s/it]

To determine if Tom is a boompus or vumpus, we need to analyze the given information.

1. Tom is a terpus.
2. Every terpus is a numpus.
3. Every numpus is a dumpus.
4. Every dumpus is a gorpus.
5. Every wumpus is a gorpus.
6. Every wumpus is a impus.
7. Every impus is a gorpus.
8. Every wumpus is a vumpus.

From the information given, we can conclude that Tom is a terpus, which is a numpus, which is a dumpus, which is a gorpus. However, we also know that Tom is a terpus, which is a wumpus, which is a vumpus.

Since Tom is a terpus, and every terpus is a wumpus, we can conclude that Tom is a wumpus. And since every wumpus is a vumpus, we can conclude that Tom is a vumpus.

Answer: Tom is a vumpus.
Case #6: correct=True, output='Tom is a vumpus.', answer='Tom is a vumpus.';accuracy=0.500 (3/6)


prosQA:   1%|▏         | 7/500 [00:43<58:46,  7.15s/it]

To determine if Stella is a wumpus or hilpus, we need to find the relationships between Stella and these two terms.

1. Stella is a boompus.
2. Every boompus is a brimpus.
3. Stella is a brimpus.

So, Stella is a brimpus.

4. Every rorpus is a sterpus.
5. Every boompus is a rorpus.
6. Stella is a boompus.
7. Therefore, Stella is a rorpus.

Now, Stella is both a brimpus and a rorpus.

8. Every rorpus is a sterpus.
9. Stella is a rorpus.
10. Therefore, Stella is a sterpus.

11. Every numpus is a impus.
12. Stella is a brimpus.
13. Every brimpus is a rorpus.
14. Every rorpus is a sterpus.
15. Every sterpus is a zumpus.
16. Every zumpus is a yumpus.
17. Every yumpus is a numpus.
18. Therefore, Stella is a numpus.

19. Stella is a numpus.
20. Every numpus is a impus.
21. Therefore, Stella is an impus.

22. Stella is a shumpus.
23. Every shumpus is a zumpus.
24. Every zumpus is a yumpus.
25. Every yumpus is a numpus.
26. Therefore, Stella is a numpus.

27. Stella is a numpus.
28. Every numpu

prosQA:   2%|▏         | 8/500 [00:50<58:47,  7.17s/it]

To determine if Fae is a yerpus or dumpus, we need to analyze the given information.

1. Every brimpus is a shumpus.
2. Every brimpus is a storpus.
3. Every storpus is a shumpus.
4. Fae is a brimpus.

From the above information, we can conclude that Fae is a shumpus (from 1 and 4) and also a storpus (from 2 and 4).

5. Every storpus is a gorpus.
6. Every storpus is a dumpus.

From the above information, we can conclude that Fae is a gorpus (from 5) and also a dumpus (from 6).

7. Every gorpus is a bompus.
8. Every bompus is a vumpus.
9. Every vumpus is a gwompus.

From the above information, we can conclude that Fae is a vumpus (from 8) and also a gwompus (from 9).

10. Every vumpus is a jompus.
11. Every jompus is a yerpus.

From the above information, we can conclude that Fae is a jompus (from 10) and also a yerpus (from 11).

However, we also know that Fae is a dumpus (from 6). 

Since Fae is both a dumpus and a yerpus, we can conclude that Fae is both a dumpus and a yerpus.

But, w

prosQA:   2%|▏         | 9/500 [00:55<51:39,  6.31s/it]

To determine whether Davis is a quimpus or a bompus, we need to analyze the given information.

From the given statements, we know that Davis is a dumpus. 

We also know that every boompus is a dumpus, and every lorpus is a dumpus. 

However, we also know that every lorpus is a quimpus. 

Therefore, since Davis is a dumpus and every lorpus is a dumpus and a quimpus, Davis must be a lorpus, and consequently, a quimpus.

Answer: Davis is a quimpus.
Case #9: correct=True, output='Davis is a quimpus.', answer='Davis is a quimpus.';accuracy=0.556 (5/9)


prosQA:   2%|▏         | 9/500 [00:57<52:23,  6.40s/it]


KeyboardInterrupt: 

In [None]:
question = "Every gerpus is a terpus. Every terpus is a zhorpus. Every lempus is a yerpus. Every boompus is a zhorpus. Every brimpus is a rempus. Every lempus is a jelpus. Every lorpus is a rorpus. Bob is a yerpus. Every worpus is a rempus. Every lempus is a impus. Every rempus is a sterpus. Every yimpus is a zumpus. Every lempus is a yumpus. Every shumpus is a jelpus. Every brimpus is a zhorpus. Every scrompus is a rempus. Every lempus is a wumpus. Sally is a boompus. Sally is a gerpus. Every gerpus is a scrompus. Bob is a wumpus. Every wumpus is a lorpus. Every yerpus is a rorpus. Sally is a terpus. Every gerpus is a zhorpus. Every yimpus is a zhorpus. Every boompus is a terpus. Every gerpus is a worpus. Bob is a lorpus. Every gerpus is a yimpus. Every scrompus is a brimpus. Every lempus is a rorpus. Every lempus is a shumpus. Bob is a jelpus. Sally is a scrompus. Every gerpus is a brimpus. Every lempus is a lorpus. Every boompus is a yumpus. Every scrompus is a zumpus. Every zhorpus is a tumpus. Sally is a zumpus. Every lempus is a storpus. Every yerpus is a lorpus. Every scrompus is a zhorpus. Every yimpus is a rempus. Every impus is a jelpus. Jack is a yimpus. Every yerpus is a wumpus. Every rorpus is a hilpus. Every yimpus is a sterpus. Bob is a lempus. Every worpus is a storpus. Every rorpus is a impus. Every boompus is a gerpus. Is Sally a hilpus or sterpus?\n\nPlease output your conclusion like 'Answer: XXX is YYY'"

"Every gerpus is a terpus. Every terpus is a zhorpus. Every lempus is a yerpus. Every boompus is a zhorpus. Every brimpus is a rempus. Every lempus is a jelpus. Every lorpus is a rorpus. Bob is a yerpus. Every worpus is a rempus. Every lempus is a impus. Every rempus is a sterpus. Every yimpus is a zumpus. Every lempus is a yumpus. Every shumpus is a jelpus. Every brimpus is a zhorpus. Every scrompus is a rempus. Every lempus is a wumpus. Sally is a boompus. Sally is a gerpus. Every gerpus is a scrompus. Bob is a wumpus. Every wumpus is a lorpus. Every yerpus is a rorpus. Sally is a terpus. Every gerpus is a zhorpus. Every yimpus is a zhorpus. Every boompus is a terpus. Every gerpus is a worpus. Bob is a lorpus. Every gerpus is a yimpus. Every scrompus is a brimpus. Every lempus is a rorpus. Every lempus is a shumpus. Bob is a jelpus. Sally is a scrompus. Every gerpus is a brimpus. Every lempus is a lorpus. Every boompus is a yumpus. Every scrompus is a zumpus. Every zhorpus is a tumpu

In [33]:
output = model.generate(["Every gerpus is a terpus. Every terpus is a zhorpus. Every lempus is a yerpus. Every boompus is a zhorpus. Every brimpus is a rempus. Every lempus is a jelpus. Every lorpus is a rorpus. Bob is a yerpus. Every worpus is a rempus. Every lempus is a impus. Every rempus is a sterpus. Every yimpus is a zumpus. Every lempus is a yumpus. Every shumpus is a jelpus. Every brimpus is a zhorpus. Every scrompus is a rempus. Every lempus is a wumpus. Sally is a boompus. Sally is a gerpus. Every gerpus is a scrompus. Bob is a wumpus. Every wumpus is a lorpus. Every yerpus is a rorpus. Sally is a terpus. Every gerpus is a zhorpus. Every yimpus is a zhorpus. Every boompus is a terpus. Every gerpus is a worpus. Bob is a lorpus. Every gerpus is a yimpus. Every scrompus is a brimpus. Every lempus is a rorpus. Every lempus is a shumpus. Bob is a jelpus. Sally is a scrompus. Every gerpus is a brimpus. Every lempus is a lorpus. Every boompus is a yumpus. Every scrompus is a zumpus. Every zhorpus is a tumpus. Sally is a zumpus. Every lempus is a storpus. Every yerpus is a lorpus. Every scrompus is a zhorpus. Every yimpus is a rempus. Every impus is a jelpus. Jack is a yimpus. Every yerpus is a wumpus. Every rorpus is a hilpus. Every yimpus is a sterpus. Bob is a lempus. Every worpus is a storpus. Every rorpus is a impus. Every boompus is a gerpus. Is Sally a hilpus or sterpus?\n\nPlease output your conclusion like 'Answer: XXX is YYY'"],
                        hide_input=True,
                        eos_token_id='\n[').text[0]#[:-1].strip()

In [34]:
output

'To determine if Sally is a hilpus or sterpus, we need to analyze the given information.\n\nFrom the given statements, we know that Sally is a boompus (Sally is a boompus) and Sally is a gerpus (Sally is a gerpus). \n\nSince every gerpus is a scrompus (Every gerpus is a scrompus), Sally is a scrompus.\n\nWe also know that Sally is a scrompus (Sally is a scrompus) and every scrompus is a rempus (Every scrompus is a rempus), so Sally is a rempus.\n\nFurthermore, every rempus is a sterpus (Every rempus is a sterpus), so Sally is a sterpus.\n\nTherefore, Sally is a sterpus.\n\nAnswer: Sally is a sterpus.'