## Does a reasoning model provide better solutions to the MacGyver problems?

The MacGyver problems are a set of problems that are used to compare the ability to think out of the box to solve commonplace problems that are grounded in the physical world. Not all problems can be solved, and those that can be solved are not always solved by the LLM. This test and paper was done in March 2024 and the best model at that time (which also showed the best overall performance) was GPT-4.

Given the latest excitement and hype around reasoning models, we thought it would be interesting to see if a reasoning model can provide better solutions to the MacGyver problems.

In the following notebook, we will be doing the following steps:

1. Read in the Excel file with problem statements
2. Send the problem statements to a reasoning model with the default prompt style used in the paper
3. Get the responses from the reasoning model
4. Potentially compare the responses with the ground truth solutions

### Step 1: Read in the Excel file with problem statements

In [5]:
!pip install pandas
!pip install openpyxl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting openpyxl
  Using cached openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Using cached et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Using cached openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Using cached et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import pandas as pd

# Read the Excel file
df = pd.read_excel('../../data/MacGyver/problem_solution_pair.xlsx')

# Display the first 5 rows
display(df.head())

Unnamed: 0,ID,Problem,Solvable?,Unconventional?,Solution,Label
0,541,You spilled red wine on the hotel carpet and w...,Yes,unconventional,Step1: Open the bottle of mineral water with t...,inefficient
1,542,You accidentally locked your only pair of glas...,Yes,conventional,Step1: Use the remaining battery in your smart...,infeasible
2,543,You have an important meeting but your suit is...,Yes,unconventional,Step1: Hang the suit on the coat hanger on the...,inefficient
3,544,The hotel bathroom door handle is broken and y...,Yes,unconventional,Step 1: Unbend the wire hanger and flatten it ...,inefficient
4,545,The hotel's WiFi signal is weak and you have a...,Yes,conventional,"Step1: Using the clothes hanger, create a hook...",infeasible


In [7]:
# Sample 10 random requests
sampled_df = df.sample(n=10, random_state=42)  # Setting random_state for reproducibility

# Display the sampled problems
display(sampled_df)

Unnamed: 0,ID,Problem,Solvable?,Unconventional?,Solution,Label
1394,1322,You are on a road trip and the car breaks down...,No,,It is not possible to fix a car engine with ju...,correct_right_reason
743,911,A baseball has been thrown over a towering fen...,Yes,conventional,Step1: Tie the two jump ropes together to leng...,infeasible
1606,1549,Your water purification tablets fell into a de...,No,,Step1: Use the pot to collect rainwater or sta...,wrong_solution
49,358,"Your puppy chewed up a pillow, and now there a...",Yes,unconventional,Step1: Use the broken broom to gather the feat...,inefficient
188,384,"You have to hang a wall decoration, but you c...",Yes,unconventional,Step1: Align the deck of cards along the wall ...,inefficient
482,885,"You want to make a sun dial to judge time, but...",Yes,unconventional,Step1: Use the hand mirror as the sundial base...,infeasible
892,1860,"You are exploring through a zoo, and there's a...",Yes,unconventional,Step1: Fill the metallic bucket with fresh veg...,infeasible
613,1372,"During a beach cleanup, you come across a heav...",Yes,unconventional,Step1: Use the volleyball as a roller under th...,efficient
1628,1640,A crack has formed on your car's radiator and ...,No,,Step1: Clean the cracked area on the radiator ...,wrong_solution
1347,791,You need to inflate a flat pool float but have...,No,,Step1: Take the trash bag and secure it around...,wrong_solution


## Step 2: Call the OpenAI API with a sample of the problems 

1. Call the GPT-4o model to check how it responds and to make sure that call is happening
2. Call the o1 model to check how it responds and to make sure that call is happening

In [14]:
!pip install openai
!pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [19]:
import os
from dotenv import load_dotenv
load_dotenv('../../.env.dev')

True

In [31]:
import os
from openai import OpenAI
import time

# Initialize the OpenAI client
client = OpenAI()  # Make sure OPENAI_API_KEY is set in your environment variables
systemPrompt = "Give a valid (feasible and efficient) solution very concisely. Use step1, step2, etc, and mention the tools to achieve each step. Use as few steps as possible and the answer should ideally be less than 100 words. When there is not a feasible solution given the constraint and provided tools, just say that it is not possible and give a very short justification."

def get_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",  # Using GPT-4o (cheaper model)
            messages=[
                {"role": "system", "content": systemPrompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
def get_reasoning_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="o1-mini",  # Using O1 reasoning model
            messages=[
                {"role": "user", "content": systemPrompt + prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None

# Create a list to store responses
responses = []

# Iterate through the sampled problems
for idx, row in sampled_df.iterrows():
    print(f"Processing problem {idx + 1}/10...")
    
    # Create the prompt
    prompt = f"Please solve this problem: {row['Problem']}"
    
    # Get completion and add delay to respect rate limits
    # response = get_completion(prompt)
    response = get_reasoning_completion(prompt)
    responses.append(response)
    time.sleep(1)  # Add a delay between requests to respect rate limits

# Add responses to the dataframe
sampled_df['model_response'] = responses

# Display the results
display(sampled_df[['Problem', 'model_response']])

Processing problem 1395/10...
Processing problem 744/10...
Processing problem 1607/10...
Processing problem 50/10...
Processing problem 189/10...
Processing problem 483/10...
Processing problem 893/10...
Processing problem 614/10...
Processing problem 1629/10...
Processing problem 1348/10...


Unnamed: 0,Problem,model_response
1394,You are on a road trip and the car breaks down...,It is not possible to fix the engine with the ...
743,A baseball has been thrown over a towering fen...,**Step 1:** Tie one jump rope securely to the ...
1606,Your water purification tablets fell into a de...,**Step 1:** Use the Swiss army knife’s hook to...
49,"Your puppy chewed up a pillow, and now there a...",**Step1:** Use the **fan** to blow and gather ...
188,"You have to hang a wall decoration, but you c...",**Step 1:** Lay multiple deck of cards side by...
482,"You want to make a sun dial to judge time, but...",**Step1:** Use the paper plate as the base of ...
892,"You are exploring through a zoo, and there's a...",**Step1:** Attach the basket of vegetables sec...
613,"During a beach cleanup, you come across a heav...",It is not possible to safely move the heavy lo...
1628,A crack has formed on your car's radiator and ...,**Step1:** Clean the radiator crack area using...
1347,You need to inflate a flat pool float but have...,**Step1:** Attach the straw securely to the po...


In [32]:
display(sampled_df[['Problem', 'model_response','Solvable?','Solution']])

Unnamed: 0,Problem,model_response,Solvable?,Solution
1394,You are on a road trip and the car breaks down...,It is not possible to fix the engine with the ...,No,It is not possible to fix a car engine with ju...
743,A baseball has been thrown over a towering fen...,**Step 1:** Tie one jump rope securely to the ...,Yes,Step1: Tie the two jump ropes together to leng...
1606,Your water purification tablets fell into a de...,**Step 1:** Use the Swiss army knife’s hook to...,No,Step1: Use the pot to collect rainwater or sta...
49,"Your puppy chewed up a pillow, and now there a...",**Step1:** Use the **fan** to blow and gather ...,Yes,Step1: Use the broken broom to gather the feat...
188,"You have to hang a wall decoration, but you c...",**Step 1:** Lay multiple deck of cards side by...,Yes,Step1: Align the deck of cards along the wall ...
482,"You want to make a sun dial to judge time, but...",**Step1:** Use the paper plate as the base of ...,Yes,Step1: Use the hand mirror as the sundial base...
892,"You are exploring through a zoo, and there's a...",**Step1:** Attach the basket of vegetables sec...,Yes,Step1: Fill the metallic bucket with fresh veg...
613,"During a beach cleanup, you come across a heav...",It is not possible to safely move the heavy lo...,Yes,Step1: Use the volleyball as a roller under th...
1628,A crack has formed on your car's radiator and ...,**Step1:** Clean the radiator crack area using...,No,Step1: Clean the cracked area on the radiator ...
1347,You need to inflate a flat pool float but have...,**Step1:** Attach the straw securely to the po...,No,Step1: Take the trash bag and secure it around...


In [33]:
# Save to excel file
sampled_df.to_excel('../../data/MacGyver/o1_response.xlsx', index=False)

## Step 3 - Summary and Analysis

- The calls are made and evaluated and we can see that there is often disagreement between what is solvable whether the solution is feasible or not. 
- Many of the results that are generated are also hard to evaluate - especially because there aren't defined answers and one could also say that the LLM came up with a creative solution. 
- In the original paper, the authors samples a set of 323 questions and then had human annotators go through the answers by each LLM and say whether the answer generated by LLM and human are correct or not.
- In fact, they went one step further by also classifying the answers in five categories

So now the open question is, how can we judge the quality and accuracy of the generated response from the reasoning models to determine if the reasoning LLMs generate better answers.

Possible approach: since we do not have the capacity to hire human annotators, our goal is to do this on a subset of the data. Here's what we are going to do:

1. Pick a set of 10 questions that look appealing to you. We will apply the following condition:
    Puzzles are interesting and could be appealing when presented to potential readers
    Puzzles are marked solvable in the original dataset - no point worrying about puzzles that can't be solved
    Look for Puzzles that already have human solutions in the other file -> this is good for comparison with LLM solutions
2. For these 10 questions: ask GPT-4o, o3 and DeepSeek-R1 for responses. Start with 1 response and consider additional calls if required.
3. For each of the 10 questions we analyze the response to each question and determine if it's interesting:
    Marked the answer as solvable -> this will give you a comparison between LLMs
    How creative is the LLM solution compared to the Human solution -> this tells us if there is creativity in the LLM
    How valid/feasible is the LLM solution -> this tells us whether it actually makes sense 