## Does a reasoning model provide better solutions to the MacGyver problems?

The MacGyver problems are a set of problems that are used to compare the ability to think out of the box to solve commonplace problems that are grounded in the physical world. Not all problems can be solved, and those that can be solved are not always solved by the LLM. This test and paper was done in March 2024 and the best model at that time (which also showed the best overall performance) was GPT-4.

Given the latest excitement and hype around reasoning models, we thought it would be interesting to see if a reasoning model can provide better solutions to the MacGyver problems.

In the following notebook, we will be doing the following steps:

1. Read in the Excel file with problem statements
2. Send the problem statements to a reasoning model with the default prompt style used in the paper
3. Get the responses from the reasoning model
4. Potentially compare the responses with the ground truth solutions

### Step 1: Read in the Excel file with problem statements

In [2]:
!pip install pandas
!pip install openpyxl



In [3]:
import pandas as pd

# Read the Excel file
df = pd.read_excel('../../data/MacGyver/problem_solution_pair.xlsx')

# Display the first 5 rows
display(df.head())

Unnamed: 0,ID,Problem,Solvable?,Unconventional?,Solution,Label,IsInteresting
0,541,You spilled red wine on the hotel carpet and w...,Yes,unconventional,Step1: Open the bottle of mineral water with t...,inefficient,1.0
1,542,You accidentally locked your only pair of glas...,Yes,conventional,Step1: Use the remaining battery in your smart...,infeasible,
2,543,You have an important meeting but your suit is...,Yes,unconventional,Step1: Hang the suit on the coat hanger on the...,inefficient,
3,544,The hotel bathroom door handle is broken and y...,Yes,unconventional,Step 1: Unbend the wire hanger and flatten it ...,inefficient,0.0
4,545,The hotel's WiFi signal is weak and you have a...,Yes,conventional,"Step1: Using the clothes hanger, create a hook...",infeasible,


In [7]:
# Sample 10 random requests
sampled_df = df.sample(n=10, random_state=42)  # Setting random_state for reproducibility

# Display the sampled problems
display(sampled_df)

Unnamed: 0,ID,Problem,Solvable?,Unconventional?,Solution,Label
1394,1322,You are on a road trip and the car breaks down...,No,,It is not possible to fix a car engine with ju...,correct_right_reason
743,911,A baseball has been thrown over a towering fen...,Yes,conventional,Step1: Tie the two jump ropes together to leng...,infeasible
1606,1549,Your water purification tablets fell into a de...,No,,Step1: Use the pot to collect rainwater or sta...,wrong_solution
49,358,"Your puppy chewed up a pillow, and now there a...",Yes,unconventional,Step1: Use the broken broom to gather the feat...,inefficient
188,384,"You have to hang a wall decoration, but you c...",Yes,unconventional,Step1: Align the deck of cards along the wall ...,inefficient
482,885,"You want to make a sun dial to judge time, but...",Yes,unconventional,Step1: Use the hand mirror as the sundial base...,infeasible
892,1860,"You are exploring through a zoo, and there's a...",Yes,unconventional,Step1: Fill the metallic bucket with fresh veg...,infeasible
613,1372,"During a beach cleanup, you come across a heav...",Yes,unconventional,Step1: Use the volleyball as a roller under th...,efficient
1628,1640,A crack has formed on your car's radiator and ...,No,,Step1: Clean the cracked area on the radiator ...,wrong_solution
1347,791,You need to inflate a flat pool float but have...,No,,Step1: Take the trash bag and secure it around...,wrong_solution


## Step 2: Call the OpenAI API with a sample of the problems 

1. Call the GPT-4o model to check how it responds and to make sure that call is happening
2. Call the o1 model to check how it responds and to make sure that call is happening

In [5]:
!pip install openai
!pip install python-dotenv



In [21]:
import os
from dotenv import load_dotenv
load_dotenv('../../.env.dev')

True

In [31]:
import os
from openai import OpenAI
import time

# Initialize the OpenAI client
client = OpenAI()  # Make sure OPENAI_API_KEY is set in your environment variables
systemPrompt = "Give a valid (feasible and efficient) solution very concisely. Use step1, step2, etc, and mention the tools to achieve each step. Use as few steps as possible and the answer should ideally be less than 100 words. When there is not a feasible solution given the constraint and provided tools, just say that it is not possible and give a very short justification."

def get_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",  # Using GPT-4o (cheaper model)
            messages=[
                {"role": "system", "content": systemPrompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
def get_reasoning_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="o1-mini",  # Using O1 reasoning model
            messages=[
                {"role": "user", "content": systemPrompt + prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None

# Create a list to store responses
responses = []

# Iterate through the sampled problems
for idx, row in sampled_df.iterrows():
    print(f"Processing problem {idx + 1}/10...")
    
    # Create the prompt
    prompt = f"Please solve this problem: {row['Problem']}"
    
    # Get completion and add delay to respect rate limits
    # response = get_completion(prompt)
    response = get_reasoning_completion(prompt)
    responses.append(response)
    time.sleep(1)  # Add a delay between requests to respect rate limits

# Add responses to the dataframe
sampled_df['model_response'] = responses

# Display the results
display(sampled_df[['Problem', 'model_response']])

Processing problem 1395/10...
Processing problem 744/10...
Processing problem 1607/10...
Processing problem 50/10...
Processing problem 189/10...
Processing problem 483/10...
Processing problem 893/10...
Processing problem 614/10...
Processing problem 1629/10...
Processing problem 1348/10...


Unnamed: 0,Problem,model_response
1394,You are on a road trip and the car breaks down...,It is not possible to fix the engine with the ...
743,A baseball has been thrown over a towering fen...,**Step 1:** Tie one jump rope securely to the ...
1606,Your water purification tablets fell into a de...,**Step 1:** Use the Swiss army knife’s hook to...
49,"Your puppy chewed up a pillow, and now there a...",**Step1:** Use the **fan** to blow and gather ...
188,"You have to hang a wall decoration, but you c...",**Step 1:** Lay multiple deck of cards side by...
482,"You want to make a sun dial to judge time, but...",**Step1:** Use the paper plate as the base of ...
892,"You are exploring through a zoo, and there's a...",**Step1:** Attach the basket of vegetables sec...
613,"During a beach cleanup, you come across a heav...",It is not possible to safely move the heavy lo...
1628,A crack has formed on your car's radiator and ...,**Step1:** Clean the radiator crack area using...
1347,You need to inflate a flat pool float but have...,**Step1:** Attach the straw securely to the po...


In [32]:
display(sampled_df[['Problem', 'model_response','Solvable?','Solution']])

Unnamed: 0,Problem,model_response,Solvable?,Solution
1394,You are on a road trip and the car breaks down...,It is not possible to fix the engine with the ...,No,It is not possible to fix a car engine with ju...
743,A baseball has been thrown over a towering fen...,**Step 1:** Tie one jump rope securely to the ...,Yes,Step1: Tie the two jump ropes together to leng...
1606,Your water purification tablets fell into a de...,**Step 1:** Use the Swiss army knife’s hook to...,No,Step1: Use the pot to collect rainwater or sta...
49,"Your puppy chewed up a pillow, and now there a...",**Step1:** Use the **fan** to blow and gather ...,Yes,Step1: Use the broken broom to gather the feat...
188,"You have to hang a wall decoration, but you c...",**Step 1:** Lay multiple deck of cards side by...,Yes,Step1: Align the deck of cards along the wall ...
482,"You want to make a sun dial to judge time, but...",**Step1:** Use the paper plate as the base of ...,Yes,Step1: Use the hand mirror as the sundial base...
892,"You are exploring through a zoo, and there's a...",**Step1:** Attach the basket of vegetables sec...,Yes,Step1: Fill the metallic bucket with fresh veg...
613,"During a beach cleanup, you come across a heav...",It is not possible to safely move the heavy lo...,Yes,Step1: Use the volleyball as a roller under th...
1628,A crack has formed on your car's radiator and ...,**Step1:** Clean the radiator crack area using...,No,Step1: Clean the cracked area on the radiator ...
1347,You need to inflate a flat pool float but have...,**Step1:** Attach the straw securely to the po...,No,Step1: Take the trash bag and secure it around...


In [33]:
# Save to excel file
sampled_df.to_excel('../../data/MacGyver/o1_response.xlsx', index=False)

## Step 3 - Summary and Analysis

- The calls are made and evaluated and we can see that there is often disagreement between what is solvable whether the solution is feasible or not. 
- Many of the results that are generated are also hard to evaluate - especially because there aren't defined answers and one could also say that the LLM came up with a creative solution. 
- In the original paper, the authors samples a set of 323 questions and then had human annotators go through the answers by each LLM and say whether the answer generated by LLM and human are correct or not.
- In fact, they went one step further by also classifying the answers in five categories

So now the open question is, how can we judge the quality and accuracy of the generated response from the reasoning models to determine if the reasoning LLMs generate better answers.

Possible approach: since we do not have the capacity to hire human annotators, our goal is to do this on a subset of the data. Here's what we are going to do:

1. Pick a set of 10 questions that look appealing to you. We will apply the following condition:
    a. Puzzles are interesting and could be appealing when presented to potential readers
    b. Puzzles are marked solvable in the original dataset - no point worrying about puzzles that can't be solved
    c. Look for Puzzles that already have human solutions in the other file -> this is good for comparison with LLM solutions
2. For these 10 questions: ask GPT-4o, o3 and DeepSeek-R1 for responses. Start with 1 response and consider additional calls if required.
3. For each of the 10 questions we analyze the response to each question and determine if it's interesting:
    a. Marked the answer as solvable -> this will give you a comparison between LLMs
    b. How creative is the LLM solution compared to the Human solution -> this tells us if there is creativity in the LLM
    c. How valid/feasible is the LLM solution -> this tells us whether it actually makes sense 

### Pick 10 questions

I used Cline to generate a simple front-end app to read in questions from the excel file and present it to me with two buttons - Interesting and Not Interesting. This was then recorded as an additional column in the excel sheet.
I filtered the Excel to contain Interesting and Solvable problems and ended up with 22 problems.
Of these, I pick ten which are interesting with a preference for those that also have solutions in the other excel sheet called - additional_human_solutions.
Selected problemIDs are:

359
1155
937
1591
541
443
480
798
669
1366

In [4]:
# Create a dataset containing only the selected questions

# Read the Excel file
df = pd.read_excel('../../data/MacGyver/problem_solution_pair.xlsx')

# List of selected problem IDs
selected_ids = [359, 1155, 937, 1591, 541, 443, 480, 798, 669, 1366]

# Filter dataframe to only include selected problems
selected_problems_df = df[df['ID'].isin(selected_ids)]

# Display the filtered dataset
selected_problems_df

Unnamed: 0,ID,Problem,Solvable?,Unconventional?,Solution,Label,IsInteresting
0,541,You spilled red wine on the hotel carpet and w...,Yes,unconventional,Step1: Open the bottle of mineral water with t...,inefficient,1.0
24,443,You need to crush some ice for making cocktai...,Yes,unconventional,Step1: Put on the rubber gloves to prevent you...,inefficient,1.0
50,359,Your cat has managed to get itself onto the to...,Yes,unconventional,Step1: Set up the step stool near the bookcase...,infeasible,1.0
96,480,"Your sleeping bag zipper is broken, and you'r...",Yes,unconventional,Step1: Lay your sleeping bag on the flat groun...,efficient,1.0
223,798,You need to create a safe area in pool for kid...,Yes,unconventional,Step1: Use floating rings by stringing pool no...,infeasible,1.0
306,669,You need to amplify the audio coming from your...,Yes,unconventional,Step1: Trim one end of the paper towel tube to...,inefficient,1.0
579,1155,You need to chop a lot of hard and large veget...,Yes,unconventional,1. Use the peeler to remove the skin of the ve...,infeasible,1.0
608,1366,"While exploring a remote jungle, you accidenta...",Yes,conventional,"Step 1: Use the sturdy vine as a splint, placi...",inefficient,1.0
766,937,You need to grind whole black peppercorns for ...,Yes,unconventional,Step1: Place the peppercorns in the glass jar....,infeasible,1.0
1129,1591,You want to train your flowering vine to grow ...,Yes,unconventional,Step1: Wear gardening gloves for protection. <...,infeasible,1.0


In [7]:
## Call GPT-4o to get the basic set of answers
import os
from openai import OpenAI
import time

# Initialize the OpenAI client
client = OpenAI()  # Make sure OPENAI_API_KEY is set in your environment variables
systemPrompt = "Give a valid (feasible and efficient) solution very concisely. Use step1, step2, etc, and mention the tools to achieve each step. Use as few steps as possible and the answer should ideally be less than 100 words. When there is not a feasible solution given the constraint and provided tools, just say that it is not possible and give a very short justification."

def get_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",  # Using GPT-4o (cheaper model)
            messages=[
                {"role": "system", "content": systemPrompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
# Create a list to store responses
responses = []

# Iterate through each problem in the dataframe
for _, row in selected_problems_df.iterrows():
    # Get the problem text
    problem = row['Problem']
    
    # Add a delay between API calls to avoid rate limiting
    time.sleep(1)
    
    # Get completion for this problem
    response = get_completion(problem)
    
    # Store the response
    responses.append(response)

# Add responses as a new column to the dataframe
selected_problems_df['gpt4o_response'] = responses


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_problems_df['gpt4o_response'] = responses


KeyError: "['model_response'] not in index"

In [8]:
# Display the problems and their responses
selected_problems_df[['Problem', 'gpt4o_response']]

Unnamed: 0,Problem,gpt4o_response
0,You spilled red wine on the hotel carpet and w...,1. **Blot the Stain**: Use the white bath towe...
24,You need to crush some ice for making cocktai...,1. **Fill the Ziploc Bag**: Fill a small snack...
50,Your cat has managed to get itself onto the to...,Step 1: Place the step stool near the bookcase...
96,"Your sleeping bag zipper is broken, and you'r...",1. **Wrap and Insulate**: Use the large scarf ...
223,You need to create a safe area in pool for kid...,It is not possible. \n\nWithout a means to sec...
306,You need to amplify the audio coming from your...,1. **Prepare the Amplifier**: Cut a small slit...
579,You need to chop a lot of hard and large veget...,It is not possible.\n\nJustification: None of ...
608,"While exploring a remote jungle, you accidenta...",1. **Create Padding:** Use the large leaves to...
766,You need to grind whole black peppercorns for ...,1. **Crush with Glass Jar:** Place a small amo...
1129,You want to train your flowering vine to grow ...,1. **Umbrella Skeleton Frame**: Use the metal ...


GPT-4o findings:
- It declared that two of the questions are not solvable when ground truth says that it is.
- Additional observations after manual review 

In [9]:
## Call o1-mini to get the first set of reasoning model answers

def get_reasoning_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="o1-mini",  # Using O1 reasoning model
            messages=[
                {"role": "user", "content": systemPrompt + prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
# Create a list to store responses
responses = []

# Iterate through each problem in the dataframe
total_problems = len(selected_problems_df)
for idx, (_, row) in enumerate(selected_problems_df.iterrows(), 1):
    # Get the problem text
    problem = row['Problem']
    
    print(f"Processing problem {idx}/{total_problems}")
    print(f"Problem text: {problem[:100]}...")  # Print first 100 chars
    
    # Add a delay between API calls to avoid rate limiting
    time.sleep(1)
    
    # Get completion for this problem
    response = get_reasoning_completion(problem)
    
    if response:
        print("Successfully got response")
    else:
        print("Failed to get response")
        
    # Store the response
    responses.append(response)
    print("-" * 80)

# Add responses as a new column to the dataframe
selected_problems_df['o1mini_response'] = responses

Processing problem 1/10
Problem text: You spilled red wine on the hotel carpet and want to clean it up before the housekeeping staff repor...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 2/10
Problem text:  You need to crush some ice for making cocktails, but you don't have an ice crusher or blender. The ...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 3/10
Problem text: Your cat has managed to get itself onto the top shelf of a high bookcase, and you need to get it saf...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 4/10
Problem text:  Your sleeping bag zipper is broken, and you're unable to close it for warmth during a cold night of...
Successfully got response
--------------------------------------------------------------------------------
Proc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_problems_df['o1mini_response'] = responses


In [10]:
# Display the problems and their responses
selected_problems_df[['Problem', 'gpt4o_response', 'o1mini_response']]

Unnamed: 0,Problem,gpt4o_response,o1mini_response
0,You spilled red wine on the hotel carpet and w...,1. **Blot the Stain**: Use the white bath towe...,**Step1:** Blot excess wine with the white bat...
24,You need to crush some ice for making cocktai...,1. **Fill the Ziploc Bag**: Fill a small snack...,"**Step 1:** Place ice into a small Ziploc bag,..."
50,Your cat has managed to get itself onto the to...,Step 1: Place the step stool near the bookcase...,**Step1:** Place the step stool securely under...
96,"Your sleeping bag zipper is broken, and you'r...",1. **Wrap and Insulate**: Use the large scarf ...,**Step1:** Align the broken zipper edges of th...
223,You need to create a safe area in pool for kid...,It is not possible. \n\nWithout a means to sec...,**Step 1:** Arrange the floating rings in a ci...
306,You need to amplify the audio coming from your...,1. **Prepare the Amplifier**: Cut a small slit...,**Step 1:** Use cardboard to create a wider fu...
579,You need to chop a lot of hard and large veget...,It is not possible.\n\nJustification: None of ...,It is not possible. None of the available tool...
608,"While exploring a remote jungle, you accidenta...",1. **Create Padding:** Use the large leaves to...,**Step 1:** **Pad the Injury**\n- **Tools:** L...
766,You need to grind whole black peppercorns for ...,1. **Crush with Glass Jar:** Place a small amo...,**Step1:** Place the peppercorns into the pape...
1129,You want to train your flowering vine to grow ...,1. **Umbrella Skeleton Frame**: Use the metal ...,**Step1:** Disassemble the broken umbrella to ...


o1-mini findings:
- The number of unsolved problems has reduced to one (GPT-4o had two unsolved)
- Additional observations after manual review 

In [22]:
# Call o3-mini to get a supposedly better reasoning model
# Make use of OpenRouter to do this call as I do not have direct access to the o3-mini API

from openai import OpenAI

openRouterclient = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=os.environ.get('OPENROUTER_API_KEY'),
)

def get_reasoning_completion(prompt):
    try:
        response = openRouterclient.chat.completions.create(
            model="openai/o3-mini",  # Using O3 reasoning model
            messages=[
                {"role": "user", "content": systemPrompt + prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
# Create a list to store responses
responses = []

# Iterate through each problem in the dataframe
total_problems = len(selected_problems_df)
for idx, (_, row) in enumerate(selected_problems_df.iterrows(), 1):
    # Get the problem text
    problem = row['Problem']
    
    print(f"Processing problem {idx}/{total_problems}")
    print(f"Problem text: {problem[:100]}...")  # Print first 100 chars
    
    # Add a delay between API calls to avoid rate limiting
    time.sleep(1)
    
    # Get completion for this problem
    response = get_reasoning_completion(problem)
    
    if response:
        print("Successfully got response")
    else:
        print("Failed to get response")
        
    # Store the response
    responses.append(response)
    print("-" * 80)

# Add responses as a new column to the dataframe
selected_problems_df['o3mini_response'] = responses

Processing problem 1/10
Problem text: You spilled red wine on the hotel carpet and want to clean it up before the housekeeping staff repor...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 2/10
Problem text:  You need to crush some ice for making cocktails, but you don't have an ice crusher or blender. The ...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 3/10
Problem text: Your cat has managed to get itself onto the top shelf of a high bookcase, and you need to get it saf...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 4/10
Problem text:  Your sleeping bag zipper is broken, and you're unable to close it for warmth during a cold night of...
Successfully got response
--------------------------------------------------------------------------------
Proc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_problems_df['o3mini_response'] = responses


In [23]:
# Display the problems and their responses
selected_problems_df[['Problem', 'gpt4o_response', 'o1mini_response', 'o3mini_response']]

Unnamed: 0,Problem,gpt4o_response,o1mini_response,o3mini_response
0,You spilled red wine on the hotel carpet and w...,1. **Blot the Stain**: Use the white bath towe...,**Step1:** Blot excess wine with the white bat...,Step 1: Use the bottle opener to open the mine...
24,You need to crush some ice for making cocktai...,1. **Fill the Ziploc Bag**: Fill a small snack...,"**Step 1:** Place ice into a small Ziploc bag,...",Step1: Place a few ice cubes in the small Zipl...
50,Your cat has managed to get itself onto the to...,Step 1: Place the step stool near the bookcase...,**Step1:** Place the step stool securely under...,Step 1 (Prepare Landing): Lay the soft blanket...
96,"Your sleeping bag zipper is broken, and you'r...",1. **Wrap and Insulate**: Use the large scarf ...,**Step1:** Align the broken zipper edges of th...,Step 1: Lay the scarf over the open zipper so ...
223,You need to create a safe area in pool for kid...,It is not possible. \n\nWithout a means to sec...,**Step 1:** Arrange the floating rings in a ci...,Step 1: Inflate several floating rings and pos...
306,You need to amplify the audio coming from your...,1. **Prepare the Amplifier**: Cut a small slit...,**Step 1:** Use cardboard to create a wider fu...,Step 1: Cut and curve the cardbox to form a fu...
579,You need to chop a lot of hard and large veget...,It is not possible.\n\nJustification: None of ...,It is not possible. None of the available tool...,Step 1: Wash and peel the vegetables using the...
608,"While exploring a remote jungle, you accidenta...",1. **Create Padding:** Use the large leaves to...,**Step 1:** **Pad the Injury**\n- **Tools:** L...,Step 1: Wrap several large leaves around the i...
766,You need to grind whole black peppercorns for ...,1. **Crush with Glass Jar:** Place a small amo...,**Step1:** Place the peppercorns into the pape...,Step 1: Put the peppercorns in the paper bag a...
1129,You want to train your flowering vine to grow ...,1. **Umbrella Skeleton Frame**: Use the metal ...,**Step1:** Disassemble the broken umbrella to ...,Step1: Mount the old bicycle wheel along the d...


o3-mini findings:
- As expected, it solves all the problems while o1-mini had missed out on one of them
- Additional observations after manual inspection

In [25]:
# Call DeepSeek to get a supposedly cheaper reasoning model
# Make use of OpenRouter to do this call as I do not have direct access to the DeepSeek API

from openai import OpenAI

openRouterclient = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=os.environ.get('OPENROUTER_API_KEY'),
)

def get_reasoning_completion(prompt):
    try:
        response = openRouterclient.chat.completions.create(
            model="deepseek/deepseek-r1",  # Using DeepSeek reasoning model
            messages=[
                {"role": "user", "content": systemPrompt + prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error occurred: {e}")
        return None
    
# Create a list to store responses
responses = []

# Iterate through each problem in the dataframe
total_problems = len(selected_problems_df)
for idx, (_, row) in enumerate(selected_problems_df.iterrows(), 1):
    # Get the problem text
    problem = row['Problem']
    
    print(f"Processing problem {idx}/{total_problems}")
    print(f"Problem text: {problem[:100]}...")  # Print first 100 chars
    
    # Add a delay between API calls to avoid rate limiting
    time.sleep(1)
    
    # Get completion for this problem
    response = get_reasoning_completion(problem)
    
    if response:
        print("Successfully got response")
    else:
        print("Failed to get response")
        
    # Store the response
    responses.append(response)
    print("-" * 80)

# Add responses as a new column to the dataframe
selected_problems_df['deepseek_response'] = responses

Processing problem 1/10
Problem text: You spilled red wine on the hotel carpet and want to clean it up before the housekeeping staff repor...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 2/10
Problem text:  You need to crush some ice for making cocktails, but you don't have an ice crusher or blender. The ...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 3/10
Problem text: Your cat has managed to get itself onto the top shelf of a high bookcase, and you need to get it saf...
Successfully got response
--------------------------------------------------------------------------------
Processing problem 4/10
Problem text:  Your sleeping bag zipper is broken, and you're unable to close it for warmth during a cold night of...
Successfully got response
--------------------------------------------------------------------------------
Proc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_problems_df['deepseek_response'] = responses


In [26]:
# Display the problems and their responses
selected_problems_df[['Problem', 'gpt4o_response', 'o1mini_response', 'o3mini_response', 'deepseek_response']]

Unnamed: 0,Problem,gpt4o_response,o1mini_response,o3mini_response,deepseek_response
0,You spilled red wine on the hotel carpet and w...,1. **Blot the Stain**: Use the white bath towe...,**Step1:** Blot excess wine with the white bat...,Step 1: Use the bottle opener to open the mine...,**Step1:** Use the bottle opener to open the m...
24,You need to crush some ice for making cocktai...,1. **Fill the Ziploc Bag**: Fill a small snack...,"**Step 1:** Place ice into a small Ziploc bag,...",Step1: Place a few ice cubes in the small Zipl...,**Step1:** Place ice in a snack-size Ziploc ba...
50,Your cat has managed to get itself onto the to...,Step 1: Place the step stool near the bookcase...,**Step1:** Place the step stool securely under...,Step 1 (Prepare Landing): Lay the soft blanket...,**Step1:** Place the pillow and soft blanket b...
96,"Your sleeping bag zipper is broken, and you'r...",1. **Wrap and Insulate**: Use the large scarf ...,**Step1:** Align the broken zipper edges of th...,Step 1: Lay the scarf over the open zipper so ...,**Step1:** Wrap the large scarf tightly over t...
223,You need to create a safe area in pool for kid...,It is not possible. \n\nWithout a means to sec...,**Step 1:** Arrange the floating rings in a ci...,Step 1: Inflate several floating rings and pos...,**Step1:** Insert pool noodles through multipl...
306,You need to amplify the audio coming from your...,1. **Prepare the Amplifier**: Cut a small slit...,**Step 1:** Use cardboard to create a wider fu...,Step 1: Cut and curve the cardbox to form a fu...,**Step1:** Insert smartphone into the cardboar...
579,You need to chop a lot of hard and large veget...,It is not possible.\n\nJustification: None of ...,It is not possible. None of the available tool...,Step 1: Wash and peel the vegetables using the...,**Step1:** Use the **peeler** to remove skins ...
608,"While exploring a remote jungle, you accidenta...",1. **Create Padding:** Use the large leaves to...,**Step 1:** **Pad the Injury**\n- **Tools:** L...,Step 1: Wrap several large leaves around the i...,**Step1:** Roll the water bottle in large leav...
766,You need to grind whole black peppercorns for ...,1. **Crush with Glass Jar:** Place a small amo...,**Step1:** Place the peppercorns into the pape...,Step 1: Put the peppercorns in the paper bag a...,**Step1:** Place peppercorns on the **cheese g...
1129,You want to train your flowering vine to grow ...,1. **Umbrella Skeleton Frame**: Use the metal ...,**Step1:** Disassemble the broken umbrella to ...,Step1: Mount the old bicycle wheel along the d...,**Step1:** Repurpose the umbrella's metal skel...
