# Mini Project 3 will have two parts - a) The first part involves Automated Prompt Engineering where you get to build the use of a Calculator Tool automatically for "Math questions" asked of an LLM. You can compare the evaluation performance with/without the use of the tool and see how it can improve the performance of GPT 3.5. b) The second part involves the use of Stability AI APIs to work on different image editing tasks - Including "Removing the photobomber" or "Removing artifacts from an image" "

### Part 1: This part aims to assess the mathematical reasoning capabilities of Large Language Models (LLMs), specifically focusing on zero-shot learning, few-shot in-context learning, and their ability to integrate external tools when solving arithmetic word problems from the SVAMP dataset. The SVAMP dataset comprises simple variations of arithmetic math word problems up to grade 4, designed to test the models beyond mere pattern recognition and evaluate their genuine problem-solving skills.

In [16]:
import openai
from datasets import load_dataset
from openai import OpenAI
from tqdm import tqdm

#### The SVAMP dataset (Simple Variations on Arithmetic Math word Problems) represents a specialized challenge set designed to assess the capabilities of state-of-the-art (SOTA) models in solving arithmetic word problems of up to grade 4 level. 
#### Unlike conventional benchmarks, which models often solve by exploiting simple heuristics, SVAMP introduces a series of one-unknown math word problems crafted to underscore the limitations of these models, demonstrating their struggle to solve even elementary problems effectively. 
#### The challenge is two fold: (1) identify what numbers in the context are actually used for the question and construct a math equation (2) correctly compute the math equation


In [10]:
dataset = load_dataset("ChilleD/SVAMP")
train_dataset = dataset['train']
test_dataset = dataset['test']
# https://huggingface.co/datasets/ChilleD/SVAMP?row=0 


In [12]:
"""ChatGPT completion"""
import os
def chatgpt_completion(prompt_text):
    api_key = os.getenv("OPENAI_API_KEY")
    client = OpenAI(api_key=api_key)
    messages = [
    { "role": "user", "content": prompt_text },
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
        max_tokens=1000,
        top_p=1,)
    response_text = response.choices[0].message.content
    return response_text


def evaluate(results, answers):
    accs = []
    for result, answer in zip(results, answers):
        try:
            _ = float(result)
            if float(result) == float(answer):
                accs.append(1)
            else:
                accs.append(0)
        except:
            accs.append(0)

    print("The scores on the test set: ", sum(accs)/len(accs)) # around 38% - 50%
    return 

In [13]:
import time

# time to restrict query speed
class SpeedLimitTimer:
    def __init__(self, second_per_step=3.1):
        self.record_time = time.time()
        self.second_per_step = second_per_step

    def step(self):
        time_div = time.time() - self.record_time
        if time_div <= self.second_per_step:
            time.sleep(self.second_per_step - time_div)
        self.record_time = time.time()

    def sleep(self, s):
        time.sleep(s)


timer = SpeedLimitTimer(second_per_step=3.1) 


# 1. Zero Shot Prompting LLM (i.e. ChatGPT)
#### In this part, we are going to ask ChatGPT to directly answer the math questions without using any tools

In [17]:
def parse_zeroshot_chatgpt_output(chatgpt_output):
    # Follow Toolformer 
    # https://arxiv.org/abs/2302.04761
    # Zero-Shot Math Reasoning with ChatGPT     
    # Check for the first number predicted by the model. 
    # An exception to this is if the model’s prediction contains an equation (e.g., “The correct answer is 5+3=8”), in which case we consider the first number after the “=” sign to be its prediction.
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            return False
    if "=" in chatgpt_output:
        chatgpt_output = chatgpt_output.split("=")[1]

    words = chatgpt_output.split()
    for word in words:
        if is_number(word):
            return word
    return float("inf")  # Return inf if no number word is found

def zero_shot_prompt_chatgpt(test_dataset):
    results = []
    answers = []
    for data in tqdm(test_dataset):
        context = data["Body"]    
        question = data["Question"] 
        answer = data["Answer"]
        answers.append(answer)

        example_prompt =  context + " " + question 
        chatgpt_output = chatgpt_completion(example_prompt)
        result = parse_zeroshot_chatgpt_output(chatgpt_output)
        results.append(result)
        timer.step()
    return results, answers


In [18]:
zeroshot_results, answers = zero_shot_prompt_chatgpt(test_dataset) # 47 # gpt-3.5-turbo-0125
evaluate(zeroshot_results, answers)  # This should give you scores 38-50%

100%|██████████| 300/300 [15:35<00:00,  3.12s/it]

The scores on the test set:  0.45





# 2. Instrution following and tool use with LLM  (i.e. ChatGPT)
#### You will see that zero shot chatgpt performance is less than ideal. Next we are gonna leverage the tool use capabilities of LLM.
#### In this case here, it's the abilitiy to leverage external calculator to deal with math questions. 
#### Using Toolformer as a reference, we construct a instruction prompt and also put in a few {input:output} demonstrations to guide ChatGPT to generate the output format we want.
#### You will also be asked to implement some functions to finish the pipeline

In [19]:
TOOLFORMER_CALCULATOR_PROMPT_4Shot = """
Your task is to add calls to a
Calculator API to a piece of text.
The calls should help you get
information required to complete the
text. You can call the API by writing
"[Calculator(expression)]" where
"expression" is the expression to be
computed. Here are some examples of API
calls:
Input: The number in the next term is 18
+ 12 x 3 = 54.
Output: The number in the next term is
18 + 12 x 3 = [Calculator(18 + 12 * 3)]
54.
Input: The population is 658,893 people.
This is 11.4% of the national average of
5,763,868 people.
Output: The population is 658,893 people.
This is 11.4% of the national average of
[Calculator(658,893 / 11.4%)] 5,763,868
people.
Input: A total of 252 qualifying matches
were played, and 723 goals were scored
(an average of 2.87 per match). This is
three times less than the 2169 goals
last year.
Output: A total of 252 qualifying
matches were played, and 723 goals were
scored (an average of [Calculator(723
/ 252)] 2.87 per match). This is twenty
goals more than the [Calculator(723 -
20)] 703 goals last year.
Input: I went to Paris in 1994 and
stayed there until 2011, so in total,
it was 17 years.
Output: I went to Paris in 1994 and
stayed there until 2011, so in total, it
was [Calculator(2011 - 1994)] 17 years.
""" # Prompt from Toolformer 
# # https://arxiv.org/abs/2302.04761  ### 68%


In [30]:
TOOLFORMER_CALCULATOR_PROMPT_1Shot = """
Your task is to add calls to a
Calculator API to a piece of text.
The calls should help you get
information required to complete the
text. You can call the API by writing
"[Calculator(expression)]" where
"expression" is the expression to be
computed. Here are some examples of API
calls:
Input: The number in the next term is 18
+ 12 x 3 = 54.
Output: The number in the next term is
18 + 12 x 3 = [Calculator(18 + 12 * 3)]
54.
""" # Prompt from Toolformer 
# https://arxiv.org/abs/2302.04761  ### 64%


In [31]:
# TODO
def calculator(chatgpt_output):
    ######################### Put your code here #########################
    # In this function, you need to implement a calculator that will compute the equation produced by ChatGPT and reurn a number as the final answer. 
    # Hint: look out for percentage sign % and dolar sign $.  
    if '%' in chatgpt_output:
        chatgpt_output = chatgpt_output.replace('%', '/100')
    if '$' in chatgpt_output:
        chatgpt_output = chatgpt_output.replace('$', '')

    try:
        # Evaluate the arithmetic expression
        result = eval(chatgpt_output)
    except Exception as e:
        return f"Error evaluating expression: {e}"
    
    return result
    ######################### Put your code here #########################
    
# You can use the below test case to check your implementation.

# # Example inputs
# example_inputs = [
#     "18 + 12 * 3",
#     "658893 / 11.4%",
#     "723 / 252"
# ]

# for example in example_inputs:
#     result = calculator(example)
#     print(result)

# 54
# 577.9763157894737
# 2.869047619047619

In [41]:
def toolformer_prompt_chatgpt(test_dataset, PROMPT):
    results = []
    answers = []
    for idx, data in enumerate(test_dataset):
        context = data["Body"]    
        question = data["Question"]
        answer = data["Answer"]

        example_prompt =  PROMPT + context + " " + question 
        try:
            response = chatgpt_completion(example_prompt)
            print(response)
            calculate_out = response.split("[Calculator(")[1].split(")")[0]
            result = calculator(calculate_out)
        except:
            result =  float("inf")

        results.append(result)
        answers.append(answer)

        if 
        
        timer.step()
    return results, answers


In [36]:
test_dataset1 = test_dataset[:10]

  0%|          | 0/6 [00:00<?, ?it/s]


TypeError: string indices must be integers, not 'str'

In [43]:
toolformer_prompt_chatgpt(test_dataset, TOOLFORMER_CALCULATOR_PROMPT_4Shot)

Mary is baking a cake. The recipe calls for 6 cups of flour, 8 cups of sugar, and 7 cups of salt. She already put in 5 cups of flour. To find out how many more cups of sugar than cups of salt she needs to add now, we can use the following calculation:

[Calculator(8 - 7)] cups more sugar than salt needs to be added now.


([1], [1.0])

In [39]:
# ChatGPT performance with tool use when prompted with instruction and 4 task demonstrations. This should give you scores 55-70%
fourshot_results, answers = toolformer_prompt_chatgpt(test_dataset, TOOLFORMER_CALCULATOR_PROMPT_4Shot)
evaluate(fourshot_results, answers)   # run 1 64.66,  run 2 65.33, run 3 65

KeyboardInterrupt: 

[1,
 inf,
 5,
 14,
 inf,
 23,
 inf,
 45.0,
 104,
 333,
 93899,
 36,
 175,
 660,
 255,
 10,
 737.0,
 230,
 10,
 3.0,
 inf,
 3.0,
 82,
 111,
 4536,
 11,
 614,
 143550,
 8,
 18,
 3,
 4.0,
 183,
 1,
 2825,
 20,
 30144,
 527292,
 23,
 13,
 30,
 2,
 10,
 17017,
 574664,
 1088,
 5,
 6,
 4.0,
 6,
 4140,
 8,
 33.0,
 inf,
 146,
 51,
 2,
 128,
 2.0,
 58,
 1145,
 84,
 0.022727272727272728,
 34,
 192,
 1,
 302611,
 297,
 30057,
 17.0,
 7,
 109,
 84,
 61,
 174,
 8.0,
 337,
 4.0,
 94,
 8,
 32,
 3,
 51,
 68,
 8,
 16,
 16,
 747,
 16,
 110,
 8066,
 14,
 11,
 7,
 0.6666666666666666,
 5,
 6,
 61,
 31,
 -18,
 19,
 229,
 1,
 22,
 30,
 19,
 1891,
 3021,
 3,
 347,
 1363293,
 65,
 23,
 78,
 76.0,
 44,
 17,
 5,
 826,
 143,
 4,
 1,
 5,
 39,
 1,
 6,
 3.0,
 3,
 "Error evaluating expression: '(' was never closed (<string>, line 1)",
 54,
 22,
 14,
 20,
 22800,
 134,
 -4,
 26,
 121,
 3,
 648,
 19,
 125,
 41,
 223,
 6,
 "Error evaluating expression: '(' was never closed (<string>, line 1)",
 5,
 21,
 388,
 460,
 15,


In [24]:
# ChatGPT performance with tool use when prompted with instruction and 1 task demonstrations. This should give you scores slightly lower than the 4 shot results.
oneshot_results, answers = toolformer_prompt_chatgpt(test_dataset, TOOLFORMER_CALCULATOR_PROMPT_1Shot)  # run 1 62.66 run 2 65.33 run 3 66.33
evaluate(oneshot_results, answers)

The scores on the test set:  0.67


# 3. Challenging Data
#### In this section, we are going to push our models to the limit. We (naively) identify a challenging subset of the dataset by picking the math equations that involve at least one number with over 3 digits. 
#### You will observe a even larger performance difference between using tools and not using.

In [27]:
def check_number(elements):
    for element in elements:
        try:
            number = int(float(element))
            if len(list(str(number))) > 3:
                return True
        except:
            pass
    return False



challenging_dataset = []
for data in train_dataset:     
    elements = data["Equation"].split()
    if check_number(elements):
        challenging_dataset.append(data)
for data in test_dataset:     
    elements = data["Equation"].split()
    if check_number(elements):
        challenging_dataset.append(data)

In [28]:

oneshot_results, answers = toolformer_prompt_chatgpt(challenging_dataset, TOOLFORMER_CALCULATOR_PROMPT_1Shot) #  # This should give you scores 55-70%
evaluate(oneshot_results, answers) # 68%    # 63 # gpt-3.5-turbo-0125

zeroshot_results, answers = zero_shot_prompt_chatgpt(challenging_dataset)
evaluate(zeroshot_results, answers)  # This should give you scores 25-35%   % 31   # 26 # gpt-3.5-turbo-0125

The scores on the test set:  0.6842105263157895


100%|██████████| 19/19 [00:59<00:00,  3.14s/it]

The scores on the test set:  0.3157894736842105





In [29]:
eval('( ( 100.0 + 3.0 ) + 7.0 )')

20.0

# 4. (Bonus) Build your own prompt to equip LLM with calculator
### The improvements by leveraging tool use if evident. Now, can you create a better prompt than the toolformer prompt? 

In [None]:
YOUR_CALCULATOR_PROMPT = """
"""

In [None]:
yourprompt_results, answers = toolformer_prompt_chatgpt(test_dataset, YOUR_CALCULATOR_PROMPT)
evaluate(yourprompt_results, answers)

# 5. Insight Sharing 
### Please write down a short paragrpah (50 ~ 100 words) and tell us any insights you got from Part1.