# Title: Enhancing LLM Reasoning via Vision-Augmented Prompting

#### Members' Names or Individual's Name: Ashish Sunuwar, Yahya Shaikh

####  Emails: ashish.sunuwar@torontomu.ca , yahya.shaikh@torontomu.ca


# Introduction:

#### Problem Description:
Reasoning about complex tasks like geometry intersections, time series forecasting, or visual puzzles requires a combination of verbal logic and spatial intuition. Take a geometry problem, for example: “How many times do a circle and a line intersect?” Or a time series question like: “What comes next in this pattern?” While these seem simple for humans — who often sketch or visualize before making decisions — language models process only text. This gap in spatial reasoning limits their ability to solve such problems accurately.


#### Context of the Problem:

Large Language Models (LLMs) like GPT-4 have made impressive advances in language-based tasks. But when it comes to problems that require spatial understanding, temporal trends, or structural insight, they fall short. These are tasks where humans rely on visual thinking — sketching, drawing patterns, or mentally simulating movement — all of which are missing in traditional LLM pipelines. Without a way to visualize or “see” the problem, LLMs can’t reason with the same depth as humans.

#### Limitation About other Approaches:

Most existing LLM-based reasoning methods — like standard prompting, Chain-of-Thought (CoT), and even Self-Consistent CoT (CoT-SC) — rely entirely on text. These methods work well when the answer is just a matter of logical inference or memory, but they break down in cases where visual cues are essential. For example, in geometry problems, models often hallucinate or underestimate intersections because they can’t “see” the shapes. Similarly, time series predictions using only text struggle to capture visual or cyclical patterns in data.


#### Solution:

To bridge this gap, we implemented a vision-augmented approach. Instead of asking LLMs to solve complex problems blindly, we give them eyes. We use external tools to draw diagrams or visualize patterns described in the problem text, and then feed these images back into the model alongside the original prompt. Our Vision-Augmented Prompting (VAP) pipeline walks the model through a 3-stage process — planning, iterative visual reasoning, and final conclusion — to improve decision-making in both geometry and time-based tasks.

![Difference between the non-VAP and VAP approach](fig_comparison.png "Difference between the non-VAP and VAP approach Ref[2], https://github.com/Ashish-Sunuwar/Vision_Augmented_Prompting/blob/main/src/Images/fig_comparison.png")

# Background

Explain the related work using the following table

| Reference | Explanation | Dataset/Input | Weakness |
| --- | --- | --- | --- |
| Wei et al. [1] | Introduced Chain-of-thought prompting, where LLMs reason step-by-step before producing an answer. This improved performance on logical reasoning and math word problems. | Diverse reasoning benchmarks (e.g., GSM8K, SVAMP) | Fails on tasks that require visual or spatial reasoning, text-only steps can't represent diagrams or physical relationships. |
| Xiao et al. [2] | Proposed Vision-Augmented Prompting (VAP), a multimodal reasoning framework. Combines image synthesis (e.g., matplotlib) with LLM planning, iterative visual reasoning, and final conclusion — enabling reasoning over geometry, time series, Sudoku, and TSP. | Geometry tasks from BIG-bench, Darts time series, synthetic Sudoku and TSP samples | Current VAP is training-free and general, but lacks dynamic image updates, uses only static renderers (e.g., matplotlib). Future work could focus on image feedback loops, dynamic drawing, and interpretability. |



# Methodology

Our project simplifies and implements the VAP framework for two reasoning domains: geometry intersection problems and time series forecasting. For keeping it simple we will just go through the geometry intersection problem. The same three-stage process is used in both cases, reflecting how humans often work: plan, think through steps with a visual aid, then conclude.

Stage 1: Planning
We begin by asking the model to generate a structured plan. This includes:
*   What tool to use for drawing (we used Matplotlib)
*   How to set up the drawing canvas (e.g., axis limits)
*   What sequence of steps to take when thinking through the problem
*   When to stop drawing or reasoning


The plan is returned in structured JSON format and helps the model establish a “mental” framework before solving the task.

Stage 2: Iterative Reasoning
Once the problem is visualized and the plan is ready, we ask the model to think step-by-step. At each iteration, it updates its thoughts (e.g., “Adding a circle intersects with the line here…”) and refines its mental image. While we do not update the image in real-time, this mimics how a human would mentally update their sketch as they add more information.

Stage 3: Conclusive Reasoning

Finally, the model is given the image, the original problem, and the reasoning trail. It now makes a final judgment — such as how many intersection points exist, or what the next value in a series should be. This combination of visual + verbal thought gives better accuracy compared to text-only reasoning.



# Implementation
We used a dataset of natural-language geometry problems from the BIG-bench benchmark that was provided by the paper. Each problem describes shapes — circles, lines, polygons — and asks how many intersections occur.

Our system:


*   Parsed the shapes using regex from the problem text

*   Drew them using Matplotlib to create a visual diagram

*   Used call_gpt() to query the model with: Standard prompt and Chain-of-Thought prompt

*   Vision-Augmented prompt (3 stages)
*   Logged predictions, matched them against ground truth, and tracked accuracy


In [None]:
# Importing all necessary libraries

import os
import re
import json
import time
from datetime import datetime
from tqdm import tqdm
import matplotlib.pyplot as plt

# GPT API call function
from call_gpt import call_gpt
#Prompts used for standard and COT prompting strategies
from prompts.intersection_count import standard_prompt, cot_prompt


In [None]:
# Clean GPT responses by removing markdown fences and trimming whitespaces
def clean_response(response_text):
    cleaned = response_text.strip()
    if cleaned.startswith("```"):
        lines = cleaned.splitlines()
        if lines and lines[0].startswith("```"):
            lines.pop(0)
        if lines and lines[-1].startswith("```"):
            lines.pop()
        if lines and lines and lines[0].strip().lower() == "json":
            lines.pop(0)
        cleaned = "\n".join(lines).strip()
    return cleaned


In [None]:
# This function runs standard, chain-of-thought prompts
def run_intersection_count(dataset_dir='dataset', task='intersection_geometry', model='gpt-4o-mini', method='standard', k_samples=5, log_dir='log'):
    from prompts.intersection_count import standard_prompt, cot_prompt

    # Create a unique logging folder for this run
    method = method.lower()
    log_dir_base = os.path.join(log_dir, task)
    os.makedirs(log_dir_base, exist_ok=True)
    current_time_str = datetime.now().strftime('%Y%m%d_%H%M%S')
    run_dir = os.path.join(log_dir_base, f'{model}_{method}_{current_time_str}')
    os.makedirs(run_dir, exist_ok=True)

    # Set up a CSV file to store results
    result_path = os.path.join(run_dir, 'result.csv')
    result_file = open(result_path, 'w', encoding='utf8')
    result_file.write('problem_id,pred_answer,is_correct,num_shape\n')
    result_file.flush()

    # Utility function to check correctness
    def check_identity(gt, pred):
        try:
            return int(pred) == int(gt)
        except:
            return False

    # Load the dataset containing geometry problems
    metadata_path = os.path.join(dataset_dir, task, 'task.json')
    with open(metadata_path, 'r', encoding='utf8') as f:
        metadata = json.load(f)

    right_count = 0
    current_count = 0

    # Loop through each geometry problem
    for question_id, item in enumerate(tqdm(metadata[:120])):
        log_path = os.path.join(run_dir, f'{question_id}.log')
        log_file = open(log_path, 'w', encoding='utf8')

        temperature = 0.7
        if method == 'standard':
            prompt = standard_prompt
            max_tokens = 150
        elif method == 'cot':
            prompt = cot_prompt
            max_tokens = 2048
        elif method == 'cot-sc':
            prompt = cot_prompt
            max_tokens = 512

        # Call GPT with formatted prompt
        result = call_gpt(prompt.format(problem=item['input']), model=model, temperature=temperature, max_tokens=max_tokens)
        print(result)

        # Extract answer from GPT's response
        if method == 'standard':
            try:
                pred_answer = int(result)
            except:
                pred_answer = 0
        elif method == 'cot':
            try:
                m = re.findall(r'answer:\s*(\d+)', result.lower())[0]
                pred_answer = int(m)
            except:
                pred_answer = 0
        elif method == 'cot-sc':
            count_map = {}
            for _ in range(k_samples):
                try:
                    m = re.findall(r'answer:\s*(\d+)', result.lower())[0]
                    pred_answer = int(m)
                except:
                    pred_answer = 0
                count_map[pred_answer] = count_map.get(pred_answer, 0) + 1
            pred_answer = sorted(count_map.items(), key=lambda x: x[1], reverse=True)[0][0]

        # Log result for this question
        log_file.write('=' * 40 + '\n')
        log_file.write(f'problem: {item["input"]}\n')
        log_file.write(f'result: {result}\n')
        log_file.write(f'pred_answer: {pred_answer}\n')
        log_file.write(f'ground truth: {item["answer"]}\n')
        log_file.flush()

        # Accuracy tracking
        is_correct = check_identity(item['answer'], pred_answer)
        if is_correct:
            right_count += 1
        current_count += 1
        result_file.write(f'{current_count-1},{pred_answer},{is_correct},{item["num_shape"]}\n')
        result_file.flush()
        print(f'Accuracy: {right_count / current_count * 100:.2f}%')

        log_file.close()
        time.sleep(5)

    # Final accuracy summary
    print(f'Final Accuracy: {right_count / len(metadata) * 100:.2f}%')
    with open(os.path.join(run_dir, 'summary.log'), 'w', encoding='utf8') as f:
        f.write(f'Accuracy: {right_count / len(metadata) * 100:.2f}%\n')
    result_file.close()


In [None]:
# Parse geometry descriptions to extract circles, lines, and polygons
def parse_geometry_description(problem_text):
    shapes = []
    text_lower = problem_text.lower()
    circle_matches = re.findall(r'circle centered at \(([-\d\.]+),\s*([-\d\.]+)\)\s*with radius\s*([-\d\.]+)', text_lower)
    for (cx, cy, r_str) in circle_matches:
        shapes.append(("circle", (float(cx), float(cy)), float(r_str.rstrip('.'))))
    line_matches = re.findall(r'line segment from \(([-\d\.]+),\s*([-\d\.]+)\)\s*to\s*\(([-\d\.]+),\s*([-\d\.]+)\)', text_lower)
    for (x1, y1, x2, y2) in line_matches:
        shapes.append(("line", (float(x1), float(y1)), (float(x2), float(y2))))
    poly_matches = re.findall(r'polygon with coordinates\s*\[([^\]]+)\]', text_lower)
    for coords_str in poly_matches:
        pts = re.findall(r'\(([-\d\.]+),\s*([-\d\.]+)\)', coords_str)
        if pts:
            shapes.append(("polygon", [(float(px), float(py)) for px, py in pts]))
    return shapes

# Draw and save the figure using matplotlib based on parsed geometry shapes
def draw_geometry(shapes, out_filename="figure.png"):
    fig, ax = plt.subplots()
    ax.set_xlim(-10, 10)
    ax.set_ylim(-10, 10)
    for shape in shapes:
        if shape[0] == "circle":
            (cx, cy), r = shape[1], shape[2]
            ax.add_patch(plt.Circle((cx, cy), r, fill=False, color='b'))
        elif shape[0] == "line":
            (x1, y1), (x2, y2) = shape[1], shape[2]
            ax.plot([x1, x2], [y1, y2], color='r')
        elif shape[0] == "polygon":
            pts = shape[1] + [shape[1][0]]
            ax.plot([p[0] for p in pts], [p[1] for p in pts], color='g')
    plt.title("Visualized Geometry Problem")
    plt.savefig(out_filename)
    plt.close(fig)


In [None]:
# Vision-Augmented Prompting (VAP) 3-stage pipeline for solving geometry intersection problems
def run_intersection_count_vap(dataset_dir="dataset", task="intersection_geometry", model="gpt-4o-mini", log_dir="log", max_tokens=500):
    # Set up directories for logs and figures
    current_time_str = datetime.now().strftime('%Y%m%d_%H%M%S')
    log_dir_base = os.path.join(log_dir, task + '_vap_separate')
    os.makedirs(log_dir_base, exist_ok=True)
    run_dir = os.path.join(log_dir_base, f'{model}_separate_{current_time_str}')
    os.makedirs(run_dir, exist_ok=True)
    figures_dir = os.path.join(run_dir, 'figures')
    os.makedirs(figures_dir, exist_ok=True)

    # CSV to save predictions
    result_csv = os.path.join(run_dir, 'result.csv')
    result_file = open(result_csv, 'w', encoding='utf8')
    result_file.write('problem_id,plan,iterations,final_answer,is_correct,num_shape\n')
    result_file.flush()

    # Stage 1: High-level plan for drawing
    def planning_stage(problem_text):
        prompt = f"""Your task is to visualize a geometry problem by creating an image.
Before drawing, provide a high-level plan including:
- Tool Selection
- Initialization
- Iterative Drawing Approach
Output JSON with: tool, initialization, iterative_approach, termination_condition.
Problem Description: {problem_text}"""
        return json.loads(clean_response(call_gpt(prompt, model=model, temperature=0, max_tokens=max_tokens)))

    # Stage 2: Simulate drawing updates and thoughts
    def iterative_reasoning_stage(problem_text, image_path, plan):
        plan_str = json.dumps(plan)
        prompt = f"""You are a visualizer for this geometry problem:
{problem_text}
Plan: {plan_str}
Image: {image_path}
Simulate iterative drawing steps in JSON array (iteration, thought, iterative_draw_step)."""
        return json.loads(clean_response(call_gpt(prompt, model=model, temperature=0, max_tokens=max_tokens, image_path=image_path)))

    # Stage 3: Predict the final answer using the problem text, image and reasoning trajectory
    def conclusive_reasoning_stage(problem_text, image_path, plan, iterations):
        prompt = f"""Based on:
- Problem: {problem_text}
- Plan: {json.dumps(plan)}
- Iterations: {json.dumps(iterations)}
- Image: {image_path}
Output final answer JSON as: {{ "final_answer": X }}"""
        return int(json.loads(clean_response(call_gpt(prompt, model=model, temperature=0, max_tokens=max_tokens, image_path=image_path))).get("final_answer", 0))

    # Load dataset and run all 3 stages
    metadata_path = os.path.join(dataset_dir, task, 'task.json')
    with open(metadata_path, 'r', encoding='utf8') as f:
        metadata = json.load(f)

    right_count = 0
    for idx, item in enumerate(metadata[:120]):
        problem_text = item["input"]
        gt_answer = int(item["answer"])
        num_shape = item.get("num_shape", 0)
        figure_path = os.path.join(figures_dir, f"fig_{idx}.png")
        draw_geometry(parse_geometry_description(problem_text), figure_path)

        try:
            plan = planning_stage(problem_text)
            iters = iterative_reasoning_stage(problem_text, figure_path, plan)
            final_ans = conclusive_reasoning_stage(problem_text, figure_path, plan, iters)
        except Exception as e:
            print(f"[{idx}] Error: ", e)
            plan, iters, final_ans = {}, [], 0

        correct = int(final_ans) == gt_answer
        right_count += int(correct)
        result_file.write(f"{idx},{json.dumps(plan)},{json.dumps(iters)},{final_ans},{correct},{num_shape}\n")
        result_file.flush()
        print(f"[{idx}] Pred: {final_ans} | GT: {gt_answer} | Correct: {correct}")

    # Print and log final results
    acc = right_count / len(metadata[:120]) * 100
    print(f"Final Accuracy: {acc:.2f}%")
    with open(os.path.join(run_dir, 'summary.log'), 'w') as f:
        f.write(f"Accuracy: {acc:.2f}%\n")
    result_file.close()


In [None]:
# Run Standard, COT and VAP methods
run_intersection_count(dataset_dir='dataset', method='standard', model='gpt-4o-mini')
run_intersection_count(dataset_dir='dataset', method='cot', model='gpt-4o-mini')
run_intersection_count_vap(dataset_dir='dataset', task='intersection_geometry', model='gpt-4o-mini', log_dir='log')


# Results

####Image formation:
####Geometry Problem
![Geometry Problem Image Formation](fig_geometry.png "Geometry Problem Image Formation, https://github.com/Ashish-Sunuwar/Vision_Augmented_Prompting/blob/main/src/Images/fig_geometry.png")

####Time Series Problem
![Time Series Problem Image Formation](fig_time.png "Time Series Problem Image Formation, https://github.com/Ashish-Sunuwar/Vision_Augmented_Prompting/blob/main/src/Images/fig_time.png")


####Final Results:
####Geometry Problem
![Geometry Problem Results](fig_geometry_result.png "Geometry Problem Results, https://github.com/Ashish-Sunuwar/Vision_Augmented_Prompting/blob/main/src/Images/fig_geometry_result.png")

####Time Series Problem
![Time Series Problem Results](fig_time_result.png "Time Series Problem Results, https://github.com/Ashish-Sunuwar/Vision_Augmented_Prompting/blob/main/src/Images/fig_time_result.png")


# Conclusion and Future Direction

This project taught us how essential visual thinking is for solving problems beyond pure text. Vision-Augmented Prompting enables LLMs to reason more like humans — by looking, planning, and thinking step by step. Even with a basic setup using Matplotlib, we observed meaningful accuracy gains. We saw this was especially helpful for problems that require spatial reasoning, like geometry intersection counting, and temporal understanding, like time series forecasting.

Future work could include exploring its application in more diverse set of problems, enabling dynamic image updates giving live visual feedback to the model and improving interpretabilily for the model reasoning.



# References:

[1]:  Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi,
Quoc V. Le, and Denny Zhou, Chain-of-thought prompting elicits reasoning in large language
models, NeurIPS, 2022

[2]:  Ziyang Xiao, Dongxiang Zhang, Xiongwei Han, Xiaojin Fu, Wing Yin Yu, Tao Zhong, SaiWu, Yuan Wang, Jianwei Yin, Gang Chen, Enhancing LLM Reasoning via Vision-Augmented
Prompting, NeurIPS, 2024