Sprint 4 Goal
Sprint 4 focuses on characterizing how LLMs behave on route-navigation tasks.
The main goal is to understand:
- which route patterns are easy or difficult for LLMs
- what kinds of route-navigation failures appear
- whether prompt design or SSAL-style graph representation improves results
- how GPT and Gemini differ when running our route-navigation tasks
- how visualization can help us inspect and explain route failures
This sprint is not intended to become a formal statistical study. Since LLM outputs can vary between runs, we may repeat selected prompts a small number of times to check whether a result is stable or inconsistent. However, the goal is descriptive robustness, not statistical significance.
By the end of the sprint, we should be able to explain not only whether an LLM succeeded or failed, but also why certain route cases were harder than others.
Sprint Summary
This sprint focuses on six connected questions.
1. What kinds of route-navigation cases are we testing?
We will define origin–destination equivalence classes and route-pattern categories so that our test cases are not arbitrary.
Possible route patterns include:
- direct routes
- short multi-hop routes
- long multi-hop routes
- junction-heavy routes
- routes with multiple plausible alternatives
- near-shortest alternative routes
- weighted routes
- detour routes
- swing routes
- routes where the locally obvious choice is globally worse
- routes that are easy to compute but hard to explain clearly
This should help us connect evaluation results to the actual spatial reasoning properties of each route.
2. How do LLMs fail or partly succeed on these cases?
We will define a failure-mode taxonomy and inspect representative model outputs.
Important failure modes include:
- invalid edge hallucination
- hallucinated nodes
- skipped intermediate nodes
- wrong junction choice
- direct-route bias
- ignoring edge weights
- locally plausible but globally wrong routing
- valid but non-optimal route
- loop or unnecessary backtracking
- malformed output
- poor route explanation
- unstable output across repeated runs
The goal is to move beyond simply saying “the model was wrong” and instead explain what kind of route-navigation failure occurred.
3. Why are some routes less reliably solved than others?
After evaluating model outputs, we will compare route performance against route properties.
Relevant route properties may include:
- number of nodes and edges in the reference path
- number of junctions or branch choices
- average and maximum node degree along the route
- number of near-shortest alternatives
- difference between weighted and unweighted shortest paths
- whether the route requires a detour or swing
- whether the locally obvious choice is globally worse
- whether the route is difficult to explain in natural language
This supports a stronger research explanation:
The model failed not just because the answer was wrong, but because the route required a specific type of spatial reasoning.
4. Do prompt templates or SSAL representation improve the results?
We will evaluate whether different prompt templates or graph representations improve route-navigation outputs.
Possible comparisons include:
- baseline route-finding prompt
- stricter JSON-output prompt
- structured route-finding prompt
- SSAL-based graph representation
- no-data or reduced-data baseline, if useful
- validation-aware or tool-aware prompt, if time allows
The goal is to identify whether improvements come from better prompting, better graph grounding, or clearer output constraints.
5. How do GPT and Gemini differ on our task?
We will compare GPT and Gemini behavior on the project’s route-navigation tasks.
This comparison should focus on practical differences such as:
- route validity
- output format reliability
- tendency to hallucinate edges or nodes
- ability to use SSAL / graph context
- explanation quality
- consistency across similar prompts
- common failure modes
The goal is not to produce a large benchmark, but to document meaningful differences observed in our project setting.
6. How can visualization help us understand the results?
We will use visualization to inspect evaluation metrics and route outputs.
Useful visualizations may include:
- reference route vs LLM route
- invalid or missing route segments
- success/failure by route pattern
- failure-mode frequency
- prompt-template comparison
- GPT vs Gemini comparison
- examples of routes that were solved, partly solved, or failed
Visualization should support interpretation and final presentation, not just make the dashboard look nicer.
Sprint Epics and Tasks
Epic 1 — Route Patterns and OD-Pair Selection
Task 1 — Define OD-pair equivalence classes and route patterns
Goal:
Make OD-pair selection systematic and defensible.
Description:
Define categories for grouping route cases by the type of route-navigation reasoning they require.
Possible categories:
- direct route
- short multi-hop route
- long multi-hop route
- junction-heavy route
- multiple plausible alternatives
- near-shortest alternatives
- weighted route
- local/global conflict
- detour route
- swing route
- explanation-heavy route
Acceptance criteria:
- OD-pair equivalence classes are documented
- Route-pattern labels are documented
- Existing or selected OD pairs are labeled
- The selection avoids overrepresenting only one route type
- Route-pattern labels can be used in evaluation outputs or result summaries
Priority: High
Epic 2 — Failure Modes and Route Difficulty
Task 2 — Define failure-mode taxonomy
Goal:
Create a structured way to analyze how LLM route-navigation outputs fail.
Description:
Define failure modes that can be applied manually, automatically, or semi-automatically.
Possible failure modes:
- invalid edge hallucination
- hallucinated node
- skipped intermediate node
- wrong junction choice
- direct-route bias
- ignores edge weights
- valid but non-optimal route
- loop or unnecessary backtracking
- malformed output
- correct route but poor explanation
- wrong explanation for correct route
- overconfident answer without enough data
- refusal despite enough data
Acceptance criteria:
- Failure-mode taxonomy is documented
- Each failure mode has a clear definition
- Each failure mode includes detection guidance
- At least one example is provided for major failure modes where available
- Failure modes can be linked to route patterns
Priority: High
Task 3 — Explain why some routes are less reliably solved
Goal:
Identify which route properties are associated with easier, harder, or failed route-navigation cases.
Description:
After model outputs are evaluated, analyze which properties of the route may explain the result.
Possible route properties to track:
- reference path length
- number of edges and intermediate nodes
- number of junctions
- average node degree along the route
- maximum node degree along the route
- number of plausible outgoing alternatives
- number of shortest or near-shortest paths
- difference between shortest and second-best route
- difference between weighted and unweighted shortest path
- whether a detour is required
- whether a swing is required
- whether the locally obvious choice is globally worse
- whether the route is difficult to explain in natural language
Acceptance criteria:
- Route-property fields are defined for selected OD-pair metadata
- Selected route cases are labeled with relevant properties
- Model performance is compared against route properties
- At least three examples explain why a route was harder or easier
- The analysis avoids overclaiming causal conclusions from small samples
- Findings are included in the results discussion or final report notes
Priority: High
Epic 3 — Prompting, SSAL, and Model Comparison
Task 4 — Evaluate prompt templates and possible improvements
Goal:
Test whether prompt design changes route-navigation reliability.
Description:
Compare different prompt templates to see whether route outputs improve.
Possible templates:
- baseline route-finding prompt
- structured route-finding prompt
- strict JSON route-output prompt
- validation-aware prompt
- no-data or reduced-data baseline prompt
- prompt using Dijkstra/reference information, if relevant
Acceptance criteria:
- Prompt templates are documented
- Same or comparable OD pairs can be tested across prompt templates
- Results can be compared by prompt type
- The analysis identifies whether prompt engineering improves route validity, output format, or explanation quality
- At least one prompt template is selected as the main comparison condition
Priority: Medium
Task 5 — Write about using SSAL as an enhancement technique
Goal:
Explain how SSAL is used as a structured graph representation and whether it improves LLM route-navigation behavior.
Description:
Document SSAL as an enhancement or grounding technique. The writeup should explain why providing structured graph context may help LLMs compared with vague or unstructured route descriptions.
Acceptance criteria:
- SSAL is explained clearly
- The writeup describes how SSAL is used in prompts or evaluation
- The writeup discusses whether SSAL improves route validity or reasoning
- Limitations are included
- Text is suitable for the final report
Priority: High
Task 6 — Write about the difference between GPT and Gemini in running our task
Goal:
Compare observed GPT and Gemini behavior in the project’s route-navigation tasks.
Description:
Focus on practical differences observed from actual runs, not broad claims about the models in general.
Possible comparison points:
- route validity
- output format reliability
- ability to follow JSON schema
- tendency to hallucinate edges or nodes
- use of SSAL / graph context
- explanation quality
- consistency across similar prompts
- common failure modes
Acceptance criteria:
- GPT/Gemini comparison is based on project outputs
- Differences are described with concrete examples where possible
- The writeup avoids overgeneralizing beyond our experiments
- Text is suitable for the final report
Priority: High
Optional Task — Evaluate LLM as route explainer using algorithmic routes
Goal:
Test whether LLMs are more useful for explaining a known route than for computing the route themselves.
Description:
Give the LLM a route produced by an algorithmic method, such as Dijkstra, and ask it to explain the route as navigation guidance. This tests whether the LLM can serve as a natural-language interface over deterministic routing tools.
This task is optional because the sprint already has several core analysis tasks. It should be attempted if time remains or if it fits naturally with the prompt-template work.
Acceptance criteria:
- A route-explainer prompt template is created
- A few Dijkstra/reference routes are selected
- The LLM is asked to produce navigation guidance from the given route
- Outputs are checked for:
- route preservation
- invented roads or landmarks
- skipped junctions
- unclear instructions
- Findings are compared informally against LLM-as-route-finder behavior
- Results are added to final report notes if useful
Priority: Optional / Nice to have
Epic 4 — Visualization and Result Interpretation
Task 7 — Visualize evaluation metrics and identify patterns
Goal:
Make route-pattern and failure-mode results easier to interpret.
Description:
Generate visual summaries of evaluation results and use them to identify meaningful patterns.
Possible visualizations:
- success/failure by route pattern
- failure-mode frequency
- route property vs performance
- prompt-template comparison
- GPT vs Gemini comparison
- examples of invalid route segments
- reference route vs LLM route, if supported
Acceptance criteria:
- At least two useful visual summaries are produced
- Visualizations are saved for report or presentation use
- Figures support interpretation rather than only decoration
- Captions or notes explain the main takeaway
- Limitations of the small project scale are noted
Priority: Medium
Task 8 — Add reusable route visualization interface for node-ID routes
Goal:
Support easier inspection of reference routes and LLM-generated routes.
Description:
Create or improve a reusable visualization interface that can display node-ID routes from the evaluation data. This should help the team inspect where a model route diverges from the reference route.
Acceptance criteria:
- A node-ID route can be visualized
- Reference route and LLM route can be compared
- Invalid or missing segments can be inspected
- The interface is reusable for multiple evaluated cases
- Screenshots or outputs can support the final presentation/report
Priority: Medium
Epic 5 — Lightweight Repeated-Run Robustness Check
Task 9 — Check repeated-run consistency for selected prompts
Goal:
Check whether selected LLM route-navigation outputs are stable across multiple runs or only succeed/fail once by chance.
Description:
This task replaces the earlier idea of formal statistical analysis. For a small set of representative OD pairs, run the same prompt a few times with the same model and compare whether the route output, validity, and failure mode remain consistent.
The goal is descriptive robustness, not statistical significance.
Acceptance criteria:
- A small set of representative route cases is selected
- Each selected prompt is run multiple times, for example 3–5 times
- Outputs are compared for route validity, route variation, and failure mode
- At least one stable success, stable failure, or inconsistent case is identified if available
- Findings are summarized descriptively
- The report avoids claiming formal statistical significance
Priority: Low / Medium
Definition of Done for Sprint 4
The sprint should be considered successful if the team can answer most of the following:
- What route patterns are represented in our dataset?
- What equivalence classes do our origin–destination pairs belong to?
- What failure modes appear most often in LLM route-navigation outputs?
- Which route properties seem associated with easier or harder cases?
- Does SSAL or structured graph context improve route-navigation behavior?
- Do prompt templates change route validity, output format, or explanation quality?
- How do GPT and Gemini differ in our task setting?
- Can visualization help explain route failures or model differences?
- Are selected outputs stable across a few repeated runs, or are they inconsistent?
- What careful claim can we make about LLM route-navigation capability on our dataset?
How Sprint 4 Supports the Four Research Questions
| Research question |
How Sprint 4 supports it |
Missing before final |
| RQ1: Existing benchmarks and frameworks |
Our route-pattern and failure-mode work helps position the project as a focused route-navigation evaluation. |
We still need a related-work comparison table and a short explanation of how our task differs from existing benchmarks. |
| RQ2: Comparative LLM performance |
GPT/Gemini comparison, prompt-template evaluation, visualized metrics, and optional repeated-run checks support comparison. |
We need actual evaluated results and clear result summaries. |
| RQ3: Failure modes |
Failure-mode taxonomy, route-difficulty explanation, and route visualization directly support this question. |
We need concrete examples from model outputs and mapping between route patterns and failures. |
| RQ4: Enhancement strategies |
SSAL writeup and prompt-template evaluation test possible improvements. The optional route-explainer task can further support the idea of LLMs as grounded navigation interfaces. |
We need at least one clear baseline-vs-enhancement comparison, even if small. |
Methodological Position
This sprint uses a practical, descriptive methodology suitable for a small project.
We will not claim broad statistical significance. Instead, we will:
- select representative route cases
- label route patterns and route properties
- evaluate model outputs against reference routes
- identify recurring failure modes
- compare prompt and representation choices
- use visualization to support interpretation
- optionally repeat selected prompts a few times to check output stability
This should give enough evidence for a careful project conclusion without over-engineering the evaluation.
Final Project Claim This Sprint Should Enable
By the end of Sprint 4, we should be able to make a careful claim such as:
Our results suggest that LLM route-navigation behavior depends strongly on route pattern, graph representation, and prompt framing. LLMs may produce valid routes in some cases, but they struggle with routes involving junction complexity, weighted choices, detours, local/global conflicts, or strict graph constraints. SSAL-style structured graph context and improved prompt templates may reduce some errors, but LLMs should not be treated as standalone routing engines. They may be more useful when grounded by deterministic route tools, especially for explaining or communicating routes.
Sprint 4 Goal
Sprint 4 focuses on characterizing how LLMs behave on route-navigation tasks.
The main goal is to understand:
This sprint is not intended to become a formal statistical study. Since LLM outputs can vary between runs, we may repeat selected prompts a small number of times to check whether a result is stable or inconsistent. However, the goal is descriptive robustness, not statistical significance.
By the end of the sprint, we should be able to explain not only whether an LLM succeeded or failed, but also why certain route cases were harder than others.
Sprint Summary
This sprint focuses on six connected questions.
1. What kinds of route-navigation cases are we testing?
We will define origin–destination equivalence classes and route-pattern categories so that our test cases are not arbitrary.
Possible route patterns include:
This should help us connect evaluation results to the actual spatial reasoning properties of each route.
2. How do LLMs fail or partly succeed on these cases?
We will define a failure-mode taxonomy and inspect representative model outputs.
Important failure modes include:
The goal is to move beyond simply saying “the model was wrong” and instead explain what kind of route-navigation failure occurred.
3. Why are some routes less reliably solved than others?
After evaluating model outputs, we will compare route performance against route properties.
Relevant route properties may include:
This supports a stronger research explanation:
4. Do prompt templates or SSAL representation improve the results?
We will evaluate whether different prompt templates or graph representations improve route-navigation outputs.
Possible comparisons include:
The goal is to identify whether improvements come from better prompting, better graph grounding, or clearer output constraints.
5. How do GPT and Gemini differ on our task?
We will compare GPT and Gemini behavior on the project’s route-navigation tasks.
This comparison should focus on practical differences such as:
The goal is not to produce a large benchmark, but to document meaningful differences observed in our project setting.
6. How can visualization help us understand the results?
We will use visualization to inspect evaluation metrics and route outputs.
Useful visualizations may include:
Visualization should support interpretation and final presentation, not just make the dashboard look nicer.
Sprint Epics and Tasks
Epic 1 — Route Patterns and OD-Pair Selection
Task 1 — Define OD-pair equivalence classes and route patterns
Goal:
Make OD-pair selection systematic and defensible.
Description:
Define categories for grouping route cases by the type of route-navigation reasoning they require.
Possible categories:
Acceptance criteria:
Priority: High
Epic 2 — Failure Modes and Route Difficulty
Task 2 — Define failure-mode taxonomy
Goal:
Create a structured way to analyze how LLM route-navigation outputs fail.
Description:
Define failure modes that can be applied manually, automatically, or semi-automatically.
Possible failure modes:
Acceptance criteria:
Priority: High
Task 3 — Explain why some routes are less reliably solved
Goal:
Identify which route properties are associated with easier, harder, or failed route-navigation cases.
Description:
After model outputs are evaluated, analyze which properties of the route may explain the result.
Possible route properties to track:
Acceptance criteria:
Priority: High
Epic 3 — Prompting, SSAL, and Model Comparison
Task 4 — Evaluate prompt templates and possible improvements
Goal:
Test whether prompt design changes route-navigation reliability.
Description:
Compare different prompt templates to see whether route outputs improve.
Possible templates:
Acceptance criteria:
Priority: Medium
Task 5 — Write about using SSAL as an enhancement technique
Goal:
Explain how SSAL is used as a structured graph representation and whether it improves LLM route-navigation behavior.
Description:
Document SSAL as an enhancement or grounding technique. The writeup should explain why providing structured graph context may help LLMs compared with vague or unstructured route descriptions.
Acceptance criteria:
Priority: High
Task 6 — Write about the difference between GPT and Gemini in running our task
Goal:
Compare observed GPT and Gemini behavior in the project’s route-navigation tasks.
Description:
Focus on practical differences observed from actual runs, not broad claims about the models in general.
Possible comparison points:
Acceptance criteria:
Priority: High
Optional Task — Evaluate LLM as route explainer using algorithmic routes
Goal:
Test whether LLMs are more useful for explaining a known route than for computing the route themselves.
Description:
Give the LLM a route produced by an algorithmic method, such as Dijkstra, and ask it to explain the route as navigation guidance. This tests whether the LLM can serve as a natural-language interface over deterministic routing tools.
This task is optional because the sprint already has several core analysis tasks. It should be attempted if time remains or if it fits naturally with the prompt-template work.
Acceptance criteria:
Priority: Optional / Nice to have
Epic 4 — Visualization and Result Interpretation
Task 7 — Visualize evaluation metrics and identify patterns
Goal:
Make route-pattern and failure-mode results easier to interpret.
Description:
Generate visual summaries of evaluation results and use them to identify meaningful patterns.
Possible visualizations:
Acceptance criteria:
Priority: Medium
Task 8 — Add reusable route visualization interface for node-ID routes
Goal:
Support easier inspection of reference routes and LLM-generated routes.
Description:
Create or improve a reusable visualization interface that can display node-ID routes from the evaluation data. This should help the team inspect where a model route diverges from the reference route.
Acceptance criteria:
Priority: Medium
Epic 5 — Lightweight Repeated-Run Robustness Check
Task 9 — Check repeated-run consistency for selected prompts
Goal:
Check whether selected LLM route-navigation outputs are stable across multiple runs or only succeed/fail once by chance.
Description:
This task replaces the earlier idea of formal statistical analysis. For a small set of representative OD pairs, run the same prompt a few times with the same model and compare whether the route output, validity, and failure mode remain consistent.
The goal is descriptive robustness, not statistical significance.
Acceptance criteria:
Priority: Low / Medium
Definition of Done for Sprint 4
The sprint should be considered successful if the team can answer most of the following:
How Sprint 4 Supports the Four Research Questions
Methodological Position
This sprint uses a practical, descriptive methodology suitable for a small project.
We will not claim broad statistical significance. Instead, we will:
This should give enough evidence for a careful project conclusion without over-engineering the evaluation.
Final Project Claim This Sprint Should Enable
By the end of Sprint 4, we should be able to make a careful claim such as: