# A More Efficient JSON Data Editing with LLM Experiment
## The current issue

The `BaselineExtractor` and `BaselineGenerator` in `base.py` classes are responsible for organizing the incoming webpages into a more structured form of data using LLMs. However, the current implementation asks the LLM to do a complete update on the current JSON data even if only a minor change is needed and the rest of the data is exactly the same. The is very inefficient and consumes a lot of time and tokens.

There was a study that proposed a patch framework for a more efficient way to edit JSON data with LLMs and called it [JSON Whisperer](https://arxiv.org/abs/2510.04717). The paper showed that rather than telling the LLM to generate the entire JSON document, it should instead generate [RFC 6902 diff patches](https://datatracker.ietf.org/doc/html/rfc6902) to indicate only the necessary changes needed for the JSON data and then updating the data locally using said changes.

This report will experiment this idea and check its potential with a simplified version of our usage.

In [137]:
import json
import jsonpatch
import time
from dotenv import load_dotenv
import os
from openai import OpenAI

load_dotenv()

True

## Experiment 1: Using Code Directly

First we'll try to test this idea without LLM and compare the performance when done directly with python.

We define an `initial_state` that will act as the JSON to be edited.

In [138]:
initial_state = {
    "articles": [
        {"id": 1, "title": "Old Title"},
        {"id": 2, "title": "Another"}
    ]
}

The code below is a very simple replication of what the `BaselineExtractor` and `BaselineGenerator` are currently doing, which is just rewriting the entire JSON data with when only a small change is made.

In [139]:
# Baseline (full JSON regeneration)
# New webpage says article 1 has a new title
new_state = initial_state
new_state = {
    "articles": [
        {"id": 1, "title": "New Title"},  # regenerated
        {"id": 2, "title": "Another"}     # unchanged but re-emitted
    ]
}

print("Baseline full regeneration:", json.dumps(new_state, indent=2))

Baseline full regeneration: {
  "articles": [
    {
      "id": 1,
      "title": "New Title"
    },
    {
      "id": 2,
      "title": "Another"
    }
  ]
}


The code below is a very simple replication of the proposed patch framework with RFC 6902 diff patches

In [140]:
# JSON Whisperer Methodology (patch-based)
# Patch generated by LLM (instead of full JSON)
patch_ops = [
    {"op": "replace", "path": "/articles/0/title", "value": "New Title"}
]

# Apply patch
patch = jsonpatch.JsonPatch(patch_ops)
new_state = patch.apply(initial_state)

print("Patch-based update:", json.dumps(new_state, indent=2))

Patch-based update: {
  "articles": [
    {
      "id": 1,
      "title": "New Title"
    },
    {
      "id": 2,
      "title": "Another"
    }
  ]
}


Both have the exact same output but the approach is different. Now we test their performance.

To automate the testing, we create an `update` object to indicate a change.

In [141]:
# Desired update: change article 1's title
update = {"id": 1, "title": "New Title"}

The previous simple implementations of the baseline and patch versions of the updating process has been turned into functions for automation of the testing.

In [142]:
# Baseline: full JSON regeneration
def baseline_update(state, update):
    new_state = state
    new_state = {
        "articles": [
            {"id": update["id"], "title": update["title"]},  # regenerated
            {"id": 2, "title": "Another"}                   # unchanged but re-emitted
        ]
    }
    return new_state

# Patch-based: JSON Whisperer style
def patch_update(state, update):
    patch_ops = [
        {"op": "replace", "path": "/articles/0/title", "value": update["title"]}
    ]
    patch = jsonpatch.JsonPatch(patch_ops)
    return patch.apply(state)



A benchmark function is created, where it runs a set number of iterations that changes the JSON data using the different methods and outputs the performance of both. The performance is measured through runtime and the number of characters in the output. We also check if the output of both methods are exactly the same.

In [143]:
# Benchmark function
def benchmark(n_runs=10000):
    # Baseline
    start = time.perf_counter()
    for _ in range(n_runs):
        baseline_result = baseline_update(initial_state, update)
    baseline_time = time.perf_counter() - start
    baseline_size = len(json.dumps(baseline_result))

    # Patch-based
    start = time.perf_counter()
    for _ in range(n_runs):
        patch_result = patch_update(initial_state, update)
    patch_time = time.perf_counter() - start
    patch_size = len(json.dumps(patch_result))

    print("=== Performance Comparison ===")
    print(f"Baseline runtime: {baseline_time:.6f} seconds for {n_runs} runs")
    print(f"Patch runtime:    {patch_time:.6f} seconds for {n_runs} runs")
    print(f"\nBaseline JSON size: {baseline_size} chars")
    print(f"Patch JSON size:    {patch_size} chars")

    # Verify correctness
    same = (baseline_result == patch_result)
    print(f"\nFinal states identical: {same}")
    if same:    
        print("\nFinal json state from both:\n", json.dumps(baseline_result, indent=2))

10000 iterations will be done to show noticeable time difference.

In [144]:
# Run benchmark
benchmark(n_runs=10000)

=== Performance Comparison ===
Baseline runtime: 0.002868 seconds for 10000 runs
Patch runtime:    0.157197 seconds for 10000 runs

Baseline JSON size: 78 chars
Patch JSON size:    78 chars

Final states identical: True

Final json state from both:
 {
  "articles": [
    {
      "id": 1,
      "title": "New Title"
    },
    {
      "id": 2,
      "title": "Another"
    }
  ]
}


We see that both methods have the exact same output. However, the proposed patch method is much slower than the baseline method. This is because we are directly using code to implement the methods and python is very efficient in performing JSON dumps compared to editing JSON through patches.

The proposed patch method is supposed to be used with LLMs, so now we move on to the next part of the experiment using LLMs.

## Experiment 2: Using LLM

This eperiment will use the OpenAI package for the LLM

First, we configure OpenAI.

In [145]:
openai_api_key = os.environ.get("OPENAI_API_KEY")
openai_base_url = os.environ.get("OPENAI_BASE_URL")
openai_model = os.environ.get("OPENAI_MODEL")

In [146]:
# Initialize OpenAI client
client = OpenAI(api_key=openai_api_key, base_url=openai_base_url)

Next, we build the prompts that will be passed into OpenAI.

For the baseline method, we ask OpenAI to update a JSON data and regernate the full JSON.

For the patch method, we instead tell OpenAI to only give us the RFC6902 patch operations.

In [147]:
# Baseline prompts
def baseline_prompt(state, update):
    return f"""You are an extractor. Current JSON:
{json.dumps(state, indent=2)}

Update: Article {update['id']} has new title "{update['title']}".
Regenerate the full JSON with this change.
"""

# Patch prompts
def patch_prompt(state, update):
    return f"""You are an extractor. Current JSON:
{json.dumps(state, indent=2)}

Update: Article {update['id']} has new title "{update['title']}".
Instead of regenerating, output RFC6902 patch operations only.
"""

Sometimes LLMs tend to have a wrapper in the output, so we'll make a function to clean the output for us to use.

In [148]:
# Robust patch output cleaner
def clean_patch_output(output_text):
    # Strip markdown fences if present
    text = output_text.strip().strip("`")
    if text.startswith("json"):
        text = text.lstrip("json").strip()
    try:
        parsed = json.loads(text)
    except json.JSONDecodeError as e:
        raise RuntimeError(f"Invalid JSON from model: {e}\n{text}")

    # Ensure it's a list of dicts with 'op'
    if not isinstance(parsed, list):
        raise RuntimeError(f"Patch output is not a list: {parsed}")
    if not all(isinstance(item, dict) and "op" in item for item in parsed):
        raise RuntimeError(f"Patch output is not valid RFC6902 ops: {parsed}")

    return parsed

Other than measruing the runtime and the output size, we will also measure the tokens. We create a class to track the tokens used by OpenAI.

In [149]:
# Tracking tokens
class TokenTracker:
    def __init__(self):
        self.total_tokens = 0

    def query_llm(self, prompt):
        start = time.perf_counter()
        response = client.chat.completions.create(
            model=openai_model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        latency = time.perf_counter() - start
        output_text = response.choices[0].message.content

        # Track tokens if usage info is available
        tokens_used = None
        if response.usage:
            tokens_used = response.usage.total_tokens
            self.total_tokens += tokens_used

        return latency, output_text, tokens_used

We define a function to benchmark the performance of the baseline method.

In [150]:
# Baseline benchmark
def benchmarkLLMBaseline(tracker):
    latency, output_text, tokens = tracker.query_llm(
        baseline_prompt(initial_state, update)
    )
    size = len(output_text)
    try:
        parsed_json = json.loads(output_text)
    except json.JSONDecodeError:
        parsed_json = None

    return {
        "latency": latency,
        "size": size,
        "tokens": tokens,
        "output_text": output_text,
        "parsed_json": parsed_json
    }

For the patch method, there are two steps: The LLM first generates the RFC 6902 diff patches and then we use the patches to apply the changes to the JSON data.

In [151]:
# Patch benchmark
def benchmarkLLMPatch(tracker):
    latency, output_text, tokens = tracker.query_llm(
        patch_prompt(initial_state, update)
    )
    size = len(output_text)

    # Clean and apply patch ops
    apply_start = time.perf_counter()
    patched_state = None
    try:
        ops = clean_patch_output(output_text)
        patch = jsonpatch.JsonPatch(ops)
        patched_state = patch.apply(initial_state)
    except Exception as e:
        print(f"Failed to apply patch: {e}")
    apply_latency = time.perf_counter() - apply_start
    total_latency = latency + apply_latency

    return {
        "latency": latency,
        "apply_latency": apply_latency,
        "total_latency": total_latency,
        "size": size,
        "tokens": tokens,
        "output_text": output_text,
        "patched_state": patched_state
    }

In [152]:
# Initialize the Tracker
tracker = TokenTracker()

In [153]:
# Benchmark the different methods
baseline_result = benchmarkLLMBaseline(tracker)
patch_result = benchmarkLLMPatch(tracker)

We now take a look at the outputs from the LLM for the different method.

In [154]:
# LLM output for baseline method
print("Baseline output:\n", baseline_result["output_text"])

Baseline output:
 {
  "articles": [
    {
      "id": 1,
      "title": "New Title"
    },
    {
      "id": 2,
      "title": "Another"
    }
  ]
}


For the baseline method, we already get the final output with the required changes.

In [155]:
print("Patch output:\n", patch_result["output_text"])

Patch output:
 ```json
[
  {
    "op": "replace",
    "path": "/articles/0/title",
    "value": "New Title"
  }
]
```


For the patch method, we are only getting the RFC 6902 diff patches which will be used to further update the JSON data.

In [156]:
print("=== OpenAI Performance Comparison ===")
print(f"Baseline latency:          {baseline_result['latency']:.3f} seconds")
print(f"\nPatch latency (LLM only):  {patch_result['latency']:.3f} seconds")
print(f"Patch apply latency:       {patch_result['apply_latency']:.6f} seconds")
print(f"Patch total latency:       {patch_result['total_latency']:.3f} seconds")
print(f"\nBaseline output size: {baseline_result['size']} chars")
print(f"Patch output size:    {patch_result['size']} chars")
print(f"\nBaseline tokens used: {baseline_result['tokens']}")
print(f"Patch tokens used:    {patch_result['tokens']}")

=== OpenAI Performance Comparison ===
Baseline latency:          2.981 seconds

Patch latency (LLM only):  1.976 seconds
Patch apply latency:       0.000059 seconds
Patch total latency:       1.976 seconds

Baseline output size: 130 chars
Patch output size:    102 chars

Baseline tokens used: 152
Patch tokens used:    144


Based on the results, we can see that even though the patch method is a two step process while the baseline is just one step, the patch method was completed faster than the baseline method. Furthermore, patch method also has a smaller size and used less tokens than the baseline method.

In [157]:
same = (baseline_result["parsed_json"] == patch_result["patched_state"])
print(f"Final states identical: {same}")
if same:    
    print("\nFinal formatted json state from both:\n", baseline_result["output_text"])

Final states identical: True

Final formatted json state from both:
 {
  "articles": [
    {
      "id": 1,
      "title": "New Title"
    },
    {
      "id": 2,
      "title": "Another"
    }
  ]
}


The output is also exactly the same.  This shows that the proposed patch method is indeed more efficient than the baseline method which still having the exact same output as seen below.

But the prompts used in the project are much more complicated that the experiments done so far. So let's run the same experiment with prompts as complex as the ones used in the project.

## Experiment 3: Complex Prompts

The prompts used in the project are much richer and detailed. They defined schemas, provided documents, and so much more while giving specific intructions on the desired output. One of the prompts was related to updating the entity recognition feature, so we'll use a similar version of that with the same richness for the prompt in this experiment.

First, we change `initial_state` and `update` to work with the entity recognition JSON data.

In [158]:
initial_state = {
    "entities": [
        {"type": "Person", "name": "Alice", "description": "Researcher"},
        {"type": "Organization", "name": "OpenAI", "description": "AI lab"}
    ]
}
update = {"id": 1, "title": "New Discovery"}

We change the prompt for the baseline method and use the rich prompt similar to the ones used in the project.

In [159]:
def baseline_prompt(state, update):
    return f"""# Task
Extract entities from the given document and update the existing list.
- Valid entity types and their schemas are defined below.
- Maintain relevance, avoid duplication, keep the list concise.
- Always include the new information provided in the query.

# Entity Definition
```python
class Person(TypedDict):
    type: Literal["Person"]
    name: str
    description: str
class Organization(TypedDict):
    type: Literal["Organization"]
    name: str
    description: str
class Event(TypedDict):
    type: Literal["Event"]
    name: str
    description: str
    time: str

Query

Update entity list with new information: "{update['title']}"
Document

<DOCUMENT START> {json.dumps(state, indent=2)} <DOCUMENT END>
Current Entities
{json.dumps(state, indent=2)}
Return the full updated entity list in JSON. DO NOT output anything else. """

For the patch prompt, we use a similar prompt to the baseline prompt and instead of regenerating the entire JSON, just output the RFC6902 patch operations.

In [160]:
def patch_prompt(state, update):
    return f"""# Task
Generate RFC6902 JSON Patch operations to update the current entity list.
- Valid entity types and their schemas are defined below.
- Always include the new information provided in the query.
- If adding an Event, set description="" and time="" unless explicit values are given.
- Do not regenerate the full list, only output patch operations.

# Entity Definition
```python
class Person(TypedDict):
    type: Literal["Person"]
    name: str
    description: str
class Organization(TypedDict):
    type: Literal["Organization"]
    name: str
    description: str
class Event(TypedDict):
    type: Literal["Event"]
    name: str
    description: str
    time: str

Query

Add new information: "{update['title']}"
Current Entities

{json.dumps(state, indent=2)}

Output

Return only a JSON array of RFC6902 operations (e.g. add/replace/remove). Do not include explanations or the full entity list. """

Same as before, we clean the output from the LLM to make it useable later

In [161]:
def clean_baseline_output(output_text: str):
    """Strip markdown fences and load baseline JSON output."""
    text = output_text.strip().strip("`")
    if text.startswith("json"):
        text = text.lstrip("json").strip()
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        raise RuntimeError(f"Invalid baseline JSON: {e}\n{text}")


def clean_patch_output(output_text: str):
    """Strip markdown fences and validate RFC6902 patch operations."""
    text = output_text.strip().strip("`")
    if text.startswith("json"):
        text = text.lstrip("json").strip()
    try:
        parsed = json.loads(text)
    except json.JSONDecodeError as e:
        raise RuntimeError(f"Invalid JSON from model: {e}\n{text}")

    # Ensure it's a list of dicts with 'op'
    if not isinstance(parsed, list):
        raise RuntimeError(f"Patch output is not a list: {parsed}")
    if not all(isinstance(item, dict) and "op" in item for item in parsed):
        raise RuntimeError(f"Patch output is not valid RFC6902 ops: {parsed}")

    return parsed

We define the benchmarks.

In [162]:
def benchmarkBaseline(tracker):
    latency, output_text, tokens = tracker.query_llm(baseline_prompt(initial_state, update))
    size = len(output_text)
    parsed_json = clean_baseline_output(output_text)
    return {
        "latency": latency,
        "size": size,
        "tokens": tokens,
        "output_text": output_text,
        "parsed_json": parsed_json
    }

def benchmarkPatch(tracker):
    latency, output_text, tokens = tracker.query_llm(patch_prompt(initial_state, update))
    size = len(output_text)
    apply_start = time.perf_counter()
    patched_state = None
    try:
        ops = clean_patch_output(output_text)
        patch = jsonpatch.JsonPatch(ops)
        patched_state = patch.apply(initial_state)
    except Exception as e:
        print(f"Failed to apply patch: {e}")
    apply_latency = time.perf_counter() - apply_start
    total_latency = latency + apply_latency
    return {
        "latency": latency,
        "apply_latency": apply_latency,
        "total_latency": total_latency,
        "size": size,
        "tokens": tokens,
        "output_text": output_text,
        "patched_state": patched_state
    }

In [163]:
baseline_result = benchmarkBaseline(tracker)
patch_result = benchmarkPatch(tracker)

We see that the output from the LLM for the baseline method is pretty much the final output with the updates.

In [164]:
print("Baseline output:\n", baseline_result["output_text"])

Baseline output:
 {
  "entities": [
    {
      "type": "Person",
      "name": "Alice",
      "description": "Researcher"
    },
    {
      "type": "Organization",
      "name": "OpenAI",
      "description": "AI lab"
    },
    {
      "type": "Event",
      "name": "New Discovery",
      "description": "",
      "time": ""
    }
  ]
}


The output from the LLM of the patch method is the RFC 6902 diff patches.

In [165]:
print("Patch output:\n", patch_result["output_text"])

Patch output:
 [
  {
    "op": "add",
    "path": "/entities/2",
    "value": {
      "type": "Event",
      "name": "New Discovery",
      "description": "",
      "time": ""
    }
  }
]


In [166]:

print("=== OpenAI Performance Comparison ===")
print(f"Baseline latency: {baseline_result['latency']:.3f} seconds")
print(f"\nPatch latency (LLM only):  {patch_result['latency']:.3f} seconds")
print(f"Patch apply latency:       {patch_result['apply_latency']:.6f} seconds")
print(f"Patch total latency:       {patch_result['total_latency']:.3f} seconds")
print(f"\nBaseline output size: {baseline_result['size']} chars")
print(f"Patch output size:    {patch_result['size']} chars")
print(f"\nBaseline tokens used: {baseline_result['tokens']}")
print(f"Patch tokens used:    {patch_result['tokens']}")


=== OpenAI Performance Comparison ===
Baseline latency: 3.527 seconds

Patch latency (LLM only):  2.282 seconds
Patch apply latency:       0.000059 seconds
Patch total latency:       2.282 seconds

Baseline output size: 322 chars
Patch output size:    172 chars

Baseline tokens used: 427
Patch tokens used:    349


We see that the patch method is still more efficient even with more rich and complex prompts. This shows that the proposal of the JSON Whisperer does indeed make JSON editing with LLMs much more efficient and is something to consider for future improvements and development of this project.

### Something to note:
The output from LLMs can be very different, even when using the exact same prompt twice with a lot of details, to a point that the behavior can sometimes be very unpredictable. This, along with network traffic and other uncontrollable factors, could potentially change the performance of either methods. However, the difference in performance seems to be large enough that we can expect the patch method to be more efficient than the baseline method in most case, if not all.