## Load required modules


In [1]:
!uv pip install llama-cpp-python partial-json-parser

[2mUsing Python 3.12.3 environment at: /mnt/d/Projects/Digipen/2026Spring/CS394/GenAI-Learning/.venv[0m
[2mAudited [1m2 packages[0m [2min 337ms[0m[0m


## Load the local gguf model

In [2]:
from llama_cpp import Llama
from pathlib import Path

current_dir = Path.cwd()

GGUF_MODEL = str(current_dir.parent / "models" / "qwen3-4b-instruct-2507-q4_k_m.gguf")

llm = Llama(
      model_path=GGUF_MODEL,
      n_ctx=8192,
)

llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/d/Projects/Digipen/2026Spring/CS394/GenAI-Learning/models/qwen3-4b-instruct-2507-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 4B Instruct 2507
llama_model_loader: - kv   3:                            general.version str              = 2507
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Qwen3
llama_model_loader: - kv   6:                         general.size_label str            

As far as I know, since it loads things like the architecture via metadata inside the file, I don't think I need to specify anything else.

I tried to use the Qwen3 model, but an error occurred saying, "llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3'." I checked and found that llama.cpp supports Qwen3 starting from build b5092, but it seems llama_cpp_python is still using a version of llama.cpp built prior to that. Therefore, I will use Gemma 3 as the model this time. 

Upon further investigation, I learned that since there are no compiled wheels for the Windows version, it was loading an old version of llama_cpp_python. So, I installed it in the WSL environment.

## Chat with the model using chat completion API

In [3]:
tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Check the current weather in a given city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

In [4]:
messages = [
          {"role": "system", "content": "You are a helpful assistant."},
          {
              "role": "user",
              "content": "How is the weather in Seoul?",
          }
      ] 

stream = llm.create_chat_completion(
            messages=messages,
            tools=tools,
            tool_choice="auto",
            stream=True,
            max_tokens=512,
        )

assistant_text = ""
tool_buf = {}

for chunk in stream:
    choice = chunk["choices"][0]
    delta = choice["delta"]

    if "content" in delta and delta["content"]:
        assistant_text += delta["content"]
        print(delta["content"], end="", flush=True)

    if "tool_calls" in delta and delta["tool_calls"]:
        for tc in delta["tool_calls"]:
            idx = tc["index"]
            entry = tool_buf.setdefault(
                idx,
                {"id": None, "type": "function", "function": {"name": "", "arguments": ""}},
            )
            if tc.get("id"):
                entry["id"] = tc["id"]
            fn = tc.get("function") or {}
            if fn.get("name"):
                entry["function"]["name"] = fn["name"]

            entry["function"]["arguments"] += fn.get("arguments") or ""

    if choice.get("finish_reason") is not None:
        finish_reason = choice["finish_reason"]
        break

print("==content==")
print(assistant_text)
print("==tools==")
for entry in tool_buf.values():
    print(entry)

<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Seoul"}}
</tool_call>

llama_perf_context_print:        load time =   34805.42 ms
llama_perf_context_print: prompt eval time =   34804.35 ms /   180 tokens (  193.36 ms per token,     5.17 tokens per second)
llama_perf_context_print:        eval time =    1999.64 ms /    21 runs   (   95.22 ms per token,    10.50 tokens per second)
llama_perf_context_print:       total time =   36859.42 ms /   201 tokens
llama_perf_context_print:    graphs reused =         19


==content==
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Seoul"}}
</tool_call>
==tools==


I'm not sure why, but the streaming for qwen3-4b-instruct-2507-q4_k_m isn't coming through properly and is coming in as content instead, so it seems I'll have to parse it myself to implement tool_call. I'll use the partial-json-parser module to parse the JSON in real-time.

### Parse unfinished json file by using module

In [5]:
from partial_json_parser import loads, Allow

def partialToolCallParser(content: str):
    start_tag = "<tool_call>"
    end_tag = "</tool_call>"
    
    start_idx = content.find(start_tag)
    
    if start_idx == -1:
        return content, {}
    
    content_str = content[:start_idx].strip()
    json_str = content[start_idx + len(start_tag):].strip()
    
    end_idx = json_str.find(end_tag)
    if end_idx != -1:
        json_str = json_str[:end_idx].strip()
        
    if not json_str:
        return content_str, {}

    parsed_data = loads(json_str, Allow.ALL)
    return content_str, parsed_data

In [6]:
test_content = """Some initial text.
<tool_call>
{
    "name": "get_current_weather",
    "arguments": {
        "city": "Seo
"""

result = partialToolCallParser(test_content)
print("==parsed tool call==")
print(result)

==parsed tool call==
('Some initial text.', {'name': 'get_current_weather', 'arguments': {'city': 'Seo'}})


In [7]:
stream = llm.create_chat_completion(
            messages=messages,
            tools=tools,
            tool_choice="auto",
            stream=True,
            max_tokens=512,
        )

full_text = ""

for chunk in stream:
    delta = chunk["choices"][0]["delta"]
    if "content" in delta and delta["content"]:
        full_text += delta["content"]
        result = partialToolCallParser(full_text)
        print(result)

Llama.generate: 179 prefix-match hit, remaining 1 prompt tokens to eval


('', {})
('', {})
('', {})
('', {})
('', {})
('', {'name': ''})
('', {'name': 'get'})
('', {'name': 'get_current'})
('', {'name': 'get_current_weather'})
('', {'name': 'get_current_weather'})
('', {'name': 'get_current_weather'})
('', {'name': 'get_current_weather'})
('', {'name': 'get_current_weather'})
('', {'name': 'get_current_weather', 'arguments': {}})
('', {'name': 'get_current_weather', 'arguments': {}})
('', {'name': 'get_current_weather', 'arguments': {}})
('', {'name': 'get_current_weather', 'arguments': {'city': ''}})
('', {'name': 'get_current_weather', 'arguments': {'city': 'Se'}})
('', {'name': 'get_current_weather', 'arguments': {'city': 'Seoul'}})
('', {'name': 'get_current_weather', 'arguments': {'city': 'Seoul'}})
('', {'name': 'get_current_weather', 'arguments': {'city': 'Seoul'}})


llama_perf_context_print:        load time =   34805.42 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2595.07 ms /    22 runs   (  117.96 ms per token,     8.48 tokens per second)
llama_perf_context_print:       total time =    2637.00 ms /    23 tokens
llama_perf_context_print:    graphs reused =         20


Since the function seems to be working properly, I will try to create a feature that streams tool calls in real-time using this.

In [8]:
tools = [{
    "type": "function",
    "function": {
        "name": "patch_manuscript",
        "description": "Patch a manuscript with corrections.",
        "parameters": {
            "type": "object",
            "properties": {
                    "patches": {
                        "type": "array", 
                        "items": {
                            "type": "object",
                            "properties": {
                                "old": {"type": "string"},
                                "new": {"type": "string"}
                            },
                        "required": ["old", "new"]
                        }
                    },
                    "required": ["patches"],
                },

        },
    },
}]

Now, I will properly try to build a simple web app using Gradio. The web app I am going to make involves a mechanism from the novel creation agent web app I am developing as a personal project, where the system modifies a part of a novel if a user requests an edit; since I am curious if this can be done even with a small local model, I will create a simple simulation to try it out myself.

In [None]:
def buildPrompt(manuscript: str, prompt: str):
    messages = [
          {"role": "system", "content": "You are a helpful assistant for patching manuscripts. Old text must be uniquely identifiable."},
          {
              "role": "user",
              "content": f"# Manuscript: <manuscript>{manuscript}</manuscript>\n\n# Instruction: <instruction>{prompt}</instruction>",
          }
      ]
    return messages

In [11]:
default_manuscript = ""
with open(current_dir / "test_manuscript.txt", "r") as f:
    default_manuscript = f.read()

In [13]:
def patchManuscript(manuscript: str, patches: list):
    patched = manuscript
    for patch in patches:
        patched = patched.replace(patch["old"], patch["new"])
    return patched

In [None]:
import gradio as gr
import pandas as pd

def run(manuscript: str, prompt: str):
    messages = buildPrompt(manuscript, prompt)

    buffer = ""
    stream = llm.create_chat_completion(
            messages=messages,
            tools=tools,
            tool_choice="auto",
            stream=True,
            max_tokens=8192,
        )

    for chunk in stream:
        choice = chunk["choices"][0]
        delta = choice["delta"]

        if "content" in delta and delta["content"]:
            buffer += delta["content"]

        assistant_message, tool_call = partialToolCallParser(buffer)

        patches = []
        if tool_call:
            patches = tool_call.get("arguments", [])
                
        print(patches)
            
        yield  assistant_message, pd.DataFrame(patches), ""


    final_text, final_tool_call = partialToolCallParser(buffer)
    final_patches = []
    if final_tool_call:
        final_patches = final_tool_call.get("arguments", [])

    final_result = patchManuscript(manuscript, final_patches)

    yield final_text, pd.DataFrame(final_patches), final_result


with gr.Blocks() as novel_editor:
    gr.Markdown("## Streaming JSON → partial parse → final result")

    t1 = gr.Textbox(label="Insert your novel", lines=8, value=default_manuscript)
    t2 = gr.Textbox(label="Write your prompt", lines=2, value="Change strawberry to shine muscat")

    btn = gr.Button("Run")

    assistant = gr.Textbox(label="assistant", lines=2)
    patch_list = gr.Dataframe(label="patches")

    final = gr.Textbox(label="final result", lines=24)

    btn.click(run, inputs=[t1, t2], outputs=[assistant, patch_list, final])

novel_editor.launch(show_error=True)

* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.




Llama.generate: 8 prefix-match hit, remaining 4010 prompt tokens to eval


[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{}]
[{}]
[{}]
[{'old': ''}]
[{'old': 'a'}]
[{'old': 'a ghost'}]
[{'old': 'a ghost strawberry'}]
[{'old': 'a ghost strawberry.'}]
[{'old': 'a ghost strawberry. It'}]
[{'old': 'a ghost strawberry. It grows'}]
[{'old': 'a ghost strawberry. It grows in'}]
[{'old': 'a ghost strawberry. It grows in a'}]
[{'old': 'a ghost strawberry. It grows in a valley'}]
[{'old': 'a ghost strawberry. It grows in a valley that'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse to'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse to show'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse to show.'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse to show. Every'}]
[{'old': 'a ghost strawberry. It grows in a valley that maps refuse to show. Every ten'}

## Conclusion

It comes out quite plausibly, but I can see a drop in reliability, such as instances where parts do not match in the current situation where the 'old' content needs to match exactly. I have seen such errors occur when applying multiple patches even with the latest frontier model, 4.6 Opus, so this was somewhat expected. If I remove all the `\s` characters, normalize the text, and then apply the patch, I might be able to overcome this to some extent, but the downside is the increased implementation difficulty. Still, it has been proven that even a small model of this size is capable to some degree, so once the web app is completed later on, I would also like to try running the model myself to provide the service.