# Introduction
Welcome to Riceroll's Koboldcpp and Ollama benchmarking notebook! The sole purpose of this notebook is to demonstrate inference speeds and KV Caching implementation of Ollama and KoboldCPP.

You can read about Ollama's KV Cache here: https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/  
You can learn more about Kobold's Context-Shift here: https://github.com/LostRuins/koboldcpp/wiki#what-is-contextshift  

If you’re not familiar with what inference means in LLMs, it simply means the total time it took to:
- Process the prompt (system prompts, user prompts, etc.)
- Generate tokens (the LLM outputs) \**This is a little tricky since we also have to consider the number of tokens generated per second.*  

We will also be testing out the KV caching features that both KoboldCPP and Ollama implement. They’re KV caching solutions provide good benefits to the overall inference speeds but serve different purposes. I will briefly explain more about them below.  

For our model, we will be using Mistral Nemo Instruct Q6_K_L gguf model for our benchmarks. We will also be testing these models using the F16 quantization KV cache.

As for the hardware to run these benchmarks and models, it is run on my personal computer with an RTX 5080, Ryzen 7 9800X3D, and 64GB of DDR5 RAM. The GPU has just enough VRAM to fully run a Mistral Nemo Q6_K_L model.

With all that being said, let’s dive into the benchmarking! 

### Imports
For our modules, we really only need requests. We also have the time module to do some performance benchmarkings.

In [6]:
import requests
import time
import tiktoken

### The Setup
We will be using Text Completions endpoints. The reason being is that we can better utilize the KV cache feature of both Ollama and KoboldCPP. We also have a lot of big prompts that will be sent to the same Mistal Nemo Instruct models under Ollama and KoboldCPP. These big prompts are here to simulate long conversations stretching thousands of tokens in the model's context window. Once the context window hits it's max, the KV caching magic happens for both KoboldCPP and Ollama.

In [7]:
# Mistral Nemo Instruct: https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
kobold_url = "http://127.0.0.1:5001/api/v1/generate"
ollama_url = "http://127.0.0.1:11434/api/generate"


system = "You are an advanced AI assistant with expertise across multiple domains including technology, science, history, literature, and social commentary. Your responses should be accurate, detailed, and clear, providing relevant context and examples where appropriate. Always maintain a professional and engaging tone while adapting your style to the user’s level of understanding. When answering technical questions, break down complex concepts step by step and include illustrative analogies to help comprehension. When discussing historical or cultural topics, provide relevant context, dates, and notable figures, and highlight connections to contemporary events if relevant. You should ask clarifying questions if a user’s prompt is ambiguous and summarize key points at the end of your response. Avoid making unsupported claims, and clearly indicate when something is speculative or uncertain. When generating code examples, use clean, well-commented, and idiomatic code. You should also prioritize conciseness where possible, but never sacrifice clarity. Maintain a friendly and approachable style, and encourage curiosity, critical thinking, and nuanced discussion."

prompts = [
    "As the golden light of the late afternoon sun spilled across the quiet cobblestone streets of the old town, weaving through narrow alleyways where the scent of fresh bread mingled with the faint tang of sea salt drifting in from the harbor, a young traveler paused at the edge of a centuries-old fountain carved with half-worn figures of myth, wondering not only about the countless stories that had unfolded in this square before his arrival, but also about the choices and unseen coincidences that had guided his own winding path to this very moment.",
    "Beneath the shadow of the towering mountains, where jagged peaks pierced the clouds and the wind carried whispers of ancient legends through the dense forest, a lone scholar carefully unrolled a weathered map dotted with cryptic symbols, each line and curve hinting at hidden paths and forgotten civilizations, all the while pondering how the convergence of fate, curiosity, and a relentless desire for knowledge had led him to this remote, windswept valley at the edge of the known world.",
    "On the sprawling plains of the Serengeti, the golden grasses swayed in rhythm with the warm, steady wind, as a herd of elephants slowly traversed the landscape, their massive forms casting long shadows in the late afternoon sun. A young naturalist observed from a distance, sketching each detail in his notebook while contemplating the fragile balance between predator and prey that had persisted here for millennia.",
    "Deep within the labyrinthine corridors of the ancient library, rows upon rows of towering shelves held the accumulated wisdom of countless generations. Candles flickered, casting dancing shadows on the walls, as a scholar carefully extracted a fragile manuscript, marveling at the intricate illustrations and the delicate script, which told tales of empires long forgotten and inventions centuries ahead of their time.",
    "Along the edge of the Arctic sea ice, where the horizon merged seamlessly with the pale sky, a team of explorers painstakingly recorded measurements of the shifting glaciers. Each creak and groan of the ice echoed across the frozen expanse, reminding them of the impermanence of the landscape and the urgency of their mission to understand the forces driving climate change in this remote and unforgiving environment.",
    "In the bustling markets of Marrakech, vibrant colors and intoxicating scents mingled as merchants displayed handwoven fabrics, spices, and intricate pottery. A traveler wandered through the crowd, absorbing the energy and rhythm of the city, pondering the cultural layers that had shaped this place over centuries, from ancient trade routes to modern tourism.",
    "High atop the cliffs overlooking the tempestuous Atlantic, a lighthouse keeper diligently maintained the beacon, its light sweeping across the stormy waters below. In the solitude of the lighthouse, he reflected on the countless ships that had depended on this singular light, understanding the weight of responsibility and the quiet heroism inherent in such solitary dedication.",
    "Within the hidden gardens of a secluded monastery, monks tended to rare medicinal herbs, their movements deliberate and meditative. The scent of blooming jasmine filled the air as the monks recorded the properties of each plant, preserving ancient knowledge and ensuring that future generations might benefit from the wisdom painstakingly accumulated over centuries.",
    "In a futuristic city where towering skyscrapers gleamed with holographic advertisements, a robotics engineer calibrated a new AI companion designed to assist citizens in their daily tasks. Each line of code represented not just a technical solution, but also an ethical consideration, balancing utility with privacy and human autonomy.",
    "Amidst the ruins of a forgotten castle on the misty hills of Scotland, an archaeologist carefully brushed away layers of earth to reveal a mural depicting ancient rituals. Each symbol, each painted line, offered clues to a civilization lost to time, igniting questions about belief systems, social structures, and human ingenuity.",
    "On the deck of a spaceship hurtling through the void between stars, an astronaut monitored life-support systems while contemplating the vastness of the cosmos. The endless black dotted with distant suns inspired both awe and humility, as well as a deep curiosity about the origins of life and the future of interstellar exploration.",
    "In a small coastal village, fishermen repaired their nets under the warm glow of lanterns, exchanging stories of storms survived and bountiful catches. The rhythm of the waves and the cries of seabirds provided a timeless soundtrack to a way of life that had endured for generations.",
    "Within a bustling laboratory, scientists in white coats meticulously conducted experiments on cutting-edge materials, hoping to unlock properties that could revolutionize energy storage. Each test, each observation, was a step toward a future where sustainable technologies reshaped the world.",
    "Beneath the surface of a crystal-clear lake, a diver marveled at the delicate ecosystem thriving in the submerged forests of kelp and coral. Each movement stirred tiny creatures into motion, highlighting the intricate web of life that often goes unnoticed beneath calm waters.",
    "In the grand halls of an opera house, the orchestra tuned their instruments while the conductor studied the score, preparing to deliver a performance that would stir the hearts of every audience member. The anticipation in the air was palpable, as music has the power to transcend words and connect deeply with human emotion.",
    "During an archaeological expedition in Egypt, researchers carefully excavated a tomb, uncovering hieroglyphs and artifacts that offered insights into the lives of pharaohs and common citizens alike. Each discovery was meticulously cataloged, forming a puzzle piece in humanity's collective history.",
    "On a remote island, a marine biologist studied the nesting patterns of endangered sea turtles, noting the effects of climate change and human interference. The rhythmic sound of waves and the sight of hatchlings scrambling toward the ocean created both a sense of urgency and profound wonder.",
    "In the quiet study of a philosopher, stacks of books and notes surrounded a thinker contemplating ethics, society, and the human condition. The pursuit of understanding abstract truths intertwined with practical concerns, bridging centuries of thought and modern dilemmas.",
    "Under the luminous glow of a full moon in the desert, nomadic tribes gathered around fires, sharing stories, songs, and wisdom passed down through generations. The vast, starlit sky reminded everyone of their small place in the universe, fostering both humility and awe.",
    "At the edge of a volcanic crater, a geologist observed flowing lava and measured gas emissions, studying the forces that shape the Earth's surface. The dynamic and unpredictable nature of volcanoes demanded respect, patience, and keen observation to extract meaningful scientific insights.",
    "As the golden light of the late afternoon sun spilled across the quiet cobblestone streets of the old town, weaving through narrow alleyways where the scent of fresh bread mingled with the faint tang of sea salt drifting in from the harbor, a young traveler paused at the edge of a centuries-old fountain carved with half-worn figures of myth, wondering not only about the countless stories that had unfolded in this square before his arrival, but also about the choices and unseen coincidences that had guided his own winding path to this very moment.",
    "Beneath the shadow of the towering mountains, where jagged peaks pierced the clouds and the wind carried whispers of ancient legends through the dense forest, a lone scholar carefully unrolled a weathered map dotted with cryptic symbols, each line and curve hinting at hidden paths and forgotten civilizations, all the while pondering how the convergence of fate, curiosity, and a relentless desire for knowledge had led him to this remote, windswept valley at the edge of the known world.",
    "On the sprawling plains of the Serengeti, the golden grasses swayed in rhythm with the warm, steady wind, as a herd of elephants slowly traversed the landscape, their massive forms casting long shadows in the late afternoon sun. A young naturalist observed from a distance, sketching each detail in his notebook while contemplating the fragile balance between predator and prey that had persisted here for millennia.",
    "Deep within the labyrinthine corridors of the ancient library, rows upon rows of towering shelves held the accumulated wisdom of countless generations. Candles flickered, casting dancing shadows on the walls, as a scholar carefully extracted a fragile manuscript, marveling at the intricate illustrations and the delicate script, which told tales of empires long forgotten and inventions centuries ahead of their time.",
    "Along the edge of the Arctic sea ice, where the horizon merged seamlessly with the pale sky, a team of explorers painstakingly recorded measurements of the shifting glaciers. Each creak and groan of the ice echoed across the frozen expanse, reminding them of the impermanence of the landscape and the urgency of their mission to understand the forces driving climate change in this remote and unforgiving environment.",
    "In the bustling markets of Marrakech, vibrant colors and intoxicating scents mingled as merchants displayed handwoven fabrics, spices, and intricate pottery. A traveler wandered through the crowd, absorbing the energy and rhythm of the city, pondering the cultural layers that had shaped this place over centuries, from ancient trade routes to modern tourism.",
    "High atop the cliffs overlooking the tempestuous Atlantic, a lighthouse keeper diligently maintained the beacon, its light sweeping across the stormy waters below. In the solitude of the lighthouse, he reflected on the countless ships that had depended on this singular light, understanding the weight of responsibility and the quiet heroism inherent in such solitary dedication.",
    "Within the hidden gardens of a secluded monastery, monks tended to rare medicinal herbs, their movements deliberate and meditative. The scent of blooming jasmine filled the air as the monks recorded the properties of each plant, preserving ancient knowledge and ensuring that future generations might benefit from the wisdom painstakingly accumulated over centuries.",
    "In a futuristic city where towering skyscrapers gleamed with holographic advertisements, a robotics engineer calibrated a new AI companion designed to assist citizens in their daily tasks. Each line of code represented not just a technical solution, but also an ethical consideration, balancing utility with privacy and human autonomy.",
    "Amidst the ruins of a forgotten castle on the misty hills of Scotland, an archaeologist carefully brushed away layers of earth to reveal a mural depicting ancient rituals. Each symbol, each painted line, offered clues to a civilization lost to time, igniting questions about belief systems, social structures, and human ingenuity.",
    "On the deck of a spaceship hurtling through the void between stars, an astronaut monitored life-support systems while contemplating the vastness of the cosmos. The endless black dotted with distant suns inspired both awe and humility, as well as a deep curiosity about the origins of life and the future of interstellar exploration.",
    "In a small coastal village, fishermen repaired their nets under the warm glow of lanterns, exchanging stories of storms survived and bountiful catches. The rhythm of the waves and the cries of seabirds provided a timeless soundtrack to a way of life that had endured for generations.",
    "Within a bustling laboratory, scientists in white coats meticulously conducted experiments on cutting-edge materials, hoping to unlock properties that could revolutionize energy storage. Each test, each observation, was a step toward a future where sustainable technologies reshaped the world.",
    "Beneath the surface of a crystal-clear lake, a diver marveled at the delicate ecosystem thriving in the submerged forests of kelp and coral. Each movement stirred tiny creatures into motion, highlighting the intricate web of life that often goes unnoticed beneath calm waters.",
    "In the grand halls of an opera house, the orchestra tuned their instruments while the conductor studied the score, preparing to deliver a performance that would stir the hearts of every audience member. The anticipation in the air was palpable, as music has the power to transcend words and connect deeply with human emotion.",
    "During an archaeological expedition in Egypt, researchers carefully excavated a tomb, uncovering hieroglyphs and artifacts that offered insights into the lives of pharaohs and common citizens alike. Each discovery was meticulously cataloged, forming a puzzle piece in humanity's collective history.",
    "On a remote island, a marine biologist studied the nesting patterns of endangered sea turtles, noting the effects of climate change and human interference. The rhythmic sound of waves and the sight of hatchlings scrambling toward the ocean created both a sense of urgency and profound wonder.",
    "In the quiet study of a philosopher, stacks of books and notes surrounded a thinker contemplating ethics, society, and the human condition. The pursuit of understanding abstract truths intertwined with practical concerns, bridging centuries of thought and modern dilemmas.",
    "Under the luminous glow of a full moon in the desert, nomadic tribes gathered around fires, sharing stories, songs, and wisdom passed down through generations. The vast, starlit sky reminded everyone of their small place in the universe, fostering both humility and awe.",
    "At the edge of a volcanic crater, a geologist observed flowing lava and measured gas emissions, studying the forces that shape the Earth's surface. The dynamic and unpredictable nature of volcanoes demanded respect, patience, and keen observation to extract meaningful scientific insights."
]

max_length = 100

def count_tokens(text):
    encoding = tiktoken.get_encoding("cl100k_base") # GPT-4 encoder
    return len(encoding.encode(text))

### Inferencing with long conversations
We have defined methods below to simulate conversations with memory. This means that for each inference, we are also sending past conversations as part of the full prompt. This is to simulate LLMs online like ChatGPT, Gemini, JAIson, and others that have memory and remember past conversations.  

In real world scenarios, it's not only just the conversations that takes up the context, but also other extra information like web search, RAG, personality prompts, etc.

In [8]:
def kobold_incremental(prompts):
    context = ""
    results = []
    for prompt in prompts:
        full_prompt = f"{context}User: Repeat this word by word: \"{prompt}\"\nAssistant:"
        data = {
            "max_context_length": 4096,
            "max_length": max_length,
            "memory": system,
            "prompt": full_prompt,
            "quiet": True,
            "temperature": 0.3,
            "top_k": 100,
            "top_p": 0.9,
            "stop_sequence": ["\n"]
        }
        start = time.perf_counter()
        result = requests.post(kobold_url, json=data)
        end = time.perf_counter()
        text = result.json()["results"][0]["text"]
        tokens = count_tokens(text)
        tps = tokens / (end - start)
        print(f"[KoboldCPP] Tokens: {tokens}, Time: {end-start:.3f}s, TPS: {tps:.2f}")
        context += f"User: Repeat this word by word: \"{prompt}\"\nAssistant: {text}\n"
        results.append((text, end - start, tps))
    return results

In [9]:
def ollama_incremental(prompts):
    context = ""
    results = []
    for prompt in prompts:
        full_prompt = f"{context}User: Repeat this word by word: \"{prompt}\"\nAssistant:"
        body = {
            "model": "mistral-nemo",
            "system": system,
            "prompt": full_prompt,
            "stream": False,
            "options": {
                "num_ctx": 4096,
                "num_predict": max_length,
                "temperature": 0.3,
                "top_k": 100,
                "top_p": 0.9,
                "stop": ["\n"]
            }
        }
        start = time.perf_counter()
        result = requests.post(ollama_url, json=body)
        end = time.perf_counter()
        text = result.json()["response"]
        tokens = count_tokens(text)
        tps = tokens / (end - start)
        print(f"[Ollama] Tokens: {tokens}, Time: {end-start:.3f}s, TPS: {tps:.2f}")
        context += f"User: Repeat this word by word: \"{prompt}\"\nAssistant: {text}\n"
        results.append((text, end - start, tps))
    return results

### Ready. Set. Go!

Alrighty, with that all set up, let's begin. Firstly, we need to make sure that KoboldCPP and Ollama instances are **not** running at the same time, hogging each other compute resources. Let's start off with Ollama. Using `ollama run <insert mistral model name here>` and `ollama ps` to make sure that our Mistral model is running and using 100% GPU. After that, we will run the below cell.  
![ollama](./images/ollama.png)

In [11]:
print("\n=== Ollama Incremental Context Test ===")
ollama_results = ollama_incremental(prompts)


=== Ollama Incremental Context Test ===
[Ollama] Tokens: 99, Time: 1.530s, TPS: 64.72
[Ollama] Tokens: 96, Time: 1.620s, TPS: 59.25
[Ollama] Tokens: 82, Time: 1.388s, TPS: 59.10
[Ollama] Tokens: 74, Time: 1.250s, TPS: 59.18
[Ollama] Tokens: 82, Time: 1.385s, TPS: 59.20
[Ollama] Tokens: 70, Time: 1.217s, TPS: 57.53
[Ollama] Tokens: 69, Time: 1.171s, TPS: 58.93
[Ollama] Tokens: 62, Time: 1.092s, TPS: 56.79
[Ollama] Tokens: 59, Time: 1.046s, TPS: 56.41
[Ollama] Tokens: 65, Time: 1.079s, TPS: 60.27
[Ollama] Tokens: 63, Time: 1.082s, TPS: 58.24
[Ollama] Tokens: 57, Time: 0.951s, TPS: 59.91
[Ollama] Tokens: 49, Time: 0.839s, TPS: 58.39
[Ollama] Tokens: 52, Time: 0.921s, TPS: 56.47
[Ollama] Tokens: 61, Time: 0.986s, TPS: 61.89
[Ollama] Tokens: 55, Time: 0.910s, TPS: 60.41
[Ollama] Tokens: 54, Time: 0.930s, TPS: 58.06
[Ollama] Tokens: 48, Time: 0.840s, TPS: 57.13
[Ollama] Tokens: 55, Time: 0.964s, TPS: 57.03
[Ollama] Tokens: 52, Time: 0.878s, TPS: 59.25
[Ollama] Tokens: 99, Time: 1.661s, TPS:

### Ollama Observations
From initial observations, it actually looks like Ollama had a really strong start! Very similar results to KoboldCPP below, maybe even edging out slightly. However, you will also notice that the TPS (Tokens Per Second) actually gets slower as the conversation grows.  

Here are the KoboldCPP settings. Keep note that `Use FlashAttention` and `Use MMAP` were manually enabled since it wasn't enabled by default (Ollama has these settings on by default). 
![Kobold1](images\kobold1.png)  
![Kobold2](./images/kobold2.png)  
![Kobold3](./images/kobold3.png)  

Let's run KoboldCPP benchmarks, now!

In [13]:
print("\n=== KoboldCPP Incremental Context Test ===")
kobold_results = kobold_incremental(prompts)


=== KoboldCPP Incremental Context Test ===
[KoboldCPP] Tokens: 99, Time: 1.629s, TPS: 60.79
[KoboldCPP] Tokens: 97, Time: 1.611s, TPS: 60.20
[KoboldCPP] Tokens: 82, Time: 1.398s, TPS: 58.66
[KoboldCPP] Tokens: 74, Time: 1.267s, TPS: 58.41
[KoboldCPP] Tokens: 82, Time: 1.398s, TPS: 58.64
[KoboldCPP] Tokens: 70, Time: 1.227s, TPS: 57.04
[KoboldCPP] Tokens: 69, Time: 1.184s, TPS: 58.27
[KoboldCPP] Tokens: 62, Time: 1.107s, TPS: 56.02
[KoboldCPP] Tokens: 59, Time: 1.095s, TPS: 53.86
[KoboldCPP] Tokens: 66, Time: 1.106s, TPS: 59.65
[KoboldCPP] Tokens: 63, Time: 1.120s, TPS: 56.26
[KoboldCPP] Tokens: 57, Time: 0.991s, TPS: 57.52
[KoboldCPP] Tokens: 49, Time: 0.873s, TPS: 56.12
[KoboldCPP] Tokens: 53, Time: 0.976s, TPS: 54.31
[KoboldCPP] Tokens: 61, Time: 1.034s, TPS: 59.01
[KoboldCPP] Tokens: 55, Time: 0.964s, TPS: 57.05
[KoboldCPP] Tokens: 54, Time: 0.981s, TPS: 55.05
[KoboldCPP] Tokens: 48, Time: 0.885s, TPS: 54.21
[KoboldCPP] Tokens: 55, Time: 1.033s, TPS: 53.23
[KoboldCPP] Tokens: 52, T

### KoboldCPP observations
KoboldCPP also started very strong, roughly matching Ollama's speed! However, look at how KoboldCPP's TPS does not slow down nearly as much compared to Ollama. In fact, KoboldCPP seems very consistent with the TPS speeds. But why is that?

# KoboldCPP Context-Shift vs Ollama KV Caching
Without going into too much detail (You can find more details on the articles I listed at the very top), here's a simple explanation as to what's really going on behind the scenes.  

Firstly, Let's talk about **KoboldCPP's Context-Shift**. With context-shift, the model keeps track of previous conversation tokens in memory and shifts them via KV Cache Shifting as new tokens arrive. When the max_context_length is reached, the Kobold drops any old tokens from the earliest part of the conversation. What's special about this is that when they drop the old conversations, they can add any new tokens/conversation into the current context and completely avoid reprocessing of the full prompt + system prompt. This explains why KoboldCPP can maintain a steady TPS as shown in the benchmarks above. The only downside is that context-shift does affect the TPS but very minimally.

But what about **Ollama's KV Caching**? Simply put, it just stores Key/Value embeddings of past tokens in the context. What's neat about this is that there is absolutely no reprocessing **only on parts of the prompt that hasn't changed** *and* **as long as the max context window is not reached**. Once the max context window overflows, Ollama will need to throw out old parts of the conversation, add the new parts in, and then they **must reprocess everything**. And at this point, every inference potentially makes it slower and slower each time because it has to do the "oh no! too much stuff! time to throw old stuff out, bring new stuff in, and reprocess everything" over and over. The pros to this approach is that your TPS initially is blazing fast (theoretically, Ollama's TPS would be faster than KoboldCPP's initially). The biggest cons however is that once you reach that max context windows, it's going to chug.  

At the end of the day, both Ollama and Kobold are amazing solutions to run LLM models. However, if you need something to hold longer conversations and be consistently fast, Kobold is my choice. If you want ease of use and don't mind keeping track of the number of tokens in your model to prevent max ctx window overflow, then Ollama is a fantastic choice as well.