# **Sending a POST Request to a Local LLM API**
This block demonstrates how to send a JSON-formatted request to a locally hosted Language Model API using requests.

**BASE_URL:** The root address for the API ‚Äî usually exposed via ngrok when hosting locally.

**ENDPOINT:** Specific path where the model API listens for requests (e.g., /api/generate).

**url:** Full URL used for the POST request.


**model:** Specifies the name/version of the language model to use.

**prompt:** The input query for the model to respond to.

**"Content-Type":** "application/json" indicates the request body format.

The Host header is intentionally omitted (handled by ngrok).



In [None]:
import requests

BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
url = BASE_URL + ENDPOINT

payload = {"model": "llama3.2", "prompt": "Hello, world!"}
headers = {"Content-Type": "application/json"}  # no Host header here

response = requests.post(url, json=payload, headers=headers)
print(response.status_code)
print(response.text)


200
{"model":"llama3.2","created_at":"2025-03-02T23:46:11.93036Z","response":"Hello","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:11.963882Z","response":"!","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.0152Z","response":" It","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.059124Z","response":"'s","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.103789Z","response":" nice","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.14873Z","response":" to","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.191039Z","response":" meet","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.233398Z","response":" you","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.276143Z","response":".","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.318491Z","response":" Is","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:12.3648

# **Simulated KV Cache for LLM Inference Using Ollama**
This section demonstrates how to simulate KV caching for large language model (LLM) inference, using token tracking and prompt reconstruction in a stateless API setup (like Ollama served over ngrok).

**Purpose**

Simulate KV cache behavior by manually storing and reusing past tokens.

Bypasses the limitation that Ollama does not support injecting past key-value pairs.

 **Features**

**Tokenization:** Naively splits text using spaces.

**Storage:** Maintains a rolling buffer (max_cache_size) of tokens.

**Retrieval:** Converts token list back to a text prompt.

**Reset:** Provides a method to clear the cache.

**Example Methods**

**add_tokens(new_text):** Appends new tokens and truncates to buffer limit.

**get_cached_tokens():** Returns all stored tokens as a string.

**clear_cache():** Resets the internal cache.



In [None]:
import numpy as np
import requests
import json

class TokenCache:
    """
    Simulates KV caching by storing past tokens and reusing them in future inferences.
    Since Ollama does not allow KV cache injection, we track tokens manually.
    """
    def __init__(self, max_cache_size=1024):
        self.max_cache_size = max_cache_size
        self.tokens = []

    def add_tokens(self, new_text):
        """Tokenizes and stores the last max_cache_size tokens."""
        new_tokens = new_text.split()  # Simulate tokenization
        self.tokens.extend(new_tokens)

        # Limit the cache size
        if len(self.tokens) > self.max_cache_size:
            self.tokens = self.tokens[-self.max_cache_size:]

    def get_cached_tokens(self):
        """Retrieves stored tokens as a single text input."""
        return " ".join(self.tokens)  # Convert stored tokens into a single string

    def clear_cache(self):
        """Clears the token cache."""
        self.tokens = []
        print("üóëÔ∏è Token cache cleared.")

# Initialize the token-based cache
token_cache = TokenCache(max_cache_size=1024)

def run_inference_with_kv_cache(api_url, model_name, new_input, token_cache):
    """
    Runs inference using Ollama while simulating KV caching by keeping track of past tokens.
    """
    # 1Ô∏è‚É£ Retrieve stored tokens
    cached_context = token_cache.get_cached_tokens()

    # 2Ô∏è‚É£ Create new input including stored tokens
    full_prompt = cached_context + " " + new_input if cached_context else new_input

    # 3Ô∏è‚É£ Send request to Ollama
    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    try:
        response = requests.post(api_url, json=data, stream=True)
        print("\nüì° Raw API Response:")

        full_response = ""
        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    full_response += parsed_line.get("response", "") + " "
                    print(json_line)
                except Exception as e:
                    print(f"Error parsing line: {line}, Error: {e}")

        # 4Ô∏è‚É£ Store generated tokens for future inference
        token_cache.add_tokens(full_response.strip())

        return {"response": full_response.strip()}
    except Exception as e:
        return {"error": str(e)}

# üîπ Test the KV cache with Ollama
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Run multiple inferences
response1 = run_inference_with_kv_cache(OLLAMA_API_URL, "llama3.2", "Hello, how are you?", token_cache)
print("\nResponse 1:", response1)

response2 = run_inference_with_kv_cache(OLLAMA_API_URL, "llama3.2", "What is your name?", token_cache)
print("\nResponse 2:", response2)

response3 = run_inference_with_kv_cache(OLLAMA_API_URL, "llama3.2", "Tell me a joke.", token_cache)
print("\nResponse 3:", response3)

# üóëÔ∏è Clear cache and test again
token_cache.clear_cache()
response4 = run_inference_with_kv_cache(OLLAMA_API_URL, "llama3.2", "What do you remember?", token_cache)
print("\nResponse 4:", response4)



üì° Raw API Response:
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.029406Z","response":"I","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.069654Z","response":"'m","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.112543Z","response":" just","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.154186Z","response":" a","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.198319Z","response":" language","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.239625Z","response":" model","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.281594Z","response":",","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.325858Z","response":" so","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.371256Z","response":" I","done":false}
{"model":"llama3.2","created_at":"2025-03-02T23:46:54.413066Z","response":" don","done":false}
{"model":"llama3.2","created_at

# **Optimized KV Cache with deque for Efficient Context Management**

This section introduces an improved approach to simulating Key-Value (KV) cache behavior for the Ollama API using collections.deque for efficient memory and token queue management.

**Purpose**

- Store and manage a rolling buffer of past tokens to simulate conversational memory.

- Uses collections.deque for efficient popping and appending, making it suitable for large-scale or real-time inference.

**Key Methods**

  - add_tokens(new_text)

        Tokenizes input (split()).

        Enqueues tokens.

        Automatically discards oldest tokens if max_tokens is exceeded.

  - get_cached_context()

        Returns a string representation of the current token buffer.

- **clear_cache()**

        Empties the queue and prints a confirmation.

**Why Use deque?**

- deque provides O(1) complexity for appends and pops on both ends.

- Ideal for simulating LRU (Least Recently Used) cache mechanisms.

In [None]:
import requests
import json
from collections import deque

class KVCache:
    """Optimized KV Cache for Ollama by storing and managing context efficiently."""
    def __init__(self, max_tokens=1024):
        self.max_tokens = max_tokens
        self.token_queue = deque()

    def add_tokens(self, new_text):
        """Splits new text into tokens and manages the queue to maintain context length."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        # Truncate the queue if it exceeds the limit
        while len(self.token_queue) > self.max_tokens:
            self.token_queue.popleft()

    def get_cached_context(self):
        """Retrieves stored tokens as a single string for inference."""
        return " ".join(self.token_queue)

    def clear_cache(self):
        """Clears the token queue."""
        self.token_queue.clear()
        print("üóëÔ∏è KV Cache Cleared.")

# Initialize KV Cache
kv_cache = KVCache(max_tokens=1024)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """
    Queries Ollama API with a KV cache mechanism.
    """
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    try:
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        # Join tokens into a full response
        final_response = " ".join(collected_response).strip()
        kv_cache.add_tokens(final_response)  # Store the response in cache

        return {"response": final_response}
    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Testing with KV Cache
print("\nüü¢ Running Query 1")
response1 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "Hello, how are you?", kv_cache)
print("\nResponse 1:", response1)

print("\nüü¢ Running Query 2")
response2 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "What is your name?", kv_cache)
print("\nResponse 2:", response2)

print("\nüü¢ Running Query 3")
response3 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "Tell me a joke.", kv_cache)
print("\nResponse 3:", response3)

# Clearing Cache and Retesting
kv_cache.clear_cache()
print("\nüü¢ Running Query 4 (After Cache Clear)")
response4 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "What do you remember?", kv_cache)
print("\nResponse 4:", response4)



üü¢ Running Query 1

Response 1: {'response': "I 'm  just  a  language  model ,  so  I  don 't  have  emotions  or  feelings  like  humans  do .  However ,  I 'm  functioning  properly  and  ready  to  assist  you  with  any  questions  or  tasks  you  may  have !  How  about  you ?  How 's  your  day  going  so  far ?"}

üü¢ Running Query 2

Response 2: {'response': "That 's  correct ,  I 'm  just  a  language  model ,  so  I  don 't  have  emotions  or  feelings  like  humans  do .  I 'm  designed  to  process  and  respond  to  text -based  inputs ,  but  I  don 't  have  subjective  experiences  or  personal  opinions .\n\n As  for  you ,  it  sounds  like  you 're  doing  great !  You 've  started  the  conversation  with  a  friendly  and  consider ate  tone ,  which  is  perfect  for  a  language  model  interaction .  However ,  I  don 't  have  a  name ,  as  I 'm  an  AI  designed  to  provide  information  and  assist  with  tasks .  I  exist  solely  to  help  users  lik

# **Persistent KV Cache with File-Based Storage for LLM Context**

This section builds on the earlier simulated KV cache by introducing file-based persistence, enabling conversational context to persist across sessions.

**Purpose**


- Automatically save and load token data to/from a JSON file.

**Key Additions**

**Method	and Purpose**

- save_cache()	Saves token queue to kv_cache.json

- load_cache()	Loads from file if present

- clear_cache()	Clears queue and deletes cache file

**Features**

- Efficient memory using collections.deque.

- File persistence via standard JSON file.

- Truncates to max_tokens for safety and performance.

In [None]:
#save cache to file

import requests
import json
import os
from collections import deque

class KVCache:
    """Optimized KV Cache with File Storage for Persistent Memory."""

    CACHE_FILE = "kv_cache.json"

    def __init__(self, max_tokens=1024):
        self.max_tokens = max_tokens
        self.token_queue = deque()
        self.load_cache()  # Load existing cache from file

    def add_tokens(self, new_text):
        """Splits new text into tokens, updates the queue, and saves to file."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        # Truncate the queue if it exceeds the limit
        while len(self.token_queue) > self.max_tokens:
            self.token_queue.popleft()

        self.save_cache()  # Save updated cache to file

    def get_cached_context(self):
        """Retrieves stored tokens as a single string for inference."""
        return " ".join(self.token_queue)

    def save_cache(self):
        """Saves the current cache to a JSON file."""
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(list(self.token_queue), file)
            print("üíæ Cache saved to file.")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving cache: {e}")

    def load_cache(self):
        """Loads the cache from a JSON file if it exists."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    cached_tokens = json.load(file)
                    self.token_queue = deque(cached_tokens[-self.max_tokens:])  # Load only up to max_tokens
                print("üîÑ Cache loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading cache: {e}")

    def clear_cache(self):
        """Clears the token queue and removes the cache file."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

# Initialize KV Cache with Persistent Storage
kv_cache = KVCache(max_tokens=1024)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """
    Queries Ollama API with a KV cache mechanism.
    """
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    try:
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        # Join tokens into a full response
        final_response = " ".join(collected_response).strip()
        kv_cache.add_tokens(final_response)  # Store the response in cache

        return {"response": final_response}
    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Testing with Persistent KV Cache
print("\nüü¢ Running Query 1")
response1 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "Hello, how are you?", kv_cache)
print("\nResponse 1:", response1)

print("\nüü¢ Running Query 2")
response2 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "What is your name?", kv_cache)
print("\nResponse 2:", response2)

print("\nüü¢ Running Query 3")
response3 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "Tell me a joke.", kv_cache)
print("\nResponse 3:", response3)

# Clearing Cache and Retesting
kv_cache.clear_cache()
print("\nüü¢ Running Query 4 (After Cache Clear)")
response4 = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", "What do you remember?", kv_cache)
print("\nResponse 4:", response4)


üîÑ Cache loaded from file.

üü¢ Running Query 1
üíæ Cache saved to file.

Response 1: {'response': "I 'm  just  a  language  model ,  so  I  don 't  have  emotions  or  feelings  like  humans  do .  However ,  I 'm  functioning  properly  and  ready  to  assist  you  with  any  questions  or  tasks  you  may  have !  How  can  I  help  you  today ?"}

üü¢ Running Query 2
üíæ Cache saved to file.

Response 2: {'response': 'Thank  you  for  the  warm  introduction !  I \'m  happy  to  chat  with  you ,  even  if  it \'s  just  a  language  model  like  yourself .  I  don \'t  have  a  personal  name ,  but  I \'ll  refer  to  myself  as  " Assistant "  or  " AI "  from  now  on .\n\n I \'m  here  to  help  answer  any  questions ,  provide  information ,  or  assist  with  tasks  you  may  have .  What \'s  on  your  mind ?  Are  you  looking  for  help  with  something  specific ,  or  do  you  want  to  engage  in  a  fun  conversation ?'}

üü¢ Running Query 3
üíæ Cache saved t

# **Measuring Cache File Growth with Fixed Prompt Size for 10 Inferences**

**This section demonstrates how to:**

- Simulate a KV cache with unlimited token storage.

- Limit the number of tokens used for prompting to a fixed window (512 tokens).

- Track the cache file size growth across multiple inferences.

**Purpose**

- Maintain a growing memory of previous interactions.

- Use only the latest prompt_size tokens (512) during inference.

- Persist the entire token history to a JSON file.

- Track and print cache file size after each inference.

**Key Features**

- add_tokens()	Appends all new tokens to the queue and saves to disk.
- get_cached_context()	Returns only the latest N tokens (prompt_size).
- save_cache()	Stores full cache and logs file size.
- load_cache()	Loads all past tokens (if any) from file.
- clear_cache()	Deletes both in-memory and file-based cache.

In [None]:
#File size calculation with fixed prompt/token size for 10 inferences

import requests
import json
import os
from collections import deque

class KVCache:
    """KV Cache with Unlimited Storage & Fixed Prompt Size."""

    CACHE_FILE = "kv_cache.json"

    def __init__(self, prompt_size=512):
        self.prompt_size = prompt_size  # Fixed prompt size
        self.token_queue = deque()
        self.load_cache()  # Load existing cache from file

    def add_tokens(self, new_text):
        """Adds new tokens to the cache with unlimited storage."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        self.save_cache()  # Save updated cache to file

    def get_cached_context(self):
        """Retrieves only the latest 512 tokens for inference."""
        tokens = list(self.token_queue)[-self.prompt_size:]  # Always take the last 512 tokens
        return " ".join(tokens)

    def save_cache(self):
        """Saves the current cache to a JSON file and logs file size."""
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(list(self.token_queue), file)
            file_size = os.path.getsize(self.CACHE_FILE)  # Get file size
            print(f"üíæ Cache saved. (Size: {file_size} bytes)")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving cache: {e}")

    def load_cache(self):
        """Loads the cache from a JSON file if it exists."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    cached_tokens = json.load(file)
                    self.token_queue = deque(cached_tokens)  # Load full cache
                print("üîÑ Cache loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading cache: {e}")

    def clear_cache(self):
        """Clears the token queue and removes the cache file."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

# Initialize KV Cache with Fixed Prompt Size (512 Tokens)
kv_cache = KVCache(prompt_size=512)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """
    Queries Ollama API with a KV cache mechanism.
    """
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    try:
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        # Join tokens into a full response
        final_response = " ".join(collected_response).strip()
        kv_cache.add_tokens(final_response)  # Store the response in cache

        return {"response": final_response}
    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Run 10 Queries (No Cache Size Limit)
for i in range(10):
    print(f"\nüü¢ Running Query {i+1} (Fixed Prompt Size: 512 tokens, Unlimited Cache)")
    user_prompt = f"This is test iteration {i+1}. Tell me something interesting."
    response = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", user_prompt, kv_cache)
    print(f"\nResponse {i+1}:", response)

    # Show file size after each iteration
    if os.path.exists("kv_cache.json"):
        file_size = os.path.getsize("kv_cache.json")
        print(f"üìÇ Cache File Size After Iteration {i+1}: {file_size} bytes")

# Clearing Cache at the End
kv_cache.clear_cache()



üü¢ Running Query 1 (Fixed Prompt Size: 512 tokens, Unlimited Cache)
üíæ Cache saved. (Size: 970 bytes)

Response 1: {'response': 'Iteration   1 ,  let \'s  get  started .\n\n Did  you  know  that  there  is  a  type  of  jelly fish  that  is  immortal ?  The  Tur rit opsis  do hr n ii ,  also  known  as  the  " imm ortal  jelly fish ,"  is  a  species  of  jelly fish  that  can  transform  its  body  into  a  younger  state  through  a  process  called  trans different iation .  This  means  that  it  can  essentially  revert  back  to  its  pol yp  stage ,  which  is  the  juvenile  form  of  a  jelly fish ,  and  then  grow  back  into  an  adult  again .  This  process  can  be  repeated  indefinitely ,  making  Tur rit opsis  do hr n ii  theoretically  immortal .\n\n How \'s  that  for  something  interesting ?'}
üìÇ Cache File Size After Iteration 1: 970 bytes

üü¢ Running Query 2 (Fixed Prompt Size: 512 tokens, Unlimited Cache)
üíæ Cache saved. (Size: 2711 bytes)

Response

# **Caching with Inference Time Tracking ‚Äì Fixed Prompt, Unlimited History**
This block enhances the persistent KV cache system by introducing inference time measurement for each API call, while still using a fixed prompt window and an unlimited cache history saved to file.

**Key Enhancements**

- Stores all past tokens (no pruning).

- Always returns only the last prompt_size tokens (e.g., 512) for prompting.

- Measures and prints the size of the cache file after each update.

**Prompt Construction**

Concatenates the last 512 tokens with the user input.

**Timing Start**

Records current time with start_time = time.time().

**Streaming API Call**

Parses streamed response as usual.

**Timing End & Delta**

Captures elapsed time as inference_time.

**Cache Update**

Response is added to the unlimited cache and saved to file.


In [None]:
import requests
import json
import os
import time
from collections import deque

class KVCache:
    """KV Cache with Unlimited Storage & Fixed Prompt Size."""

    CACHE_FILE = "kv_cache.json"

    def __init__(self, prompt_size=512):
        self.prompt_size = prompt_size  # Fixed prompt size
        self.token_queue = deque()
        self.load_cache()  # Load existing cache from file

    def add_tokens(self, new_text):
        """Adds new tokens to the cache with unlimited storage."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        self.save_cache()  # Save updated cache to file

    def get_cached_context(self):
        """Retrieves only the latest 512 tokens for inference."""
        tokens = list(self.token_queue)[-self.prompt_size:]  # Always take the last 512 tokens
        return " ".join(tokens)

    def save_cache(self):
        """Saves the current cache to a JSON file and logs file size."""
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(list(self.token_queue), file)
            file_size = os.path.getsize(self.CACHE_FILE)  # Get file size
            print(f"üíæ Cache saved. (Size: {file_size} bytes)")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving cache: {e}")

    def load_cache(self):
        """Loads the cache from a JSON file if it exists."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    cached_tokens = json.load(file)
                    self.token_queue = deque(cached_tokens)  # Load full cache
                print("üîÑ Cache loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading cache: {e}")

    def clear_cache(self):
        """Clears the token queue and removes the cache file."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

# Initialize KV Cache with Fixed Prompt Size (512 Tokens)
kv_cache = KVCache(prompt_size=512)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """
    Queries Ollama API with a KV cache mechanism and measures inference time.
    """
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    start_time = time.time()  # Start measuring inference time

    try:
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        end_time = time.time()  # End measuring inference time
        inference_time = end_time - start_time  # Compute elapsed time

        # Join tokens into a full response
        final_response = " ".join(collected_response).strip()
        kv_cache.add_tokens(final_response)  # Store the response in cache

        return {"response": final_response, "inference_time": inference_time}
    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Run 10 Queries (No Cache Size Limit)
for i in range(10):
    print(f"\nüü¢ Running Query {i+1} (Fixed Prompt Size: 512 tokens, Unlimited Cache)")

    user_prompt = f"This is test iteration {i+1}. Tell me something interesting."

    response = query_ollama_with_cache(OLLAMA_API_URL, "llama3.2", user_prompt, kv_cache)

    print(f"\nResponse {i+1}: {response['response']}")
    print(f"‚è± Inference Time for Query {i+1}: {response['inference_time']:.4f} seconds")

    # Show file size after each iteration
    if os.path.exists("kv_cache.json"):
        file_size = os.path.getsize("kv_cache.json")
        print(f"üìÇ Cache File Size After Iteration {i+1}: {file_size} bytes")

# Clearing Cache at the End
kv_cache.clear_cache()



üü¢ Running Query 1 (Fixed Prompt Size: 512 tokens, Unlimited Cache)
üíæ Cache saved. (Size: 970 bytes)

Response 1: Iteration   1 ,  let 's  get  started !

 Did  you  know  that  there  is  a  type  of  jelly fish  that  is  immortal ?  The  Tur rit opsis  do hr n ii ,  also  known  as  the  " imm ortal  jelly fish ,"  is  a  species  of  jelly fish  that  can  transform  its  body  into  a  younger  state  through  a  process  called  trans different iation .  This  means  it  can  essentially  revert  back  to  its  pol yp  stage ,  which  is  the  juvenile  form  of  a  jelly fish ,  and  then  grow  back  into  an  adult  again .  This  process  can  be  repeated  indefinitely ,  making  the  Tur rit opsis  do hr n ii  theoretically  immortal !

 How 's  that  for  an  interesting  fact ?
‚è± Inference Time for Query 1: 9.9436 seconds
üìÇ Cache File Size After Iteration 1: 970 bytes

üü¢ Running Query 2 (Fixed Prompt Size: 512 tokens, Unlimited Cache)
üíæ Cache saved. (Size

# **Inference Time Comparison: With vs. Without KV Cache**
This experiment benchmarks and compares the inference latency of a language model when using a simulated KV cache versus not using it. It also tracks how the persistent cache grows in file size.

**Key Features**

- Unlimited token storage via deque

- Fixed prompt window of latest 512 tokens

- Automatic saving/loading to/from kv_cache.json

- File size monitoring after each inference

- Manual cache reset with .clear_cache()

**Inference Time Tracking**

- Uses time.time() to calculate latency.

- Returns both the model response and the time elapsed

Clears memory and deletes kv_cache.json.

Ensures reproducibility for future experiments.

In [None]:
import requests
import json
import os
import time
from collections import deque

class KVCache:
    """KV Cache with Unlimited Storage & Fixed Prompt Size."""

    CACHE_FILE = "kv_cache.json"

    def __init__(self, prompt_size=512):
        self.prompt_size = prompt_size  # Fixed prompt size
        self.token_queue = deque()
        self.load_cache()  # Load existing cache from file

    def add_tokens(self, new_text):
        """Adds new tokens to the cache with unlimited storage."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)
        self.save_cache()  # Save updated cache to file

    def get_cached_context(self):
        """Retrieves only the latest 512 tokens for inference."""
        tokens = list(self.token_queue)[-self.prompt_size:]  # Always take the last 512 tokens
        return " ".join(tokens)

    def save_cache(self):
        """Saves the current cache to a JSON file and logs file size."""
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(list(self.token_queue), file)
            file_size = os.path.getsize(self.CACHE_FILE)  # Get file size
            print(f"üíæ Cache saved. (Size: {file_size} bytes)")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving cache: {e}")

    def load_cache(self):
        """Loads the cache from a JSON file if it exists."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    cached_tokens = json.load(file)
                    self.token_queue = deque(cached_tokens)  # Load full cache
                print("üîÑ Cache loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading cache: {e}")

    def clear_cache(self):
        """Clears the token queue and removes the cache file."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

# Initialize KV Cache with Fixed Prompt Size (512 Tokens)
kv_cache = KVCache(prompt_size=512)

def query_ollama(api_url, model_name, user_input, use_cache=True):
    """
    Queries Ollama API with or without KV caching and measures inference time.
    """
    if use_cache:
        cached_context = kv_cache.get_cached_context()
        full_prompt = f"{cached_context} {user_input}" if cached_context else user_input
    else:
        full_prompt = user_input  # No cache is used

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    start_time = time.time()  # Start measuring inference time

    try:
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        end_time = time.time()  # End measuring inference time
        inference_time = end_time - start_time  # Compute elapsed time

        # Join tokens into a full response
        final_response = " ".join(collected_response).strip()

        if use_cache:
            kv_cache.add_tokens(final_response)  # Store the response in cache

        return {"response": final_response, "inference_time": inference_time}
    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://d3ab-84-224-189-77.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

# Run 10 Queries (With & Without Cache)
for i in range(10):
    print(f"\nüü¢ Running Query {i+1} (Fixed Prompt Size: 512 tokens, Unlimited Cache)")

    user_prompt = f"This is test iteration {i+1}. Tell me something interesting."

    # With KV Caching
    response_with_cache = query_ollama(OLLAMA_API_URL, "llama3.2", user_prompt, use_cache=True)
    print(f"\n‚úÖ With KV Cache - Response {i+1}: {response_with_cache['response']}")
    print(f"‚è± Inference Time with Cache: {response_with_cache['inference_time']:.4f} seconds")

    # Without KV Caching
    response_without_cache = query_ollama(OLLAMA_API_URL, "llama3.2", user_prompt, use_cache=False)
    print(f"\n‚ùå Without KV Cache - Response {i+1}: {response_without_cache['response']}")
    print(f"‚è± Inference Time without Cache: {response_without_cache['inference_time']:.4f} seconds")

    # Show file size after each iteration
    if os.path.exists("kv_cache.json"):
        file_size = os.path.getsize("kv_cache.json")
        print(f"üìÇ Cache File Size After Iteration {i+1}: {file_size} bytes")

# Clearing Cache at the End
kv_cache.clear_cache()



üü¢ Running Query 1 (Fixed Prompt Size: 512 tokens, Unlimited Cache)
üíæ Cache saved. (Size: 763 bytes)

‚úÖ With KV Cache - Response 1: Iteration   1  complete !

 Here 's  something  interesting :

 Did  you  know  that  there  is  a  species  of  jelly fish  that  is  immortal ?  The  Tur rit opsis  do hr n ii ,  also  known  as  the  " imm ortal  jelly fish ,"  can  transform  its  body  into  a  younger  state  through  a  process  called  trans different iation .  This  means  it  can  essentially  revert  back  to  its  pol yp  stage  and  grow  back  into  an  adult  again ,  making  it  theoretically  immortal .

 Would  you  like  me  to  share  more  interesting  facts ?
‚è± Inference Time with Cache: 8.3453 seconds

‚ùå Without KV Cache - Response 1: Iteration   1 ,  a  new  beginning !

 Here 's  something  interesting :

 Did  you  know  that  there  is  a  species  of  jelly fish  that  is  immortal ?  The  Tur rit opsis  do hr n ii ,  also  known  as  the  " imm orta

# **Optimized KV Cache with Growth Tracking & Compression**

This section performs inference on a trained LLM (llama3.2) using a persistent Key-Value (KV) cache. It logs cache size and response time per iteration, useful for evaluating performance and memory behavior post-training.

**This section enhances previous KV cache strategies by adding:**

-  Token compression to reduce file size.

-  Cache growth tracking to measure memory accumulation over time.

-  Context preservation with persistent storage and quantization.

-  Inference time monitoring for performance diagnostics.

**Methods:**

**add_tokens()**	- Adds tokens, enforces max_tokens, and tracks growth

**track_cache_growth()**	- Logs percentage growth after every addition

**compress_cache()**	- Bins tokens using quantization (compression_factor) to save disk space

**decompress_cache()**	- Restores tokens using the reversed bin map

**save_cache()**	- Persists compressed token history to kv_cache_compressed.json

**load_cache()**	- Loads and decompresses the token buffer on startup

**get_cache_size()**	- Returns total number of cached tokens

**clear_cache()**	- Clears both memory and disk cache



In [None]:
import json
import os
import numpy as np
from collections import deque
import requests
import time

class KVCacheOptimized:
    """Enhanced KV Cache with Growth Tracking and Compression for Efficient Inference."""

    CACHE_FILE = "kv_cache_compressed.json"

    def __init__(self, max_tokens=2048, compression_factor=2):
        self.max_tokens = max_tokens
        self.token_queue = deque()
        self.prev_cache_size = 0
        self.compression_factor = compression_factor  # Control compression level
        self.load_cache()

    def add_tokens(self, new_text):
        """Adds new tokens while maintaining cache limit and tracking growth."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        while len(self.token_queue) > self.max_tokens:
            self.token_queue.popleft()  # Remove oldest tokens

        self.track_cache_growth()
        self.save_cache()

    def get_cached_context(self):
        """Retrieves stored tokens as context."""
        return " ".join(self.token_queue)

    def track_cache_growth(self):
        """Tracks cache growth ratio for analysis."""
        new_cache_size = len(self.token_queue)
        growth_ratio = (new_cache_size - self.prev_cache_size) / max(1, self.prev_cache_size)
        self.prev_cache_size = new_cache_size
        print(f"üìä Cache Growth: {growth_ratio:.2%} | Current Cache Size: {new_cache_size} tokens")

    def save_cache(self):
        """Compresses and saves KV cache."""
        compressed_tokens = self.compress_cache(list(self.token_queue))
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(compressed_tokens, file)
            print("üíæ KV Cache (Compressed) saved to file.")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving KV cache: {e}")

    def load_cache(self):
        """Loads compressed KV cache from file."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    compressed_tokens = json.load(file)
                    self.token_queue = deque(self.decompress_cache(compressed_tokens)[-self.max_tokens:])
                print("üîÑ KV Cache (Compressed) loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading KV cache: {e}")

    def compress_cache(self, tokens):
        """Applies token binning and quantization for compression."""
        unique_tokens = list(set(tokens))  # Reduce redundancy
        token_map = {token: idx // self.compression_factor for idx, token in enumerate(unique_tokens)}
        compressed_tokens = [token_map[token] for token in tokens]  # Store token indices
        return {"token_map": token_map, "compressed_tokens": compressed_tokens}

    def decompress_cache(self, compressed_data):
        """Restores compressed cache to original tokens."""
        token_map = {v: k for k, v in compressed_data["token_map"].items()}  # Reverse mapping
        return [token_map[idx] for idx in compressed_data["compressed_tokens"]]

    def clear_cache(self):
        """Clears KV cache."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

    def get_cache_size(self):
        """Returns the number of tokens stored in KV cache."""
        return len(self.token_queue)

# Initialize Optimized KV Cache
kv_cache = KVCacheOptimized(max_tokens=2048, compression_factor=4)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """
    Queries Ollama model with KV caching, analyzing cache growth & compression.
    """
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    data = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": True
    }

    try:
        start_time = time.time()  # Start timing inference
        response = requests.post(api_url, json=data, stream=True)
        collected_response = []

        for line in response.iter_lines():
            if line:
                try:
                    json_line = line.decode('utf-8')
                    parsed_line = json.loads(json_line)
                    token_text = parsed_line.get("response", "")
                    collected_response.append(token_text)
                except Exception as e:
                    print(f"Error parsing response: {e}")

        final_response = " ".join(collected_response).strip()
        end_time = time.time()  # Stop timing inference

        # Store response in KV Cache
        kv_cache.add_tokens(final_response)

        # Log Cache Statistics
        cache_size = kv_cache.get_cache_size()
        inference_time = round(end_time - start_time, 4)
        print(f"‚è≥ Inference Time: {inference_time} sec | üì¶ KV Cache Size: {cache_size} tokens")

        return {"response": final_response, "inference_time": inference_time, "cache_size": cache_size}

    except Exception as e:
        return {"error": str(e)}

# API URL Placeholder
BASE_URL = "https://71cd-91-104-75-197.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT


# üîπ Running Inference and Analyzing Cache Growth
print("\nüü¢ Running Inference 1")
response1 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "Explain AI's impact on society.", kv_cache)
print("\nResponse 1:", response1)

print("\nüü¢ Running Inference 2")
response2 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "How does reinforcement learning work?", kv_cache)
print("\nResponse 2:", response2)

print("\nüü¢ Running Inference 3")
response3 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "Summarize what we discussed about AI ethics.", kv_cache)
print("\nResponse 3:", response3)

# Clearing Cache and Retesting
kv_cache.clear_cache()
print("\nüü¢ Running Inference 4 (After Cache Clear)")
response4 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "Can you recall previous topics?", kv_cache)
print("\nResponse 4:", response4)


üîÑ KV Cache (Compressed) loaded from file.

üü¢ Running Inference 1
üìä Cache Growth: 42800.00% | Current Cache Size: 428 tokens
üíæ KV Cache (Compressed) saved to file.
‚è≥ Inference Time: 30.8492 sec | üì¶ KV Cache Size: 428 tokens

Response 1: {'response': "What  a  fascinating  conversation !  As  we  embark  on  this  dialogue ,  I 'd  like  to  start  by  asking :  What  do  you  think  about  the  impact  of  Artificial  Intelligence  ( AI )  on  our  society ?\n\n As  an  AI  language  model ,  I 'll  respond  with  insights  and  thoughts  based  on  my  training  data .  Feel  free  to  pick  any  topic  or  memory  that  comes  to  mind ,  and  I 'll  engage  in  a  conversation  that 's  as  natural  as  possible .\n\n To  begin ,  AI  has  had  a  profound  impact  on  various  aspects  of  our  lives .  From  automation  and  efficiency  gains  in  industries  like  manufacturing  and  healthcare  to  enhancing  customer  service  experiences ,  AI  has  revolution 

# **Advanced KV Cache: Huffman Compression, Compression Ratio Tracking & Memory Usage Logging**

This final version of the KV caching system introduces Huffman encoding for optimal compression and tracks both compression ratio dynamics and memory consumption during inference using the Ollama API.

**Function Summary**

- Sends a non-streamed request to Ollama for a full reply.

- Logs full request payload and response metadata.

**Measures:**

        - Inference latency

        - Change in memory usage (MB)

        - Compression ratio

**Huffman-compressed cache file size**

Updates cache using Huffman-compressed tokens.

Huffman Compression Logic:

- Count token frequencies using collections.Counter.

- Construct Huffman tree via heapq.

- Assign binary codes: '0' for left, '1' for right.

- Encode tokens into a bitstring.

- Decode via reverse lookup in decompress_cache().

This simulates real-world token compression approaches.

In [None]:
#Compression Implementation
import json
import os
import time
import psutil  # For measuring memory usage
import requests
import heapq
from collections import deque, Counter

class KVCacheOptimized:
    """Optimized KV Cache with Dynamic Compression & Huffman Encoding."""

    CACHE_FILE = "kv_cache_compressed.json"

    def __init__(self, max_tokens=2048):
        self.max_tokens = max_tokens
        self.token_queue = deque()
        self.load_cache()
        self.previous_compression_ratio = None  # Store previous ratio for comparison

    def add_tokens(self, new_text):
        """Adds new tokens to KV cache & dynamically tracks compression ratio."""
        new_tokens = new_text.split()
        self.token_queue.extend(new_tokens)

        while len(self.token_queue) > self.max_tokens:
            self.token_queue.popleft()

        original_size = sum(len(token) for token in self.token_queue)  # Measure in bytes
        compressed_size = self.get_compressed_size()  # Measure compressed size

        # Track compression ratio dynamically
        compression_ratio = original_size / max(1, compressed_size)

        if self.previous_compression_ratio:
            change = (compression_ratio - self.previous_compression_ratio) / self.previous_compression_ratio * 100
            print(f"üìä Compression Ratio: {compression_ratio:.2f} | üîÑ Change: {change:.2f}%")
        else:
            print(f"üìä Initial Compression Ratio: {compression_ratio:.2f}")

        self.previous_compression_ratio = compression_ratio  # Store for next comparison
        self.save_cache()

    def get_cached_context(self):
        """Retrieves last 256 tokens from cache to avoid corruption."""
        return " ".join(list(self.token_queue)[-256:])

    def get_compressed_size(self):
        """Returns the size of compressed KV cache in bytes."""
        compressed_data = self.compress_cache(list(self.token_queue))
        return len(compressed_data["compressed_tokens"]) // 8  # Convert bitstring to bytes

    def save_cache(self):
        """Compresses & saves KV cache to file."""
        compressed_tokens = self.compress_cache(list(self.token_queue))
        try:
            with open(self.CACHE_FILE, "w", encoding="utf-8") as file:
                json.dump(compressed_tokens, file)
            print("üíæ KV Cache (Compressed) saved to file.")
        except Exception as e:
            print(f"‚ö†Ô∏è Error saving KV cache: {e}")

    def load_cache(self):
        """Loads KV cache from file."""
        if os.path.exists(self.CACHE_FILE):
            try:
                with open(self.CACHE_FILE, "r", encoding="utf-8") as file:
                    compressed_tokens = json.load(file)
                    self.token_queue = deque(self.decompress_cache(compressed_tokens)[-self.max_tokens:])
                print("üîÑ KV Cache (Compressed) loaded from file.")
            except Exception as e:
                print(f"‚ö†Ô∏è Error loading KV cache: {e}")

    def compress_cache(self, tokens):
        """Applies Huffman encoding for better compression."""
        token_map = self.huffman_encoding(tokens)  # Get Huffman codes
        compressed_tokens = "".join(token_map[token] for token in tokens)  # Store as bitstring
        return {"token_map": token_map, "compressed_tokens": compressed_tokens}

    def decompress_cache(self, compressed_data):
        """Restores compressed cache using Huffman decoding."""
        token_map = {v: k for k, v in compressed_data["token_map"].items()}
        bitstring = compressed_data["compressed_tokens"]
        current_code = ""
        decompressed_tokens = []

        for bit in bitstring:
            current_code += bit
            if current_code in token_map:
                decompressed_tokens.append(token_map[current_code])
                current_code = ""

        return decompressed_tokens

    def huffman_encoding(self, tokens):
        """Applies Huffman encoding to tokens for efficient compression."""
        token_counts = Counter(tokens)
        heap = [[weight, [token, ""]] for token, weight in token_counts.items()]
        heapq.heapify(heap)

        while len(heap) > 1:
            lo = heapq.heappop(heap)
            hi = heapq.heappop(heap)
            for pair in lo[1:]:
                pair[1] = '0' + pair[1]  # Prefix '0' for left subtree
            for pair in hi[1:]:
                pair[1] = '1' + pair[1]  # Prefix '1' for right subtree
            heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])

        huffman_dict = {pair[0]: pair[1] for pair in heap[0][1:]}  # Store actual Huffman encoding
        return huffman_dict

    def clear_cache(self):
        """Clears KV cache."""
        self.token_queue.clear()
        if os.path.exists(self.CACHE_FILE):
            os.remove(self.CACHE_FILE)
        print("üóëÔ∏è KV Cache Cleared.")

    def measure_memory_usage(self):
        """Returns memory usage in MB."""
        process = psutil.Process(os.getpid())
        return process.memory_info().rss / (1024 * 1024)  # Convert bytes to MB

# Initialize Optimized KV Cache
kv_cache = KVCacheOptimized(max_tokens=2048)

def query_ollama_with_cache(api_url, model_name, user_input, kv_cache):
    """Queries Ollama API & analyzes KV Cache compression."""
    cached_context = kv_cache.get_cached_context()
    full_prompt = f"{cached_context} {user_input}" if cached_context else user_input

    payload = {
        "model": model_name,
        "prompt": full_prompt,
        "stream": False  # Ensure a full response is returned
    }

    headers = {"Content-Type": "application/json"}

    start_time = time.time()
    memory_before = kv_cache.measure_memory_usage()

    try:
        print(f"\nüì§ Sending Request to API: {api_url}")
        print(f"üìú Payload: {json.dumps(payload, indent=2)}")

        response = requests.post(api_url, json=payload, headers=headers, timeout=50)

        print(f"üì© API Response Status: {response.status_code}")
        print(f"üì© API Response Text: {response.text}")

        response.raise_for_status()  # Ensure no HTTP error

        data = response.json()
        response_text = data.get("response", "No response received.")

        # Update KV Cache with new response
        kv_cache.add_tokens(response_text)

    except requests.exceptions.RequestException as e:
        print(f"‚ùå API Call Failed: {e}")
        response_text = "Error: API request failed."

    memory_after = kv_cache.measure_memory_usage()
    end_time = time.time()

    cache_size = kv_cache.get_compressed_size()
    inference_time = round(end_time - start_time, 4)
    memory_diff = round(memory_after - memory_before, 2)

    print(f"‚è≥ Inference Time: {inference_time} sec | üì¶ Compressed KV Cache Size: {cache_size} bytes")
    print(f"üñ•Ô∏è Memory Usage Change: {memory_diff} MB")

    return {"response": response_text, "inference_time": inference_time, "compressed_cache_size": cache_size, "memory_change": memory_diff}

# üü¢ Running API Tests
BASE_URL = "https://71cd-91-104-75-197.ngrok-free.app"  # Your ngrok URL
ENDPOINT = "/api/generate"
OLLAMA_API_URL = BASE_URL + ENDPOINT

print("\nüü¢ Running Inference 1")
response1 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "Explain AI's impact on society.", kv_cache)
print("\nResponse 1:", response1)

print("\nüü¢ Running Inference 2")
response2 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "How does reinforcement learning work?", kv_cache)
print("\nResponse 2:", response2)

# Clearing Cache and Retesting
kv_cache.clear_cache()
print("\nüü¢ Running Inference 3 (After Cache Clear)")
response3 = query_ollama_with_cache(OLLAMA_API_URL, "llama3", "Can you recall previous topics?", kv_cache)
print("\nResponse 3:", response3)


üîÑ KV Cache (Compressed) loaded from file.

üü¢ Running Inference 1

üì§ Sending Request to API: https://71cd-91-104-75-197.ngrok-free.app/api/generate
üìú Payload: {
  "model": "llama3",
  "prompt": "I'm a large language model, I don't have personal memories or the ability to recall specific conversations or topics. Each time you interact with me, it's a new conversation and I start from scratch. However, I can try to: 1. Use context: If we're discussing a topic that spans multiple messages, I can use the context of our previous messages to inform my responses. 2. Refer back to previous messages: If you explicitly reference a previous message or topic, I can look up the relevant information and respond accordingly. 3. Provide general information: If you ask about a topic we've discussed before, I can provide general information or insights related to that topic. That being said, I don't have the ability to recall specific conversations or topics from previous interactions. Each i