# Interactive Chatbot

This notebook demonstrates the core mechanics of building a stateful, multi-turn chatbot. The key to enabling a continuous conversation is the conversation history, a list that stores every user message and assistant response. On each turn, this history is sent back to the model, providing the necessary context to understand follow-up questions and maintain a coherent dialogue.

To prevent the conversation from exceeding the model's context window, a function (manage_conversation_history) trims the oldest messages, ensuring the prompt remains within a manageable size. A crucial feature for user experience is response streaming (stream=True). Instead of waiting for the entire reply, the code processes and prints each piece (or token) of the response as it's generated, creating the familiar real-time "typing" effect. The script simulates a conversation by iterating through predetermined inputs, continuously updating the history with both the user's turn and the model's full reply.

In [None]:
from pathlib import Path

from llama_cpp import Llama

In [None]:
MODEL_ROOT = Path("../llama-cpp-python/models")
assert MODEL_ROOT.exists()

In [None]:
model_path = MODEL_ROOT / "text_gen/llama/meta-llama-3-8b-instruct.Q4_K_M.gguf"
assert model_path.exists()

In [None]:
llm = Llama(
    model_path=str(model_path),
    chat_format="llama-3",  # The chat format MUST match the model
    n_ctx=4096,           # Context window size
    n_gpu_layers=-1,      # Use -1 to offload all layers to GPU
    verbose=True         # Set to True to see llama.cpp logs
)

In [None]:
def manage_conversation_history(history, max_turns):
    """
    Manages the conversation history to fit within a sliding window.

    This function keeps the system prompt and the most recent 'max_turns'
    of the user-assistant conversation.

    Args:
        history (list): The list of conversation messages.
        max_turns (int): The maximum number of conversational turns to keep.

    Returns:
        list: The trimmed conversation history.
    """
    # A turn consists of one user message and one assistant message (2 items)
    max_history_items = max_turns * 2

    # Always keep the first message, which is the system prompt
    system_prompt = history[0]

    # Get the conversational part of the history (all but the system prompt)
    conversation = history[1:]

    # If the conversation is longer than the max allowed, trim it
    if len(conversation) > max_history_items:
        # Keep only the last 'max_history_items' messages
        trimmed_conversation = conversation[-max_history_items:]
        # Reconstruct the history with the system prompt and the trimmed conversation
        return [system_prompt] + trimmed_conversation

    # If not longer, return the original history
    return history

In [None]:
# --- Configuration ---
def get_config():
    # Set the maximum number of conversation turns to remember
    MAX_TURNS = 10

    # 2. Set up the conversation history
    conversation_history = [
        {
            "role": "system",
            "content": "You are a helpful and friendly assistant. Always be polite and concise."
        },
    ]

    return MAX_TURNS, conversation_history

In [None]:
# First we run an predetermined input to show you how it works
MAX_TURNS, conversation_history = get_config()
predetermined_inputs = [
    "What is the capital of the United Kingdom?",
    "What is a famous landmark there that involves a clock?",
    "How old is that landmark?",
    "Thank you for the information."
]

print(f"--- Starting automated chat with {len(predetermined_inputs)} inputs ---")

# 2. Loop through the predetermined inputs instead of waiting for user input
for user_input in predetermined_inputs:
    # Print the user's turn to simulate a conversation
    print(f"You: {user_input}")

    # Add the new user message to the full history
    conversation_history.append({"role": "user", "content": user_input})

    # Manage the history BEFORE sending it to the model
    history_for_model = manage_conversation_history(conversation_history, MAX_TURNS)

    print("Assistant: ", end="", flush=True)

    # 4. Generate and stream the response
    response_stream = llm.create_chat_completion(
        messages=history_for_model,  # Use the potentially trimmed history
        stream=True,
    )

    full_response = ""
    for chunk in response_stream:
        delta = chunk['choices'][0]['delta']
        if "content" in delta:
            token = delta["content"]
            print(token, end="", flush=True)
            full_response += token

    # Print a newline for better formatting between turns
    print("\n")

    # 5. Add the assistant's full response to the original, untrimmed history
    if full_response:
        conversation_history.append({"role": "assistant", "content": full_response})

print("--- Automated chat finished ---")