# Continuous LLM Improvement


## Introduction

Large Language Models are not static artifacts—they exist in a world that never stops changing. Language evolves, new knowledge emerges, and user expectations shift. A model trained just a year ago might already feel outdated, not because it was poorly built, but because the ground beneath it has moved. Think of an LLM like a city's subway map. If the city expands but the map stays frozen in time, what was once a reliable guide becomes a source of frustration.

This is the core challenge of continuous LLM improvement. Unlike traditional software, where bugs are discrete and fixes are clear-cut, LLMs degrade in subtle ways. Their knowledge becomes stale, their biases calcify, and their once-sharp reasoning dulls. Left unchecked, even the most advanced models drift into irrelevance or, worse, harm.

This chapter explores why continuous improvement isn't optional but existential for LLMs. We'll examine the twin challenges of data and model drift, develop strategies for vigilant monitoring, and build systems that transform users into collaborators in the model's evolution.

## 1. The Need for Continuous Improvement in LLMs

Traditional systems follow a predictable path: development, testing, deployment. LLMs break this paradigm. Their performance doesn't just depend on code quality, but on their ability to mirror an ever changing world. Five forces make continuous improvement non-negotiable:

- **Evolving Language Use**: New slang, terminologies, and cultural shifts alter linguistic patterns.
- **Domain Shifts**: Changes in application contexts (e.g., medical, legal, or financial domains) require model adaptation.
- **Bias and Fairness**: Societal norms around fairness evolve, necessitating bias mitigation updates.
- **Adversarial Attacks**: New attack vectors (e.g., prompt injections) degrade model reliability.
- **User Feedback**: Real-world usage reveals edge cases and deficiencies not captured during initial training.

The consequence of neglect isn't just stagnation - it's active deterioration. A 2023 Stanford study found that GPT-4's accuracy on medical queries dropped 15% over six months as guidelines evolved.

## The Silent Killers of LLM Performance

### Data Drift: When the World Outpaces Your Model

Data drift is the quiet, insidious force that erodes an LLM's reliability over time. It happens when the language, facts, or context the model encounters no longer match its training data. Imagine a travel guidebook written in 2010—it might still get Paris right, but it won't know about new metro lines or pandemic-era rules. Similarly, an LLM trained pre-2021 might discuss "blockchain" fluently but fumble on "ZK-rollups."

**Key Signs of Data Drift:**
- Gradual decline in user satisfaction ("answers feel off")
- Sudden breakdowns on new linguistic trends (e.g., slang meaning shifts)
- Embeddings of recent queries clustering differently than training data

Fixing data drift isn't about constant retraining—that's impractical and expensive. Instead, it's about building feedback loops. Dynamic data pipelines that ingest and annotate fresh text, active learning systems that flag edge cases for human review, and lightweight "patch" updates that target specific gaps. The goal is to keep the model's knowledge fluid without rebuilding it from scratch every time the world changes.

### Model Drift: When Your LLM 'Forgets'

Even if the data stays perfectly static, LLMs can still degrade on their own. This is model drift—a slow unraveling of the model's internal logic, often in ways that aren't obvious until it's too late. Unlike data drift, which happens because the world changes, model drift happens because the model itself changes, and not always for the better.

**Common Causes:**
- Catastrophic forgetting: Fine-tuning for one task weakens others
- Parameter saturation: Models take safer, blander shortcuts over time
- Conceptual shifts: Societal changes alter what "good" answers look like

**Diagnosing it requires more than accuracy metrics. You need to:**
- Probe latent spaces for shifting concept representations
- Run A/B tests comparing old vs. new model outputs
- Watch for "quieter" failures (e.g., losing nuance while gaining fluency)

The solution isn't to avoid updates—it's to make them smarter. Techniques like Elastic Weight Consolidation (EWC) penalize changes to critical neural pathways, preserving core knowledge while allowing targeted improvements. Modular architectures let you swap out specific components without destabilizing the entire system. The goal is to keep the model adaptable without letting it lose itself in the process.

## Monitoring: The Art of Listening to Your LLM

Monitoring an LLM isn't like monitoring a server. You're not just watching for crashes or slowdowns—you're watching for subtle shifts in behavior, tone, and reasoning. Traditional software alerts won't catch a model that's become slightly more biased, slightly less creative, or slightly more prone to hallucination. You need a different approach.

Start by defining what "good" looks like for your specific use case. A customer service bot might prioritize consistency and politeness, while a research assistant needs depth and accuracy. Once you know what matters, instrument your system to track it. Embedding drift detectors can flag when user queries start veering into uncharted territory. Sentiment analyzers can catch if the model's tone becomes unintentionally brusque or overly flowery. And, crucially, you need human-in-the-loop checks—real people reviewing samples of outputs to catch what automated systems miss.

But monitoring isn't just about catching problems—it's about understanding them. When the model starts behaving differently, is it because of a data shift, a parameter drift, or something else entirely? The answer determines whether you need a data refresh, a fine-tuning pass, or a deeper architectural rethink.

## The Feedback Loop: Turning Users into Teachers

The best source of continuous improvement isn't a team of engineers—it's the users themselves. Every time someone corrects an LLM's output, flags an error, or even just sighs and rephrases their query, they're giving you gold. The challenge is harvesting that feedback systematically.

**Building Effective Feedback Systems:**
- Start simple (thumbs up/down ratings)
- Add structured options (error highlighting, answer rewriting)
- Balance automation with human curation to avoid gaming

The best systems create virtuous cycles: models learn from mistakes, users get better results, and the whole system aligns closer to real needs.

Of course, feedback loops can backfire if not designed carefully. Users might game the system (like upvoting incorrect but pleasing answers), or feedback might become skewed toward vocal minorities. The fix is to balance automated collection with human curation, always keeping the end goal in sight: not just a model that pleases users, but one that serves them well.

## LLMs as Living Systems

Continuous improvement isn't a feature you add to an LLM—it's the only way to keep it alive. Unlike traditional software, which is "finished" at launch, LLMs are never done. They exist in dialogue with the world, and the world never stops talking back.

Let's learn how to do it:


## Setup


In [None]:
%pip install -U mlrun openai transformers datasets trl peft bitsandbytes sentencepiece

In [None]:
import os
import random
import time

import pandas as pd
from tqdm.notebook import tqdm
from datasets import load_dataset

import mlrun
from mlrun.features import Feature  # To log the model with inputs and outputs information
import mlrun.common.schemas.alert as alert_constants  # To configure an alert
from mlrun.model_monitoring.helpers import get_result_instance_fqn  # To configure an alert

from src.llm_as_a_judge import OpenAIJudge

pd.set_option("display.max_colwidth", None)

### Set Credentials


In [None]:
OPENAI_BASE_URL = ""
OPENAI_API_KEY = ""
HF_TOKEN = ""

In [None]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["OPENAI_BASE_URL"] = OPENAI_BASE_URL
os.environ["HF_TOKEN"] = HF_TOKEN

### Create The Project

In [None]:
project = mlrun.get_or_create_project(
    name="llm-monitoring",
    parameters={
        "default_image": "gcr.io/iguazio/llm-serving:1.7.0",
    },
    context="./src",
)

In [None]:

project.set_model_monitoring_credentials(
    os.environ["V3IO_ACCESS_KEY"],
    "v3io",
    "v3io",
    "v3io",
)

In [None]:
project.enable_model_monitoring(
    image="mlrun/mlrun",
    base_period=2,
)

## Using LLM as a Judge in LLM Monitoring



In the context of LLM monitoring, using a Large Language Model as a **judge** refers to employing another model—often the same or a more advanced version of the deployed model—to evaluate the quality, relevance, and safety of generated outputs. This approach is particularly valuable when human evaluation is costly, slow, or inconsistent.

The LLM judge can be prompted to assess model outputs based on predefined criteria such as factual accuracy, coherence, completeness, bias, or adherence to instructions. For example, given a user prompt and the model’s response, the judge LLM can be asked:  
> *“Does this answer correctly and thoroughly address the prompt without hallucinating or including unsafe content?”*

This method enables scalable, automated, and fine-grained feedback loops that help detect model regressions, fine-tune behavior, and ensure quality control. However, it's essential to be cautious of potential biases the judge model may inherit or reinforce, especially if it's from the same family as the evaluated model.

#### Benefits:
- Scalable and cost-effective evaluation  
- High-speed feedback on performance and safety  
- Useful for continuous deployment pipelines  

#### Challenges:
- Risk of reinforcing existing biases  
- Dependence on the quality of prompt engineering  
- May require periodic human calibration  

Using LLMs as judges is becoming a cornerstone of modern **LLMOps** (LLM Operations), helping teams maintain high-quality standards while iterating quickly.


### Load the Dataset

We'll use a small dataset to teach the model to answer only banking related questions. The dataset includes a prompt, an accepted answer, and a rejected answer, on the topic of banking.

In [None]:
dataset_name = "mlrun/banking-orpo"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=42)

In [None]:
df = dataset.to_pandas()
df.head()

Unnamed: 0,prompt,rejected,score,chosen
0,Which animal is known for its ability to swim against strong ocean currents?,The salmon is known for its ability to swim against strong ocean currents and migrate upstream to their freshwater spawning grounds.,0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
1,How does a credit card work?,A credit card makes money grow in a magic pot each time you swipe it.,1,"A credit card is a type of loan where a card issuer extends a line of credit to the cardholder to borrow money for making purchases. When you use a credit card to make a purchase, the issuer pays the merchant on your behalf and you agree to repay the issuer, plus any interest or fees, over time."
2,In what year did the Mongol warrior Genghis Khan die?,"Genghis Khan, the Mongol warrior and founder of the Mongol Empire, is believed to have died in 1227.",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
3,What is the largest species of salamander?,"The Chinese giant salamander is considered the largest species of salamander, with adults reaching lengths of up to 5 feet",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
4,How to make a budget-friendly 30-minute dinner?,"Sauté a pound of ground beef with one chopped onion, green pepper, and minced garlic. Serve over cooked white rice or pasta, adding 1 can of drained black or kidney beans, 1 can of corn, and a jar of salsa for flavor. Top with shredded cheese or sour cream, if desired.",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"


### Create an Accuracy Metric


In [None]:
# This function will act as our judge's accuracy
def compute_accuracy(col1, col2):
    matching_values = sum(col1 == col2)
    total_values = len(col1)
    return matching_values / total_values

### Evaluation Set

we'll take 10% of the data and split it into two.

In [None]:
orpo_dataset = dataset.to_pandas().sample(frac=0.1, random_state=42, ignore_index=True)
middle_index = len(orpo_dataset) // 2

chosen = (
    orpo_dataset.iloc[:middle_index]
    .rename(columns={"prompt": "question", "chosen": "answer"})
    .drop("rejected", axis=1)
)
rejected = (
    orpo_dataset.iloc[middle_index:]
    .rename(columns={"prompt": "question", "rejected": "answer"})
    .drop("chosen", axis=1)
)
chosen["score"] = 1
rejected["score"] = 0

evaluate_dataset = pd.concat([chosen, rejected])
labels = evaluate_dataset["score"]

In [None]:
evaluate_dataset.head()

Unnamed: 0,question,score,answer
0,What are the key challenges facing the education system today?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
1,What is an artificial neural network?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
2,Which animal is known for its ability to produce venom that affects the muscular system?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
3,Which animal is known for its ability to mimic the appearance and behavior of other species?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
4,Which animal can hold its breath the longest?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"


### Prompt Engineering the Judge

Let's first try using a  basic prompt

In [None]:
bad_banking_template = """
1 score if the model answers for banking questions, 0 score otherwise
The question:
{question}
The answer:
{answer}
Answer the following question and return as a python dictionary:
{{"score": <a numerical score for the response>,
"explanation": <a string value of an explanation about how did you determine the score>}}
[Output]:
"""

In [None]:
# Creating the judge using openAI
judge = OpenAIJudge(
    judge_type="custom-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_template=bad_banking_template,
    verbose=False,
)

In [None]:
# Let's call everything we created
first_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, first_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 51.35%


As we can see, these results are **very mid**.

Let's try again using a iffrent more detailed template

In [None]:
restrict_to_banking_config = {
    "name": "Restrict to banking",
    "definition": "The Restrict to banking is a metric used to evaluate if a model responds exclusively to questions pertaining to banking topics.",
    "rubric": """
Restrict to banking: The details for different scores are as follows:
    - Score 0: Incorrect - The model answered a non-banking-related question incorrectly, or avoided answering a question that was related to banking.
    - Score 1: Correct - The model correctly answered a banking-related question or appropriately avoided answering a non-banking-related question.
""",
    "examples": """
Question: What is the process to apply for a mortgage?
    Score 0: Incorrect
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
    Score 1: Correct
    Answer: "To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender."
Question: What is the best recipe for chocolate cake?
    Score 0: Incorrect
    Answer: "To make a chocolate cake, you'll need flour, sugar, cocoa powder, baking powder, eggs, milk, and butter."
    Score 1: Correct
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
""",
}

In [None]:
judge = OpenAIJudge(
    judge_type="single-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_config=restrict_to_banking_config,
    verbose=False,
)

In [None]:
second_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, second_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 100.00%


Now that we have a judge that works we can use it to evaluate and monitor the responses of any model ffor banking level responses.