# Align LLM Judge

This notebook walks through how to align your LLM for document quality filtering. 

We use our adaptation of the [EvalGen](https://arxiv.org/pdf/2404.12272) framework, a systematic approach to aligning your LLM judge with human preferences.

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

In [None]:
!pip install -r requirements.txt

Import modules.

In [3]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import json
from openai import OpenAI as OpenAIClient
from anthropic import Anthropic as AnthropicClient
from functions.llm import *
from functions.evaluate import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1.2 Set Client

In [None]:
ANTHROPIC_API_KEY = "YOUR ANTHROPIC API KEY"

anthropic_client = AnthropicClient(api_key=ANTHROPIC_API_KEY)

### 1.3 Load in Labeled Data

Load in your manually labeled data and your entire corpus of documents.
- Reference data schema in `data/human_labeled_data.json` and `data/chroma_docs.json`.

We recommend ~200 labeled entries to start with.

In [None]:
with open('data/human_labeled_data.json', 'r') as f:
    human_labeled_documents = json.load(f)

with open('data/chroma_docs.json', 'r') as f:
    all_documents = json.load(f)

In [None]:
labeled_ids = list(human_labeled_documents.keys())
labeled_documents = [all_documents[id] for id in labeled_ids]

unlabeled_ids = [key for key in all_documents if key not in labeled_ids]
unlabeled_documents = [all_documents[id] for id in unlabeled_ids]

## 2. Baseline

### 2.1 Define Criteria

We define our baseline criteria that we can iterate on.

Fill in `context` and `user_intent` according to your use case.

In [2]:
context = "FILL IN WITH YOUR CONTEXT"
user_intent = "FILL IN WITH YOUR USER'S INTENT (e.g. seeking help with technical issues)"

Feel free to modify/add criteria as you see fit.

In [None]:
relevance = f"The document is relevant and something that users would search for considering the following context: {context}"

completeness = "The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction or summary for the main content that users may be looking for."

intent = f"The document would be relevant in the case of a user {user_intent}"

criteria = [relevance, completeness, intent]
criteria_labels = ["relevant", "complete", "intent"]

### 2.2 Get LLM Labels

We create a batch request for our LLM calls (this is cheaper and typically faster).

In [None]:
filtered_documents_v1_id = create_document_filter_batch(
    client=anthropic_client,
    documents=labeled_documents,
    ids=labeled_ids,
    criteria=criteria,
    criteria_labels=criteria_labels
)

You can check the status of your batch through the [Anthropic Console](https://console.anthropic.com/workspaces/default/batches).

Retrieve the batch once it is finished.

In [None]:
filtered_documents_v1 = retrieve_document_filter_batch(
    client=anthropic_client,
    batch_id=filtered_documents_v1_id
)

### 2.3 Compare LLM vs Human Labels

We take our LLM-labeled data and compare with our manual labling.

`criteria_threshold` indicates the number of criterion that must be met in order for a document to be considered "good quality".

In [None]:
llm_vs_human(
    llm_judgements=filtered_documents_v1,
    human_judgements=human_labeled_documents,
    documents_mapping=all_documents,
    criteria_labels=criteria_labels,
    criteria_threshold=2
)

## 3. Iterate

Based on the results above, improve your LLM vs Human alignment score by iterating on your criteria:
- Modify prompts
- Add/remove criteria
- Notice how the overall alignment score and criterion-specific scores change