# 🧠 RADE: Retriever-Augmented Document Entity Extraction

This notebook demonstrates how to extract information from trust documents using the **RADE** pipeline.  
Each query retrieves relevant chunks using **ColBERT**, then:
- Runs a **QA model** (RoBERTa) to extract precise answers
- Uses **GLiNER** for entity extraction tied to each query

***
## Section 1:

This section demonstrates how to use the **RADE** (Retriever-Augmented Document Entity Extraction) system for parsing and indexing an unstructured pdf documents

RADE integrates:
- **ColBERT** for semantic retrieval
- **Azure Document Intelligence** for document parsing (PDFs & scanned pages)

---

## 🗂️ Steps:
1. **Initialize RADE** with model names and Azure credentials
2. **Add a document** (PDF) — scanned or digital
3. **Build the ColBERT index**
4. **Retrieve top-k relevant pages**


In [1]:
#imports
import sys
from rade import RADE, RetrievedPage, DocumentPage

### 🔧 Step 1 – Initialize RADE

In [2]:
#initialize
from rade import RADE  # assuming you saved it as rade.py
rade = RADE()


Using device map: {'retrieval': 'cuda:0', 'qa': 'cuda:1'}
Initializing retrieval model...


  self.scaler = torch.cuda.amp.GradScaler()


Initializing QA model and entity extraction models...


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Models initialized successfully!


## 📄 Step 2 – Add a PDF Document

In [3]:
# ✅ Define additional QA Queries
pdf_path = "data/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf"

rade.add_document(pdf_path, 1)

📄 Loading document: 1


📚 Step 3 – Build the Index

In [4]:
rade.build_index()

🔧 Building index for 17 pages...


[May 08, 19:38:01] #> Note: Output directory .ragatouille/colbert/indexes/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf already exists


[May 08, 19:38:01] #> Will delete 15 files already at .ragatouille/colbert/indexes/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf in 20 seconds...
#> Starting...
#> Starting...


  self.scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
  self.scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


nranks = 2 	 num_gpus = 2 	 device=1
[May 08, 19:38:27] [1] 		 #> Encoding 18 passages..
nranks = 2 	 num_gpus = 2 	 device=0
[May 08, 19:38:27] [0] 		 #> Encoding 20 passages..
[May 08, 19:38:27] [1] 		 avg_doclen_est = 123.32221984863281 	 len(local_sample) = 18
[May 08, 19:38:27] [0] 		 avg_doclen_est = 123.32221984863281 	 len(local_sample) = 20
[May 08, 19:38:27] [0] 		 Creating 1,024 partitions.
[May 08, 19:38:27] [0] 		 *Estimated* 4,686 embeddings.
[May 08, 19:38:27] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf/plan.json ..


  sub_sample = torch.load(sub_sample_path)


Clustering 4459 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
[May 08, 19:38:28] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[May 08, 19:38:28] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[May 08, 19:38:29] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  centroids = torch.load(centroids_path, map_location='cpu')
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[May 08, 19:38:29] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[May 08, 19:38:29] [1] 		 #> Encoding 18 passages..
[0.027, 0.027, 0.026, 0.023, 0.022, 0.027, 0.025, 0.022, 0.025, 0.027, 0.023, 0.029, 0.025, 0.025, 0.026, 0.027, 0.024, 0.023, 0.025, 0.026, 0.026, 0.025, 0.024, 0.026, 0.023, 0.025, 0.025, 0.027, 0.022, 0.024, 0.025, 0.026, 0.029, 0.025, 0.026, 0.023, 0.027, 0.027, 0.026, 0.032, 0.027, 0.023, 0.023, 0.025, 0.027, 0.026, 0.024, 0.027, 0.027, 0.025, 0.025, 0.026, 0.025, 0.024, 0.022, 0.027, 0.029, 0.026, 0.031, 0.026, 0.023, 0.024, 0.027, 0.024, 0.023, 0.026, 0.024, 0.024, 0.024, 0.026, 0.027, 0.025, 0.027, 0.021, 0.027, 0.028, 0.027, 0.026, 0.025, 0.026, 0.027, 0.028, 0.026, 0.027, 0.024, 0.027, 0.025, 0.024, 0.025, 0.03, 0.026, 0.025, 0.028, 0.028, 0.028, 0.026, 0.031, 0.026, 0.027, 0.025, 0.027, 0.03, 0.026, 0.023, 0.029, 0.027, 0.023, 0.025, 0.024, 0.024, 0.023, 0.032, 0.026, 0.025, 0.03, 0.024, 0.027, 0.027, 0.024, 0.

If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
1it [00:00, 19.49it/s]
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
  return torch.load(codes_path, map_location='cpu')
100%|██████████| 2/2 [00:00<00:00, 1673.70it/s]
100%|██████████| 1024/1024 [00:00<00:00, 53893.91it/s]


[May 08, 19:38:29] #> Optimizing IVF to store map from centroids to list of pids..
[May 08, 19:38:29] #> Building the emb2pid mapping..
[May 08, 19:38:29] len(emb2pid) = 4693
[May 08, 19:38:29] #> Saved optimized IVF to .ragatouille/colbert/indexes/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf/ivf.pid.pt





#> Joined...
#> Joined...
Done indexing!
✅ Index built: .ragatouille/colbert/indexes/living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf


🔍 Step 4 – Retrieve Top-k Pages for a Query

In [5]:
results = rade.retrieve("What is the name of the Trustee of this trust?", 3)
for r in results:
    print(f"📄 Page {r.page.page_num} | Score: {r.score:.2f}")
    print(r.page.text[:400])


🔍 Retrieving top-3 chunks for query: What is the name of the Trustee of this trust?
Loading searcher for index living-trust-forms-03_Ileana Hardison Rev Family Trust DTD 05052010.pdf for the first time... This may take a few seconds
[May 08, 19:38:31] #> Loading codec...
[May 08, 19:38:31] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[May 08, 19:38:31] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[May 08, 19:38:31] #> Loading IVF...


  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')


[May 08, 19:38:31] #> Loading doclens...


100%|██████████| 2/2 [00:00<00:00, 4202.71it/s]

[May 08, 19:38:31] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 2/2 [00:00<00:00, 275.97it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is the name of the Trustee of this trust?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2003,  1996,  2171,  1997,  1996, 13209,  1997,
         2023,  3404,  1029,   102,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')




  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


📄 Page 1 | Score: 22.97
With this purpose, the primary asset management goal for this Living Trust will be the protection of the value of the Property. The secondary asset management goal for this Living Trust is to generate income and growth at a reasonable risk.
Trustee
2. During the lifetime of the Grantor, and while the Grantor is not Incapacitated, the Grantor will serve as the primary trustee (the "Primary Trustee"
📄 Page 9 | Score: 22.34
b. After the death of the Grantor, the Trustee will have the power to appoint one or more individuals or institutions to act as co-Trustee where it is deemed reasonable and in the best overall interest of this Living Trust.
c. The Trustee may employ and rely on the advice of experts including, but not limited to, legal counsel, accountants and investment advisors to help in the management of the P
📄 Page 10 | Score: 22.23
NONPUBLIC//FDIC INTERNAL ONLY
d. The Trustee may retain, exchange, insure, repair, improve, sell or dispose of any and all pe

---

## Section 2 🔍 RADE Trust Document Analyzer -  Entity and QA Extraction from Trust Documents

This section supports:
- Named Entity Recognition (NER) using GLiNER for labeled entity queries
- Question Answering (QA) using RoBERTa for direct questions

Each result:
- Retrieves top-k chunks via ColBERT
- Runs either QA or NER based on the query type
- Presents results interactively using arrows to flip through top passages



### Step1 🧩 – Define Query-to-Label Mapping

Use the schema as below:

```python
{
  "Who are the Grantors?": {
    "type": "ner",
    "labels": ["Grantor", "Settlor"]
  },
  "Who are the Trustees?": {
    "type": "ner",
    "labels": ["Primary Trustee", "Trustee"]
  },
  "Who are the Successor Trustees?": {
    "type": "ner",
    "labels": ["Successor Trustee"]
  },
  "Who are the Beneficiaries?": {
    "type": "ner",
    "labels": ["Primary Beneficiary", "Beneficiary", "Residuary Beneficiary"]
  },
  "Who are the Successor Beneficiaries?": {
    "type": "ner",
    "labels": ["Successor Beneficiary", "Secondary Beneficiary"]
  },
  "What is the name of the trust?": {
    "type": "qa",
    "labels": []
  },
  "What is the date of the trust?": {
    "type": "qa",
    "labels": []
  },
  "Is this trust revocable or irrevocable?": {
    "type": "qa",
    "labels": []
  }
}


In [21]:
#read query file
query_file = "data/query_plan.json"
with open(query_file, "r") as j:
    query_plan = json.load(j)


In [7]:
#function to use RADE extractive model
def extract_entities_per_page(rade, retrieved_pages, labels):
    page_entities = {}

    for rp in retrieved_pages:
        text = rp.page.text.replace("\n", " ")
        entities = rade.entity_extraction_model.predict_entities(text, labels)
        key = f"{rp.page.doc_name}_page{rp.page.page_num}"
        page_entities[key] = entities

    return page_entities


## 🧠 Step 2 – Run RADE: Parse + Index + Query

In [23]:
#Run queries and following NER and QA pipelines with Rade
query_results = []

for query, meta in query_plan.items():
    retrieved = rade.retrieve(query, k=5)

    if meta["type"] == "qa":
        qa_result = rade.run_qa_pipeline(
            question=query,
            retrieved_texts=[
                {
                    "content": rp.page.text,
                    "document_metadata": {
                        "document": rp.page.doc_name,
                        "page": rp.page.page_num
                    }
                } for rp in retrieved
            ]
        )
        query_results.append({
            "query": query,
            "type": "qa",
            "labels": [],
            "results": retrieved,
            "qa": qa_result,
            "entities": {}
        })

    elif meta["type"] == "ner":
        retrieved_texts = [
            {
                "content": rp.page.text,
                "document_metadata": {
                    "document": rp.page.doc_name,
                    "page": rp.page.page_num
                }
            } for rp in retrieved
        ]

        entity_result = rade.extract_entities_with_gliner(
            retrieved_texts=retrieved_texts,
            labels=meta["labels"],
            threshold=0.3
        )

        query_results.append({
            "query": query,
            "type": "ner",
            "labels": meta["labels"],
            "results": retrieved,
            "qa": None,
            "entities": {
                f"{r['document']}_page{r['page']}": [
                    e for e in entity_result["entities"]
                    if e in entity_result["entities"]
                ]
                for r in entity_result["retrieved"]
            }
        })


🔍 Retrieving top-5 chunks for query: Who are the Grantors?


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


🔍 Retrieving top-5 chunks for query: Who are the Trustees?
🔍 Retrieving top-5 chunks for query: Who are the Successor Trustees?
🔍 Retrieving top-5 chunks for query: Who are the Beneficiaries?
🔍 Retrieving top-5 chunks for query: Who are the Successor Beneficiaries?
🔍 Retrieving top-5 chunks for query: What is the name of the trust?
🔍 Retrieving top-5 chunks for query: What is the date of the trust?
🔍 Retrieving top-5 chunks for query: Is this trust revocable or irrevocable?


## Step 3 - Visualize

In [18]:
import html
import ipywidgets as widgets
from IPython.display import display, Markdown, HTML, clear_output

def highlight_entities(text: str, entities: list, highlight_answer: str = None) -> str:
    """
    Highlight QA answer and NER entities using bright yellow <mark>.
    """
    text = html.escape(text)

    # Highlight QA answer (first)
    if highlight_answer:
        safe_answer = html.escape(highlight_answer.strip())
        if safe_answer and safe_answer in text:
            text = text.replace(
                safe_answer,
                f'<mark style="background-color:yellow" title="QA Answer">{safe_answer}</mark>'
            )

    # Highlight NER entities
    entities = sorted(entities, key=lambda e: -len(e['text']))
    for ent in entities:
        safe_text = html.escape(ent["text"])
        label = html.escape(ent["label"])
        score = ent.get("score", None)
        tooltip = f"{label}" + (f" ({score:.2f})" if score else "")
        text = text.replace(
            safe_text,
            f'<mark style="background-color:yellow" title="{tooltip}">{safe_text}</mark>'
        )

    return text


def show_query_results(query_result: dict):
    query = query_result["query"]
    query_type = query_result["type"]
    pages = query_result["results"]
    entity_map = query_result["entities"]
    qa = query_result["qa"]

    index = widgets.IntText(value=0, layout=widgets.Layout(width="40px"), disabled=True)
    output = widgets.Output()

    prev_button = widgets.Button(description="◀️ Prev", layout=widgets.Layout(width="80px"))
    next_button = widgets.Button(description="Next ▶️", layout=widgets.Layout(width="80px"))
    nav_box = widgets.HBox([prev_button, index, next_button])

    def render_page(i):
        with output:
            clear_output(wait=True)
            page = pages[i].page
            score = pages[i].score
            key = f"{page.doc_name}_page{page.page_num}"
            ents = entity_map.get(key, []) if query_type == "ner" else []

            display(Markdown(f"## 🔎 Query: `{query}`"))

            if query_type == "qa":
                answer = qa.get("answer", "[No answer]")
                display(Markdown(f"### 🤖 QA Answer: `{answer}`"))
            else:
                answer = None  # Not used

            display(Markdown(f"**📄 Page {page.page_num} — Score: `{score:.2f}` — File: `{page.doc_name}`**"))

            highlighted_text = highlight_entities(
                text=page.text[:3000],
                entities=ents,
                highlight_answer=answer
            )

            display(HTML(f"<pre style='line-height:1.5;font-family:monospace'>{highlighted_text}</pre>"))

    def on_prev_clicked(_):
        if index.value > 0:
            index.value -= 1
            render_page(index.value)

    def on_next_clicked(_):
        if index.value < len(pages) - 1:
            index.value += 1
            render_page(index.value)

    prev_button.on_click(on_prev_clicked)
    next_button.on_click(on_next_clicked)

    display(nav_box, output)
    render_page(index.value)


In [19]:
for result in query_results:
    show_query_results(result)



HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()

HBox(children=(Button(description='◀️ Prev', layout=Layout(width='80px'), style=ButtonStyle()), IntText(value=…

Output()