---
title: "Recursive"
icon: "arrows-spin"
---



```mermaid
graph TD
    %% Level 0 - Original Docs
    A1[Doc1] --> B1[Sum1]
    A2[Doc2] --> B2[Sum2]
    A3[Doc3] --> B3[Sum3]
    A4[Doc4] --> B4[Sum4]
    A5[Doc5] --> B5[Sum5]
    A6[Doc6] --> B6[Sum6]
    A7[Doc7] --> B7[Sum7]
    A8[Doc8] --> B8[Sum8]

    %% Level 1 - First Combines
    B1 --> C1[CombSum1]
    B2 --> C1
    B3 --> C2[CombSum2]
    B4 --> C2
    B5 --> C3[CombSum3]
    B6 --> C3
    B7 --> C4[CombSum4]
    B8 --> C4

    %% Level 2 - Mega Combines
    C1 --> D1[MegaSum1]
    C2 --> D1
    C3 --> D2[MegaSum2]
    C4 --> D2

    %% Level 3 - Final Summary
    D1 --> E[FINAL_SUMMARY]
    D2 --> E
```


## Example dataset


This text is sourced from [Project Gutenberg](https://www.gutenberg.org/ebooks/2600) and is in the public domain. Redistribution is permitted, but the following attribution must be preserved:

> This eBook is for the use of anyone anywhere at no cost and with
> almost no restrictions whatsoever. You may copy it, give it away or
> re-use it under the terms of the Project Gutenberg License included
> with this eBook or online at [www.gutenberg.org](https://www.gutenberg.org).
>
> Public domain text provided by Project Gutenberg:
> [https://www.gutenberg.org/ebooks/2600](https://www.gutenberg.org/ebooks/2600)




In [5]:
from pathlib import Path
import requests

# URL of the plain text file from Project Gutenberg
url = "https://www.gutenberg.org/cache/epub/1184/pg1184.txt"
output_path = Path("war_and_peace_gutenberg.txt")

# Check if file already exists
if output_path.exists():
    print(f"File '{output_path}' already exists. Skipping download.")
else:
    response = requests.get(url)
    if response.status_code == 200:
        output_path.write_text(response.text + attribution, encoding="utf-8")
        print(f"Downloaded and saved to '{output_path}' with attribution.")
    else:
        print(f"Failed to download. Status code: {response.status_code}")


File 'war_and_peace_gutenberg.txt' already exists. Skipping download.


In [31]:
from langchain.documents import Document

In [32]:
text = output_path.read_text()

In [33]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100_000)

In [34]:
texts = splitter.split_text(text)

In [35]:
len(texts)


27

In [12]:

from langchain.chains.summarization import create_summarizer
from langchain.chat_models import init_chat_model
from langchain_core.documents import Document
from pydantic import BaseModel, Field


class Person(BaseModel):
    """Person to extract."""

    name: str
    age: str | None = None
    hair_color: str | None = None
    source_doc_ids: list[str] = Field(
        default=[],
        description="The IDs of the documents where the information was found.",
    )


class PeopleRoot(BaseModel):
    people: list[Person]

In [13]:
# People = RootModel(list[Person])

model = init_chat_model("claude-opus-4-20250514")
summarizer = create_summarizer(
    model,
    initial_prompt="Produce a summary in bullet with up to 3 bullets.",
    response_format=PeopleRoot,
).compile(name="Refiner")


output = summarizer.invoke({"documents": documents})