# How to handle long text when doing extraction
When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. 

To process this text, consider these strategies:

1. Change LLM Choose a different LLM that supports a larger context window.
2. Brute Force Chunk the document, and extract content from each chunk.
3. RAG Chunk the document, index the chunks, and only extract content from a subset of chunks that look "relevant".


Keep in mind that these strategies have different trade off and the best strategy likely depends on the application that you're designing!

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["LANGCHAIN_API_KEY"]=os.environ.get('LANGCHAIN_API_KEY')
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_PROJECT"]="Extraction"

# Set up
We need some example data! Let's download an article about cars from wikipedia and load it as a LangChain Document.

In [2]:
import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)

In [3]:
print(len(document.page_content))

80562


# Define the schema
Following the extraction tutorial, we will use Pydantic to define the schema of information we wish to extract. 

In this case, we will extract a list of "key developments" (e.g., important historical events) that include a year and description.

Note that we also include an ```evidence``` key and instruct the model to provide in verbatim the relevant sentences of text from the article. This allows us to compare the extraction results to (the model's reconstruction of) text from the original document.

In [8]:
from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field


class KeyDevelopment(BaseModel):
    """Information about a development in the history of cars."""

    year: int = Field(
        ..., description="The year when there was an important historic development."
    )
    description: str = Field(
        ..., description="What happened in this year? What was the development?"
    )
    evidence: str = Field(
        ...,
        description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
    )


class ExtractionData(BaseModel):
    """Extracted information about key developments in the history of cars."""

    key_developments: List[KeyDevelopment]


In [9]:
# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert at identifying key historic development in text. "
            "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
        ),
        ("human", "{text}"),
    ]
)

# Create an extractor
Let's select an LLM. Because we are using tool-calling, we will need a model that supports a tool-calling feature. See this table for available LLMs.

In [10]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")



In [14]:
extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    include_raw=False,
)

In [11]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)


In [12]:
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Only extract from first document

In [15]:
rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # fetch content of top doc
} | extractor

In [16]:
results = rag_extractor.invoke("Key developments associated with cars")

In [17]:
for key_development in results.key_developments:
    print(key_development)

year=2020 description='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.'
year=2020 description='The automotive industry in China produces by far the most cars, with 20 million cars manufactured in 2020.' evidence='The automotive industry in China produces by far the most (20 million in 2020), followed by Japan (seven million), then Germany, South Korea and India.'
year=2020 description='Around the world, there are about a billion cars on the road, consuming about 50 exajoules of energy yearly.' evidence='Around the world, there are about a billion cars on the road; they burn over a trillion litres of petrol and diesel fuel yearly, consuming about 50 exajoules (14,000 TWh) of energy.'
year=2019 description='This section needs expansion. You can help by adding to it. (March 2019)' evidence='This section needs expansion. You can 