# Document Retrieval

Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

## Document Loaders

Use document loaders to load data from a source as `Document`'s. A `Document` is a piece of text and associated metadata. For example, there are document loaders for loading a simple `.txt` file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

The simplest loader reads in a file as text and places it all into one document.

In [1]:
from langchain.document_loaders import TextLoader

loader = TextLoader("./data/the-social-cancer.txt")
pages = loader.load()  # Returns a list of `Document` objects
print(f"{pages[0].page_content[:300]}...")

The Social Cancer: A Complete English Version of Noli Me Tangere

This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Guten...


### Loading PDF Files Using PyPDF

This covers how to load `PDF` documents into the `Document` format that we use downstream. Load `PDF` using `pypdf` into array of documents, where each document contains the page content and metadata with page number.

In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./data/neuron.pdf")
pages = loader.load_and_split()

In [3]:
pages[0]

Document(page_content='Understanding Neurons: The Building Blocks of\nthe Nervous System\nIntroduction\nNeurons are the core components of the nervous system, responsible for carrying\nmessages throughout the body. These specialized cells are the main players in the\nbrain and spinal cord of the central nervous system, as well as the nerves that run\nthroughout our body in the peripheral nervous system.\nNeurons are the building blocks of the nervous system. They are responsible for\ncarrying messages throughout the body. These specialized cells are the main play-\ners in the brain and spinal cord of the central nervous system, as well as the nerves\nthat run throughout our body in the peripheral nervous system.\nWhat are Neurons?\nNeurons are cells designed to transmit information. They are unique in their shape\nand function, optimized for sending and receiving signals. Neurons are different\nfrom other cells in the body because of their ability to communicate through electri-\ncal a

You can now use the content of the PDF as context for a language model.

## Document Transformers

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

### Text Splitters

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

- Split the text up into small, semantically meaningful chunks (often sentences).
- Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
- Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

In [4]:
# This is a long document we can split up.
loader = TextLoader("./data/the-social-cancer.txt")
pages = loader.load()
word_count = len(pages[0].page_content.split())
print(f"Word count: {word_count}")

Word count: 108787


In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap  = 20,
    length_function = len,
    add_start_index = True,
)

texts = text_splitter.create_documents([pages[0].page_content])
print(f"Number of chunks: {len(texts)}")

Number of chunks: 2755


In [6]:
print(f"First chunk: '{texts[0].page_content}'\n")
print(f"Second chunk: '{texts[1].page_content}'\n")
print(f"Third chunk: '{texts[2].page_content}'\n")

First chunk: 'The Social Cancer: A Complete English Version of Noli Me Tangere'

Second chunk: 'This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online'

Third chunk: 'at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.'

