# Document & Document Loader

- Author: [Jaemin Hong](https://github.com/geminii01)
- Peer Review : [Taylor(Jihyun Kim)](https://github.com/Taylor0819), [Yejin Park](https://github.com/ppakyeah)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/01-DocumentLoader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/01-DocumentLoader.ipynb)

## Overview

This tutorial covers the fundamental methods for loading Documents.

By completing this tutorial, you will learn how to load Documents and check their content and associated metadata.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Document](#document)
- [Document Loader](#document-loader)

### References

- [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)
- [Load Documents](https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseLoader.html#langchain_core.document_loaders.base.BaseLoader)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "pypdf",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "01-DocumentLoader",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Document

Class for storing a piece of text and its associated metadata.

- `page_content` (Required): Stores a piece of text as a string.
- `metadata` (Optional): Stores metadata related to `page_content` as a dictionary.

In [5]:
from langchain_core.documents import Document

document = Document(page_content="Hello, welcome to LangChain Open Tutorial!")

# Check the attributes using __dict__
document.__dict__

{'id': None,
 'metadata': {},
 'page_content': 'Hello, welcome to LangChain Open Tutorial!',
 'type': 'Document'}

The metadata is empty. Let's add some values.

In [6]:
# Add metadata
document.metadata["source"] = "./example-file.pdf"
document.metadata["page"] = 0

# Check metadata
document.metadata

{'source': './example-file.pdf', 'page': 0}

## Document Loader

Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

- `PyPDFLoader`: Loads PDF files
- `CSVLoader`: Loads CSV files
- `UnstructuredHTMLLoader`: Loads HTML files
- `JSONLoader`: Loads JSON files
- `TextLoader`: Loads text files
- `DirectoryLoader`: Loads documents from a directory

Now, let's learn how to load Documents .

In [7]:
# Example file path
FILE_PATH = "./data/01-document-loader-sample.pdf"

In [8]:
from langchain_community.document_loaders import PyPDFLoader

# Set up the loader
loader = PyPDFLoader(FILE_PATH)

### `load()`

- Loads Documents and returns them as a `list[Document]`.

In [9]:
# Load Documents
docs = loader.load()

In [10]:
# Check the number of loaded Documents
len(docs)

48

In [11]:
# Check Documents
docs[0:10]

[Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content=' \n \n \nOctober  2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN  \nNational Science and Technology Council  \n \nNetworking and Information Technology \nResearch and Development Subcommittee  \n '),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 1}, page_content=' ii  \n '),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content=' \n  \n iii About the National Science and Technology Council  \nThe National Science and Technology Council (NSTC) is the principal means by which the Executive \nBranch coordinates science and technology policy across the diverse entities that make up the Federal \nresearch and development (R&D) enterprise . One of the NSTC’s primary objectives is establishing clear \nnational goal s for Federal science and technology inves

### `aload()`

- Asynchronously loads Documents and returns them as a `list[Document]`.

In [12]:
# Load Documents asynchronously
docs = await loader.aload()

### `load_and_split()`

- Loads Documents and automatically splits them into chunks using TextSplitter , and returns them as a `list[Document]`.

In [13]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Set up the TextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)

# Split Documents into chunks
docs = loader.load_and_split(text_splitter=text_splitter)

In [14]:
# Check the number of loaded Documents
len(docs)

1441

In [15]:
# Check Documents
docs[0:10]

[Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='October  2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='National Science and Technology Council  \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 1}, page_content='ii'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='iii About the National Science and Technology Council'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='The National Science and Technology Council (NSTC) is the principal means by which the Executive'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='Bran

### `lazy_load()`

- Loads Documents sequentially and returns them as an `Iterator[Document]`.

In [16]:
loader.lazy_load()

<generator object PyPDFLoader.lazy_load at 0x000001902A0117B0>

It can be observed that this method operates as a `generator`. This is a special type of iterator that produces values on-the-fly, without storing them all in memory at once.

In [17]:
# Load Documents sequentially
docs = loader.lazy_load()
for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'source': './data/01-document-loader-sample.pdf', 'page': 0}


### `alazy_load()`

- Asynchronously loads Documents sequentially and returns them as an `AsyncIterator[Document]`.

In [18]:
loader.alazy_load()

<async_generator object BaseLoader.alazy_load at 0x000001902A00B140>

It can be observed that this method operates as an `async_generator`. This is a special type of asynchronous iterator that produces values on-the-fly, without storing them all in memory at once.

In [19]:
# Load Documents asynchronously and sequentially
docs = loader.alazy_load()
async for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'source': './data/01-document-loader-sample.pdf', 'page': 0}
