# Lesson 1 : Loading and Splitting Documents with LangChain

# Introduction to Document Processing with LangChain

Welcome to the first lesson of **Document Processing and Retrieval with LangChain in Python**! In this course, you'll learn how to work with documents programmatically, extract valuable information from them, and build systems that intelligently interact with document content.

Document processing is a fundamental task in many applications, from search engines to question-answering systems. A typical document processing pipeline consists of:

- Loading documents from various sources
- Splitting them into manageable chunks
- Converting those chunks into numerical representations (embeddings)
- Retrieving relevant information when needed

In this lesson, we'll focus on loading documents and splitting them into appropriate chunks. These steps form the foundation for all subsequent document processing tasks.

## Learning Objectives
By the end of this lesson, you'll be able to:
- Load documents from different file formats using LangChain
- Split documents into manageable chunks for further processing
- Understand document preparation for embedding and retrieval

Let's begin by exploring document loaders in LangChain.

## Setting Up PyPDF

Ensure the `pypdf` package is installed. `pypdf` allows LangChain's `PyPDFLoader` to read text from PDFs effectively.

```bash
pip install pypdf
```

> Note: This package is pre-installed on CodeSignal.

## LangChain Document Loaders

LangChain simplifies document processing with specialized loaders for different file formats.

### PDF Files
```python
from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("document.pdf")
```

### Text Files
```python
from langchain_community.document_loaders import TextLoader

text_loader = TextLoader("document.txt")
```

### General Files
```python
from langchain_community.document_loaders import UnstructuredFileLoader

general_loader = UnstructuredFileLoader("document.docx")
```

LangChain also provides loaders like `CSVLoader`, `JSONLoader`, and `WebBaseLoader`.

## Loading a Document
Example using a Sherlock Holmes PDF:

```python
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"
pdf_loader = PyPDFLoader(file_path)
docs = pdf_loader.load()
```

## Inspecting Loaded Documents

```python
print(f"Loaded {len(docs)} document chunks")
print(f"\nFirst 200 characters:\n{docs[0].page_content[:200]}")
print(f"\nMetadata:\n{docs[0].metadata}")
```

## Document Splitting Techniques
Documents often require splitting into smaller chunks. LangChain's `RecursiveCharacterTextSplitter` splits documents recursively by separators:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)
```

- **chunk_size:** Maximum characters per chunk
- **chunk_overlap:** Characters shared between chunks to maintain context

## Splitting Documents into Chunks

```python
split_docs = text_splitter.split_documents(docs)

print(f"After splitting: {len(split_docs)} chunks")
print(f"\nFirst chunk content:\n{split_docs[0].page_content}")
```

This method preserves metadata and splits content based on specified parameters.

## Optimizing Chunk Size and Overlap
Effective chunking balances size and overlap:
- Small chunks fragment ideas; large chunks exceed model limits.
- Moderate overlap (50–100 characters) maintains context without redundancy.

Adjust chunking parameters based on:
- Document type
- Task requirements
- Model token limits

## Review and Next Steps
In this lesson, you:
- Explored LangChain document loaders (`PyPDFLoader`, `TextLoader`, `UnstructuredFileLoader`).
- Learned loading and inspecting document content and metadata.
- Discussed the importance of document splitting.
- Utilized `RecursiveCharacterTextSplitter`.

Next lesson covers converting chunks into vector embeddings for semantic retrieval.


## Loading and Inspecting PDF Documents

Now that you've learned about different document loaders in LangChain, let's put that knowledge into practice! In this exercise, you'll work with the PyPDFLoader to load a Sherlock Holmes story and examine its contents.

You'll complete a script that loads a PDF document and helps you understand what information is available after loading. This hands-on experience will show you exactly what happens when a document is loaded into LangChain.

Your tasks are to:

Create a PyPDFLoader instance for the provided PDF file.
Load the document using the loader.
Print the number of document chunks (pages) that were loaded.
Print a sample of the content (the first 200 characters).
Print the metadata to see what additional information is available.
This exercise will give you a solid foundation in document loading — a critical first step before we move on to document splitting in the next section.

```python
from langchain_community.document_loaders import PyPDFLoader

# Define the file path to our Sherlock Holmes story
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# TODO: Create a PDF loader for our document

# TODO: Load the document

# TODO: Print the number of document chunks loaded

# TODO: Print the first 200 characters of the first chunk

# TODO: Print the metadata of the first chunk


```

Here's your completed script for loading and inspecting PDF documents using `PyPDFLoader`:

```python
from langchain_community.document_loaders import PyPDFLoader

# Define the file path to our Sherlock Holmes story
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a PDF loader for our document
pdf_loader = PyPDFLoader(file_path)

# Load the document
docs = pdf_loader.load()

# Print the number of document chunks loaded
print(f"Loaded {len(docs)} document chunks")

# Print the first 200 characters of the first chunk
print(f"\nFirst 200 characters of the first chunk:\n{docs[0].page_content[:200]}")

# Print the metadata of the first chunk
print(f"\nMetadata of the first chunk:\n{docs[0].metadata}")
```

**Explanation of each step:**

* **Loader Creation:** Instantiates a `PyPDFLoader` object to handle PDF files.
* **Loading Documents:** Reads the content of the PDF into `docs`, a list of document chunks (usually pages).
* **Inspecting Content:** Provides an overview by displaying the number of chunks, a snippet of text content, and available metadata.

This approach allows you to understand precisely what information is extracted when loading documents, ensuring effective downstream processing.



## Switching to Text File Loading


Let's switch gears and load a new file format!

Your task is to replace the PyPDFLoader, which was previously used to load a Sherlock Holmes PDF file, with a TextLoader to load "Alice in Wonderland" as a text file. This will allow you to handle the text file format appropriately.

As you make this change, pay attention to any differences you notice between the metadata and structure of text files versus PDFs. While PDFs split by pages, text files typically load as a single document chunk with simpler metadata.

This exercise will help you understand how LangChain handles different file formats with their specialized loaders.

```python
from langchain_community.document_loaders import PyPDFLoader, TextLoader

# Define the file path to our text file
file_path = "data/alice_in_wonderland.txt"

# TODO: Create a Text loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Print the number of document chunks loaded
print(f"Loaded {len(docs)} document chunks")

# Print the content of the first chunk
print(f"\nFirst 200 characters of the first chunk:\n{docs[0].page_content[:200]}")

# Print the metadata of the first chunk
print(f"\nMetadata of the first chunk:\n{docs[0].metadata}")

```


```python
from langchain_community.document_loaders import TextLoader

# Define the file path to our text file
file_path = "data/alice_in_wonderland.txt"

# Create a Text loader for our document
loader = TextLoader(file_path)

# Load the document
docs = loader.load()

# Print the number of document chunks loaded
print(f"Loaded {len(docs)} document chunks")

# Print the first 200 characters of the first chunk
print(f"\nFirst 200 characters of the first chunk:\n{docs[0].page_content[:200]}")

# Print the metadata of the first chunk
print(f"\nMetadata of the first chunk:\n{docs[0].metadata}")
```

### Differences to notice:

* Text files typically load as **one single chunk**, whereas PDFs split content into multiple chunks, usually by pages.
* Metadata for text files is simpler, primarily showing only basic file details (such as source and filename), unlike PDFs that include richer metadata like authors, dates, and page numbers.

This change demonstrates clearly how LangChain handles file format specifics using specialized loaders.


## Experiment with Document Splitting Parameters


Well done loading and inspecting documents using LangChain. Now, let's dive into experimenting with document splitting!

In this exercise, you'll work with the RecursiveCharacterTextSplitter to explore how different settings affect the way a document is divided into chunks. Specifically, you'll adjust the parameters to:

Set the chunk_size to 500
Set the chunk_overlap to 50
By making these changes, you'll observe how they influence both the number of chunks and the content within each chunk.

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Initialize the text splitter with a specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    # TODO: Set the chunk_size value to 500
    chunk_size=1000,
    # TODO: Set the chunk_overlap value to 50
    chunk_overlap=100
)

# Split the loaded document into chunks using the text splitter
split_docs = text_splitter.split_documents(docs)

# Print the number of chunks after splitting
print(f"After splitting: {len(split_docs)} chunks")

# Print the content of the first chunk
print(f"\nFirst chunk content:\n{split_docs[0].page_content}")

```


Anda sudah sangat dekat! Anda hanya perlu menyesuaikan nilai `chunk_size` dan `chunk_overlap` pada inisialisasi `RecursiveCharacterTextSplitter` sesuai dengan instruksi.

Berikut adalah kode yang telah diperbaiki:

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Initialize the text splitter with a specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    # Set the chunk_size value to 500
    chunk_size=500,
    # Set the chunk_overlap value to 50
    chunk_overlap=50
)

# Split the loaded document into chunks using the text splitter
split_docs = text_splitter.split_documents(docs)

# Print the number of chunks after splitting
print(f"After splitting: {len(split_docs)} chunks")

# Print the content of the first chunk
print(f"\nFirst chunk content:\n{split_docs[0].page_content}")
```

**Perubahan yang dilakukan:**

1.  `chunk_size` diubah dari `1000` menjadi `500`.
2.  `chunk_overlap` diubah dari `100` menjadi `50`.

Dengan perubahan ini, Anda akan melihat bahwa:

*   **Jumlah chunk kemungkinan akan bertambah** karena setiap chunk sekarang lebih kecil (`chunk_size=500` dibandingkan `1000`).
*   **Konten setiap chunk akan lebih pendek.**
*   **Overlap antar chunk akan lebih kecil** (`chunk_overlap=50` dibandingkan `100`), yang berarti lebih sedikit teks yang diulang antara akhir satu chunk dan awal chunk berikutnya.

Eksperimen seperti ini sangat berguna untuk memahami bagaimana parameter `chunk_size` dan `chunk_overlap` memengaruhi proses pemecahan dokumen, yang merupakan langkah penting dalam banyak aplikasi RAG (Retrieval Augmented Generation) dan pemrosesan dokumen lainnya dengan LLM. Ukuran chunk dan overlap yang optimal dapat bervariasi tergantung pada jenis dokumen, model LLM yang digunakan, dan tugas spesifik yang ingin diselesaikan.

## Exploring Different Text Splitters


In this task, you'll explore how different text splitters affect document processing in LangChain. Your goal is to modify the existing code by replacing the RecursiveCharacterTextSplitter with CharacterTextSplitter.

The two splitters differ in their approach:

The Recursive Character Text Splitter uses a list of separators to recursively break down text until the chunks are below a specified size, making it flexible for handling complex text structures.
The CharacterTextSplitter divides text based solely on character count, offering a simpler approach that may result in chunks cutting through sentences or words.
After making this change, run the code and observe how the alternative splitter impacts the resulting document chunks. Pay attention to the number of chunks created and the content of the first chunk.

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# TODO: Replace RecursiveCharacterTextSplitter with CharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Split the loaded document into chunks using the text splitter
split_docs = text_splitter.split_documents(docs)

# Print the number of chunks after splitting
print(f"After splitting: {len(split_docs)} chunks")

# Print the content of the first chunk
print(f"\nFirst chunk content:\n{split_docs[0].page_content}")

```

Here’s how you can swap out the `RecursiveCharacterTextSplitter` for a plain `CharacterTextSplitter` and see the difference:

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document (a list of Document objects, one per page)
docs = loader.load()

# Replace RecursiveCharacterTextSplitter with CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    chunk_size=1000,    # max characters per chunk
    chunk_overlap=100   # overlap between chunks
)

# Split the loaded document into chunks
split_docs = text_splitter.split_documents(docs)

# Inspect the results
print(f"After splitting: {len(split_docs)} chunks\n")
print("First chunk content:")
print(split_docs[0].page_content)
```

---

## What to look for

1. **Chunk count**

   * **RecursiveCharacterTextSplitter** might give you dozens of small, sentence-friendly chunks (e.g. 50+).
   * **CharacterTextSplitter** will simply slice every \~1000 characters (minus the overlap), so you’ll typically see far **fewer** chunks (roughly total\_chars ÷ (chunk\_size–chunk\_overlap)).

2. **Chunk boundaries**

   * With **recursive**, splits tend to respect sentence/paragraph breaks.
   * With **character-based**, you’ll often cut right through the middle of a sentence or word at the 1,000th character.

3. **First chunk content**

   * You’ll notice that the first chunk ends exactly at your character limit (minus the overlap), even if that’s mid-sentence.
   * Compare it side-by-side with what you got from the recursive splitter to see how “clean” or “jagged” it feels.

---

### Example (hypothetical) output

```text
After splitting: 14 chunks

First chunk content:
The Adventure of the Blue Carbuncle
Arthur Conan Doyle

One foggy Christmas Eve … [continues up to around character 1000, cutting off mid-sentence]
```

By running both approaches you’ll get a clear sense of when you need sentence-aware splitting versus when a simple character-count splitter will do.


## Loading and Splitting PDF Documents

You've done a great job learning about document loaders and splitting techniques. Now, let's put it all together!

In this exercise, you'll work with a PDF document and practice the following steps:

Create a PyPDFLoader instance to load the provided PDF file.
Load the document using the loader.
Initialize a RecursiveCharacterTextSplitter with a specified chunk size and overlap.
Use the text splitter to split the document into chunks.
Print the content of the first five chunks to see how the document is divided.
This task will solidify your understanding of loading and splitting documents, setting you up for more advanced processing tasks. Dive in and see how these tools can make document handling a breeze!

```python

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# TODO: Create a loader for our document

# TODO: Load the document

# TODO: Initialize a RecursiveCharacterTextSplitter with a specified chunk size and overlap

# TODO: Split the loaded document into chunks using the text splitter

# TODO: Print the content of the first five chunks

```

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document (returns a list of Document objects, one per page)
docs = loader.load()

# Initialize a RecursiveCharacterTextSplitter with a chunk size of 1000 chars and 100-char overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Split the loaded document into chunks
split_docs = text_splitter.split_documents(docs)

# Print the content of the first five chunks
for idx, chunk in enumerate(split_docs[:5], start=1):
    print(f"\n--- Chunk {idx} ---\n")
    print(chunk.page_content)
    print("\n" + "-" * 40)
```

**What this does:**

1. **Loads** the PDF into page-based `Document` objects.
2. **Splits** each page recursively at natural boundaries (paragraphs, sentences, etc.) until each chunk is ≤1000 characters, with 100 chars of context overlap.
3. **Prints** the first five chunks so you can inspect how the text was divided.

Run this script to see exactly how your PDF gets loaded and split — a key first step before you move on to embedding or semantic searching!
