### 🎯 Module Overview
This module covers everything you need to know about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain v0.3 and explore each technique with practical examples.

Table of Contents

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

### Introduction To Data Ingestion


In [1]:
"""
Data Ingestion and Parsing Setup
=================================
This module demonstrates various techniques for ingesting and parsing different types
of documents for RAG (Retrieval-Augmented Generation) systems using LangChain v0.3.

Key components:
- Document loaders for various file formats
- Text splitting strategies for chunking
- Metadata handling for better retrieval
"""

# Standard library imports
import os
from typing import List, Dict, Any

# Third-party imports
import pandas as pd

# LangChain imports for document handling
from langchain_core.documents import Document
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter,  # Best general-purpose splitter
    CharacterTextSplitter,           # Simple character-based splitting
    TokenTextSplitter                # Token-aware splitting for LLMs
)

print("✅ Setup Completed! Ready for data ingestion demonstrations.")

Set up Completed!


### Understanding Document Structure In Langchain

In [2]:
"""
Understanding LangChain Document Structure
=========================================
A Document in LangChain consists of two main components:
1. page_content: The actual text content
2. metadata: Additional information about the document

This structure allows for rich context and filtering capabilities in RAG systems.
"""

# Create a simple document with sample content and metadata
doc = Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source": "example.txt",        # File source for traceability
        "page": 1,                      # Page number for reference
        "author": "Emil",               # Author information
        "date_created": "2024-01-01",   # Creation timestamp
        "custom_field": "any_value"     # Custom fields for specific use cases
    }
)

print("Document Structure")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")

# Why metadata matters:
print("\n📝 Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

Document Structure
Content :This is the main text content that will be embedded and searched.
Metadata :{'source': 'example.txt', 'page': 1, 'author': 'Emil', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}

📝 Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [3]:
# Check the type of the document object to understand its class structure
type(doc)

langchain_core.documents.base.Document

### Text Files (.txt) - The Simplest Case {#2-text-files}

In [None]:
"""
Directory Setup for Text Files
=============================
Create a data directory structure for organizing sample text files.
The exist_ok=True parameter prevents errors if the directory already exists.
"""

import os
os.makedirs("data/text_files", exist_ok=True)

In [5]:
"""
Sample Text Files Creation
==========================
Create sample text files for demonstrating different loading techniques.
These files represent common content types found in RAG systems.
"""

# Dictionary mapping file paths to their content
sample_texts={
    "data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

# Write each sample text to its respective file
for filepath, content in sample_texts.items():
    with open(filepath,"w", encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created!")

✅ Sample text files created!


### TextLoader- Read Single File 

In [7]:
"""
TextLoader - Single File Loading
===============================
TextLoader is the most basic document loader in LangChain for handling plain text files.
It creates a Document object with the file content and basic metadata.
"""

from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

# Loading a single text file
# encoding="utf-8" ensures proper handling of special characters
loader=TextLoader("data/text_files/python_intro.txt", encoding="utf-8")

# Load the document - returns a list of Document objects
documents=loader.load()
print(f"📄 Loaded {len(documents)} document")
print(f"Content preview: {documents[0].page_content[:100]}...")
print(f"Metadata: {documents[0].metadata}")

📄 Loaded 1 document
Content preview: Python Programming Introduction

Python is a high-level, interpreted programming language known for ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


In [12]:
# Inspect the metadata structure to understand what information is automatically added
documents[0].metadata

{'source': 'data/text_files/python_intro.txt'}

### DirectoryLoader- Multiple Text Files

In [15]:
"""
DirectoryLoader - Batch File Loading
===================================
DirectoryLoader allows loading multiple files from a directory at once.
It's efficient for processing entire directories of similar file types.

Parameters:
- path: Directory path to load from
- glob: Pattern to match files (e.g., "*.txt", "*.pdf")
- loader_cls: The loader class to use for each file
- loader_kwargs: Arguments to pass to the loader
- show_progress: Display progress bar during loading
"""

from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

# Configure directory loader for text files
directory_loader = DirectoryLoader(
    "data/text_files",                    # Directory to scan
    glob="*.txt",                         # Only load .txt files
    loader_cls=TextLoader,                # Use TextLoader for each file
    loader_kwargs={"encoding":"utf-8"},   # Pass encoding to TextLoader
    show_progress=True                    # Show loading progress
)   

# Load all documents from the directory
documents = directory_loader.load()

# Display information about loaded documents
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.metadata['source']}")
    print(f"  Length: {len(doc.page_content)} characters")

# 📊 Analysis of DirectoryLoader characteristics
print("\n📊 DirectoryLoader Characteristics:")
print("✅ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n❌ Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

100%|██████████| 2/2 [00:00<00:00, 879.03it/s]


Document 1:
  Source: data/text_files/machine_learning.txt
  Length: 575 characters

Document 2:
  Source: data/text_files/python_intro.txt
  Length: 489 characters

📊 DirectoryLoader Characteristics:
✅ Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

❌ Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories





### Text Splitting Statergies

In [None]:
"""
Text Splitting Imports
=====================
Import various text splitters to demonstrate different chunking strategies:
- TextSplitter: Base class for all text splitters
- RecursiveCharacterTextSplitter: Intelligent splitting with fallback separators
- TokenTextSplitter: Token-aware splitting for LLM limits
"""

from langchain.text_splitter import (
    TextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

In [17]:
# Pretty print the documents to see their structure
from pprint import pprint

pprint(documents)

[Document(metadata={'source': 'data/text_files/machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '),
 Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogram

In [20]:
# Method 1: Character Text Splitter
# Extract the text content from the first document for splitting experiments
text = documents[0].page_content
print(text)

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems


    


In [22]:
"""
Character-based Text Splitting
==============================
This approach splits text based on a specific character (like space).
It's simple but may break sentences in awkward places.

Parameters:
- separator: Character to split on
- chunk_size: Maximum characters per chunk
- chunk_overlap: Characters to overlap between chunks
- length_function: How to measure chunk length
"""

# Method 1: Character-based splitting
print("1️⃣ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator=" ",      # Split on spaces
    chunk_size=200,     # Max 200 characters per chunk
    chunk_overlap=20,   # 20-character overlap for context continuity
    length_function=len # Use character count to measure length
) 

char_chunks = char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [25]:
# Display first two chunks to see how splitting works
print(char_chunks[0])
print("------------------")
print(char_chunks[1])

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
------------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:


In [26]:
# Method 1: Character-based splitting with newline separator
print("1️⃣ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines instead of spaces
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks for context preservation
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems...


In [27]:
# Display the first few chunks to understand how newline splitting works
print(char_chunks[0])
print("-------------")
print(char_chunks[1])
print("-------------")
print(char_chunks[2])

Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve
-------------
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning:
-------------
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties


In [28]:
# Method 2: Recursive character splitting (RECOMMENDED)
print("\n2️⃣ RECURSIVE CHARACTER TEXT SPLITTER")
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Try these separators in order
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][:100]}...")


2️⃣ RECURSIVE CHARACTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [29]:
# Compare first three chunks from recursive splitter
print(recursive_chunks[0])
print("-----------------")
print(recursive_chunks[1])
print("------------------")
print(recursive_chunks[2])

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
-----------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:
------------------
Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation


In [30]:
"""
Demonstrating Chunk Overlap
===========================
This example shows how chunk overlap works by creating consecutive chunks
and displaying how they share common text for context preservation.
"""

# Create text without natural break points
simple_text = "This is sentence one and it is quite long. This is sentence two and it is also quite long. This is sentence three which is even longer than the others. This is sentence four. This is sentence five. This is sentence six."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Only split on spaces
    chunk_size=80,     # Smaller chunks to see overlap clearly
    chunk_overlap=20,  # 20 character overlap between chunks
    length_function=len
)

chunks = splitter.split_text(simple_text)

print(f"\nSimple text example - {len(chunks)} chunks:\n")

# Display consecutive chunks to show overlap
for i in range(len(chunks) - 1):
    print(f"Chunk {i+1}: '{chunks[i]}'")
    print(f"Chunk {i+2}: '{chunks[i+1]}'")
    print()  # Empty line for readability


Simple text example - 4 chunks:

Chunk 1: 'This is sentence one and it is quite long. This is sentence two and it is also'
Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'

Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'
Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'

Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'
Chunk 4: 'is sentence five. This is sentence six.'



In [31]:
# Method 3: Token-based splitting
print("\n3️⃣ TOKEN TEXT SPLITTER")
token_splitter = TokenTextSplitter(
    chunk_size=50,  # Size in tokens (not characters)
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]}...")


3️⃣ TOKEN TEXT SPLITTER
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [32]:
# 📊 Comparison
print("\n📊 Text Splitting Methods Comparison:")
print("\nCharacterTextSplitter:")
print("  ✅ Simple and predictable")
print("  ✅ Good for structured text")
print("  ❌ May break mid-sentence")
print("  Use when: Text has clear delimiters")

print("\nRecursiveCharacterTextSplitter:")
print("  ✅ Respects text structure")
print("  ✅ Tries multiple separators")
print("  ✅ Best general-purpose splitter")
print("  ❌ Slightly more complex")
print("  Use when: Default choice for most texts")

print("\nTokenTextSplitter:")
print("  ✅ Respects model token limits")
print("  ✅ More accurate for embeddings")
print("  ❌ Slower than character-based")
print("  Use when: Working with token-limited models")


📊 Text Splitting Methods Comparison:

CharacterTextSplitter:
  ✅ Simple and predictable
  ✅ Good for structured text
  ❌ May break mid-sentence
  Use when: Text has clear delimiters

RecursiveCharacterTextSplitter:
  ✅ Respects text structure
  ✅ Tries multiple separators
  ✅ Best general-purpose splitter
  ❌ Slightly more complex
  Use when: Default choice for most texts

TokenTextSplitter:
  ✅ Respects model token limits
  ✅ More accurate for embeddings
  ❌ Slower than character-based
  Use when: Working with token-limited models


In [None]:
"""
Text Processing Completed! 🎉
=============================
In this section, we've covered:
- Basic LangChain Document structure
- Single file loading with TextLoader
- Batch loading with DirectoryLoader
- Different text splitting strategies

Next Steps:
- PDF document processing
- Word document handling
- CSV and Excel files
- Web scraping techniques
- Database integration
"""

print("✅ Text file processing demonstration completed!")
print("📚 Ready to move on to more complex document types...")