## Document Loading
Document loading refers to the process of ingesting text data from various sources. This is the first step in processing knowledge for a RAG system.

Common Document Loaders
|Source Type	|Techniques / Tools                                       |
|---------------|---------------------------------------------------------|
|Plain Text	    |TextLoader (LangChain), Python’s open() function         |
|PDFs	        |PyMuPDF, pdfplumber, PDFMiner                            |
|DOCX	        |python-docx                                              |
|CSV & Excel	|pandas.read_csv(), openpyxl                              |
|Webpages	    |BeautifulSoup, trafilatura                               |
|Databases	    |SQL queries (sqlite3, SQLAlchemy)                        |
|APIs	        |requests, LangChain APILoader                            |

> How to Choose the Right Loader?   
    - If you need structured data (CSV, JSON) → Use pandas or csv parser.  
    - For large PDFs with complex layouts → Use PyMuPDF (better table handling).  
    - If the source is a database → Use SQL queries with SQLAlchemy.  
    - If the data is from the web → Use BeautifulSoup or trafilatura.  


## Document Splitting
After loading, large documents need to be split into manageable sections for efficient retrieval.

Why Split Documents?
- LLMs have a context length limit (e.g., GPT-4 has ~8k–32k tokens).
- Splitting helps better semantic retrieval (smaller, focused chunks).
- Prevents irrelevant information retrieval in RAG

### Common Splitting Technique


#### Character-based Splitting
Splits text by characters, maintaining a fixed chunk size (e.g., every 500 characters).

When to use: When documents have continuous text with no clear structure.

#### Senetnce-based Splitting
Uses sentence boundaries to ensure meaningful splits. Often done with nltk or spaCy.  
When to use: When document meaning depends on complete sentences (e.g., research papers, news articles).

#### Paragraph-based Splitting
Splits at paragraph boundaries, preserving logical divisions.  
When to use: When documents have well-defined paragraphs (e.g., blogs, legal texts).

#### Token-based Splitting
Splits based on a fixed number of tokens (e.g., every 512 tokens using tiktoken).  
When to use: When token limits matter (e.g., fitting LLM context windows).

#### Semantic Splitting
Uses AI models (BERT, LangChain RecursiveCharacterTextSplitter) to split text at logical breaks.
When to use: When preserving semantic coherence is crucial (e.g., splitting transcripts, technical documents).

> Choosing the Right Splitting Strategy  
    - If working with plain text with no structure → Use character-based or token-based splitting.  
    - If dealing with structured documents (articles, blogs) → Use sentence- or paragraph-based splitting.  
    - If meaning must be preserved across chunks → Use semantic splitting.  

## Chunking
Chunking refers to structuring and organizing text into meaningful segments for retrieval.

### Chunking Strategies

#### Fixed-size chunking
Splits text into equal-length chunks (e.g., every 500 tokens).  
Pros: Simple, fast  
Cons: May cut off important context

#### Sliding Window Chunking
Overlapping chunks (e.g., 512 tokens per chunk, 128 tokens overlap).  
Pros: Preserves context between chunks.  
Cons: Increases storage & Retrieval overhead.

#### Recursive Chunking
Splits text at logical boundaries (e.g., heading → paragraph → sentence).  
Pros: Maintains document structure.  
Cons: Needs NLP models to declare boundaries.

#### Dynamic Chunking
Adjusts chunk size based on content (e.g., keeping topic coherence).  
Pros: Preserves meaning better.  
Cons: Computationally expensive.

> Choosing the Right Chunking Strategy
    - If speed & simplicity matter → Use fixed-size chunking.  
    - If context preservation is important → Use sliding window chunking.  
    - If documents have structured sections → Use recursive chunking.  
    - If you need adaptive retrieval → Use dynamic chunking.

## Final Decision Framework

|Scenario	                                             |Recommended Approach
|--------------------------------------------------------|-------------------------------------------------|
|Short plain text files (blogs, news articles)	         |Sentence-based splitting + Fixed-size chunking
|Long structured documents (PDFs, books)	             |Paragraph-based splitting + Recursive chunking
|AI chatbots that need context	                         |Sliding window chunking
|Knowledge bases with different formats	                 |Dynamic chunking