# Part 1 — Data Loading (Loaders)
### Understanding how to load and prepare data for LLM pipelines

This notebook introduces document loaders, supported formats, and best practices for preparing data before processing it with LLM applications.


## Learning Guide

In this notebook, you will learn:

- What document loaders are and why they are essential in LLM workflows  
- Different types of built-in loaders in LangChain  
- Best practices when handling text data from various file formats  
- How to load and inspect documents through hands-on examples  

This is a foundational step for building Retrieval-Augmented Generation (RAG) systems and other LLM applications requiring document ingestion.


In [12]:
from secrete_key import my_gemini_api_key
API_KEY = my_gemini_api_key()

print("API Key loaded successfully.")

API Key loaded successfully.


## 1.1 What Are Loaders?

Document loaders read raw data from files or URLs and convert them into a structured format suitable for LLM pipelines.

Supported formats include:

- PDF  
- Text files  
- Word documents  
- HTML pages  
- CSV files  
- Websites  

Loaders provide standardized input so that downstream steps—splitting, embedding, retrieval—operate smoothly.


## 1.2 Built-in Loaders in LangChain

Commonly used document loaders:

- `PyPDFLoader` — Extracts text from PDF files  
- `TextLoader` — Reads plain text files  
- `WebBaseLoader` — Fetches and processes website content  
- `CSVLoader` — Loads structured CSV data  

Each loader outputs a list of `Document` objects containing text and metadata.

<a href="https://docs.langchain.com/oss/python/integrations/document_loaders">
  LangChain Document Loaders docs
</a>



## 1.3 Best Practices for Data Loading

- Clean text to remove noise, whitespace, or unnecessary characters  
- Validate file encoding before processing  
- Use lazy loading when dealing with large datasets  
- Inspect metadata to track document source and structure  


## 1.4 Hands-On Demo

Below is an example using common loaders.  
Upload files like `example.txt` or `example.pdf` before running the demo cells.


In [13]:
pip install langchain_community pypdf

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader

# Example: Load a text file
try:
    loader = TextLoader("example.txt")
    docs = loader.load()
    print("Loaded text documents:", docs[:1])
except Exception as e:
    print("TextLoader example skipped:", e)


Loaded text documents: [Document(metadata={'source': 'example.txt'}, page_content='Force majeure, French for "superior force," is a contractual provision designed to protect parties from liability when extraordinary, unforeseeable events beyond their reasonable control prevent them from fulfilling their obligations. Such events, often called "acts of God," can include natural disasters (earthquakes, floods, hurricanes), as well as human-caused disruptions like wars, strikes, pandemics, or government actions. To successfully invoke force majeure, the affected party typically must demonstrate that the event was unforeseeable, external, and rendered performance impossible or illegal, not merely more difficult or expensive')]


In [15]:
# Example: Load a PDF file
try:
    loader_pdf = PyPDFLoader("Contract_doc_1pages.pdf")
    pdf_docs = loader_pdf.load()
    print("First PDF page snippet:", pdf_docs[0].page_content[:200])
except Exception as e:
    print("PyPDFLoader example skipped:", e)

First PDF page snippet: 28 
AIRLIFT   
14.1 Should the BUYER intend to airlift all or some of the stores the SELLER shall pack 
the stores accordingly on  receipt of intimation to that effect from the BUYER. Such 
deliveries


In [16]:
from datetime import datetime

print("System Time:", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))


System Time: 2025-12-08 10:15:39
