# **1-data-loaders**

## **🤖 Introduction**
- **Data Loaders**: Tools that let you **load all kinds of data**—documents, PDFs, web pages—so you can query them with an LLM.
- **Connect** to private data sources or local files, turning them into a format the LLM can work with.
- **LangChain Built-In Loaders**: Often labeled as **"integrations"**; each loader may require specific libraries (e.g., `PyPDF2`, `docx`, `BeautifulSoup`).
- For more details, see the **LangChain documentation on Document Loaders**:
  - **Documentation Page**: Explains usage patterns and advanced tips.
  - **List of Built-In Loaders**: Shows which libraries you need for each file or data type.

---

## **⚙️ Setup**
1. **Clone or Download** the GitHub repository to your computer.
2. In **terminal**:
   ```
   cd project_name
   pyenv local 3.11.4
   poetry install
   poetry shell
   ```
3. Launch **Jupyter Lab**:
   ```
   jupyter lab
   ```
   - Open the `001-data-loaders.ipynb` notebook in your notebooks folder.
4. **View Code** in an editor like VS Code:
   - Locate and open `001-data-loaders.py`.

---

## **🔐 Create Your `.env` File**
- **`.env.example`** is provided; rename it to **`.env`**.
- Add the following keys:
  ```
  OPENAI_API_KEY=your_openai_api_key
  LANGCHAIN_TRACING_V2=true
  LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
  LANGCHAIN_API_KEY=your_langchain_api_key
  LANGCHAIN_PROJECT=your_project_name
  ```
- This project is **`001-data-loaders`** in **LangSmith**.

---

## **📊 Track Operations**
- **Monitor** usage and costs in **LangSmith**:
  ```
  smith.langchain.com
  ```

> **💡 Note**: Data loaders can drastically simplify your workflow by **standardizing** how data is read, parsed, and fed into the LLM for further processing or question-answering.


## LangChain documentation on Document Loaders
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/).
* See the list of built-in document loaders [here](https://python.langchain.com/v0.1/docs/integrations/document_loaders/).

## Connect with the .env file located in the same directory of this notebook

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

## Simple data loading

#### Loading a .txt file

In [None]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain-community

In [None]:
# Import the TextLoader class from the community document loaders
from langchain_community.document_loaders import TextLoader

# Create a TextLoader instance, pointing to a local text file
loader = TextLoader("./data/be-good.txt")

# Load the data from the specified file into a structured format
loaded_data = loader.load()

### **⚙️ How This Works**
- **TextLoader** reads plain text files and converts them into a **standardized** LangChain document format.
- By pointing it to `"./data/be-good.txt"`, you can **easily** incorporate the file’s content into your workflow—such as Q&A, summarization, or chaining tasks.

### **💼 Why Use a Document Loader?**
- **Consistency**: Ensures each document is represented with consistent metadata (e.g., source, page numbers).
- **Scalability**: Makes it simpler to add more loaders for **different** file types (PDF, CSV, HTML).
- **Modularity**: Keep your data ingestion separate from your LLM logic, making code more maintainable.

> **🔎 Pro Tip**: After loading, you can pass `loaded_data` to various **LangChain** components (like text splitters, vector stores, or prompt templates) to build sophisticated applications.






* If you uncomment and execute the next cell you will see the contents of the loaded document.

In [None]:
#loaded_data

# **Loading a CSV file**

In [None]:
# Import CSVLoader from the community document loaders
from langchain_community.document_loaders import CSVLoader

# Instantiate the loader with a path to the CSV file
loader = CSVLoader('./data/Street_Tree_List.csv')

# Load the CSV file data into a structured LangChain format
loaded_data = loader.load()

### **⚙️ How Does CSVLoader Work?**  
- **🔍 Reads CSV**: The loader scans each row, converting it into a format that LangChain can understand—usually a list of documents or records.  
- **🗄 Data Organization**: Each row can become its own “document,” complete with any metadata you might need (like column headers).  
- **♻️ Reusability**: If you have multiple CSV files, you can apply the same loader logic to each, keeping your code consistent.

### **💼 Why Use a Document Loader for CSV?**
- **Standardization**: Ensures the data is uniformly structured for subsequent LLM tasks (like Q&A or summarization).  
- **Scalability**: Makes it easy to load large or multiple CSVs without manually parsing them.  
- **Integration**: Once loaded, you can feed `loaded_data` into other LangChain components—like text splitters, vector stores, or prompt templates.

> **💡 Pro Tip**: Pair CSVLoader with a text splitter if you have lengthy cell data. This helps the LLM handle the content more effectively during downstream tasks.






In [None]:
#loaded_data

# **Loading an .html file**

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install bs4

In [None]:
# Import the UnstructuredHTMLLoader for parsing HTML files
from langchain_community.document_loaders import UnstructuredHTMLLoader

# Instantiate the loader, pointing to an HTML file named '100-startups.html'
loader = UnstructuredHTMLLoader('./data/100-startups.html')

# Load the HTML data, converting it into a LangChain-compatible document format
loaded_data = loader.load()

### **🔍 Line-by-Line Analysis**
1. **`from langchain_community.document_loaders import UnstructuredHTMLLoader`**  
   - Imports the **UnstructuredHTMLLoader**, a specialized loader that extracts text and metadata from HTML documents.

2. **`loader = UnstructuredHTMLLoader('./data/100-startups.html')`**  
   - Creates an instance of **UnstructuredHTMLLoader**, pointing to the **`100-startups.html`** file in the `./data` folder.  
   - Under the hood, it leverages unstructured parsing methods to **cleanly** extract text from HTML tags, discarding extraneous markup.

3. **`loaded_data = loader.load()`**  
   - Executes the loading process, returning a list (or similar structure) of LangChain **Documents**.  
   - Each document typically contains:  
     - **Text**: The main textual content from the HTML file.  
     - **Metadata**: Possibly includes source info, document creation time, or extracted metadata from `<title>` or `<meta>` tags.

### **💼 Why Use `UnstructuredHTMLLoader`?**
- **Comprehensive Parsing**: It can handle various HTML layouts, ignoring scripts and styling to focus on **readable** text.  
- **Format Uniformity**: Converts HTML content into a **standard** LangChain document format, simplifying subsequent LLM tasks like summarization or question-answering.  
- **Scalability**: You can easily process multiple HTML files with consistent parsing rules.

### **🤖 Common Next Steps**
- **Indexing**: Store the extracted text in a vector database for **semantic search** or chat-based retrieval.  
- **Text Splitting**: If the HTML is lengthy, break the content into **manageable** chunks before passing it to an LLM.  
- **Chaining**: Combine these documents with other data loaders (e.g., PDFs, CSVs) to build a **unified** knowledge base for your LLM.

> **💡 Pro Tip**: If your HTML files have **complex** structures, consider customizing the parser settings or pre-processing the HTML to remove non-relevant sections (like ads or navigation menus). This ensures cleaner data for your LLM pipeline.

In [None]:
#loaded_data

# **Loading a .pdf file**

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install pypdf

In [None]:
# Import the PyPDFLoader from the community document loaders
from langchain_community.document_loaders import PyPDFLoader

# Create a loader for the specified PDF file
loader = PyPDFLoader('./data/5pages.pdf')

# Load the PDF and split its content into smaller segments
loaded_data = loader.load_and_split()

### **🔎 What’s Happening?**
1. **`PyPDFLoader('./data/5pages.pdf')`**  
   - **Loads** the file named **`5pages.pdf`** from the `./data` folder.  
   - Under the hood, PyPDFLoader leverages **PyPDF2** or a similar library to read PDF pages.

2. **`loader.load_and_split()`**  
   - **Retrieves** the PDF’s textual content and **splits** it into smaller chunks.  
   - This is especially useful for **LLM** tasks, where dealing with large text blocks can lead to token limits or less coherent outputs.

### **🤖 Why Use `PyPDFLoader`?**
- **Streamlined Workflow**: Automates reading PDFs, saving you from manual parsing.  
- **Consistent Output**: Produces standardized LangChain “document” objects that can be indexed, embedded, or queried.  
- **Scalability**: Works on multi-page PDFs, ensuring each page (or chunk) is treated separately for more efficient **Q&A** or summarization.

### **💡 Key Benefits**
- **Chunked Documents**: Breaking large PDFs into smaller segments helps the LLM handle content more accurately.  
- **Metadata Preservation**: The loader often retains page numbers or other relevant info, aiding in references or citations.  
- **Versatility**: Combine with other loaders (HTML, CSV, text) for a **multi-source** pipeline.

> **✨ Pro Tip**: If your PDF is **very** large or has complex formatting (tables, footnotes), consider **additional** text-splitting or data-cleaning steps for best results.

In [None]:
#loaded_data[0].page_content

# **Loading a Wikipedia page and asking questions about it**

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install wikipedia

In [None]:
# Import WikipediaLoader from the community document loaders
from langchain_community.document_loaders import WikipediaLoader

# Define the search query for Wikipedia
name = "JFK"

# Create a loader that fetches up to 1 document matching the query
loader = WikipediaLoader(query=name, load_max_docs=1)

# Load the data and extract the 'page_content' from the first result
loaded_data = loader.load()[0].page_content

### **🔎 Line-by-Line Breakdown**
1. **`from langchain_community.document_loaders import WikipediaLoader`**  
   - **⚙️** Imports a specialized loader that queries Wikipedia and returns matching articles in a LangChain-compatible format.

2. **`name = "JFK"`**  
   - **📝** Specifies the **search term** (“JFK”), which the loader will use to find relevant Wikipedia entries.

3. **`loader = WikipediaLoader(query=name, load_max_docs=1)`**  
   - **🔍** Initializes the loader with the query (`"JFK"`) and limits results to **1 document**.  
   - **Why a limit?** Avoids pulling multiple pages if you only need a single reference.

4. **`loaded_data = loader.load()[0].page_content`**  
   - **📥** Calls `.load()` to perform the Wikipedia search and retrieval.  
   - **[0].page_content** selects the first document from the list of results and extracts its **text content**.  
   - You can now **store**, **summarize**, or **query** this text using LangChain or other LLM-based methods.

### **💡 Why Use `WikipediaLoader`?**
- **Immediate Access**: Instantly fetch up-to-date encyclopedia entries without manual copying.  
- **Structured Output**: Integrates seamlessly with LangChain, enabling Q&A or summarization pipelines.  
- **Customizable**: Adjust `load_max_docs` to fetch more articles or refine your query for specific topics.

> **🤖 Pro Tip**: After retrieving `page_content`, consider splitting or chunking the text for better performance in LLM tasks (especially if the article is very long).

In [None]:
# Import the ChatPromptTemplate class to build a chat-style prompt
from langchain_core.prompts import ChatPromptTemplate

# Create a chat prompt with placeholders for 'question' and 'context'
chat_template = ChatPromptTemplate.from_messages(
    [
        ("human", "Answer this {question}, here is some extra {context}"),
    ]
)

# Format the messages by substituting the actual question and the loaded data
messages = chat_template.format_messages(
    question="What was the full name of JFK",
    context=loaded_data
)

### **🔎 What’s Happening?**
1. **Prompt Definition**  
   - **`ChatPromptTemplate.from_messages(...)`**: Defines a single **human** message containing two placeholders:
     - **`{question}`**: The query the user wants answered (e.g., “What was the full name of JFK?”).  
     - **`{context}`**: Extra information or text (in this case, `loaded_data` from Wikipedia about JFK).

2. **Placeholder Substitution**  
   - **`chat_template.format_messages(...)`**: Replaces the placeholders with the actual values (`"What was the full name of JFK"` and the text in `loaded_data`).  
   - Produces a **message list** ready to be passed to an LLM or chain.

3. **Relevance of `loaded_data`**  
   - **`loaded_data`** presumably contains a Wikipedia entry or other relevant text about JFK.  
   - By injecting it into the prompt, you give the LLM **context** that it can reference to craft an accurate answer.

### **🤔 Why This Matters?**
- **Context Injection**: Providing background info alongside the question helps the LLM answer more precisely.  
- **Modular Design**: You can easily **swap** `question` or `context` to address different topics or data sources without rewriting the prompt logic.  
- **Scalability**: In a larger application, you might load multiple documents or even do a semantic search first, then feed the best result as `context`.

> **💡 Pro Tip**: If the `loaded_data` is lengthy, consider **text splitting** or summarizing it before injecting it into the prompt. This helps the LLM handle the data more efficiently and avoids hitting token limits.

In [None]:
response = chatModel.invoke(messages)

In [None]:
response

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 001-data-loaders.py
* In terminal, make sure you are in the directory of the file and run:
    * python 001-data-loaders.py