# LangChain Document Loaders Overview

This notebook demonstrates how to ingest data from various sources using LangChain's document loaders, including text files, PDFs, webpages, arXiv research papers, and Wikipedia articles.

## Document Loaders Covered

- TextLoader (plain text files)
- PyPDFLoader (PDF files)
- WebBaseLoader (webpages)
- ArxivLoader (arXiv research papers)
- WikipediaLoader (Wikipedia articles)

Each section will show how to ingest data from the respective source.

In [None]:
# If running in a fresh environment, install all required packages
# !pip install langchain openai python-dotenv pypdf requests beautifulsoup4 arxiv wikipedia

## 1. Ingesting Text Files with TextLoader

In [None]:
# Install for TextLoader (text files)
# !pip install langchain

In [None]:
from langchain.document_loaders import TextLoader

# Create a sample text file
sample_text_path = "./temp/sample_data.txt"
with open(sample_text_path, "w") as f:
    f.write("LangChain makes it easy to work with language models and ingest data from various sources.")

# Load the text file using LangChain
loader = TextLoader(sample_text_path)
documents = loader.load()

print("Loaded documents:")
for doc in documents:
    print(doc.page_content)

## 2. Ingesting PDF Files with PyPDFLoader

In [None]:
# Install for PyPDFLoader (PDF files)
# !pip install langchain pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load a sample PDF file (replace with your own PDF path if needed)
# For demonstration, this cell assumes sample.pdf exists in temp folder.
pdf_path = "./temp/sample.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

print(f"Loaded {len(documents)} pages from PDF.")
for i, doc in enumerate(documents):
    print(f"--- Page {i+1} ---")
    print(doc.page_content[:500])  # Print first 500 characters for brevity
    print()

## 3. Ingesting Webpages with WebBaseLoader

In [None]:
# Install for WebBaseLoader (webpages)
# !pip install langchain requests beautifulsoup4

In [None]:
from langchain.document_loaders import WebBaseLoader

# Specify the URL of the webpage to ingest
url = "https://www.geeksforgeeks.org/artificial-intelligence/what-is-generative-ai/"
loader = WebBaseLoader(url)
documents = loader.load()

print(f"Loaded {len(documents)} document(s) from the webpage.")
for i, doc in enumerate(documents):
    print(f"--- Document {i+1} ---")
    print(doc.page_content[:500])  # Print first 500 characters for brevity
    print()

## 4. Ingesting arXiv Research Papers with ArxivLoader

In [None]:
# Install for ArxivLoader (arXiv research papers)
# !pip install langchain arxiv pymupdf

In [None]:
# If you are getting SSLCertVerificationError, you can try disabling SSL verification (not recommended for production)
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
from langchain.document_loaders import ArxivLoader

# The arXiv ID for 'Attention Is All You Need' is 1706.03762
arxiv_id = "1706.03762"
loader = ArxivLoader(arxiv_id)
documents = loader.load()

print(f"Loaded {len(documents)} document(s) from arXiv.")
if documents:
    print("Title:", documents[0].metadata.get('Title', 'N/A'))
    print("\nAbstract/Content:\n", documents[0].page_content[:1000])  # Print first 1000 chars
else:
    print("No documents loaded.")

## 5. Ingesting Wikipedia Articles with WikipediaLoader

In [None]:
# Install for WikipediaLoader (Wikipedia articles)
# !pip install langchain wikipedia

In [None]:
from langchain.document_loaders import WikipediaLoader

# Specify the Wikipedia page to ingest
page_title = "Transformer (machine learning model)"
loader = WikipediaLoader(query=page_title, lang="en", load_max_docs=2)
documents = loader.load()

print(f"Loaded {len(documents)} document(s) from Wikipedia.")

for doc in documents:
    print(f"Title: {doc.metadata['title']}")
    print(f"URL: {doc.metadata['source']}")
    print()

# Print the content of the loaded Wikipedia article
if documents:
    print(documents[0].page_content[:1000])  # Print first 1000 characters
else:
    print("No documents loaded.")

## Next Steps

You can now use these loaded documents for further processing, such as splitting, embedding, or querying with LangChain tools.