# **Document Loading**

Document loading refers to the process of fetching and retrieving content from external sources such as websites or databases into an application or system. It is essential for accessing and utilizing data, text, or multimedia resources dynamically, enabling applications to display, process, or manipulate information sourced from various repositories. This capability is crucial for real-time updates, content integration, and efficient data handling in modern software and web development.



In [None]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community langchain_aws pypdf python-dotenv

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["AWS_ACCESS_KEY_ID"] = os.getenv("AWS_ACCESS_KEY_ID")
os.environ["AWS_SECRET_ACCESS_KEY"] = os.getenv("AWS_SECRET_ACCESS_KEY")
os.environ["AWS_DEFAULT_REGION"] = os.getenv("AWS_DEFAULT_REGION")

# **PDF**

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./content/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [None]:
len(pages)

page = pages[0]
print(page.page_content[0:500])

page.metadata

# **URL**

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://swayaan.com/")

docs = loader.load()
content = docs[0].page_content.strip()
content[:1000]

# **CSV**

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = (
    "./content/employee.csv"
)

loader = CSVLoader(file_path=file_path)
data = loader.load()

for record in data[:4]:
    print(record)

# **Let's Do an Activity**

## **Objective**

Practice document loading techniques with LangChain to fetch and process content from various sources such as PDFs, URLs, and CSV files.

## **Scenario**

You are developing a data processing module that needs to retrieve and analyze information from different types of documents. This activity will help you familiarize yourself with document loading capabilities in LangChain and understand how to handle diverse data sources effectively.

## **Steps**

* Load PDF Document
* Load Web Content (URL)
* Load CSV Data
* Explore and Analyze