# **Document Loading**

Document loading refers to the process of fetching and retrieving content from external sources such as websites or databases into an application or system. It is essential for accessing and utilizing data, text, or multimedia resources dynamically, enabling applications to display, process, or manipulate information sourced from various repositories. This capability is crucial for real-time updates, content integration, and efficient data handling in modern software and web development.



In [1]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community langchain_aws pypdf

In [2]:
import os
from google.colab import userdata
os.environ["AWS_ACCESS_KEY_ID"] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get('AWS_SECRET_ACCESS_KEY')
os.environ["AWS_DEFAULT_REGION"] = userdata.get('AWS_DEFAULT_REGION')

# **PDF**

In [4]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/Documents/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [5]:
len(pages)

page = pages[0]
print(page.page_content[0:500])

page.metadata

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the 


{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': '/content/Documents/MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}

# **URL**

In [6]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://swayaan.com/")

docs = loader.load()
content = docs[0].page_content.strip()
content[:1000]



'Account Suspended!\nPlease contact our support team for further assistance.\n\n*If you’re the owner of this website and have questions, reach out to Bluehost. We’re happy to help.'

# **CSV**

In [7]:
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = (
    "/content/Documents/employee.csv"
)

loader = CSVLoader(file_path=file_path)
data = loader.load()

for record in data[:4]:
    print(record)

page_content='Employee ID: 1
Employee Name: John Doe
Designation: Software Engineer
Tools Used: Eclipse, Git, JIRA
Date of Birth: 15-03-1985
Salary: $75,000
Hire Date: 20-06-2010
: ' metadata={'source': '/content/Documents/employee.csv', 'row': 0}
page_content='Employee ID: 2
Employee Name: Jane Smith
Designation: UI/UX Designer
Tools Used: Figma, Adobe XD, Sketch
Date of Birth: 22-08-1990
Salary: $55,000
Hire Date: 10-02-2019
: ' metadata={'source': '/content/Documents/employee.csv', 'row': 1}
page_content='Employee ID: 3
Employee Name: Alice Brown
Designation: Database Administrator
Tools Used: MySQL, MongoDB, Oracle
Date of Birth: 10-11-1988
Salary: $60,000
Hire Date: 05-04-2015
: ' metadata={'source': '/content/Documents/employee.csv', 'row': 2}
page_content='Employee ID: 4
Employee Name: Bob White
Designation: DevOps Engineer
Tools Used: Jenkins, Docker, Kubernetes
Date of Birth: 02-04-1980
Salary: $80,000
Hire Date: 15-09-2013
: ' metadata={'source': '/content/Documents/employee.

# **Let's Do an Activity**

## **Objective**

Practice document loading techniques with LangChain to fetch and process content from various sources such as PDFs, URLs, and CSV files.

## **Scenario**

You are developing a data processing module that needs to retrieve and analyze information from different types of documents. This activity will help you familiarize yourself with document loading capabilities in LangChain and understand how to handle diverse data sources effectively.

## **Steps**

* Load PDF Document
* Load Web Content (URL)
* Load CSV Data
* Explore and Analyze