# Document Loaders

There are many other types of Documents that can be loaded in, including integrations which we'll cover in the next notebook. You can see all the document loaders available here: 
https://python.langchain.com/docs/modules/data_connection/document_loaders/

Keep in mind many Loaders are dependent on other libraries, meaning issues in those libraries can end up breaking the Langchain loaders.

## CSV

In [1]:
from langchain.document_loaders import CSVLoader

In [2]:
loader = CSVLoader("some_data/penguins.csv")
data = loader.load()

In [3]:
# Check the Object type
type(data)

list

In [4]:
# Check the first entry
data[0]

Document(metadata={'source': 'some_data/penguins.csv', 'row': 0}, page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE')

In [5]:
# Check with proper formatting
print(data[0].page_content)

species: Adelie
island: Torgersen
bill_length_mm: 39.1
bill_depth_mm: 18.7
flipper_length_mm: 181
body_mass_g: 3750
sex: MALE


## HTML

In [6]:
from langchain.document_loaders import BSHTMLLoader

In [7]:
loader = BSHTMLLoader("some_data/some_website.html")
data = loader.load()
data

[Document(metadata={'source': 'some_data/some_website.html', 'title': ''}, page_content='Heading 1')]

## PDF

In [8]:
from langchain.document_loaders import PyPDFLoader

In [9]:
loader = PyPDFLoader("some_data/SomeReport.pdf")
pages = loader.load_and_split()

In [10]:
type(pages)

list

In [11]:
# Check the first page
pages[0]

Document(metadata={'source': 'some_data/SomeReport.pdf', 'page': 0}, page_content='This\nis\nthe\nfirst\nline\nPDF.\nThis\nis\nthe\nsecond\nline\nin\nthe\nPDF.\nThis\nis\nthe\nthird\nline\nin\nthe\nPDF.')

In [12]:
# Check the content
print(pages[0].page_content)

This
is
the
first
line
PDF.
This
is
the
second
line
in
the
PDF.
This
is
the
third
line
in
the
PDF.
