# Document Loaders

Loaders are lazy, meaning they will not do anything until an action is performed, e.g. `.load()`

Check more loaders on: https://python.langchain.com/docs/modules/data_connection/document_loaders/

## Some Types

### CSV

In [1]:
from langchain.document_loaders import CSVLoader

loader = CSVLoader("some_data/penguins.csv")
data = loader.load()

In [2]:
type(data)

list

In [3]:
type(data[0])

langchain.schema.document.Document

In [4]:
print(data[0].page_content)

species: Adelie
island: Torgersen
bill_length_mm: 39.1
bill_depth_mm: 18.7
flipper_length_mm: 181
body_mass_g: 3750
sex: MALE


### HTML

In [5]:
from langchain.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("some_data/some_website.html")
data = loader.load()

In [6]:
print(data[0].page_content)

Heading 1


### PDF

**Important note:** PDF formatting is sometimes a wildcard. We may end up getting weird results when loading PDF data, for example, in this case, it assumes each word is a new line.

In [8]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("some_data/SomeReport.pdf")
pages = loader.load()

In [9]:
pages

[Document(page_content='This\nis\nthe\nfirst\nline\nPDF.\nThis\nis\nthe\nsecond\nline\nin\nthe\nPDF.\nThis\nis\nthe\nthird\nline\nin\nthe\nPDF.', metadata={'source': 'some_data/SomeReport.pdf', 'page': 0})]

In [10]:
print(pages[0].page_content)

This
is
the
first
line
PDF.
This
is
the
second
line
in
the
PDF.
This
is
the
third
line
in
the
PDF.
