# Document Loaders

There are many other types of Documents that can be loaded in, including integrations which we'll cover in the next notebook. You can see all the document loaders available here: 
https://python.langchain.com/docs/modules/data_connection/document_loaders/

Keep in mind many Loaders are dependent on other libraries, meaning issues in those libraries can end up breaking the Langchain loaders.

## CSV

In [1]:
from langchain.document_loaders import CSVLoader

In [2]:
loader = CSVLoader("some_data/penguins.csv")
data = loader.load()

In [3]:
# Check the Object type
type(data)

list

In [4]:
# Check the first entry
data[0]

Document(metadata={'source': 'some_data/penguins.csv', 'row': 0}, page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE')

In [5]:
# Check with proper formatting
print(data[0].page_content)

species: Adelie
island: Torgersen
bill_length_mm: 39.1
bill_depth_mm: 18.7
flipper_length_mm: 181
body_mass_g: 3750
sex: MALE


## HTML

In [6]:
from langchain.document_loaders import BSHTMLLoader

In [7]:
loader = BSHTMLLoader("some_data/some_website.html")
data = loader.load()
data

[Document(metadata={'source': 'some_data/some_website.html', 'title': ''}, page_content='Heading 1')]

In [10]:
data[0].page_content

'Heading 1'

## PDF

In [11]:
from langchain.document_loaders import PyPDFLoader

In [12]:
loader = PyPDFLoader("some_data/SomeReport.pdf")
pages = loader.load_and_split()

In [13]:
type(pages)

list

In [14]:
# Check the first page
pages[0]

Document(metadata={'source': 'some_data/SomeReport.pdf', 'page': 0}, page_content='This\nis\nthe\nfirst\nline\nPDF.\nThis\nis\nthe\nsecond\nline\nin\nthe\nPDF.\nThis\nis\nthe\nthird\nline\nin\nthe\nPDF.')

In [15]:
# Check the content
print(pages[0].page_content)

This
is
the
first
line
PDF.
This
is
the
second
line
in
the
PDF.
This
is
the
third
line
in
the
PDF.


## Integrations

You can explore all the integrations here: https://python.langchain.com/docs/modules/data_connection/document_loaders/

Let's just go through a quick example!

In [16]:
from langchain.document_loaders import HNLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [18]:
loader = HNLoader('https://news.ycombinator.com/item?id=30084169')

data = loader.load()
print(data[0].page_content)

nicholast on Jan 26, 2022  
             | next [–] 

He was also a jazz musician (the clarinet), a somewhat accomplished juggler, a devoted unicycle enthusiast, and left behind a basement full of contraptions he was building in various states of finish - like the electronic mouse navigating a maze, a chess playing machine, and all other kinds of curiosities. His papers are coherent and still relevant to this day and follow the birth of each of these fields like information theory and artificial intelligence. Who knows what else he might have been working on at Bell labs that we may not be privy too.


In [19]:
# Check the metadata
print(data[0].metadata)

{'source': 'https://news.ycombinator.com/item?id=30084169', 'title': 'How Claude Shannon helped kick-start machine learning'}


## Create Summary of First Comment

Let's show a simple example of combining a normal text document with an LLM Chat Model

In [23]:
import os
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI

api_key = os.getenv("OPENAI_API_KEY")
f = open('some_data/sample_file.txt')

In [24]:
model = ChatOpenAI(openai_api_key=api_key)

In [25]:
human_prompt = HumanMessagePromptTemplate.from_template('Please give me a single sentence summary of the following:\n{document}')
chat_prompt = ChatPromptTemplate.from_messages([human_prompt])

In [26]:
result = model(chat_prompt.format_prompt(document=data[0].page_content).to_messages())
result.content

  result = model(chat_prompt.format_prompt(document=data[0].page_content).to_messages())


'Nicholast was a multi-talented individual who left behind a collection of innovative projects and papers that are still influential in fields such as information theory and artificial intelligence.'

### Now we will use wikipedia to answer questions

In [7]:
import os
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate

from langchain_openai import ChatOpenAI
from langchain.document_loaders import WikipediaLoader

In [8]:
# Now we will use wikipedia to answer questions

def answer_question_about(person_name, question):
    '''
    Use the Wikipedia Document Loader to help answer questions about someone, insert it as additional helpful context.
    '''

    #  Get Wikipedia Article
    docs = WikipediaLoader(query=person_name, load_max_docs=1)
    context_text = docs.load()[0].page_content
    # Connect to OpenAI model
    api_key = os.getenv("OPENAI_API_KEY")
    model = ChatOpenAI(openai_api_key = api_key)
    # Ask model a question
    human_prompt = HumanMessagePromptTemplate.from_template('Answer this question\n{question}, here is some extra context:\n{document}')
    # Assemble chat prompt
    chat_prompt = ChatPromptTemplate.from_messages([human_prompt])

    # result
    result = model(chat_prompt.format_prompt(question=question, document=context_text).to_messages())    
    
    print(result.content)

In [9]:
answer_question_about('Draupadi Murmu', 'When was She born?')

  result = model(chat_prompt.format_prompt(question=question, document=context_text).to_messages())


She was born on June 20, 1958.


In [10]:
answer_question_about('Tom Hanks', 'What is his last movie name?')

His last movie name is "Elvis" (2022).
