## Load Document $\rightarrow$ Split Document $\rightarrow$ Storage 

![steps](image.png)

### Load Document 

#### From PDF File

In [23]:
from pprint import pprint
from dotenv import load_dotenv
import os

from langchain.document_loaders import PyPDFLoader

In [17]:
load_dotenv()

True

In [25]:
loader = PyPDFLoader("docs/test.pdf")
pages = loader.load()

In [2]:
len(pages)

86

In [3]:
page = pages[0]
page.metadata

{'source': 'test.pdf', 'page': 0}

In [10]:
print(page.page_content[0:200]) 

The Rise and Potential of Large Language Model
Based Agents: A Survey
Zhiheng Xi∗†, Wenxiang Chen∗, Xin Guo∗, Wei He∗, Yiwen Ding∗, Boyang Hong∗,
Ming Zhang∗, Junzhe Wang∗, Senjie Jin∗, Enyu Zhou∗,
Ru


#### From YT link

In [13]:
import openai

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [18]:
API_KEY: str = os.getenv("OPENAI_API_KEY")
openai.api_key = API_KEY

In [21]:
url = "https://www.youtube.com/watch?v=DM5tT63Hs5I"
save_dir = "youtube/"

loader = GenericLoader(YoutubeAudioLoader([url], save_dir), OpenAIWhisperParser())
docs = loader.load() 

[youtube] Extracting URL: https://www.youtube.com/watch?v=DM5tT63Hs5I
[youtube] DM5tT63Hs5I: Downloading webpage
[youtube] DM5tT63Hs5I: Downloading ios player API JSON
[youtube] DM5tT63Hs5I: Downloading android player API JSON
[youtube] DM5tT63Hs5I: Downloading m3u8 information
[info] DM5tT63Hs5I: Downloading 1 format(s): 140
[download] youtube//I Go Die For Nyash 😆😂.m4a has already been downloaded
[download] 100% of    1.28MiB
[ExtractAudio] Not converting audio youtube//I Go Die For Nyash 😆😂.m4a; file is already in target format m4a
Transcribing part 1!


In [26]:
pprint(docs[0].page_content) 

("******* ******* Madam, please, good evening. I'm going to U.S. Janssen. "
 "What's your address? Hello. U.S. Janssen. Please take your left. Left. The "
 "speed Janssen. Your right. There's a sign board there. Just take the speed "
 'Janssen. Okay. Get there. Thank you. Okay. ******* ******* ******* ******* '
 "******* ******* ******* Hello. Just pass straight. Please, I've missed the "
 "room. Hello. Yes. ******* Just take the left direction. Okay. There's a T "
 'Janssen there. Okay. Can you go back a little? ******* ******* Excuse me. '
 "Okay. Just take the right direction. Okay. Just go down, there's a ******* "
 '******* Raise your hand. ******* ***** ***** ******* ******* ******* Show me '
 'the value of that어도 ******* ******* Show me the value of thatarthy. ******* '
 '******* ******* Okay, this is good. ******* ******* ******* ******* ******* '
 '******* ******* ****** ****** *******')


#### From Webpage

In [20]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.promptingguide.ai/introduction/settings")

In [21]:
pages = loader.load() 

In [22]:
print(pages[0].page_content) 

LLM Settings | Prompt Engineering Guide Prompt Engineering GuidePrompt Engineering CoursePrompt Engineering CourseServicesServicesAboutAboutGitHubGitHub (opens in a new tab)DiscordDiscord (opens in a new tab)Prompt EngineeringIntroductionLLM SettingsBasics of PromptingPrompt ElementsGeneral Tips for Designing PromptsExamples of PromptsTechniquesZero-shot PromptingFew-shot PromptingChain-of-Thought PromptingSelf-ConsistencyGenerate Knowledge PromptingTree of ThoughtsRetrieval Augmented GenerationAutomatic Reasoning and Tool-useAutomatic Prompt EngineerActive-PromptDirectional Stimulus PromptingReActMultimodal CoTGraph PromptingApplicationsProgram-Aided Language ModelsGenerating DataGenerating CodeGraduate Job Classification Case StudyPrompt FunctionModelsFlanChatGPTLLaMAGPT-4LLM CollectionRisks & MisusesAdversarial PromptingFactualityBiasesPapersToolsNotebooksDatasetsAdditional ReadingsPrompt EngineeringIntroductionLLM SettingsBasics of PromptingPrompt ElementsGeneral Tips for Designing

#### From Notion

In [26]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [28]:
len(docs)

1

In [29]:
docs[0].metadata

{'source': "docs/Notion_DB/Blendle's Employee Handbook 8a6917052a0742e68d842b89a76f28b5.md"}

In [27]:
print(docs[0].page_content)

# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. 

**Everything related to working at Blendle and the people of Blendle, made public.**

These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.

We've made this document public because we want to learn 