# Document Loading

## Retrieval augmented generation

In retrievel augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc.)

![Alt Text](img/rag.jpeg)


In [None]:
import os
import openai
import sys
#sys.path.append('../..')

#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # This loads the .env file that contains the OpenAI API key

openai.api_key = os.getenv("OPENAI_API_KEY")

## PDFs


[Machine Learning Lecture 01 Transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf)


In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("pdf/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (page_content) and `metadata`

In [None]:
len(pages)

In [None]:
page = pages[0]

In [None]:
print(page.page_content[0:500])

In [None]:
page.metadata

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()