### Retrieval Augmented Generation (RAG)

#### Loading PDF documents using LangChain into standard format

In [1]:
from langchain.document_loaders import PyPDFLoader

In [2]:
# loading Machine learning CS229 lecture pdf
loader = PyPDFLoader("./docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

- Each page is a `Document` in Langchain. A `Document` contains text (`page_content`) and `metadata`.

In [3]:
len(pages)

22

In [4]:
print(pages[0].page_content[:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [5]:
pages[0].metadata

{'source': './docs/MachineLearning-Lecture01.pdf', 'page': 0}

In [6]:
# loading The Adventures of Sherlock Holmes by Arthur Conan Doyle
loader = PyPDFLoader("./docs/Sherlock_adv.pdf")
pages = loader.load()
len(pages)

162

In [7]:
print(pages[0].page_content[:500])
pages[0].metadata

The Adventures of Sherlock Holmes
Arthur Conan Doyle


{'source': './docs/Sherlock_adv.pdf', 'page': 0}

#### Loading data from URL

In [8]:
from langchain.document_loaders import WebBaseLoader

In [9]:
data_url = "https://openai.com/research/gpt-4" # OpenAI GPT-4 blog URL
loader = WebBaseLoader(data_url)

In [10]:
url_docs = loader.load()

In [11]:
len(url_docs)

1

In [12]:
print(url_docs[0].page_content[:500])




GPT-4












CloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3APIOverviewData privacyPricingDocsChatGPTOverviewEnterpriseTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchAPIChatGPTSafetyCompany Quick Links Log inTry ChatGPTSearch Submit ResearchGPT-4Illustration: Ruby ChenWe’ve created GPT-4, the latest milestone in OpenAI


In [13]:
url_docs[0].metadata

{'source': 'https://openai.com/research/gpt-4',
 'title': 'GPT-4',
 'description': 'We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.',
 'language': 'en-US'}

#### Extracting images as text

In [14]:
# loading 75 pages GPT-3 research paper
loader = PyPDFLoader("https://arxiv.org/pdf/2005.14165v4.pdf", extract_images=True)
pages = loader.load()

In [15]:
pages[7].page_content

'Model Name nparamsnlayersdmodelnheadsdhead Batch Size Learning Rate\nGPT-3 Small 125M 12 768 12 64 0.5M 6.0×10−4\nGPT-3 Medium 350M 24 1024 16 64 0.5M 3.0×10−4\nGPT-3 Large 760M 24 1536 16 96 0.5M 2.5×10−4\nGPT-3 XL 1.3B 24 2048 24 128 1M 2.0×10−4\nGPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6×10−4\nGPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2×10−4\nGPT-3 13B 13.0B 40 5140 40 128 2M 1.0×10−4\nGPT-3 175B or “GPT-3” 175.0B 96 12288 96 128 3.2M 0.6×10−4\nTable 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models\nwhich we trained. All models were trained for a total of 300 billion tokens.\n2.1 Model and Architectures\nWe use the same model and architecture as GPT-2 [ RWC+19], including the modiﬁed initialization, pre-normalization,\nand reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse\nattention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS