# PDF Loader
This covers how to load PDF documents into the Document format that we use downstream.

In [1]:
!pip install python-dotenv langchain openai



## Using PyPDF
Load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with page number.

In [12]:
!pip install pypdf pymupdf faiss-cpu

Collecting pymupdf
  Obtaining dependency information for pymupdf from https://files.pythonhosted.org/packages/ca/dd/3301dba92880ed2b0f283074320d1d00ba9afe5d98334239b8a1ba519563/PyMuPDF-1.22.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading PyMuPDF-1.22.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.3 kB)
Downloading PyMuPDF-1.22.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.22.5


In [10]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/MetaAI - LLM guide with Llama2, fine tuning.pdf")
pages = loader.load_and_split()
pages[10]

Document(page_content='The responsible fine-tuning flow\nHere are the general steps needed to responsibly fine-\ntune an LLM for alignment, guided at a high  \nlevel by Meta’s Responsible AI  framework:\n1. Define content policies & mitigations\n2. Prepare data  \n3. Train the model\n4. Evaluate and improve performance \nSTEP 1: DEFINE CONTENT POLICIES & MITIGATIONS \nBased on the intended use and audience for your \nproduct, a content policy will define what content \nis allowable and may outline safety limitations on \nproducing illegal, violent, or harmful content. These \nlimits should be evaluated in light of the product \ndomain, as specific sectors and regions may have \ndifferent laws or standards. Additionally, the needs \nof specific user communities should be considered as \nyou design content policies, such as the development \nof age-appropriate product experiences. Having \nthese policies in place will dictate the data needed, \nannotation requirements, and goals for safe

# Using PyMuPDF
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

In [14]:
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/MetaAI - LLM guide with Llama2, fine tuning.pdf")
data = loader.load()
data[10]

Document(page_content='The responsible fine-tuning flow\nHere are the general steps needed to responsibly fine-\ntune an LLM for alignment, guided at a high  \nlevel by Meta’s Responsible AI framework:\n1. Define content policies & mitigations\n2. Prepare data  \n3. Train the model\n4. Evaluate and improve performance \nSTEP 1: DEFINE CONTENT POLICIES & MITIGATIONS \nBased on the intended use and audience for your \nproduct, a content policy will define what content \nis allowable and may outline safety limitations on \nproducing illegal, violent, or harmful content. These \nlimits should be evaluated in light of the product \ndomain, as specific sectors and regions may have \ndifferent laws or standards. Additionally, the needs \nof specific user communities should be considered as \nyou design content policies, such as the development \nof age-appropriate product experiences. Having \nthese policies in place will dictate the data needed, \nannotation requirements, and goals for safet

Initialize FAISS vectorstore and OpenAI embeddings and use similarity search to pull top 2 relavant docs

In [8]:
import os
from dotenv import load_dotenv
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

load_dotenv() # Load environment variables from .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Get API key from environment variable

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("What does it say about fine tunning the model?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

9: Fine-tune for product 
Product-specific fine-tuning enables developers to 
leverage pretrained models or models with some  
fine-tuning for a specific task requiring only limited 
data and resources. Even with initial fine-tuning 
performed by Meta, developers can further train the 
model with domai
11: will depend on the specific context in which a product 
is deployed. Developers should also pay attention 
to how human feedback and annotation of data may 
further polarize a fine-tuned model with respect 
to subjective opinions, and take steps to prevent 
injecting bias in annotation guidelines an


In [9]:
for doc in docs:
    #print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
    print(str(doc.metadata["page"]) + ":", doc.page_content)

9: Fine-tune for product 
Product-specific fine-tuning enables developers to 
leverage pretrained models or models with some  
fine-tuning for a specific task requiring only limited 
data and resources. Even with initial fine-tuning 
performed by Meta, developers can further train the 
model with domain-specific datasets to improve 
quality on their defined use case. 
Fine-tuning adapts the model 
to domain- or application-
specific requirements and 
introduces additional layers  
of safety mitigations. Examples of fine-tuning for a pretrained  
LLM include:
• Text summarization: By using a pretrained 
language model, the model can be fine-tuned 
on a dataset that includes pairs of long-form 
documents and corresponding summaries. This 
fine-tuned model can then generate concise 
summaries for new documents.
• Question answering: Fine-tuning a language 
model on a Q&A dataset such as SQuAD 
(Stanford Question Answering Dataset) allows 
the model to learn how to answer questions based 
