[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1vCg_Zb_hBTfus1rMCCtwRwz_FvvHtKPm?usp=sharing)


# Build a PDF ingestion and Question/Answering system

PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model.

In this tutorial, I have created a system that can answer questions about PDF files. More specifically, I'll use a [Document Loader](https://python.langchain.com/docs/concepts/document_loaders/) to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.

Let's dive in!

## Loading documents
First, you'll need to choose a PDF to load. I'm using PDF of the book 'The Bitcoin Standard' by Saifedean Ammous. It's over 300 pages long. However, you can feel free to use a PDF of your choosing.

Once you've chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. LangChain has a few different [built-in document loaders](https://python.langchain.com/docs/how_to/document_loader_pdf/) for this purpose which you can experiment with. Below, we'll use one powered by the `pypdf` package that reads from a filepath:

In [None]:
!pip install pypdf langchain_community

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "The Bitcoin Standard.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))


304


So what just happened?

- The loader reads the PDF at the specified path into memory.
- It then extracts text data using the pypdf package.
- Finally, it creates a LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) for each page of the PDF with the page's content and some metadata about where in the document the text came from.
  
LangChain has [many other document loaders](https://python.langchain.com/docs/integrations/document_loaders/) for other data sources, or you can create a [custom document loader](https://python.langchain.com/docs/how_to/document_loader_custom/).

## Question answering with RAG
Next, you'll prepare the loaded documents for later retrieval. Using a [text splitter](https://python.langchain.com/docs/concepts/text_splitters/), you'll split your loaded documents into smaller documents that can more easily fit into an LLM's context window, then load them into a [vector store](https://python.langchain.com/docs/concepts/vectorstores/). You can then create a [retriever](https://python.langchain.com/docs/concepts/retrievers/) from the vector store for use in our RAG chain:

### Setting Up Groq with LangChain for LLMs

We will set up access to Groq's API (create an account and get your free API key from [Groq Console](https://console.groq.com/keys) and use it with LangChain to run a model called `llama3-8b-8192`.

In [None]:
!pip install langchain-groq

In [None]:
import getpass
import os

os.environ["GROQ_API_KEY"] = getpass.getpass()

from langchain_groq import ChatGroq

llm = ChatGroq(model="llama3-8b-8192")

 ········


## In-Memory Vector Store with Custom Embeddings

This code snippet demonstrates how to build an in-memory vector store for information retrieval using custom embeddings. Below is a breakdown of the components and their functions:

### Libraries and Classes Used:
- **`InMemoryVectorStore`**: A lightweight vector store for storing and retrieving document embeddings.
- **`RecursiveCharacterTextSplitter`**: A utility for splitting large documents into smaller chunks for efficient embedding.
- **`SentenceTransformer`**: A library from the `sentence-transformers` package used for creating embeddings from documents and queries.

### Custom Embedding Class:
The `HFEmbeddings` class wraps the `SentenceTransformer` model for embedding documents and queries.


In [None]:
!pip install langchain-core langchain-text-splitters sentence-transformers

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

class HFEmbeddings:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, texts):
        """Embed a list of documents."""
        return self.model.encode(texts, show_progress_bar=True)

    def embed_query(self, query):
        """Embed a single query."""
        return self.model.encode(query, show_progress_bar=False)

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create an instance of the embedding class
embeddings = HFEmbeddings()

# Create the vectorstore using the custom embeddings
vectorstore = InMemoryVectorStore.from_documents(
    documents=splits, embedding=embeddings
)

retriever = vectorstore.as_retriever()

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

### Finally, you'll use some built-in helpers to construct the final `rag_chain`:



In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [None]:
results = rag_chain.invoke({"input": "Explain in detail why Bitcoin is the best form of money ever?"})
print(results["answer"])

Based on the provided context, Bitcoin is considered the best form of money ever due to its decentralized and secure nature, which eliminates intermediary control and the risk of debasement or confiscation by any authority. Here are some reasons why:

1. Decentralization: Bitcoin is the only truly decentralized digital currency, meaning that no single entity or government has control over it. This ensures that no single point of failure can bring the system down, and that transactions are secure and tamper-proof.

2. Security: The use of advanced cryptography and blockchain technology makes it nearly impossible to hack or manipulate the transactions on the network. This ensures that the transactions are secure and trustworthy.

3. Limited supply: The supply of Bitcoin is capped at 21 million, which means that it can't be inflated or devalued by printing more money, like traditional fiat currencies. This limited supply also makes each Bitcoin valuable and scarce, which increases its pur

### You can see that you get both a final answer in the `answer` key of the results dict, and the `context` the LLM used to generate an answer.

### Examining the values under the `context` further, you can see that they are documents that each contain a chunk of the ingested page content. Usefully, these documents also preserve the original metadata from way back when you first loaded them:

In [None]:
print(results["context"])

[Document(id='ea432340-3af3-4fc3-9b9a-afaabb999e77', metadata={'source': 'The Bitcoin Standard.pdf', 'page': 267}, page_content='primarily: Bitcoin is the only truly decentralized digital currency which\nhas grown spontaneously as a finely balanced equilibrium between\nminers, coders, and users, none of whom can control it. It was only'), Document(id='090ac658-1e06-4261-8e72-c8edca8dad05', metadata={'source': 'The Bitcoin Standard.pdf', 'page': 198}, page_content='seen with other forms of money, while its divisibility into 100,000,000\nsatoshis makes it salable in scale. Further, Bitcoin’s elimination of inter-\nmediary control and the near-impossibility of any authority debasing\nor confiscating it renders it free of the main drawbacks of government\nmoney. As the digital age has introduced improvements and efficiencies\nto most aspects of our life, Bitcoin presents a tremendous technolog-\nical leap forward in the monetary solution to the indirect exchange\nproblem, perhaps as signif

In [None]:
print(results["context"][0].page_content)

primarily: Bitcoin is the only truly decentralized digital currency which
has grown spontaneously as a finely balanced equilibrium between
miners, coders, and users, none of whom can control it. It was only


In [None]:
print(results["context"][0].metadata)

{'source': 'The Bitcoin Standard.pdf', 'page': 267}


### This particular chunk came from page 267 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.