# Ollama with PDF Ingestion Project

Using Langchain and Chroma, this project demonstrates a local Retrieval-Augmented Generation (RAG) system for ingesting PDF files. The system leverages several tools:

- **Langchain**: For managing the entire workflow, including document loading, embedding, and querying.
- **UnstructuredPDFLoader**: Used to extract text content from PDF files.
- **RecursiveCharacterTextSplitter**: For splitting large text content into manageable chunks.
- **Ollama Embeddings**: Converts text into vector embeddings for efficient storage and retrieval.
- **Chroma**: A vector database used to store and query the embeddings.
- **MultiQueryRetriever**: Enhances retrieval accuracy by generating multiple variations of the user's query.
- **ChatOllama**: A local language model (e.g., Mistral) used to generate responses based on the retrieved document context.

The pipeline runs entirely offline, ensuring that sensitive documents remain private and secure.

## PDF Ingestion

### Key Components:
1. **UnstructuredPDFLoader**: This component from LangChain is responsible for reading and extracting text content from PDF files. It is designed to handle unstructured data, ensuring that documents of various formats are properly processed.
2. **Text Processing**: Once the content is extracted, the RecursiveCharacterTextSplitter tool is used to split the text into smaller chunks. Chunking the text ensures efficient processing and better results during retrieval and embedding.


In [None]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

**Loading the PDF**:
   The first step in the process is loading the PDF file using `UnstructuredPDFLoader`. The file is specified by its local path, and the loader extracts its content. For example, the following code loads the "WEF_The_Global_Cooperation_Barometer_2024.pdf", a document that contains recent research on global cooperation conducted by McKinsey & Company. For practical use, users have the ability to upload their own PDF files.

In [None]:
local_path = "WEF_The_Global_Cooperation_Barometer_2024.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

We preview the data to see if it is loaded properly.

In [None]:
# Preview first page
data[0].page_content[:100]

'In collaboration with McKinsey & Company\n\nThe Global Cooperation Barometer 2024\n\nI N S I G H T R E P'

## Vector Embeddings
Once the PDF content has been ingested and chunked, the next step involves converting the text chunks into vector embeddings. This process allows the text to be stored and later queried efficiently using a vector database.

### Why Vector Embeddings Are Necessary:
Vector embeddings are crucial for enabling efficient semantic search across large documents. Instead of matching exact words, vector embeddings convert text into numerical representations (vectors) that capture the meaning and context of the text. This allows the system to retrieve relevant information based on similarity between queries and the document content, making it far more powerful than traditional keyword-based search. 

### Why we need Chroma:
Chroma is used as the vector database to store and manage the vector embeddings. It allows for scalable and fast retrieval of these embeddings. When a user submits a query, Chroma searches through the stored vectors to find the most semantically relevant chunks of the document. Chroma's support for large datasets and its optimized performance make it a suitable choice for real-time querying in this RAG system.

In [None]:
!ollama pull nomic-embed-text

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [None]:
!ollama list

NAME                       ID              SIZE      MODIFIED               
nomic-embed-text:latest    0a109f422b47    274 MB    Less than a second ago    
llama3.1:latest            42182419e950    4.7 GB    32 hours ago              


In [None]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

Splitting the Text: Once the PDF content is loaded, the text is split into chunks using RecursiveCharacterTextSplitter. The chunk_size parameter controls how large each chunk is, while the chunk_overlap ensures that there is overlap between adjacent chunks to maintain context. The overlap between chunks helps to maintain context, preventing the system from losing important information when boundaries are cut off between chunks. This setup ensures accurate retrieval when querying the document later in the process. In this case, chunks are set to 7,500 characters with a 100-character overlap.

In [None]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

Storing Embeddings in Chroma: The generated embeddings are stored in a Chroma vector database, which is an efficient solution for storing and querying large sets of vector data. The embeddings are added to a collection named "local-rag", which acts like a table in the database. This collection will be queried during the retrieval process to fetch relevant information based on user queries.

In [None]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

## Retrieval

The retrieval process is the core functionality of the Retrieval-Augmented Generation (RAG) system. Once the document embeddings are stored in the vector database (Chroma), the system is set up to retrieve relevant chunks of information based on user queries. This step involves generating multiple variations of a query, retrieving the relevant chunks from the vector database, and generating a response using a local language model (LLM).

In [None]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

The local language model (LLM) is responsible for generating human-readable responses based on the retrieved context. In this case, the Mistral model is used for local inference, we can also try Llama3.1 but it requiers better resources

In [None]:
# LLM from Ollama
local_model = "mistral" # "mistral" or "llama-3"
llm = ChatOllama(model=local_model)

**Prompt Template for Query Expansion**:
   A `PromptTemplate` is used to generate multiple variations of a user’s query. The purpose of generating multiple versions of the query is to improve retrieval by covering different phrasings and perspectives. This helps overcome some of the limitations of distance-based similarity search in the vector database.

In [None]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

The `MultiQueryRetriever` retrieves multiple relevant document chunks from the vector database by sending the generated query variations to the Chroma vector database. This improves retrieval accuracy by ensuring that more contextually relevant document parts are returned.

In [None]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

**RAG Prompt for Final Response:** A RAG prompt is designed to ensure the language model generates answers solely based on the retrieved context. The `ChatPromptTemplate` ensures that the user query and the context retrieved from the vector database are fed to the LLM in the correct format.

In [None]:
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

**Running the Retrieval Chain:** The process is executed as a chain. The context is retrieved via the multi-query retriever, and the final question and context are passed through the language model to generate the answer. The chain is invoked by passing user input into the process.

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The system can then be invoked to retrieve an answer:

In [None]:
chain.invoke(input(""))

 what is this about?


OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:01<00:00,  1.15s/it]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 36.58it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 14.64it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 23.34it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 23.14it/s]


' This document is the Insight Report of The Global Cooperation Barometer 2024 by the World Economic Forum in collaboration with McKinsey & Company. It provides an analysis of the state of global cooperation across five pillars: trade and capital, innovation and technology, climate and natural capital, health and wellness, and peace and security. The report examines trends in cooperative actions and their outcomes to determine the overall level of global cooperation in each area. It also includes recommendations for leaders on how to reimagine global cooperation in a new era.'

In [None]:
chain.invoke("What are the 5 pillars of global cooperation?")

OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:01<00:00,  1.33s/it]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 26.36it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 36.23it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 49.43it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 63.03it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 58.14it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 59.76it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 56.69it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 48.34it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 51.85it/s]


' The 5 pillars of global cooperation are:\n\n1. Trade and capital\n2. Innovation and technology\n3. Climate and natural capital\n4. Health and wellness\n5. Peace and security.'

**Deleting Collections in the Vector Database:** Once the retrieval task is completed, the vector database can be cleared by deleting the collection. This ensures that any temporary data is removed, freeing up space for future tasks.

In [None]:
# Delete all collections in the db
vector_db.delete_collection()

## Summary:
In this retrieval phase, a user’s query is expanded into multiple variations to improve the retrieval of relevant document parts from the vector database. These document chunks are then passed to the language model, which generates an answer based solely on the provided context. The use of `MultiQueryRetriever` and `ChatOllama` ensures that the system delivers precise, context-aware responses to user queries.