# Ollama with PDF Ingestion Project

Using Langchain and Chroma, this project demonstrates a local Retrieval-Augmented Generation (RAG) system for ingesting PDF files. The system leverages several tools:

- **Langchain**: For managing the entire workflow, including document loading, embedding, and querying.
- **UnstructuredPDFLoader**: Used to extract text content from PDF files.
- **RecursiveCharacterTextSplitter**: For splitting large text content into manageable chunks.
- **Ollama Embeddings**: Converts text into vector embeddings for efficient storage and retrieval.
- **Chroma**: A vector database used to store and query the embeddings.
- **MultiQueryRetriever**: Enhances retrieval accuracy by generating multiple variations of the user's query.
- **ChatOllama**: A local language model (e.g., Mistral) used to generate responses based on the retrieved document context.

The pipeline runs entirely offline, ensuring that sensitive documents remain private and secure.

## PDF Ingestion in the Ollama RAG System

### Key Components:
1. **UnstructuredPDFLoader**: This component from LangChain is responsible for reading and extracting text content from PDF files. It is designed to handle unstructured data, ensuring that documents of various formats are properly processed.
2. **Text Processing**: Once the content is extracted, the RecursiveCharacterTextSplitter tool is used to split the text into smaller chunks. Chunking the text ensures efficient processing and better results during retrieval and embedding.


In [1]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

**Loading the PDF**:
   The first step in the process is loading the PDF file using `UnstructuredPDFLoader`. The file is specified by its local path, and the loader extracts its content. For example, the following code loads the "WEF_The_Global_Cooperation_Barometer_2024.pdf", a document that contains recent research on global cooperation conducted by McKinsey & Company. For practical use, users have the ability to upload their own PDF files.

In [3]:
local_path = "WEF_The_Global_Cooperation_Barometer_2024.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

We preview the data to see if it is loaded properly.

In [5]:
# Preview first page
data[0].page_content[:100]

'In collaboration with McKinsey & Company\n\nThe Global Cooperation Barometer 2024\n\nI N S I G H T R E P'

## Vector Embeddings

In [5]:
!ollama pull nomic-embed-text

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [6]:
!ollama list

NAME                       ID              SIZE      MODIFIED               
nomic-embed-text:latest    0a109f422b47    274 MB    Less than a second ago    
llama3.1:latest            42182419e950    4.7 GB    29 hours ago              


In [7]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [9]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

: 

In [10]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 40/40 [01:50<00:00,  2.75s/it]

## Retrieval

In [None]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [511]:
# LLM from Ollama
local_model = "mistral"
llm = ChatOllama(model=local_model)

In [512]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [513]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [516]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [518]:
chain.invoke(input(""))

 what is this about?


OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:01<00:00,  1.15s/it]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 36.58it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 14.64it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 23.34it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 23.14it/s]


' This document is the Insight Report of The Global Cooperation Barometer 2024 by the World Economic Forum in collaboration with McKinsey & Company. It provides an analysis of the state of global cooperation across five pillars: trade and capital, innovation and technology, climate and natural capital, health and wellness, and peace and security. The report examines trends in cooperative actions and their outcomes to determine the overall level of global cooperation in each area. It also includes recommendations for leaders on how to reimagine global cooperation in a new era.'

In [519]:
chain.invoke("What are the 5 pillars of global cooperation?")

OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:01<00:00,  1.33s/it]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 26.36it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 36.23it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 49.43it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 63.03it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 58.14it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 59.76it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 56.69it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 48.34it/s]
OllamaEmbeddings: 100%|███████████████████████████| 1/1 [00:00<00:00, 51.85it/s]


' The 5 pillars of global cooperation are:\n\n1. Trade and capital\n2. Innovation and technology\n3. Climate and natural capital\n4. Health and wellness\n5. Peace and security.'

In [None]:
# Delete all collections in the db
vector_db.delete_collection()