# Notebook 1: PDF Documents Ingestion Pipeline into PGVector
The notebook provides a reference implementation of a __document ingestion pipeline__ that creates a __knowledgebase from documents belonging to a specific domain__. The  
documents will be encoded as high dimensional vectors and these will get stored in a __vector DB__ powered by __PGVector (PostgreSQL)__. This vector store will provide  
the __retrieval engine__  to __augment and ground the generation process of LLMs__ when __answering questions__ about the __knowledge domain__ bounded by the ingested documents.

![Ingestion Pipeline Process Flow](img/PGVector_store_population.png)

## Notebook Details and Preliminary Instructions
- The knowledge base consists of [__10 NASA history books__](https://www.nasa.gov/ebooks/). The RAG pipelines we build in this series of notebooks has the mission to __ground an LLM to make it answer questions about the   
content of these books__ without making things up or using its pre-training data as knowledge base to generate answers.
- The ingestion pipeline is powered by __LlamaIndex (tested on v0.10.26)__
- The embedding model is [__BAAI/bge-base-en-v1.5__](https://huggingface.co/BAAI/bge-base-en-v1.5) which ranks high at the [__MTEB Leaderboard__](https://huggingface.co/spaces/mteb/leaderboard). This family of models can get [__CPU-optimized__](https://huggingface.co/blog/intel-fast-embedding) to save GPU memory.
- The LLM used to execute generation tasks in the RAG pipeline is [__HuggingFaceH4/zephyr-7b-alpha__](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). This  
model is [__compatible with LlamaIndex__](https://docs.llamaindex.ai/en/latest/module_guides/models/llms/) (notice it is the open-source LLM ticking more checkboxes)  
and in our experiments showed better quality generations than other models of similar size (and even bigger sizes). 

### BEFORE YOU START:
- The LLM runs on [__vLLM__](https://github.com/vllm-project/vllm) which is one (if not the most) popular open-source LLM inference engine. 
    -- To run __Zephyr-7b-alpha__ on __vLLM__ in __OpenAI-compatible mode__, make sure you have an __A100 (40GB) GPU__ available at the OS-level and CUDA 12.1 installed.
    -- Then you need to run the following commands to make the LLM service available from `http://localhost:8010/v1`:  
  ```
        # (Optional) Create a new conda environment.
        conda create -n vllm-env python=3.9 -y
        conda activate vllm-env

        # Install vLLM with CUDA 12.1.
        pip install vllm
        
        # Serve the zephyr-7b-alpha LLM
        python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --port 8010 --enforce-eager
  ```

- The vector store is implemented using the __PGVector extension of PostgreSQL__ (v12). For the purposes of this demo, __please go to the PGVector directory (`../PGVector`) in this  
repository and execute the `run_pgvector.sh` script to pull and launch a PostgreSQL + PGVector Docker container.__ Once up and running, the DB engine will be available  
from `localhost:5432`

- Finally, it is necessary to create a new Conda environment with the required packages to run the Python scripts. Here the steps to do this:
  ```
    # From a shell terminal, go to the "utils" directory. Then run the following command:
    conda env create -f adv_rag.yml

    # Wait several minutes until a the "adv_rag" conda environment gets created.
    # Next, activate the new Conda env
    conda activate adv_rag.yml
  ```
- Make sure you run the Python scripts from the Improved RAG Starter Pack always from the __adv_rag__ Conda environment. 



## Imports Section

In [1]:
# General purpose imports
import pandas as pd
import psycopg2
from sqlalchemy import make_url
from pprint import pprint

# LlamaIndex imports
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

# Imports from this repo
import sys
utils_path = "../utils"
if utils_path not in sys.path:
    sys.path.append(utils_path)

from helpers import (
    get_indices_with_nulls, 
    remove_elements,
)

## Global Config Setup 
- Here we define most of the RAG pipeline parameters and LlamaIndex defaults.
- The goal is to provide a single control cell to define the notebook's execution.

In [6]:
# vLLM service settings
LLM_MODEL = "HuggingFaceH4/zephyr-7b-alpha" # You may replace this with a different model
LLM_API_BASE = "http://localhost:8010/v1" # The URL vLLM service is accessible from.
LLM_API_KEY = "NO_KEY" # By default, vLLM does not require a key.
GEN_TEMP=0.1 # Generation temperature
MAX_TOKENS=512 # Max tokens the LLM should generate 
REP_PENALTY=1.03 # Word repetition penalty at generation time

# LLamaIndex LLM provider
Settings.llm = OpenAILike(
    model=LLM_MODEL,
    api_key=LLM_API_KEY,
    api_base=LLM_API_BASE,
    temperature=GEN_TEMP,
    max_tokens=MAX_TOKENS,
    repetition_penalty=REP_PENALTY,
)

# LLamaIndex embedding model
EMB_MODEL="BAAI/bge-base-en-v1.5" # For better results you can use the "large" variant
DEVICE="cuda:0" # If running out of GPU RAM, switch to "cpu" (although slower)
Settings.embed_model = HuggingFaceEmbedding(
    model_name=EMB_MODEL,
    device=DEVICE
)
EMBEDDING_SIZE = len(Settings.embed_model.get_text_embedding("hi"))

# PGVector DB params as defines in the reference Docker compose file
## available from the "../PGvector" directory
DB_PORT = 5432
DB_USER = "demouser"
DB_PASSWD = "demopasswd"
DEFAULT_DB = "postgres"
DB_NAME = "vectordb"
DB_HOST = "localhost"
TABLE_NAME = "NASA_HISTORY_BOOKS"

# Ingestion pipeline settings
NUM_WORKERS = 4 # Num. of parallel workers for the ingestion process
CHUNK_SIZE = 1024 # Num. of words per chunk 
PDF_FILES_PATH = "../02-KB-Documents/NASA" # The directory to read PDF files from
MIN_DOC_LENGTH = 40 # Minimum number of words allowed per doc

## PDF Documents Ingestion

In [7]:
%%time
# Lamda function to add the file name as metadata at loading time
filename_fn = lambda filename: {"file_name": filename.split("/")[-1]}

# >> PDF Document Parsing
# The PyMuPDFReader takes ~ 1/20 the time it takes to the default reader to ingest the PDF files
# Note: PyMuPDFReader creates a document object per page in a PDF document.
reader = SimpleDirectoryReader(
    input_dir=PDF_FILES_PATH,
    required_exts=[".pdf"],
    file_extractor={".pdf":PyMuPDFReader()},
    file_metadata=filename_fn,
    num_files_limit=10,
)
documents = reader.load_data()

# Filter out documents with null (`\x00') characters which are incompatible with PGVector.
bad_docs = get_indices_with_nulls(documents)
documents = remove_elements(documents, bad_docs)
print(f"Loaded {len(documents)} pages")

# Display one document as example
example_doc = 10
print(f" >> TEXT FROM DOCUMENT #{example_doc}:\n", documents[example_doc], "\n")
print(f" >> DOCUMENT {example_doc} METADATA:")
pprint(documents[example_doc].metadata, depth=1, indent=4, width=100)

Loaded 3855 pages
 >> TEXT FROM DOCUMENT #10:
 Doc ID: 6baac775-12b5-469c-ae11-5c6c424dca17
Text: 1 INTRODUCTION NASA’s Solar System Exploration Paradigm:  The
First 50 Years and a  Look at the Next 50 James L. Green and Kristen
J. Erickson A FTER MANY FAILURES to get to the Moon and to the planets
beyond,  Mariner 2 successfully flew by Venus in December 1962. This
historic  mission began a spectacular era of solar system exploration
for NA... 

 >> DOCUMENT 10 METADATA:
{   'file_name': '50-years-of-solar-system-exploration_tagged.pdf',
    'file_path': '/home/vmuser/RAG/Improved_RAG_Starter_Pack/03-Document_ingestion/../02-KB-Documents/NASA/50-years-of-solar-system-exploration_tagged.pdf',
    'source': '11',
    'total_pages': 364}
CPU times: user 17.6 s, sys: 472 ms, total: 18.1 s
Wall time: 18.1 s


## Data Analysis Section
- This section is useful to get a better understanding about th text data and its possible effects on LLM inference timeouts. It turns 
out that documents with few pages (<10) had too few or no words at all might make the LLM inference service to fail.

In [8]:
# Helper dict to form a dataframe
docs_metrics = {
    'words': [],
    'text': [],
}
for doc in documents:
    docs_metrics['words'].append(len(doc.text.split()))
    docs_metrics['text'].append(doc.text)
docs_metrics = pd.DataFrame.from_dict(docs_metrics)

# Get statistics about number of words in documents
## Notice the presence of docs with 0 words
print("Words per page statistics")
display(docs_metrics.words.describe())

Words per page statistics


count    3855.000000
mean      383.659403
std       156.257311
min         0.000000
25%       304.000000
50%       412.000000
75%       474.000000
max      2775.000000
Name: words, dtype: float64

In [9]:
# Display pages with too few words (length < MIN_DOC_LENGTH)
print(f">> Filter out documents with < {MIN_DOC_LENGTH} words")
pd.set_option('display.max_colwidth', 150)
display(docs_metrics[docs_metrics.words<MIN_DOC_LENGTH])

>> Filter out documents with < 40 words


Unnamed: 0,words,text
0,11,"HISTORICAL \nPERSPECTIVES\n \n50 YEARS OF \n SOLAR SYSTEM \nEXPLORATION\nLINDA BILLINGS, EDITOR\n"
2,6,50 YEARS OF \nSOLAR SYSTEM\nEXPLORATION\n
3,0,
4,29,50 YEARS OF \nSOLAR SYSTEM\nEXPLORATION \nHISTORICAL PERSPECTIVES\nEdited by \nLINDA BILLINGS\nNational Aeronautics and Space Administration\nOffi...
23,0,
...,...,...
3326,27,"383\nDocument 5-25 (a–b)Figure 1. Schematics of Langley tank models 203, 213, 214, and 224.\nFIGURE 1. Lines of Langley tank models 203, 213, 214,..."
3423,29,"The Wind and Beyond, Volume III\n480Fig 1. Graph showing curves of characteristic coefficients for standing thrust and power of 2-blade uniform ge..."
3424,23,481\nDocument 5-31Fig 2. Graph showing curves of characteristic coefficients for standing thrust and power of 2-blade now warped propeller F2 A1 S...
3843,0,


In [10]:
# Remove "too small" pages from the corpus
short_docs = docs_metrics.words[docs_metrics.words<MIN_DOC_LENGTH].index.to_list()
print(f">> Removing {len(short_docs)} pages from the corpus")
documents = remove_elements(documents, short_docs)
print(f" > New pages list size: {len(documents)}")

>> Removing 163 pages from the corpus
 > New pages list size: 3692


## Setup the Ingestion Pipeline

In [11]:
# Create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        # Split docs with preference to complete sentences
        SentenceSplitter(
            chunk_size=CHUNK_SIZE,
            chunk_overlap=20
        ),
        # Generate embeddings for document splits
        Settings.embed_model,
    ]
)

## Ingestion Pipeline Parallel Execution
- Setting `num_workers` to a value greater than 1 will invoke parallel execution.


In [12]:
%%time
# Run the ingestion pipeline
nodes = pipeline.run(
    show_progress=True,
    documents=documents,
    num_workers=NUM_WORKERS,
)
print(f">> Created {len(nodes)} nodes.")

>> Created 3893 nodes.
CPU times: user 4.74 s, sys: 560 ms, total: 5.3 s
Wall time: 49.3 s


In [13]:
# Show the aspect of LlamaIndex nodes and their metadata.
print(">> LlamaIndex node's metadata sample:\n")
# print(" > Questions this node can answer:")
print(nodes[0].to_dict()['metadata'], "\n")
print(">> LlamaIndex node's text:\n", nodes[0].text)

>> LlamaIndex node's metadata sample:

{'file_name': '50-years-of-solar-system-exploration_tagged.pdf', 'total_pages': 364, 'file_path': '/home/vmuser/RAG/Improved_RAG_Starter_Pack/03-Document_ingestion/../02-KB-Documents/NASA/50-years-of-solar-system-exploration_tagged.pdf', 'source': '2'} 

>> LlamaIndex node's text:
 HISTORICAL PERSPECTIVES
NASA’S FIRST SUCCESSFUL MISSION to another planet, 
Mariner 2 to Venus in 1962, marked the begin­
ning of what NASA Chief Scientist Jim Green 
describes in this volume as “a spectacular era” 
of solar system exploration. In its first 50 years 
of planetary exploration, NASA sent spacecraft 
to fly by, orbit, land on, or rove on every planet 
in our solar system, as well as Earth’s Moon and 
several moons of other planets. Pluto, reclassi­
fied as a dwarf planet in 2006, was visited by 
the New Horizons spacecraft in 2015.
What began as an endeavor of two 
nations—the United States and the former 
Soviet Union—has become a multinational 
enterpris

## Create the PGVector Store from an existing PostgresSQL DB

In [14]:
# Connect to the PostgreSQL engine ans initialize de DB to serve as vector/document store.
connection_string = f"postgresql://{DB_USER}:{DB_PASSWD}@{DB_HOST}:{DB_PORT}/{DEFAULT_DB}"
conn = psycopg2.connect(connection_string)
conn.autocommit = True
with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {DB_NAME}")
    c.execute(f"CREATE DATABASE {DB_NAME}")

In [15]:
%%time
# Create a url object to store DB connection parameters
url = make_url(connection_string)

# Connect to the PGVector extension
vector_store = PGVectorStore.from_params(
    database=DB_NAME,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=TABLE_NAME,
    embed_dim=EMBEDDING_SIZE, # embedding model dimension
    cache_ok=True,
    hybrid_search=True, # retrieve nodes based on vector values and keywords
)

# LlamaIndex persistence object backed by the PGVector connection
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# LlamaIndex index population (nodes -> embeddings -> vector store)
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    show_progress=True,
    transformations=None,
)

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

CPU times: user 7.84 s, sys: 223 ms, total: 8.07 s
Wall time: 13.6 s


## RAG Pipeline Test

In [16]:
%%time
# Use the index as a query engine to give the LLM the required context
## to answer questions about the NASA history knowledge domain.
query_engine = index.as_query_engine(
    similarity_top_k=5,
    vector_store_kwargs={"hnsw_ef_search": 256}
)

print(">> Quick test on the RAG system.")
question = "What are the main Hubble telescope discoveries about exoplanets?"
print(f" > Question: {question}")
response = query_engine.query(question)
print(f" > Response:\n", response.response)

>> Quick test on the RAG system.
 > Question: What are the main Hubble telescope discoveries about exoplanets?
 > Response:
 

The Hubble Space Telescope has revealed exceedingly valuable information about hundreds of other worlds, even though it was not designed with exoplanet science in mind. Hubble's observations have extended to Earth-size worlds and have even identified atmospheres that contain sodium, oxygen, carbon, hydrogen, carbon dioxide, methane, and water vapor. While most of the planets Hubble has studied to date are too hot to host life as we know it, the telescope's observations demonstrate that the basic organic components for life can be detected and measured on planets orbiting other stars, setting the stage for more detailed studies with future observatories. Hubble has also confrmed that a planet orbits two suns, and made a detailed global map of another world showing the temperature at different layers in its atmosphere and the amount and distribution of its water 