# Document Ingestion Pipeline

__Please make sure you follow the instructions provided in the README file of this directory before starting the execution of this notebook.__

In [1]:
import time
# Start timer to time the notebook execution
start = time.time()

# General purpose imports
import pandas as pd
import ast
import yaml
from dotmap import DotMap
import psycopg2
from sqlalchemy import make_url
from pprint import pprint
from IPython.display import display, Markdown
import ipywidgets as widgets
widgets.IntSlider()

# LlamaIndex imports
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.nvidia import NVIDIAEmbedding

# Imports from this repo
import sys
utils_path = "../08-Utils"
if utils_path not in sys.path:
    sys.path.append(utils_path)

from helpers import (
    get_indices_with_nulls, 
    remove_elements,
    TextCleaner
)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Global Config Setup 
- Here we define most of the RAG pipeline parameters and LlamaIndex defaults.
- The goal is to provide a single control cell to define the notebook's execution.

In [2]:
# Open the Starter Pack global configuration file
with open('../07-Starter_Pack_config/improved_rag_config.yaml', 'r') as file:
    config = yaml.safe_load(file)
config = DotMap(config)

In [3]:
# LLamaIndex LLM provider
llm_cfg = config.ml_models.llm_generator
Settings.llm = OpenAILike(
    model=llm_cfg.model,
    api_key=llm_cfg.api_key,
    api_base=llm_cfg.api_base,
    temperature=llm_cfg.temperature,
    max_tokens=llm_cfg.max_tokens,
    repetition_penalty=llm_cfg.repetition_penalty,
)

# LLamaIndex embedding model
emb_cfg = config.ml_models.embedder
Settings.embed_model = NVIDIAEmbedding(
        base_url=emb_cfg.api_base,
        model=emb_cfg.model,
        embed_batch_size=emb_cfg.batch_size,
        truncate="END",
)
EMBEDDING_SIZE = len(Settings.embed_model.get_text_embedding("hi"))

## PDF Documents Parsing

In [4]:
%%time
# Lamda function to add the file name as metadata at loading time
filename_fn = lambda filename: {"file_name": filename.split("/")[-1]}

# >> PDF Document Parsing
# The PyMuPDFReader takes ~ 1/20 the time it takes to the default reader to ingest the PDF files
# Note: PyMuPDFReader creates a document object per page in a PDF document.
reader = SimpleDirectoryReader(
    input_dir=config.file_paths.kb_doc_dir,
    required_exts=[".pdf"],
    file_extractor={".pdf":PyMuPDFReader()},
    file_metadata=filename_fn,
    num_files_limit=10,
)
documents = reader.load_data()

# Filter out documents with null (`\x00') characters which are incompatible with PGVector.
bad_docs = get_indices_with_nulls(documents)
documents = remove_elements(documents, bad_docs)
print(f"Parsing is complete. Loaded {len(documents)} pages")


Parsing is complete. Loaded 3855 pages
CPU times: user 33.1 s, sys: 12.6 s, total: 45.6 s
Wall time: 46.3 s


In [5]:
# Display one document as example
example_doc = 10
print(f">> Document #{example_doc}'s text:\n", documents[example_doc], "\n")
print(f">> Document #{example_doc}'s metadata:")
pprint(documents[example_doc].metadata, depth=1, indent=4, width=100)

>> Document #10's text:
 Doc ID: b55ed939-3e75-4dde-975f-284798f12b94
Text: 1 INTRODUCTION NASA’s Solar System Exploration Paradigm:  The
First 50 Years and a  Look at the Next 50 James L. Green and Kristen
J. Erickson A FTER MANY FAILURES to get to the Moon and to the planets
beyond,  Mariner 2 successfully flew by Venus in December 1962. This
historic  mission began a spectacular era of solar system exploration
for NA... 

>> Document #10's metadata:
{   'file_name': '50-years-of-solar-system-exploration_tagged.pdf',
    'file_path': '/Users/kike/DataspellProjects/Improved_RAG/Starter-Packs/Improved_RAG/03-Document_ingestion/../02-KB-Documents/NASA/50-years-of-solar-system-exploration_tagged.pdf',
    'source': '11',
    'total_pages': 364}


## Documents metrics analysis
This section is useful to get a better understanding about th text data and its possible effects on LLM inference timeouts. Notice that documents with few pages had too few or no words at all might make the vLLM inference service to fail.

In [6]:
# Helper dict to form a dataframe
docs_metrics = {
    'words': [],
    'text': [],
}
for doc in documents:
    docs_metrics['words'].append(len(doc.text.split()))
    docs_metrics['text'].append(doc.text)
docs_metrics = pd.DataFrame.from_dict(docs_metrics)

# Get statistics about number of words in documents
## Notice the presence of docs with 0 words
print("Words per page statistics")
display(docs_metrics.words.describe())

# Remove "too small" pages from the corpus
lindex_cfg = config.llama_index
short_docs = docs_metrics.words[docs_metrics.words<lindex_cfg.min_doc_length].index.to_list()
print(f">> Removing {len(short_docs)} pages from the corpus")
documents = remove_elements(documents, short_docs)
print(f" > New pages list size: {len(documents)}")

Words per page statistics


count    3855.000000
mean      383.659403
std       156.257311
min         0.000000
25%       304.000000
50%       412.000000
75%       474.000000
max      2775.000000
Name: words, dtype: float64

>> Removing 163 pages from the corpus
 > New pages list size: 3692


## Setup and run the ingestion pipeline

In [7]:
%%time

# Create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        # Remove troublesome characters
        TextCleaner(),
        # Split docs with preference to complete sentences
        SentenceSplitter(
        # TokenTextSplitter(
                chunk_size=lindex_cfg.chunk_size,
                chunk_overlap=lindex_cfg.chunk_overlap,
                include_metadata=False,
        ),
        # Generate embeddings for document splits
        Settings.embed_model,
    ]
)

# Run the ingestion pipeline
nodes = pipeline.run(
    show_progress=True,
    documents=documents,
    num_workers=lindex_cfg.num_workers,
)
print(f">> Created {len(nodes)} nodes.")

Parsing nodes:   0%|          | 0/3692 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/7053 [00:00<?, ?it/s]

>> Created 7053 nodes.
CPU times: user 13.8 s, sys: 1.16 s, total: 15 s
Wall time: 1min 11s


## Displaying a node's structure: text + metadata

In [8]:
# Show the aspect of LlamaIndex nodes and their metadata.
print(">> LlamaIndex node's metadata sample:")
print(nodes[0].to_dict()['metadata'], "\n")
print(">> LlamaIndex node's text excerpt:\n", 
      nodes[0].text,
      "...")

>> LlamaIndex node's metadata sample:
{} 

>> LlamaIndex node's text excerpt:
 HISTORICAL PERSPECTIVES NASA’S FIRST SUCCESSFUL MISSION to another planet,  Mariner 2 to Venus in 1962, marked the beginning of what NASA Chief Scientist Jim Green  describes in this volume as “a spectacular era”  of solar system exploration. In its first 50 years  of planetary exploration, NASA sent spacecraft  to fly by, orbit, land on, or rove on every planet  in our solar system, as well as Earth’s Moon and  several moons of other planets. Pluto, reclassified as a dwarf planet in 2006, was visited by  the New Horizons spacecraft in 2015. What began as an endeavor of two  nations—the United States and the former  Soviet Union—has become a multinational  enterprise, with a growing number of space  agencies worldwide building and launching planetary exploration missions—sometimes alone,  sometimes together.  In this volume, a diverse array of scholars addresses the science, technology, policy,  and politics

## Create a PGVector store in an existing PostgresSQL DB

In [9]:
%%time

# Connect to the PostgreSQL engine and initialize de DB to serve as vector/document store.
db_cfg = config.postgresql
connection_string = (f"postgresql://{db_cfg.user}:"
                     f"{db_cfg.password}@{db_cfg.db_host}:{db_cfg.port}/{db_cfg.default_db}")
conn = psycopg2.connect(connection_string)
conn.autocommit = True

# Create a url object to store DB connection parameters
url = make_url(connection_string)
conn = psycopg2.connect(connection_string)
cursor = conn.cursor()
cursor.execute(f"DROP TABLE IF EXISTS public.data_{db_cfg.tables.std_rag};")
conn.commit()
conn.close()

# Connect to the PGVector extension
vector_store = PGVectorStore.from_params(
    database=url.database,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=db_cfg.tables.std_rag,
    embed_dim=EMBEDDING_SIZE, # embedding model dimension
    cache_ok=True,
    hybrid_search=db_cfg.pgvector.hybrid_search, # retrieve nodes based on vector values and keywords
)

# LlamaIndex persistence object backed by the PGVector connection
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# LlamaIndex index population (nodes -> embeddings -> vector store)
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    show_progress=True,
    transformations=None,
)

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

CPU times: user 7.02 s, sys: 821 ms, total: 7.84 s
Wall time: 1min 21s


## RAG Pipeline Test

In [10]:
%%time
# Use the index as a query engine to give the LLM the required context
## to answer questions about the NASA history knowledge domain.
h_search = ast.literal_eval(db_cfg.pgvector.pgvector_kwargs)
query_engine = index.as_query_engine(
    llm=Settings.llm,
    similarity_top_k=5,
    vector_store_kwargs=h_search
)

print(">> Quick test on the RAG system.\n")
question = ("As a research assistant, "
            "list the top 10 Hubble telescope discoveries about exoplanets. "
            "Highlight important text using markdown formatting.")
print(f" > Query: {question}")
response = query_engine.query(question)
display(Markdown(response.response))

>> Quick test on the RAG system.

 > Query: As a research assistant, list the top 10 Hubble telescope discoveries about exoplanets. Highlight important text using markdown formatting.


 **Top 10 Hubble Telescope Discoveries about Exoplanets**

1. **Probing the Atmospheres of Rocky, Habitable-Zone Planets**: Hubble studies exoplanet atmospheres spectroscopically during transits, helping astronomers understand what the atmospheres are made of. This offers clues to the planets' formation and evolution and hints at whether they are likely to be habitable.
2. **Spotting a World with a Glowing Water Atmosphere**: Hubble detected water vapor in the atmosphere of a distant exoplanet, which could be a sign of life.
3. **Discovering an Alien Atmosphere that is Brimming with Water**: Hubble found that a distant exoplanet has an atmosphere rich in water vapor, which could be a sign of life.
4. **Detecting Water Vapor on a Habitable-Zone Exoplanet**: Hubble detected water vapor on a planet that orbits its star at a distance where liquid water could exist on its surface, making it a potential candidate for hosting life.
5. **Exposing the First Evidence of a Possible Exomoon**: Hubble detected a possible moon orbiting a distant exoplanet, which could provide insights into the formation and evolution of planetary systems.
6. **Capturing a Blistering Pitch-Black Planet**: Hubble captured an image of a planet with a surface that is as dark as coal, which could be due to the presence of organic material or other substances.
7. **Finding a Shrinking Planet**: Hubble detected a planet that is shrinking due to the loss of mass, which could be a sign of a planet that is experiencing a catastrophic event.
8. **Uncovering a Football-Shaped ‘Heavy Metal’ Exoplanet**: Hubble discovered a planet with a unique shape and composition, which could provide insights into the formation and evolution of planetary systems.
9. **Unraveling Mysteries Surrounding ‘Cotton Candy’ Planets**: Hubble studied the atmospheres of exoplanets with unusual compositions, which could provide insights into the formation and evolution of planetary systems.
10. **Tracking an Exiled Exoplanet’s Far-Flung Orbit**: Hubble tracked the orbit of a distant exoplanet that is no longer in its original orbit, which could provide insights into the formation and evolution of planetary systems.

These discoveries highlight the importance of the Hubble Space Telescope in advancing our understanding of exoplanets and their potential for hosting life. **[1](https://www.nasa.gov/hubble)** **[2](https://hubblesite.org)** **[3](https://www.facebook.com/NASAHubble)** **[4](https://twitter.com/NASAHubble)** **[5](https://www.instagram.com/NASAHubble)** **[6](https://www.youtube.com/playlist?list=PL3E861DC9F9A8F2E9)** **[7](https://www.flickr.com/photos/nasahubble)** **[8](https://www.pinterest.com/nasa/hubble-space-telescope/)**

Note: The text in bold is the original text from the provided context information. The rest of the answer is a summary and explanation of the top 10 Hubble telescope discoveries about exoplanets.

CPU times: user 60.5 ms, sys: 6.48 ms, total: 66.9 ms
Wall time: 4.73 s


In [11]:
stop = time.time()
print(f"Notebook execution time: {(stop-start)/60:.1f} minutes")

Notebook execution time: 3.5 minutes
