# LlamaIndex Temporal RAG pipeline using KDB.AI vector store

> [KDB.AI](https://kdb.ai/) is a powerful knowledge-based vector database and search engine that allows you to build scalable, reliable AI applications, using real-time data, by providing advanced search, recommendation and personalization.

This example demonstrates how to use KDB.AI to run semantic search on annual financial reports.

To access your end point and API keys, sign up to KDB.AI here.

To set up your development environment, follow the instructions on the KDB.AI pre-requisites page.

The following examples demonstrate some of the ways you can interact with KDB.AI through LlamaIndex.

## Install dependencies with Pip

In [None]:
# %pip install llama-index llama-index-embeddings-huggingface llama-index-llms-openai llama-index-readers-file llama-index-vector-stores-kdbai
# %pip install kdbai_client pandas plotly

## Import dependencies

In [None]:
from getpass import getpass
import re
import os
import shutil
import time
import urllib

import pandas as pd
import plotly.express as px

from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.kdbai import KDBAIVectorStore

import pykx as kx
import kdbai_client as kdbai

OUTDIR = "pdf"
RESET = True

# LLM = 'gpt-3.5-turbo'
LLM = "gpt-4-turbo-preview"  # Expensive !!!
EMBEDDING = "sentence-transformers/all-mpnet-base-v2"

os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")

OpenAI API key:  ········


## Create KDB.AI session and table

In [None]:
KDBAI_ENDPOINT = "http://localhost:8082"
KDBAI_API_KEY = None
KDBAI_TABLE_NAME = "reports"

session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)

if KDBAI_TABLE_NAME in session.list():
    session.table(KDBAI_TABLE_NAME).drop()

schema = dict(
    columns=[
        dict(name="document_id", pytype="bytes"),
        dict(name="text", pytype="bytes"),
        dict(
            name="embedding",
            vectorIndex=dict(type="flat", metric="L2", dims=768),
        ),
        dict(name="title", pytype="bytes"),
        dict(name="publication_date", pytype="datetime64[ns]"),
    ]
)

table = session.create_table(KDBAI_TABLE_NAME, schema)

## Financial reports urls and metadata

In [None]:
INPUT_URLS = [
    "https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf",
    "https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf",
]

METADATA = {
    "pdf/PLAW-106publ102.pdf": {
        "title": "GRAMM–LEACH–BLILEY ACT, 1999",
        "publication_date": pd.to_datetime("1999-11-12"),
    },
    "pdf/PLAW-111publ203.pdf": {
        "title": "DODD-FRANK WALL STREET REFORM AND CONSUMER PROTECTION ACT, 2010",
        "publication_date": pd.to_datetime("2010-07-21"),
    },
}

## Download PDF files locally

In [None]:
%%time

CHUNK_SIZE = 512 * 1024


def download_file(url):
    print("Downloading %s..." % url)
    out = os.path.join(OUTDIR, os.path.basename(url))
    try:
        response = urllib.request.urlopen(url)
    except urllib.error.URLError as e:
        logging.exception("Failed to download %s !" % url)
    else:
        with open(out, "wb") as f:
            while True:
                chunk = response.read(CHUNK_SIZE)
                if chunk:
                    f.write(chunk)
                else:
                    break
    return out


if RESET:
    if os.path.exists(OUTDIR):
        shutil.rmtree(OUTDIR)
    os.mkdir(OUTDIR)

    local_files = [download_file(x) for x in INPUT_URLS]
    local_files[:10]

Downloading https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf...
Downloading https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf...
CPU times: user 45.3 ms, sys: 31.6 ms, total: 76.9 ms
Wall time: 5.63 s


## Load local PDF files with LlamaIndex

In [None]:
%%time


def get_metadata(filepath):
    return METADATA[filepath]


documents = SimpleDirectoryReader(
    input_files=local_files,
    file_metadata=get_metadata,
)

docs = documents.load_data()
len(docs)

CPU times: user 3.96 s, sys: 26.8 ms, total: 3.99 s
Wall time: 3.99 s


994

## Setup LlamaIndex RAG pipeline using KDB.AI vector store

In [None]:
%%time

embed_model = HuggingFaceEmbedding(model_name=EMBEDDING)
llm = OpenAI(temperature=0, model=LLM)
vector_store = KDBAIVectorStore(table)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model, llm=llm
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    docs,
    service_context=service_context,
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=2048, chunk_overlap=0)],
)



CPU times: user 1min 27s, sys: 3min 4s, total: 4min 31s
Wall time: 33.3 s


## Setup the LlamaIndex Query Engine

In [None]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4-turbo-preview, the 128k tokens context size can take 100 pages.
K = 100

query_engine = index.as_query_engine(
    similarity_top_k=K,
    filter=[("<", "publication_date", "2008-09-15")],
    sort_by="publication_date",
)

CPU times: user 32.9 ms, sys: 5.4 ms, total: 38.3 ms
Wall time: 37.8 ms


## Before the 2008 crisis

In [None]:
%%time

result = query_engine.query(
    """
What was the main financial regulation in the US before the 2008 financial crisis ?
"""
)
print(result.response)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Before the 2008 financial crisis, the main financial regulations in the US included the Gramm-Leach-Bliley Act of 1999, which repealed parts of the Glass-Steagall Act of 1933, allowing banks to offer a broader range of financial services, including investment and insurance services. Other significant regulations included the Sarbanes-Oxley Act of 2002, aimed at protecting investors from fraudulent financial reporting by corporations, and various rules enforced by the Securities and Exchange Commission (SEC) and other regulatory bodies.
CPU times: user 232 ms, sys: 40.5 ms, total: 272 ms
Wall time: 24.2 s


In [None]:
%%time

result = query_engine.query(
    """
Is the Gramm-Leach-Bliley Act of 1999 enough to prevent the 2008 crisis. Search the document and explain its strenghts and weaknesses to regulate the US stock market.
"""
)
print(result.response)

The Gramm-Leach-Bliley Act of 1999, also known as the Financial Services Modernization Act, primarily aimed to allow the consolidation of financial services companies, such as banks, securities firms, and insurance companies. It repealed the Glass-Steagall Act's provisions that prohibited a bank from offering investment, commercial banking, and insurance services. 

Strengths:
1. The Act facilitated a more competitive and efficient financial services market by allowing financial institutions to offer a broader range of services.
2. It aimed to improve the stability of financial institutions by allowing diversification of their services and revenue sources.
3. The Act included provisions for the protection of consumers' personal financial information.

Weaknesses:
1. The Act's allowance for the consolidation of financial services may have contributed to the "too big to fail" phenomenon, where the failure of a single, large financial institution could pose a systemic risk to the economy.

## After the 2008 crisis

In [None]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4-turbo-preview, the 128k tokens context size can take 100 pages.
K = 100

query_engine = index.as_query_engine(
    similarity_top_k=K,
    filter=[(">=", "publication_date", "2008-09-15")],
    sort_by="publication_date",
)

In [None]:
%%time

result = query_engine.query(
    """
What happened on the 15th of September 2008 ? Answer from your own knowledge if necessary.
"""
)
print(result.response)

On the 15th of September 2008, Lehman Brothers, a global financial services firm, filed for bankruptcy protection, marking the largest bankruptcy filing in U.S. history at the time. This event significantly intensified the 2008 financial crisis, leading to a global economic downturn.
CPU times: user 199 ms, sys: 28.4 ms, total: 228 ms
Wall time: 15.8 s


In [None]:
%%time

result = query_engine.query(
    """
What was the new US financial regulation enacted after the 2008 crisis to increase the market regulation and to improve consumer sentiment ?
"""
)
print(result.response)

The Dodd-Frank Wall Street Reform and Consumer Protection Act, 2010.
CPU times: user 195 ms, sys: 37.5 ms, total: 233 ms
Wall time: 12.5 s


## In depth analysis

In [None]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4-turbo-preview, the 128k tokens context size can take 100 pages.
K = 100

query_engine = index.as_query_engine(
    similarity_top_k=K, sort_by="publication_date"
)

CPU times: user 237 µs, sys: 79 µs, total: 316 µs
Wall time: 320 µs


In [None]:
%%time

result = query_engine.query(
    """
Analyse the US financial regulations before and after the 2008 crisis and produce a report of all related arguments to explain what happened, and to ensure that does not happen again.
Use both the provided context and your own knowledge but do mention explicitely which one you use.
"""
)
print(result.response)

Based on the provided context, the Dodd-Frank Wall Street Reform and Consumer Protection Act, enacted on July 21, 2010, represents a significant overhaul of financial regulation in the United States following the 2008 financial crisis. The Act aimed to promote financial stability by improving accountability and transparency in the financial system, ending "too big to fail" institutions, protecting consumers from abusive financial services practices, and for other purposes. Key components of the Act include the establishment of the Financial Stability Oversight Council to monitor systemic risk, the creation of the Office of Financial Research to support the Council with data analysis and research, and enhanced regulation of credit rating agencies to address conflicts of interest and promote accuracy in ratings. The Act also introduced concentration limits on large financial firms to prevent excessive risk accumulation and mandated the creation of a Consumer Financial Protection Bureau t