# Combined data flow use case

This notebook works through the process for loading PDF files into a vector database and performing a semantic search over those files. This involves two data flows:
1. Processing the PDF files and loading the embedded results into a vector database.
2. Executing the search against a vector index and returning the most semantically similar results.

Since Denodo can manage both the unstructured data and the access to the vector database, we can streamline both of these processes:
* Denodo can serve the PDF files to an application that can process, chunk, and embed the PDF files.
* Denodo can load the vector database with the results generated by the application.
* Applications wanting to perform a vector search can access the Denodo Platform to do this as well, using simple SQL statements.

As usual, we'll download the required packages first:

In [None]:
!pip install langchain langchain_core langchain_community langchain-openai pymupdf

## Importing Libraries

Here we're importing a few helper libraries. The ones to note are the PyMuPDFLoader, which is important for its ability to process PDF into text. The LangChain RecursiveCharacterTextSplitter will be used to split the full pages of PDF text into smaller chunks, so that their semantic meaning is not diluted when being embedded by our embeddings model.

Finally, we also import the DenodoVector library to help us interact more efficiently with the vector database through the Denodo Platform.

In [None]:
# Importing libraries
import base64, os, urllib.parse, json, re

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# We have to move up a directory to import the Denodo libraries, but we switch back to avoid issues.
%cd ..
try:
    from denodo_python.denodo_vectorstore import DenodoVector, create_denodo_connection, denodo_connection_param, denodo_connection_param_oauth, underlying_view_exists
except Exception as err:
    print(f"Could not import helper libraries, please check with the support team: {err}")
finally:
    %cd ./1_denodo_tools

## Instantiating Embedding Model

Here we use the LangChain BedrockEmbeddings class to interact with Bedrock's embeddings models. This model can be fed into a VectorStore implementation in order to automatically handle embedding operations before inserting the results into the vector database.

In order to check that it's working correctly, we also call it on an individual string `Hello` to make sure that we can get a list of float values back.

In [None]:
# Set up our normal embeddings model, that will be used to generate vector embeddings from the input data
# We need this for our LangChain VectorStore.
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
)
embed_text = embeddings.embed_query('Hello')
print(str(embed_text[0:3]) + f" + {len(embed_text) - 3} more float values")

## Denodo Connection

Here we connect to the Denodo Platform with the Flight SQL driver and retrieve PDF files from the unstructured data store. Note that we can select specific PDF files by specifying the extension and relative path of these files in the underlying data source.

Please make sure to replace the value of `host` with the DNS name of the EC2 instance hosting the Denodo Platform.

In [None]:
# !!! Update this to point to the EC2 instance's DNS name or we won't be able to access Denodo !!! Example:
# host="ec2-3-238-31-20.compute-1.amazonaws.com"
host="denodo-service"

# Define connection parameters to the Denodo Platform so that we can connect.
port = 9994
db = "vector"
user = "admin"
password = "admin"

# The following function from the Denodo library constructs connection parameters that we can feed into another function
# To create the connection. For OAuth support we have the "denodo_connection_param_oauth" function.
denodo_con_param = denodo_connection_param(user, password, host, db, port)

# This query retreives our PDF documents. Note that retrieving different files is as simple as updating the WHERE clause.
with create_denodo_connection(denodo_con_param) as con:
    with con.cursor() as cur:
        cur.execute("SELECT blob_value, file_name, uri FROM bank.bv_local_files WHERE extension = 'pdf'")
        pdf_results = cur.fetchallarrow()

for name in pdf_results.column(1):
    print(name)

## PDF Processing

After getting the actual binary values for our PDFs, we still have to feed them into a library that can convert the binary data into human readable text and associated metadata--this is what the PyMuPDFLoader is able to do for us.

After loading the PDF, LangChain takes over in processing the text into a format that will work well for our vector index--this is performed by the RecursiveCharacterTextSplitter that we can configure to generate chunks of text that are small enough to have a specific semantic meaning. Finally, we're left with a list of `Document` objects that we can feed into our `VectorStore`.

In [None]:
# For a vector search to work well, we want to split our text into smaller chunks, so that too many sentences with different contexts
# are not included in the same chunk. To achieve this, we use a LangChain text splitter.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

# To extract text from the PDF files, we use the PyPDFLoader. 
pdfs = []
for blob_value, file_name, uri in zip(pdf_results.column(0),pdf_results.column(1),pdf_results.column(2)):
    # The PDF loader requires a file path, so we temporarily save the PDF to the local filesystem
    tmp_file_path = '/tmp/output_file.bin'
    with open(tmp_file_path, 'wb') as file:
        file.write(blob_value.as_py())
        loader = PyMuPDFLoader(tmp_file_path)
        pdf = loader.load()
        # The loader automatically sets the metadata using the file location, however, we want to overwrite this with how to find the file in Denodo.
        for p in pdf:
            page = str(int(p.metadata['page']) + 1)
            # URI field is not being properly set. This should be resolved
            p.metadata = {"source": uri.as_py(), "page": page, "link": f"http://{host}:9090/denodo-restfulws/unstructured/views/all_files/{urllib.parse.quote(uri.as_py()).replace('/', '%2F')}?%24select=blob_value#page={page}"}
        pdfs += pdf
        print(f"Parsed document '{file_name.as_py()}'")
    if os.path.exists(tmp_file_path):
        os.remove(tmp_file_path)
    else:
        print(f"Error: {tmp_file_path} does not exist.")

# Concatenate all of our text chunks
pages = []
pages += text_splitter.split_documents(pdfs)
print(f"Total text chunks: {len(pages)}")

## Creating the `VectorStore` Object

In this case, instead of results, we're feeding in a list of LangChain `Document` objects to the vector store--these contain metadata about where the `Document` originated along with the actual text contained in the `Document`. These objects are the common output of LangChain processes.

Note that this does take a while to embed all of the data (~2 minutes)

In [None]:
# Name of the collection into which we will be inserting our embeddings and text
collection_name = "pdf_vector_search"

# This command ensures that any existing table with the same name is deleted, using helper functions
if 'pdf_denodo_vec' in locals():
    pdf_denodo_vec.delete_index()
    print("Deleted existing VectorStore")
with create_denodo_connection(denodo_con_param) as con:
    with con.cursor() as cur:
        if underlying_view_exists(cur,denodo_con_param['db'],collection_name):
            cur.execute(f"SELECT * FROM DROP_REMOTE_TABLE() WHERE base_view_name = '{collection_name}'")
            cur.fetchone()
            print("Deleted existing backend table")

# This is inserting the text chunks into PGVector after embedding them. It's a bit slow but it should be possible to adapt this to a PySpark pipeline to parallelize it. 

# Initialize the DenodoVector VectorStore
pdf_denodo_vec = DenodoVector.from_documents(
    documents=pages,
    embedding=embeddings,
    collection_name=collection_name,
    connection_param=denodo_con_param,
    # These variables let the Python library know where the vector database data source is defined in Denodo
    db_db='admin', 
    db_name='ds_pgvector', 
    db_schema = 'public'
)

# This will take a bit to insert all the records, since pyodbc inserts them one by one.

## Vector Search

After creating a `VectorStore` object, performing a semantic search against the data is as easy as calling the single `similarity_search()` function on that object with the string being searched for.

Note that the response metadata also contains a link through which the PDF file itself can be accessed, and this automatically navigates to the page containing the search result using a URL parameter.

In [None]:
# This executes a similarity search against our vector store, which returns a list of documents
output = pdf_denodo_vec.similarity_search("What risks does Microsoft highlight in their report?")

# I would like each document to be led by its location in Denodo, so that I can drill into the document if necessary
# Clicking on the link included in the source will open the document in the browser, accessing it through a Denodo web service
for doc in output:
    print(f"""Source: {doc.metadata}
Content: {doc.page_content}
---""")