## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [1]:
!pip install -q pgvector

DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [2]:
!pip install python-dotenv



DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


### Base parameters, the PostgreSQL info

First we need to build the CONNECTION  like:
```python
CONNECTION_STRING = "postgresql+psycopg://user:password@postgresql-server:5432/vectordb"
```

In [53]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

# Get the values from the .env file
user = os.getenv("DB_USER")
password = os.getenv("DB_PASSWORD")
database = os.getenv("DB_NAME")
server = "af651cca01b154fe28a0df0167cad5a7-844854289.us-east-2.elb.amazonaws.com"

# Construct the connection string
CONNECTION_STRING = f"postgresql+psycopg://{user}:{password}@{server}:5432/{database}"

# Print the connection string
#print(CONNECTION_STRING)

In [5]:
!pip install langchain



DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


#### Imports

In [6]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

In [7]:
!pip install pypdf



DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


#### Document loading from a folder containing PDFs

In [8]:
pdf_folder_path = './rhods-doc'
loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

#### Split documents into chunks with some overlap

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content="Vector database\nA vector database management system (VDBMS) or simply vector database or vector store is a\ndatabase that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases\ntypically implement one or more Approximate Nearest Neighbor  (ANN) algorithms,[1][2] so that one can\nsearch the database with a query vector to retrieve the closest matching da tabase records.\nVectors are mathematical representations of data in a high-dimensional space. In this space, each dimension\ncorresponds  to a feature of the data, and tens of thous ands of dimensions might be used to represent\nsophisticated data. A vector's position in this space represents its characteristics. Words, phrases, or entire\ndocuments, and images, audio, and ot her types of data can all be vectorized.[3]\nThese feature vectors may be computed from the raw data using machine learning methods such as feature\nextraction algorithms, word embeddings[4] or deep

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [10]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [11]:
!pip install sentence-transformers



DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [12]:
# Pip install necessary package
%pip install --upgrade --quiet  langchain-openai
%pip install --upgrade --quiet  psycopg2-binary
%pip install --upgrade --quiet  tiktoken

Note: you may need to restart the kernel to use updated packages.


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


Note: you may need to restart the kernel to use updated packages.


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


Note: you may need to restart the kernel to use updated packages.


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


#### Create the index and ingest the documents

In [20]:
!pip install psycopg

Collecting psycopg
  Obtaining dependency information for psycopg from https://files.pythonhosted.org/packages/fe/f2/ab7de9bed559fa1f5efe2b9638be6e2d51ae605c9c5a321e26290cfe9899/psycopg-3.1.17-py3-none-any.whl.metadata
  Using cached psycopg-3.1.17-py3-none-any.whl.metadata (4.2 kB)
Using cached psycopg-3.1.17-py3-none-any.whl (178 kB)
Installing collected packages: psycopg
Successfully installed psycopg-3.1.17


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [22]:
pip install pq

Collecting pq
  Downloading pq-1.9.1.tar.gz (15 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pq
  Building wheel for pq (setup.py): started
  Building wheel for pq (setup.py): finished with status 'done'
  Created wheel for pq: filename=pq-1.9.1-py3-none-any.whl size=12563 sha256=4ff6602fb860f3b539f77c917247e8b07e020490de4f03839f51015a61786bc3
  Stored in directory: c:\users\rusla\appdata\local\pip\cache\wheels\ae\88\f0\0e3e6cbc020914476fd134c58e2b1a336bf7afea9559e217c5
Successfully built pq
Installing collected packages: pq
Successfully installed pq-1.9.1


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [29]:
!pip install "psycopg[binary]"

Collecting psycopg-binary==3.1.17 (from psycopg[binary])
  Obtaining dependency information for psycopg-binary==3.1.17 from https://files.pythonhosted.org/packages/97/5e/4443a7a19e02486026d9556ac063b98c4185c3037c327d5d64e819b9bca7/psycopg_binary-3.1.17-cp310-cp310-win_amd64.whl.metadata
  Downloading psycopg_binary-3.1.17-cp310-cp310-win_amd64.whl.metadata (2.9 kB)
Downloading psycopg_binary-3.1.17-cp310-cp310-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ---- ----------------------------------- 0.3/2.9 MB 6.3 MB/s eta 0:00:01
   ---------------------------- ----------- 2.1/2.9 MB 26.7 MB/s eta 0:00:01
   ---------------------------- ----------- 2.1/2.9 MB 26.7 MB/s eta 0:00:01
   ---------------------------------------  2.9/2.9 MB 16.7 MB/s eta 0:00:01
   ---------------------------------------- 2.9/2.9 MB 14.2 MB/s eta 0:00:00
Installing collected packages: psycopg-binary
Successfully installed psycopg-binary-3.1.17


DEPRECATION: celery 5.0.5 has a non-standard dependency specifier pytz>dev. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of celery or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [42]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "documents_test"

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,)

In [43]:
query = "What is vector database?"
docs_with_score = db.similarity_search_with_score(query)

In [44]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.18316362945708264
Vector database
A vector database management system (VDBMS) or simply vector database or vector store is a
database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases
typically implement one or more Approximate Nearest Neighbor  (ANN) algorithms,[1][2] so that one can
search the database with a query vector to retrieve the closest matching da tabase records.
Vectors are mathematical representations of data in a high-dimensional space. In this space, each dimension
corresponds  to a feature of the data, and tens of thous ands of dimensions might be used to represent
sophisticated data. A vector's position in this space represents its characteristics. Words, phrases, or entire
documents, and images, audio, and ot her types of data can all be vectorized.[3]
These feature vectors may be computed from the raw data using machine learni

## Ingesting new documents

#### Example with Web pages

In [45]:
from langchain.document_loaders import WebBaseLoader

In [46]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [47]:
data = loader.load()

In [48]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [49]:
embeddings = HuggingFaceEmbeddings()
store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

In [50]:
store.add_documents(all_splits);

In [51]:
query = "How do you install OpenShift Data Science?"
docs_with_score = store.similarity_search_with_score(query)

In [52]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.44127142429351807
Dashboard configuration
  





    Custom notebooks
  





    NVIDIA GPUs
  





    Custom Serving Runtime (Triton)
  





    OpenShift Group Management
  









    Tools and Applications
  





            Tools and Applications
          




    Apache Airflow
  





    Apache Spark
  





    Apache NiFi
  





    MLflow
  





    NVIDIA Riva
  





    Rclone
  






    Minio
  




    Minio
  




      Table of contents
    




      What is it?
    





      Why this guide?
    





      Pre-requisites
    





      Deploying Minio on OpenShift
    






      Create a Data Science Project (Optional)
    





      Log on to your project in OpenShift Console
    





      Deploy Minio in your project
    








      Creating a bucket in Minio
    






      Log in to Minio
    





      Create a bucket
    








      Create a 