## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [1]:
#!pip install -q pgvector

In [2]:
#!pip install python-dotenv

### Base parameters, the PostgreSQL info

First we need to build the CONNECTION  like:
```python
CONNECTION_STRING = "postgresql+psycopg://user:password@postgresql-server:5432/vectordb"
```

In [59]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

# Get the values from the .env file
user = "testuser"
password ="testpwd"
database = "vectordb"
#server = "af651cca01b154fe28a0df0167cad5a7-844854289.us-east-2.elb.amazonaws.com"
server="localhost"
# Construct the connection string
CONNECTION_STRING = f"postgresql+psycopg://{user}:{password}@{server}:5432/{database}"

# Print the connection string
print(CONNECTION_STRING)

postgresql+psycopg://testuser:testpwd@localhost:5432/vectordb


In [46]:
#!pip install langchain

#### Imports

In [60]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

In [61]:
#!pip install pypdf

#### Document loading from a folder containing PDFs

In [62]:
pdf_folder_path = './rhods-doc'
loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

#### Split documents into chunks with some overlap

In [63]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content="Vector database\nA vector database management system (VDBMS) or simply vector database or vector store is a\ndatabase that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases\ntypically implement one or more Approximate Nearest Neighbor  (ANN) algorithms,[1][2] so that one can\nsearch the database with a query vector to retrieve the closest matching da tabase records.\nVectors are mathematical representations of data in a high-dimensional space. In this space, each dimension\ncorresponds  to a feature of the data, and tens of thous ands of dimensions might be used to represent\nsophisticated data. A vector's position in this space represents its characteristics. Words, phrases, or entire\ndocuments, and images, audio, and ot her types of data can all be vectorized.[3]\nThese feature vectors may be computed from the raw data using machine learning methods such as feature\nextraction algorithms, word embeddings[4] or deep

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [64]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [65]:
#!pip install sentence-transformers

In [66]:
# Pip install necessary package
#%pip install --upgrade --quiet  langchain-openai
#%pip install --upgrade --quiet  psycopg2-binary
#%pip install --upgrade --quiet  tiktoken

#### Create the index and ingest the documents

In [67]:
#!pip install psycopg

In [68]:
#pip install pq

In [69]:
#!pip install "psycopg[binary]"

In [70]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "documents_test"

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,)

In [71]:
query = "What is vector database?"
docs_with_score = db.similarity_search_with_score(query)

In [72]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.18316369157053736
Vector database
A vector database management system (VDBMS) or simply vector database or vector store is a
database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases
typically implement one or more Approximate Nearest Neighbor  (ANN) algorithms,[1][2] so that one can
search the database with a query vector to retrieve the closest matching da tabase records.
Vectors are mathematical representations of data in a high-dimensional space. In this space, each dimension
corresponds  to a feature of the data, and tens of thous ands of dimensions might be used to represent
sophisticated data. A vector's position in this space represents its characteristics. Words, phrases, or entire
documents, and images, audio, and ot her types of data can all be vectorized.[3]
These feature vectors may be computed from the raw data using machine learni

## Ingesting new documents

#### Example with Web pages

In [73]:
from langchain.document_loaders import WebBaseLoader

In [74]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [75]:
data = loader.load()

In [76]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [77]:
embeddings = HuggingFaceEmbeddings()
store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

In [78]:
store.add_documents(all_splits);

In [79]:
query = "How do you install OpenShift Data Science?"
docs_with_score = store.similarity_search_with_score(query)

In [80]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.4412715768056079
Dashboard configuration
  





    Custom notebooks
  





    NVIDIA GPUs
  





    Custom Serving Runtime (Triton)
  





    OpenShift Group Management
  









    Tools and Applications
  





            Tools and Applications
          




    Apache Airflow
  





    Apache Spark
  





    Apache NiFi
  





    MLflow
  





    NVIDIA Riva
  





    Rclone
  






    Minio
  




    Minio
  




      Table of contents
    




      What is it?
    





      Why this guide?
    





      Pre-requisites
    





      Deploying Minio on OpenShift
    






      Create a Data Science Project (Optional)
    





      Log on to your project in OpenShift Console
    





      Deploy Minio in your project
    








      Creating a bucket in Minio
    






      Log in to Minio
    





      Create a bucket
    








      Create a m