## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [11]:
!pip install -q pgvector


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the PostgreSQL info

In [2]:
CONNECTION_STRING = "postgresql+psycopg://vectordb:vectordb@postgresql:5432/vectordb"

#### Imports

In [9]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

#### Document loading from a folder containing PDFs

In [4]:
pdf_folder_path = './pdf/'

loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

#### Split documents into chunks with some overlap

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Student Workbook\nOCP 4.6 DO400\nRed Hat DevOps Pipelines and Processes: CI/CD\nwith Jenkins, Git, and Test-driven Development\n(TDD)\nEdition 6\nDO400-OCP4.6-en-6-20221025 Copyright ©2022 Red Hat, Inc.', metadata={'source': 'pdf/DO400-OCP4.6-en-6-20221025.pdf', 'page': 0})

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [6]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [7]:
len(all_splits)

10737

#### Create the index and ingest the documents

In [12]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "documents_test"

store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

start = 1300
chunk_size = 100

for i in range(start, len(all_splits), chunk_size):
    store.add_documents(all_splits[i:i + chunk_size])
    print(f"Added {i + chunk_size} splits")

#db = PGVector.from_documents(
#    documents=all_splits,
#    embedding=embeddings,
#    collection_name=COLLECTION_NAME,
#    connection_string=CONNECTION_STRING)

Added 1400 splits
Added 1500 splits
Added 1600 splits
Added 1700 splits
Added 1800 splits
Added 1900 splits
Added 2000 splits
Added 2100 splits
Added 2200 splits
Added 2300 splits
Added 2400 splits
Added 2500 splits
Added 2600 splits
Added 2700 splits
Added 2800 splits
Added 2900 splits
Added 3000 splits
Added 3100 splits
Added 3200 splits
Added 3300 splits
Added 3400 splits
Added 3500 splits
Added 3600 splits
Added 3700 splits
Added 3800 splits
Added 3900 splits
Added 4000 splits
Added 4100 splits
Added 4200 splits
Added 4300 splits
Added 4400 splits
Added 4500 splits
Added 4600 splits
Added 4700 splits
Added 4800 splits
Added 4900 splits
Added 5000 splits
Added 5100 splits
Added 5200 splits
Added 5300 splits
Added 5400 splits
Added 5500 splits
Added 5600 splits
Added 5700 splits
Added 5800 splits
Added 5900 splits
Added 6000 splits
Added 6100 splits
Added 6200 splits
Added 6300 splits
Added 6400 splits
Added 6500 splits
Added 6600 splits
Added 6700 splits
Added 6800 splits
Added 6900

In [14]:
query = "How do you run a container with a specific user?"
docs_with_score = store.similarity_search_with_score(query)

In [15]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.27092097016933436
Chapter 4| Custom Container Images
Then, you must verify the mapping of host user (huser) to the container user (user). The
following example uses the container ID e6116477c5c9 :
[user@host ~]$ podman top e6116477c5c9 huser user
HUSER       USER
1000        root
The preceding example shows that the user inside the container, root, is mapped to a user with ID
1000 on the host system.
Alternatively, you can verify the same ID mapping by printing the /proc/self/uid_map  and /
proc/self/gid_map  files inside of the container:
[root@e6116477c5c9 /]# cat /proc/self/uid_map /proc/self/gid_map
         0       1000           1
         0       1000           1
When you execute a container with elevated privileges on the host machine, the
root mapping does not take place even when you define subordinate ID ranges, for
example:
[user@host ~]$ sudo podman run -it registry.access.redhat.com

## Ingesting new documents

#### Example with Web pages

In [10]:
from langchain.document_loaders import WebBaseLoader

In [11]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [12]:
data = loader.load()

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [14]:
embeddings = HuggingFaceEmbeddings()
store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

In [16]:
store.add_documents(all_splits)