## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [1]:
!pip install -q pgvector


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the PostgreSQL info

In [2]:
CONNECTION_STRING = "postgresql+psycopg://user:password@postgresql-server:5432/vectordb"

#### Imports

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

#### Document loading from a folder containing PDFs

In [4]:
pdf_folder_path = './rhods-doc'

loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

#### Split documents into chunks with some overlap

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat OpenShift Data Science self-\nmanaged\n \n1.32\nUpgrading OpenShift Data Science self-\nmanaged in a disconnected environment\nLearn how to upgrade Red Hat OpenShift Data Science on OpenShift Container\nPlatform in a disconnected environment\nLast Updated: 2023-09-05', metadata={'source': 'rhods-doc/red_hat_openshift_data_science_self-managed-1.32-upgrading_openshift_data_science_self-managed_in_a_disconnected_environment-en-us.pdf', 'page': 0})

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [6]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

#### Create the index and ingest the documents

In [7]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "documents_test"

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,)

In [8]:
query = "How do you install OpenShift Data Science?"
docs_with_score = db.similarity_search_with_score(query)

In [9]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.22300212246166795
CHAPTER 2. OVERVIEW OF INSTALLING AND DEPLOYING
OPENSHIFT DATA SCIENCE
Red Hat OpenShift Data Science is a platform for data scientists and developers of artificial intelligence
(AI) applications. It provides a fully supported environment that lets you rapidly develop, train, test, and
deploy machine learning models on-premises and/or in the public cloud.
OpenShift Data Science is provided as a managed cloud service add-on for Red Hat OpenShift or as
self-managed software that you can install on-premise or in the public cloud on OpenShift. For
information on installing OpenShift Data Science as a managed cloud service add-on, see 
Installing
OpenShift Data Science
.
Installing OpenShift Data Science involves the following high-level tasks:
1
. 
Confirm that your OpenShift Container Platform cluster meets all requirements.
2
. 
Configure an identity provider for OpenShift Contain

## Ingesting new documents

#### Example with Web pages

In [10]:
from langchain.document_loaders import WebBaseLoader

In [11]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [12]:
data = loader.load()

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [14]:
embeddings = HuggingFaceEmbeddings()
store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

In [16]:
store.add_documents(all_splits);