## Creating an index and populating it with documents using Elasticsearch

Simple example on how to ingest PDF documents, then web pages content into an Elasticsearch VectorStore.

Requirements:
- An Elasticsearch cluster 
    - Can be done using ElasticSearch operator
    - Create an ElasticSearch Cluster instance from the operator
    - This will create the required certs and credentials for connecting
    
__NOTE: You will need the correct certs in order to establish a connection with the Elasticsearch pod. It might be helpful to use an SSL Cert decoder to decode the `tls.crt` and the `ca.crt` to ensure that the service URL is whitelisted.__

## ElasticSearch Install and Config

## Install and Configuration

1. Install Elasticsearch (ECK) Operator from Operator Hub or through ArgoCD
2. Create an Elasticsearch Cluster Resource using the form or a YAML

```yaml
kind: Elasticsearch
apiVersion: elasticsearch.k8s.elastic.co/v1
metadata:
  name: elasticsearch # or any name you want
  namespace: <NAMESPACE> # CHANGE
spec:
  version: 8.14.0
  nodeSets:
    - name: default
      config:
        node.roles:
          - master
          - data
        node.attr.attr_name: attr_value
        node.store.allow_mmap: false
      podTemplate:
        metadata:
          labels: # Configure as necessary
            <ADD LABELS HERE>
        spec:
          containers:
            - name: elasticsearch
              resources: <CHANGE MEMORY REQUESTS> # Configure as necessary
                requests:
                  memory: 4Gi
                  cpu: 1
                limits:
                  memory: 4Gi
                  cpu: 2
      count: <POD REPLICAS>

```
3. Once this yaml has been created and deployed the following resources will also be created
    * elasticsearch-es-default
    * elasticsearch-es-default-es-config
    * elasticsearch-es-default-es-transport-certs
    * elasticsearch-es-file-settings
    * elasticsearch-es-http
    * elasticsearch-es-http-ca-internal
    * elasticsearch-es-http-certs-internal
    * elasticsearch-es-internal-http
    * elasticsearch-es-internal-users
    * elasticsearch-es-remote-ca
    * elasticsearch-es-scripts
    * elasticsearch-es-transport
    * elasticsearch-es-transport-ca-internal
    * elasticsearch-es-unicast-hosts
    * elasticsearch-es-xpack-file-realm

4. The resources that we will need to connect and query the elastic DB are:
    * elasticsearch-es-http (svc)
    * elasticsearch-es-internal-http (svc)
    * elasticsearch-es-http-certs-internal (secret)
    * elasticsearch-es-internal-users (secret)

5. Inside this notebook:
    - HOST =  elasticsearch-es-http
    - PORT = 9200
    - Create create/update ca.crt file to have the contents of the ca.crt field of the secret `elasticsearch-es-http-certs-internal` 

### (Optional) Creating a Route
1. Go to OCP Web Console > Networking > Routes
2. Create a route that points to the service `elasticsearch-es-http`
3. Check TLS encrypted
4. Set the encryption type to Reencrypt
5. Set the `Destination CA Cert` to the value of the ca.crt field of the secret `elasticsearch-es-http-certs-internal`
6. Visit the route, you will need to use the username `elastic` and password from the secret `elasticsearch-es-internal-users`


### Needed packages

In [56]:
!pip install -q elasticsearch langchain==0.1.12 pypdf==4.0.2 sentence-transformers==2.4.0 einops==0.7.0 lxml==5.1.0 tqdm==4.66.2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the Elasticsearch info

In [123]:
import os

ELASTIC_USER = "elastic"
ELASTIC_PASSWORD = os.getenv("ELASTIC_PASSWORD")
HOST = 'elasticsearch-es-http.rhsaia-lab.svc'
PORT = '9200'

product_version = 2.9
COLLECTION_NAME = f"rhoai-doc-{product_version}"

#### Imports

In [120]:
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import ElasticsearchStore
from elasticsearch import Elasticsearch

#### Create Elasticsearch Connection

In [124]:
es_conn = Elasticsearch(
    f"https://{HOST}:{PORT}",
    basic_auth=(ELASTIC_USER, ELASTIC_PASSWORD),
    ca_certs="ca.crt"
)
es_conn.info()

ObjectApiResponse({'name': 'elasticsearch-es-default-0', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'AXMjHNKaR2S7-fp7uVRv4w', 'version': {'number': '8.14.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '8d96bbe3bf5fed931f3119733895458eab75dca9', 'build_date': '2024-06-03T10:05:49.073003402Z', 'build_snapshot': False, 'lucene_version': '9.10.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

## Initial index creation and document ingestion

#### Download and load pdfs

In [100]:
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [101]:
import requests
import os

docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

Skipped https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.9/pdf/developing_a_model/red_hat_openshift_ai_self-managed-2.9-developing_a_model-en-us.pdf
Skipped https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.9/pdf/monitoring_data_science_models/red_hat_openshift_ai_self-managed-2.9-monitoring_data_science_models-en-us.pdf


In [102]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [103]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

#### Load websites

In [104]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [105]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [106]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [108]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat OpenShift AI Self-Managed\n \n2.9\nUpgrading OpenShift AI Self-Managed in a\ndisconnected environment\nUpgrade Red Hat OpenShift AI on OpenShift Container Platform in a disconnected\nenvironment\nLast Updated: 2024-05-10', metadata={'source': 'https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.9/html-single/upgrading_openshift_ai_self-managed_in_a_disconnected_environment/index', 'page': 0})

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [110]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

#### Create the index and ingest the documents (Method #1)

In [113]:
# To ingest with GPUs
model_kwargs = {"trust_remote_code": True, "device": "cuda"}

# Define embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True,
)

# Instantiate langchain vectorstore and ingest from documents
db = ElasticsearchStore.from_documents(
    documents=all_splits,
    embedding=embeddings,
    index_name=COLLECTION_NAME,
    es_connection=es_conn,
)

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


Batches:   0%|          | 0/37 [00:00<?, ?it/s]

#### Alternatively, add new documents (Method #2)

In [114]:
model_kwargs = {"trust_remote_code": True, "device": "cuda"}

embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True,
)

# Instantiate langchain vectorstore
db = ElasticsearchStore(
    embedding=embeddings,
    index_name=COLLECTION_NAME,
    es_connection=es_conn
)

# Add docs
db.add_documents(all_splits)

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


Batches:   0%|          | 0/37 [00:00<?, ?it/s]

['b422cff2-7cbe-4667-8254-ff697ef45112',
 '1d1d6c60-c9a7-4b73-bbdc-3805ed86b3ab',
 '56781e9e-6730-45db-ae3d-22b8f164c23f',
 '1e490255-f8f4-4b3b-8c90-0f36e08760bb',
 'd22f208d-27cc-4173-9155-c003cb2a3346',
 '92d1a8a1-a351-4a7b-8878-c741b9e00c2b',
 'fc5531c3-4d67-4c59-8ea1-abc27d5a4a46',
 '656a3252-6221-490d-8eef-fb772e3eb731',
 '911bbb1d-bfe8-41d1-9cf8-0d61ae3d307f',
 '76221d96-104a-4b11-ac06-63d9aa45332b',
 '7cbf56d7-bfd5-4597-91c8-bf65487fcb59',
 'f2aadb19-110f-45a8-99d8-30f09195d4b7',
 'ad1902a5-7d47-44e4-8bdd-6b9c45c0fb25',
 '82e92453-31fe-43a1-ac92-b9969799ada8',
 'c663183d-fdbe-4a65-87a2-29de1e2242db',
 'a32d5b28-eaee-428e-83d9-60a280c29c6d',
 '092f2aec-b00d-4580-85d2-ef2e93288b0a',
 '2b52405f-0982-4941-92d0-e05ba0931545',
 '2d7bdfa2-9f34-4354-b084-1a80e9159acf',
 '9909c7c6-ffe8-4814-b551-d6f454411915',
 '60282fea-bd60-49f1-a7bb-e90e80b19314',
 'f657806b-fdc7-4476-b434-3b9b3c0a876f',
 'abe83579-532b-4e01-a71c-ed620f1dc25b',
 '6064cb30-a853-4d43-86f4-c15b6e65b5da',
 '78778370-10cc-

#### Test query

In [115]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [116]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.8863795
But don't worry, OpenShift AI and Open Data Hub take care of this part for you when you launch notebooks, workbenches, model servers, or pipeline runtimes!
Installation
Here is the documentation you can follow:

OpenShift AI documentation
NVIDIA documentation (more detailed)

Advanced configuration
Working with taints
In many cases, you will want to restrict access to GPUs, or be able to provide choice between different types of GPUs: simply stating "I want a GPU" is not enough. Also, if you want to make sure that only the Pods requiring GPUs end up on GPU-enabled nodes (and not other Pods that just end up being there at random because that's how Kubernetes works...), you're at the right place!
The only supported method at the moment to achieve this is to taint nodes, then apply tolerations on the Pods depending on where you want them scheduled. If you don't pay close attention though whe