## Creating an index and populating it with documents using Elasticsearch

Simple example on how to ingest PDF documents, then web pages content into an Elasticsearch VectorStore.

Requirements:
- An Elasticsearch cluster 
    - Can be done using ElasticSearch operator
    - Create an ElasticSearch Cluster instance from the operator
    - This will create the required certs and credentials for connecting
    
__NOTE: You will need the correct certs in order to establish a connection with the Elasticsearch pod. It might be helpful to use an SSL Cert decoder to decode the `tls.crt` and the `ca.crt` to ensure that the service URL is whitelisted.__

## ElasticSearch Install and Config

## Install and Configuration

1. Install Elasticsearch (ECK) Operator from Operator Hub or through ArgoCD
2. Create an Elasticsearch Cluster Resource using the form or a YAML

```yaml
kind: Elasticsearch
apiVersion: elasticsearch.k8s.elastic.co/v1
metadata:
  name: elasticsearch # or any name you want
  namespace: <NAMESPACE> # CHANGE
spec:
  version: 8.14.0
  nodeSets:
    - name: default
      config:
        node.roles:
          - master
          - data
        node.attr.attr_name: attr_value
        node.store.allow_mmap: false
      podTemplate:
        metadata:
          labels: # Configure as necessary
            <ADD LABELS HERE>
        spec:
          containers:
            - name: elasticsearch
              resources: <CHANGE MEMORY REQUESTS> # Configure as necessary
                requests:
                  memory: 4Gi
                  cpu: 1
                limits:
                  memory: 4Gi
                  cpu: 2
      count: <POD REPLICAS>

```
3. Once this yaml has been created and deployed the following resources will also be created
    * elasticsearch-es-default
    * elasticsearch-es-default-es-config
    * elasticsearch-es-default-es-transport-certs
    * elasticsearch-es-file-settings
    * elasticsearch-es-http
    * elasticsearch-es-http-ca-internal
    * elasticsearch-es-http-certs-internal
    * elasticsearch-es-internal-http
    * elasticsearch-es-internal-users
    * elasticsearch-es-remote-ca
    * elasticsearch-es-scripts
    * elasticsearch-es-transport
    * elasticsearch-es-transport-ca-internal
    * elasticsearch-es-unicast-hosts
    * elasticsearch-es-xpack-file-realm

4. The resources that we will need to connect and query the elastic DB are:
    * elasticsearch-es-http (svc)
    * elasticsearch-es-internal-http (svc)
    * elasticsearch-es-http-certs-internal (secret)
    * elasticsearch-es-internal-users (secret)

5. Inside this notebook:
    - HOST =  elasticsearch-es-http
    - PORT = 9200
    - Create create/update ca.crt file to have the contents of the ca.crt field of the secret `elasticsearch-es-http-certs-internal` 

### (Optional) Creating a Route
1. Go to OCP Web Console > Networking > Routes
2. Create a route that points to the service `elasticsearch-es-http`
3. Check TLS encrypted
4. Set the encryption type to Reencrypt
5. Set the `Destination CA Cert` to the value of the ca.crt field of the secret `elasticsearch-es-http-certs-internal`
6. Visit the route, you will need to use the username `elastic` and password from the secret `elasticsearch-es-internal-users`


### Needed packages

In [None]:
!pip install -q elasticsearch langchain==0.1.12 pypdf==4.0.2 sentence-transformers==2.4.0 einops==0.7.0 lxml==5.1.0 tqdm==4.66.2

### Base parameters, the Elasticsearch info

In [None]:
import os

ELASTIC_USER = "elastic"
ELASTIC_PASSWORD = os.getenv("ELASTIC_PASSWORD")
HOST = 'elasticsearch-es-http.rhsaia-lab.svc'
PORT = '9200'

product_version = 2.9
COLLECTION_NAME = f"rhoai-doc-{product_version}"

#### Imports

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import ElasticsearchStore
from elasticsearch import Elasticsearch

#### Create Elasticsearch Connection

In [None]:
es_conn = Elasticsearch(
    f"https://{HOST}:{PORT}",
    basic_auth=(ELASTIC_USER, ELASTIC_PASSWORD),
    ca_certs="ca.crt"
)
es_conn.info()

## Initial index creation and document ingestion

#### Download and load pdfs

In [None]:
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [None]:
import requests
import os

docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

In [None]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [None]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

#### Load websites

In [None]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [None]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [None]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [None]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

#### Create the index and ingest the documents (Method #1)

In [None]:
# To ingest with GPUs
model_kwargs = {"trust_remote_code": True, "device": "cuda"}

# Define embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True,
)

# Instantiate langchain vectorstore and ingest from documents
db = ElasticsearchStore.from_documents(
    documents=all_splits,
    embedding=embeddings,
    index_name=COLLECTION_NAME,
    es_connection=es_conn,
)

#### Alternatively, add new documents (Method #2)

In [None]:
model_kwargs = {"trust_remote_code": True, "device": "cuda"}

embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True,
)

# Instantiate langchain vectorstore
db = ElasticsearchStore(
    embedding=embeddings,
    index_name=COLLECTION_NAME,
    es_connection=es_conn
)

# Add docs
db.add_documents(all_splits)

#### Test query

In [None]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)