# Ingestion of Expert profile data to vector db and retrieval

This notebook extracts a json with expert profile information, transforms it into a document format, and then stores them into a vector database for subsequent processing. 

This is an additional index created alongside the watsonx docs rag vector index created in the **Process and Ingest into vector DB** notebook. 

The indexed documents are further used in the **Create and Deploy QnA AI Service** notebook which creates QnA RAG AI service, deploys it on watsonx.ai and can additionally retrieve the expert profile information in case the LLM has no answer to a specific question.

The process involves extracting content from the expert profiles file, segmenting the data, converting it into a document format, and finally indexing the content into a vector database. 

**Note:** A sample profile data is shipped with the accelerator. Alteratively you can index your own expert profile data instead.

The accelerator currently supports Elasticsearch and Milvus or Datastax vector databases. The ingestion process uses vector embeddings to enhance data storage and retrieval within either Elasticsearch or Milvus or Datastax vector databases, ensuring both efficiency and effectiveness.

* Establishing a connection to the chosen vector database (Elasticsearch or Milvus or Datastax) and loading input data from processed documents.
* Generating unique IDs for documents.
* Inserting the documents with embeddings into Elasticsearch or Milvus or Datastax using these generated IDs, with progress monitoring provided by a progress bar.

## Contents
* [Pre-Requisite Libraries and Dependencies](#setup)
* [Import Dependencies](#import)
* [Extract data from input file](#input)
* [Connect to Vector Database](#connect)
* [Index Documents using langchains vectorstore](#insert)
* [Search and Retrieve using Vectorstore and Query templates](#search)

**Note:** Datastax is not supported in this cloud version.

<a id="setup"></a>
### Pre-Requisite Libraries and Dependencies
Download and import mandatory libraries and dependencies. 

**Note** : Some of the versions of the libraries may throw warnings after installation. These library versions are crucial for successful execution of the accelerator. Please ignore the warning/error and proceed with your execution. 

In [None]:
!pip install elasticsearch==8.18.1 | tail -n 1
!pip install langchain | tail -n 1
!pip install ibm_watsonx_ai==1.3.26 | tail -n 1
!pip install langchain_elasticsearch==0.3.2 | tail -n 1
!pip install langchain_milvus==0.2.0 | tail -n 1
!pip install pymilvus==2.5.11 | tail -n 1
!pip install cassio==0.1.10

Restart the kernel after performing the pip install if the below cell fails to import all the libraries.

In [None]:
from langchain.schema import Document
from ibm_watsonx_ai import APIClient,Credentials
import os,re
from tqdm import tqdm
import json
from ibm_watsonx_ai import __version__
import warnings
import hashlib
from ibm_watsonx_ai.foundation_models import Embeddings
from elasticsearch import Elasticsearch, helpers
warnings.filterwarnings("ignore")

#### Get Environment variables

In [None]:
project_id = os.environ['PROJECT_ID']
# Environment and host url
hostname = os.environ['RUNTIME_ENV_APSX_URL']

if hostname.endswith("cloud.ibm.com") == True:
    environment = "cloud"
    runtime_region = os.environ["RUNTIME_ENV_REGION"] 
else:
    environment = "on-prem"
    from ibm_watson_studio_lib import access_project_or_space
    wslib = access_project_or_space()

<a id="import"></a>
### Import Parameter Sets, credentials and Helper functions script.

#### Parameter sets import
Below cells imports parameter sets values, sets the watsonx.ai credentials and imports the helper functions script.

#### RAG helper functions script import

In [None]:
try:
    filename = 'rag_helper_functions.py'
    wslib.download_file(filename)
    import rag_helper_functions
    print("rag_helper_functions imported from the project assets")
except NameError as e:
    print(str(e))
    print("If running watsonx.ai aaS on IBM Cloud, check that the first cell in the notebook contains a project token. If not, select the vertical ellipsis button from the notebook toolbar and `insert project token`. Also check that you have specified your ibm_api_key in the second code cell of the notebook")


In [None]:
parameter_sets = ["RAG_parameter_set","RAG_advanced_parameter_set"]

parameters=rag_helper_functions.get_parameter_sets(wslib, parameter_sets)

In [None]:
ibm_api_key=parameters['watsonx_ai_api_key']
if environment == "cloud":
    WML_SERVICE_URL=f"https://{runtime_region}.ml.cloud.ibm.com" 
    wml_credentials = Credentials(api_key=parameters['watsonx_ai_api_key'], url=WML_SERVICE_URL)
else:
    token = os.environ['USER_ACCESS_TOKEN']
    wml_credentials=Credentials(token=os.environ['USER_ACCESS_TOKEN'],url=hostname,instance_id='openshift')

#### Setup the watsonx.ai client
Below cell uses the watson machine learning credentials to create an API client to interact with the project and deployment space.

In [None]:
client = APIClient(wml_credentials)
client.set.default_project(project_id=project_id)

#### Import Expert Profiles

Change the `expert_profiles_document` parameter if you wish to use your own Expert Data to Ingest into the vector database

In [None]:
# Get the ingestion document for indexing - Breakpoint for folks indexing their own expert profile data

filename = parameters['expert_profiles_document']
wslib.download_file(filename)
with open(filename) as f:
    expert_profiles = json.load(f)
    
print(len(expert_profiles), "expert profiles have been imported.")    

<a id="input"></a>
### Extract the json document and convert the data into Langchain format for indexing
The cell below prepares documents for insertion into a vector database. 

In [None]:
def convert_json_to_documents(json_data):
    documents = []
    for item in json_data:
        
        text = item.get("text", "")   
        # Extract the text with the profile information & rest is put into metadata
        metadata = {key: value for key, value in item.items() if key != "text"}  
        documents.append(Document(page_content=text, metadata=metadata))
    return documents

documents = convert_json_to_documents(expert_profiles)

In [None]:
import hashlib
from langchain.schema import Document

def process_profiles_to_documents(json_data):
    """
    Converts JSON data to Document objects, extracts metadata and content, generates unique IDs,
    and identifies duplicate documents.

    Parameters:
    - json_data (list): List of dictionaries representing expert profiles.

    Returns:
    - profile_documents (list): List of Document objects with processed metadata and content.
    - id_list (list): List of unique IDs generated for each document.
    - duplicate_count (int): Number of duplicate documents found.
    """
    # Convert JSON data to Document objects
    documents = []
    for item in json_data:
        text = item.get("text", "")  # Extract the main text content
        metadata = {key: value for key, value in item.items() if key != "text"}  # Extract metadata
        documents.append(Document(page_content=text, metadata=metadata))

    # Prepare content and metadata for profile documents
    profile_content = []
    profile_metadata = []
    for doc in documents:
        profile_metadata.append({
            "source": doc.metadata['datainfo']['source'],
            "entry_number": doc.metadata['datainfo']['entry_number'],
            "name": doc.metadata['datainfo']['name'],
            "email": doc.metadata['datainfo']['email'],
            "phone": doc.metadata['datainfo']['phone'],
            "domain": doc.metadata['datainfo']['domain'],
            "position": doc.metadata['datainfo']['position'],
            "document_id": doc.metadata['document_id']
        })
        profile_content.append("Document Content: " + doc.page_content)

    # Create profile documents
    profile_documents = [
        Document(page_content=text, metadata=meta)
        for text, meta in zip(profile_content, profile_metadata)
    ]

    # Generate unique IDs for profile documents
    id_list = [
        hashlib.sha256(doc.page_content.encode()).hexdigest()
        for doc in profile_documents
    ]

    # Identify duplicates
    duplicate_count = len(id_list) - len(set(id_list))
    print(f"{duplicate_count} duplicate profiles found.")
    print(f"{len(id_list)} profiles returned.")

    return profile_documents

In [None]:
profile_documents = process_profiles_to_documents(expert_profiles)

<a id="connect"></a>
## Connecting to Vector Database

The notebook, by default, will look for a connection asset in the project named milvus_connect or elasticsearch_connect or datastax_connect. You can set this up by following the instructions in the project readme. This code checks if a specified connection exists in the project. If found, it retrieves the connection details and identifies the connection type. Depending on the connection type, it establishes a connection to the appropriate database. If the connection is not found, it raises an error indicating the absence of the specified connection in the project.

In [None]:
connection_name=parameters["connection_asset"]
if(next((conn for conn in wslib.list_connections() if conn['name'] == connection_name), None)):
    print(connection_name, "Connection found in the project")
    db_connection = wslib.get_connection(connection_name)
    
    connection_datatypesource_id=client.connections.get_details(db_connection['.']['asset_id'])['entity']['datasource_type']
    connection_type = client.connections.get_datasource_type_details_by_id(connection_datatypesource_id)['entity']['name']
    
    print("Successfully retrieved the connection details")
    print("Connection type is identified as:",connection_type)

    if connection_type=="elasticsearch":
        es_client=rag_helper_functions.create_and_check_elastic_client(db_connection, parameters['elastic_search_model_id'])
    elif connection_type=="milvus" or connection_type=="milvuswxd":
        milvus_credentials = rag_helper_functions.connect_to_milvus_database(db_connection, parameters)
    elif connection_type=="datastax":
        if environment == "cloud":
            raise ValueError(f"ERROR! we don't support datastax connection for Cloud as of now")
        datastax_session,datastax_cluster = rag_helper_functions.connect_to_datastax(db_connection, parameters)
        import cassio
        cassio.init(session=datastax_session, keyspace=db_connection.get('keyspace'))
else:
    db_connection=""
    raise ValueError(f"No connection named {connection_name} found in the project.")

<a id="insert"></a>
### Create vector database for Expert Profiles and index using Langchain vector store

Below code utilizes Langchains vector store extension to store documents.
The code sets up a vector store based on the specified connection type, either "elasticsearch" or "milvus".

* If connection_type is `"elasticsearch"`, it imports `ElasticsearchStore` from the `langchain_elasticsearch` library and initializes it with an Elasticsearch client and specified parameters with the given model ID. The Elastic Search vector store is then created using the `model_id`, connection parameters and index settings
* If connection_type is `"milvus"`, the code imports `langchain_milvus` and configures credentials using an API key and service URL. It initializes the embedding model specified in the parameters to generate embeddings, and sets up index parameters with specific metrics and configurations. The Milvus vector store is then created using the embedding function, connection details, and index settings. For cloud environments, it uses the `ibm_watsonx_ai` library to initialize the embedding function with the required API key, URL, and project ID. On Cloud Pak for Data software, it downloads the embedding model from HuggingFace via the `HuggingFaceEmbeddings` class to generate embeddings for the chunked documents.
* If `connection_type` is `"datastax"`, it uses `langchain_community.vectorstores.Cassandra` to initialize a `Cassandra-based vector store`. The embedding model is initialized similarly using `get_embedding()`, and the store is configured using a specified keyspace and table name. This setup supports vector search on a DataStax Astra DB or Cassandra cluster.

Firstly, it creates a connection to the vector database and defines how documents will be retrieved later.
Then, it defines a function to add these documents to the vector store. This function takes the documents and additional parameters for efficient processing, such as splitting the documents into smaller chunks and setting a timeout for requests.

Overall, this code efficiently adds a list of documents to a vector store, thereby making them searchable

Below code creates the `Elasticsearch` dense vector index for indexing of documents if dense embedding model like `E5 multilingual` is used and also creates elasticsearch pipeline for indexing.

In [None]:
def create_es_dense_index(index_name):
    try:
        es_client.options(ignore_status=400).indices.create(
                    index=index_name,
                    mappings={
                        'properties': {
                            'vector.tokens': {
                                'type': 'dense_vector',
                            },
                        }
                    },
                    settings={
                        'index': {
                            'default_pipeline': 'dense-ingest-pipeline',
                        },
                        "number_of_shards": parameters["es_number_of_shards"],
                    }
        )
        es_client.ingest.put_pipeline(
                    id='dense-ingest-pipeline',
                    processors=[
                        {
                            'inference': {
                                'model_id': parameters['elastic_search_model_id'],
                                'input_output': [
                                    {
                                        'input_field': 'text',
                                        'output_field': 'vector.tokens',
                                    }
                                ]
                            }
                        }
                    ]
                )
        print('Elastic search index created!')
    except Exception as e:
        print('Error creating elastic index', e)

In [None]:
def get_embedding(environment, parameters, project_id, wml_credentials, WML_SERVICE_URL=None):
    if environment == "cloud":
        credentials = Credentials(
            api_key=parameters['watsonx_ai_api_key'],
            url=WML_SERVICE_URL
        )
        embedding = Embeddings(
            model_id=parameters['embedding_model_id'],
            credentials=credentials,
            project_id=project_id,
            verify=True
        )
    elif environment == "on-prem":
        try:
            if client.foundation_models.EmbeddingModels.__members__:
                if client.foundation_models.EmbeddingModels(parameters["embedding_model_id"]).name:
                    embedding = Embeddings(
                        model_id=parameters['embedding_model_id'],
                        credentials=wml_credentials,
                        project_id=project_id,
                        verify=True
                    )
                else:
                    print("Local on-prem embedding models not found, using models from IBM Cloud API")
                    credentials = Credentials(
                        api_key=parameters['watsonx_ai_api_key'],
                        url=parameters['watsonx_ai_url']
                    )
                    embedding = Embeddings(
                        model_id=parameters['embedding_model_id'],
                        credentials=credentials,
                        space_id=parameters["wx_ai_inference_space_id"],
                        verify=True
                    )
        except Exception as e:
            print(f"Exception in loading Embedding Models: {str(e)}")
            raise
    else:
        raise ValueError(f"Invalid environment: {environment}. Must be 'cloud' or 'on-prem'.")
    
    return embedding

In [None]:

def create_vector_store(connection_type,index_name,parameters,milvus_index_params=None):
    if connection_type=="elasticsearch":
        from langchain_elasticsearch import ElasticsearchStore
        if 'dense' in parameters['elastic_search_vector_type']:
            create_es_dense_index(index_name)
            vector_store=ElasticsearchStore(
                            es_connection=es_client,
                            index_name=index_name,
                            strategy=ElasticsearchStore.ApproxRetrievalStrategy(query_model_id=parameters['elastic_search_model_id']),
                            )
        else:
            vector_store=ElasticsearchStore(
                            es_connection=es_client,
                            index_name=index_name,
                            strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=parameters['elastic_search_model_id']),
                            custom_index_settings={"number_of_shards": parameters["es_number_of_shards"]}
                            )
        print("Elastic Search Vector Store Created with",parameters['elastic_search_model_id'])
    elif connection_type=="milvus" or connection_type=="milvuswxd":
        from langchain_milvus import Milvus
        
        #milvus_credentials={'database': db_connection['database'],'password':db_connection['password'] ,'port': db_connection['port'] ,'host': db_connection['host'],"secure": True,'user': db_connection['username']}
        print("using the model",parameters['embedding_model_id'], "to create embeddings")
        embedding = get_embedding(environment, parameters, project_id, wml_credentials, WML_SERVICE_URL) if environment == "cloud" else get_embedding(environment, parameters, project_id, wml_credentials, None)  
            
        vector_store = Milvus(
            embedding_function=embedding,
            connection_args=milvus_credentials,
            index_params=milvus_index_params,
            primary_field='id',
            #text_field="page_content",
            collection_name=index_name
             
        )
        print("Milvus Vector Store Created")
    elif connection_type == "datastax":
        if environment == "cloud":
            raise ValueError(f"ERROR! we don't support datastax connection for Cloud as of now")
        print("using the model",parameters['embedding_model_id'], "to create embeddings")
        embedding = get_embedding(environment, parameters, project_id, wml_credentials, WML_SERVICE_URL) if environment == "cloud" else get_embedding(environment, parameters, project_id, wml_credentials, None)  
        from langchain_community.vectorstores import Cassandra
        vector_store = Cassandra(
            embedding=embedding,
            table_name=index_name
        )
        print("Datastax Vector Store Created")
        
    return vector_store



Below code defines a function to generates a list of unique IDs for each document by hashing their page_content. The code sets a chunk size for batch processing. It iterates over the documents in chunks, inserting each chunk into the vector store with corresponding IDs. The progress bar is updated to reflect the number of documents processed.


In [None]:
def generate_hash(content):
    return hashlib.sha256(content.encode()).hexdigest()


def insert_docs_to_vector_store(vector_store,split_docs,insert_type="docs" ):
    with tqdm(total=len(split_docs), desc="Inserting Documents", unit="docs") as pbar:
        try:
            for i in range(0, len(split_docs), parameters['index_chunk_size']):
                chunk = split_docs[i:i + parameters['index_chunk_size']]
                if insert_type=="docs":
                    id_chunk = [generate_hash(doc.page_content+'\nTitle: '+doc.metadata['title']+'\nUrl: '+doc.metadata['document_url']+'\nPage: '+doc.metadata['page_number']) for doc in chunk]
                elif insert_type=="profiles":
                    id_chunk = [generate_hash(doc.page_content) for doc in chunk]
                vector_store.add_documents(chunk, ids=id_chunk)
                pbar.update(len(chunk))
            print("Documents are inserted into vector database")
        except Exception as e:
            print(f"An error occurred: {e}")

In [None]:
milvus_index_params = {"index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}

profile_vector_store=create_vector_store(connection_type,parameters['expert_profiles_index'],parameters, milvus_index_params)
insert_docs_to_vector_store(profile_vector_store,profile_documents,"profiles")

Above cell is a synchronous call & may take time to complete based on the size of the documents. You can proceed to **Create and Deploy Q&A Python Function** notebook to create and deploy the RAG python function without waiting for the previous cell to complete.

<a id="search"></a>
### Querying the vector index/collection for the expert profiles

The following sections of the notebook are designed to test a sample Question and Answer (QnA) interaction on the vector store. The subsequent cell in the notebook executes this test and provides a response that includes several key pieces of information.

In [None]:
question ="how to perform decision optimization?"

In case of **Elastic Search**, The following section of the notebook is designed to test a sample Question and Answer (QnA) interaction using sample template of ELSER model or multilingual model, assuming it is utilized. This response comprises of: 

* `Document ID` : A unique identifier for the document within the database or index.
* `Document Context` : Expert Profile data content or text from the document that is relevant to the queried question.
* `Relevance Score` : A numerical value indicating the relevance or confidence level of the answer provided by the model.

This setup allows for a practical demonstration of the model's capabilities in retrieving and presenting information in response to a specific query. There are 3 ways to perform this step, depending on the `elastic_search_template_file` parameter provided in the parameter set by the user.

* **ELSER**: An ELSER exclusive search query is invoked.
* **ELSER + BM25**: A hybrid search query that is a combination of a tradition vector search and ELSER is invoked.
* **Multilingual**: A dense vector search query is invoked.


In case of **Milvus** & **Datastax**, below code initializes a Milvus collection using a specified collection name and loads it into memory. It then generates embeddings for a given question using above specified embedding model. The code performs a vector search in the collection based on these embeddings, and retrieves the top closest result.


In [None]:
############################ ELASTIC #################################

match connection_type:
    case "elasticsearch":
        wslib.download_file(parameters['elastic_search_template_file'])
        with open(parameters['elastic_search_template_file']) as f:
            es_query_json = json.load(f)

        es_query_str = json.dumps(es_query_json)  
        if 'dense' in parameters['elastic_search_vector_type']:
            from langchain_elasticsearch import ElasticsearchEmbeddings
            embeddings = ElasticsearchEmbeddings.from_es_connection(
                        model_id=parameters['elastic_search_model_id'],
                        es_connection=es_client,
                    )
            query_vector = embeddings.embed_documents([question])[0]
            es_query_str = es_query_str.replace('"{{query_vector}}"', str(query_vector))
        else:
            es_query_str = es_query_str.replace("{{model_id}}", parameters['elastic_search_model_id'])
            es_query_str = es_query_str.replace("{{model_text}}", question)
        
        # Convert back to dictionary
        es_query_template = json.loads(es_query_str)
        es_query=es_query_template.get("query",es_query_template)
        print(es_query)

        query_temp_args = {'query': es_query}
        if 'sub_searches' in es_query:
            query_temp_args = {'body': es_query}

        try:
            response = es_client.search(
                index=parameters["expert_profiles_index"], 
                size=parameters['top_k_experts'],
                **query_temp_args
            )
            print("\nResponse:")
            for hit in response['hits']['hits']:
                score = hit['_score']
                source = hit['_source']
                doc_id = source['metadata']['document_id']
                page_content = source['text']
                name = source['metadata']['name']
                phone = source['metadata']['phone']
                email = source['metadata']['email']            
                domain = source['metadata']['domain']
                position = source['metadata']['position']
                source = source['metadata']['source']
                print(f"\nRelevance Score  : {score}\nDocument ID : {doc_id}\nExpert Name : {name}\nEmail : {email}\nPhone Number : {phone}\nDomain : {domain}\nPosition : {position}\nSource : {source}\n\n{page_content}")
        

        except Exception as e:
                print("\nAn error occurred while querying elastic search, please retry after sometime:", e)

############################ MILVUS #################################

    case "milvus" | "milvuswxd":
        print("using the model",parameters['embedding_model_id'], "to create embeddings")
        embedding = get_embedding(environment, parameters, project_id, wml_credentials, WML_SERVICE_URL) if environment == "cloud" else get_embedding(environment, parameters, project_id, wml_credentials, None)  

        search_params = {"metric_type": "L2", "params": {"radius": 1.07}}
        
        query_embeddings = embedding.embed_documents(texts=[question])

        result = profile_vector_store.similarity_search_with_score_by_vector(embedding.embed_query(question), k=1, param = search_params)
        try: 
            for res in result:
                #print(res)
                print(f"\nRelevance Score  : {res[1]}\n\nMetadata:-->\n\n{res[0].metadata}\n\nText :-->\n\n{res[0].page_content}")
                print('-----------')
        except Exception as e: 
            print("ERROR: No relevant information found. Please check all the parameters and try again.", e)

############################ Datastax #################################
    case "datastax":
        if environment == "cloud":
            raise ValueError(f"ERROR! we don't support datastax connection for Cloud as of now")
        print("using the model",parameters['embedding_model_id'], "to create embeddings")
        embedding = get_embedding(environment, parameters, project_id, wml_credentials, WML_SERVICE_URL) if environment == "cloud" else get_embedding(environment, parameters, project_id, wml_credentials, None)  
        from langchain_community.vectorstores import Cassandra
        vector_store = Cassandra(
            embedding=embedding,
            table_name=parameters["vector_store_index_name"] 
        )
        print("Datastax vector store Created on the index",parameters["vector_store_index_name"] )
        
        search_result= vector_store.similarity_search_with_score_by_vector(embedding.embed_query(question), k=1)
        print("\nQuestion:",question, "\nSearch Results:", search_result)

    case _:
        raise ValueError(f"Unsupported connection_type: {connection_type}")

**Note** It's recommended to close the datastax session once you are done with ingestion in this notebook for optimal performance. once you execute this cell existing datastax connections are closed. if have to re run above code cells you have to create new connection for datastax by re running cells from `Connect to Vector Database`

In [None]:
if connection_type=="datastax" and environment != "cloud":
    if not datastax_session.is_shutdown:
        datastax_session.shutdown()
        print(f"datastax_session got shutdown : {datastax_session.is_shutdown}")
    if not datastax_cluster.is_shutdown:
        datastax_cluster.shutdown()
        print(f"datastax_cluster got shutdown : {datastax_cluster.is_shutdown}")

Proceed to the **Create and Deploy QnA AI Service** notebook to create and deploy the RAG AI service python function.


**Sample Materials, provided under license.** <br>
**Licensed Materials - Property of IBM.** <br>
**Â© Copyright IBM Corp. 2024, 2025. All Rights Reserved.** <br>
**US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.** <br>