In this notebook, we will set up OpenSearch and get it ready for search.

Before you run code in this notebook, make sure that OpenSearch is running at `localhost:9200`. You can run the following command to fire up OpenSearch:

```docker-compose up d```

You can check whether OpenSearch is fired up or not using the following command: `curl -X GET "http://localhost:9200"`.

In [1]:
import requests
import json

headers = {'Content-Type': 'application/json'}

### Configure settings

We will configure ML plug-ins, which let us use ML models::

1. Set `only_run_on_ml_node` to `false`. This means that the ML models will run on the same node that holds the indexed data. (In production, we should set this to `true` so that ML models can run on their dedicated node.
2. Set `model_access_control_enabled` to `true`. This will help us upload a custom ML model (if needed) instead of using ML models that are available within OpenSearch.

In [2]:
url = "http://localhost:9200/_cluster/settings"

payload = {
    "persistent": {
        "plugins": {
            "ml_commons.only_run_on_ml_node": "false",
            "ml_commons.model_access_control_enabled": "true",
            "ml_commons.native_memory_threshold": "99",
        }
    }
}

response = requests.put(url, headers=headers, data=json.dumps(payload))

response.json()

{'acknowledged': True,
 'persistent': {'plugins': {'ml_commons': {'only_run_on_ml_node': 'false',
    'model_access_control_enabled': 'true',
    'native_memory_threshold': '99'}}},
 'transient': {}}

### Register a model

Before we can assign (and use) a model, we need to create a **model group** so we can keep the models organized.

In [3]:
url = "http://localhost:9200/_plugins/_ml/model_groups/_register"

payload = {
    "name": "TEXT_model_group",
    "description": "A model group for text embeddings."
}

response = requests.request("POST", url, headers=headers, data=json.dumps(payload))

print(response.json())

{'model_group_id': 'BdlFZZQBuL5CuRNm9qxV', 'status': 'CREATED'}


In [4]:
model_group_id = response.json()['model_group_id']

print(f"Created a Model Group ID: {model_group_id}")

Created a Model Group ID: BdlFZZQBuL5CuRNm9qxV


#### Register a model to the model group

For this exercise, we will use `all-MiniLM-L6-v2` that's available on OpenSearch. See the list of all available models [here](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models#supported-pretrained-models).

In [9]:
url = "http://localhost:9200/_plugins/_ml/models/_register"

payload = {
    "name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
    "version": "1.0.1",
    "model_group_id": model_group_id,
    "model_format": "TORCH_SCRIPT"
}

response = requests.post(url, headers=headers, data=json.dumps(payload))

print(response.json())

{'task_id': 'B9lMZZQBuL5CuRNm0qxb', 'status': 'CREATED'}


In [10]:
task_id = response.json()['task_id']

print(f"Task ID: {task_id}")

Task ID: B9lMZZQBuL5CuRNm0qxb


#### Check the status of the registered model

Sometimes, registering a model may take some time (especially if the model is large). Let's make sure it's complete before we proceed to the next step.

In [11]:
url = f"http://localhost:9200/_plugins/_ml/tasks/{task_id}"

response = requests.get(url, headers=headers)
print(response.json()['state'])

'COMPLETED'

If you don't see this status as 'COMPLETE' wait for some time. If you continue to get 'FAILED' status, check the docker logs using the following command: `docker logs opensearch_evals`.

In [12]:
model_id = response.json()['model_id']

print(f"Created Model ID {model_id}")

Created Model ID CNlMZZQBuL5CuRNm06zQ


### Deploy the model

In [14]:
url = f"http://localhost:9200/_plugins/_ml/models/{model_id}/_deploy"

response = requests.post(url, headers=headers)

print(response.json())

{'task_id': 'CtlNZZQBuL5CuRNm6qwG', 'task_type': 'DEPLOY_MODEL', 'status': 'CREATED'}


In [15]:
deploy_model_task_id = response.json()['task_id']

print(f"Deploy Model Task ID {deploy_model_task_id}")

Deploy Model Task ID CtlNZZQBuL5CuRNm6qwG


#### Check the status of the deployed model

In [16]:
url = f"http://localhost:9200/_plugins/_ml/tasks/{deploy_model_task_id}"


response = requests.get(url, headers=headers)

print(response.json())

{'model_id': 'CNlMZZQBuL5CuRNm06zQ', 'task_type': 'DEPLOY_MODEL', 'function_name': 'TEXT_EMBEDDING', 'state': 'COMPLETED', 'worker_node': ['lFDMCo6MRD2f0WplTf_BrQ'], 'create_time': 1736866392582, 'last_update_time': 1736866392607, 'is_async': True}


Let's see if we can use this model to generate sentence embeddings.

In [17]:
url = f"http://localhost:9200/_plugins/_ml/_predict/text_embedding/{model_id}"

payload = {
    "text_docs":["This is a test string."],
    "return_number": True,
    "target_response": ["sentence_embedding"]
}

response = requests.request("POST", url, headers=headers, data=json.dumps(payload))

print(response.json())

{'inference_results': [{'output': [{'name': 'sentence_embedding', 'data_type': 'FLOAT32', 'shape': [384], 'data': [0.015552435, 0.07408792, -0.056174736, 0.020090407, -0.08067822, -0.023988575, 0.038808424, 0.03269541, -0.0018481113, -0.028275384, 0.06459784, -0.039391752, 0.007533904, 0.011672242, -0.025374359, -0.012356011, -0.024540061, -0.094170876, -0.046924755, -0.0017848986, 0.03356117, 0.059485007, -0.058551002, 0.010995594, 0.0643761, 0.03311312, -0.02322506, 0.033511985, 0.0556194, -0.0043963925, 0.00041493264, 0.014317813, 0.017027754, 0.03436633, 0.05863846, 0.028501928, -0.017333908, -0.0033512053, 0.048525553, 0.0031814873, 0.025857659, -0.11040843, 0.021466155, 0.03526723, -0.017158495, 0.04018047, -0.058251813, 0.046199523, -0.014536076, -0.052641265, 0.04867339, -0.03086721, -0.09876064, -0.023367776, 0.045699462, 0.0438746, 0.013673885, -0.014007184, 0.027587453, 0.064072855, 0.022105912, -0.0238617, -0.04212197, 0.059888072, 0.07783545, -0.045667414, 0.0029138038, -0

Great, that worked.

### Create an ingestion pipeline

We want to use the `tags` column in our dataset (which was created in the previous notebook), and generate embeddings using the strings in that column. We will store those embeddings into the `tag_embedding` column in the vector database.

In [18]:
url = "http://localhost:9200/_ingest/pipeline/tags-ingest-pipeline"

payload = {
    "description": "Image tags embedding pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": model_id,
                "field_map": {
                    "tags": "tag_embedding"  # input column -> vector db column
                }
            }
        }
    ]
}

response = requests.put(url, headers=headers, data=json.dumps(payload))

print(response.json())

{'acknowledged': True}


Now that we have configured OpenSearch, registered a model, and created an ingestion pipeline, we are ready to ingest the dataset that we created in the previous notebook. Let's [move on to the next notebook](./02_index_tags_data.ipynb)!