# Index Query Data

## Import Data

Let us start by reading the first 1000 rows of query data from `train.csv`.

In [1]:
import pandas as pd
QUERIES_CSV_FILE = "/workspace/datasets/train.csv"
df_queries = pd.read_csv(QUERIES_CSV_FILE, nrows = 1000)
df_queries.head()

Unnamed: 0,user,sku,category,query,click_time,query_time
0,000000df17cd56a5df4a94074e133c9d4739fae3,2125233,abcat0101001,Televisiones Panasonic 50 pulgadas,2011-09-01 23:44:52.533,2011-09-01 23:43:59.752
1,000001928162247ffaf63185cd8b2a244c78e7c6,2009324,abcat0101001,Sharp,2011-09-05 12:25:37.42,2011-09-05 12:25:01.187
2,000017f79c2b5da56721f22f9fdd726b13daf8e8,1517163,pcmcat193100050014,nook,2011-08-24 12:56:58.91,2011-08-24 12:55:13.012
3,000017f79c2b5da56721f22f9fdd726b13daf8e8,2877125,abcat0101001,rca,2011-10-25 07:18:14.722,2011-10-25 07:16:51.759
4,000017f79c2b5da56721f22f9fdd726b13daf8e8,2877134,abcat0101005,rca,2011-10-25 07:19:51.697,2011-10-25 07:16:51.759


Let us write a function to transform `click_time` and `query_time` into timestamps. This will come in handy when we try to build search aggregations.

In [2]:
def prepare_batch_queries(batch):
    return (
        batch
          .assign(click_time = lambda d: pd.to_datetime(d.click_time, format='ISO8601'))
          .assign(query_time = lambda d: pd.to_datetime(d.query_time, format='ISO8601'))
    )

## Index Documents

Let us write a function that would index a single batch of records using the bulk indexing capabilities of Opensearch.

In [18]:
from opensearchpy.helpers import bulk
import uuid
def index_batch(batch: pd.DataFrame, index_name:str):
    records = batch.to_dict(orient = "records")
    docs = [{"_id": uuid.uuid4().hex, "_index": index_name, "_source": record, } for record in records]
    bulk(client, docs, request_timeout=60)
    return len(batch)

We can index the batches in parallel by writing an `index_batches` function that executes `index_batch` in parallel across multiple batches.

In [15]:
from tqdm import tqdm
from functools import partial
from tqdm.contrib.concurrent import process_map
from itertools import repeat
def index_batches(batches, index_name:str):
    index_batch_partial = partial(index_batch, index_name = index_name)
    results = process_map(index_batch_partial, batches, max_workers=8)
    print(f"Indexed {sum(results)} records")
    

## Initialize Client

In [5]:
from opensearchpy import OpenSearch
from IPython.display import JSON
import json

def print_json(x):
    print(json.dumps(x, indent = 2))
    
client = OpenSearch(
    hosts = [{"host": "localhost", "port": 9200}],
    http_auth = ("admin", "admin"),
    use_ssl = True,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
)
print_json(client.info())

{
  "name": "fc81333c71df",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "_Xsvjrs0TJOF3p6QlsWnJA",
  "version": {
    "distribution": "opensearch",
    "number": "2.9.0",
    "build_type": "tar",
    "build_hash": "1164221ee2b8ba3560f0ff492309867beea28433",
    "build_date": "2023-07-18T21:23:29.367080729Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}


## Create Index

In [19]:
client.indices.delete("test-index-to-delete-1")

{'acknowledged': True}

In [20]:
import yaml
body = yaml.safe_load("""
settings:
  index:
    query:
      default_field: body

""")
response = client.indices.create("test-index-to-delete-1", body = body)
print_json(response)

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "test-index-to-delete-1"
}


## Test Indexing

Before we go ahead and index all documents, let us do a trial run on a small subset. This will allow us to debug any issues.

In [21]:
df_queries_small = (
  pd.read_csv(QUERIES_CSV_FILE, nrows = 6000)
    .assign(doc_id = list(range(0, 6000)))
)
df_queries_small.to_csv("/tmp/queries.csv", index=False)

In [22]:
batches = list(pd.read_csv("/tmp/queries.csv", chunksize=3000))
batches = [prepare_batch_queries(batch) for batch in batches]
index_batches(batches, "test-index-to-delete-1")

  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:04<00:00,  2.12s/it]

Indexed 6000 records





We can check the contents of the index to make sure that the number of documents match the number of records we indexed.

In [23]:
client.indices.refresh(index = "test-index-to-delete-1")
client.cat.count(index = "test-index-to-delete-1", format="json")

[{'epoch': '1696888670', 'timestamp': '21:57:50', 'count': '6000'}]

## Search Documents

Let us write a function to transform the search response into a table.

In [24]:
def display_search_response(response):
  hits = response['hits']['hits']
  sources = [hit["_source"] for hit in hits]
  return pd.concat([pd.DataFrame(hits).drop(["_source"], axis=1), pd.DataFrame(sources)], axis=1)

In [25]:
# Search Documents
body_yaml = """
query:
  match_all: {}
size: 10000
"""
response = client.search(
  index = "test-index-to-delete-1",
  body = yaml.safe_load(body_yaml)
)
hits = display_search_response(response)
hits.head()

Unnamed: 0,_index,_id,_score,user,sku,category,query,click_time,query_time,doc_id
0,test-index-to-delete-1,5e080cc9c6bf411ea6c32a70bcddaa0d,1.0,000395d07d4bad3c5a3534e178a31dccadcb7fe6,2944053,pcmcat218000050000,iPad 2 executive cases,2011-09-28T08:42:02.706000,2011-09-28T08:40:06.023000,133
1,test-index-to-delete-1,2b244447d0cb4125af5dde8d8ec6f815,1.0,000395d07d4bad3c5a3534e178a31dccadcb7fe6,3216222,pcmcat218000050000,iPad 2 executive cases,2011-09-28T08:42:24.276000,2011-09-28T08:40:06.023000,134
2,test-index-to-delete-1,9a7779a664cb45c3acc2490bbad96d56,1.0,000395d07d4bad3c5a3534e178a31dccadcb7fe6,2944044,pcmcat218000050000,iPad 2 executive cases,2011-09-28T08:42:42.828000,2011-09-28T08:40:06.023000,135
3,test-index-to-delete-1,bf30a7e7b185469fbe7cb3753422660f,1.0,0003a0dbe6ea8d4905412f7ac1d6bb5817200481,8054963,pcmcat232900050017,ico,2011-10-17T12:31:39.042000,2011-10-17T12:29:08.582000,136
4,test-index-to-delete-1,266d8f7dde904411836ec2a78534c585,1.0,0003ab040e9195df8c58f7e9c07b1ddec1be5a07,9755886,cat02015,Lord of the rings,2011-10-20T13:28:40.291000,2011-10-20T13:27:48.953000,137


In [160]:
client.indices.delete('test-index-to-delete-1')

{'acknowledged': True}