# Opensearch User Behavior Insights (UBI)

### This notebook covers the basics around setting up UBI, ingesting data using the UBI plugin, and setting up a basic UBI opensearch dashboard.
### Note: Since OpenSearch 3.0.0 the plugin no longer needs to be installed, but it must for any OpenSearch 2.x

**Information regarding UBI:**

https://opensearch.org/docs/latest/search-plugins/ubi/index

https://github.com/opensearch-project/user-behavior-insights

In [1]:
from aips import get_engine
from engines.opensearch.config import OPENSEARCH_URL
from aips.spark.dataframe import from_sql
from aips.spark import create_view_from_collection
import tqdm
import aips.indexer
import requests, json
engine = get_engine("opensearch")

In [2]:
aips.indexer.build_collection(engine, "products")
aips.indexer.build_collection(engine, "signals")

Wiping "products" collection
Creating "products" collection
Loading Products
Schema: 
root
 |-- upc: string (nullable = true)
 |-- name: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- short_description: string (nullable = true)
 |-- long_description: string (nullable = true)

Successfully written 48194 documents
Wiping "signals" collection
Creating "signals" collection
Loading data/retrotech/signals.csv
Schema: 
root
 |-- query_id: string (nullable = true)
 |-- user: string (nullable = true)
 |-- type: string (nullable = true)
 |-- target: string (nullable = true)
 |-- signal_time: timestamp (nullable = true)

Successfully written 2172605 documents


<engines.opensearch.OpenSearchCollection.OpenSearchCollection at 0xffff5ff5de40>

### Loading signals from AIPS format into UBI

This AIPS (AI-Powered Search) code base uses a simplified signals format described section 4.1.3 of the _AI-Powered Search_ book.

This section shows you how to convert from this format into the UBI standard format and add those signals to OpenSearch using the UBI plugin.

### **Step 1**: Install UBI plugin (no longer required for OpenSearch 3.0.0)

To install UBI on an opensearch cluster, execute the following command on a node or during the building of an image. This command has already been run on the AIPS opensearch node.

**bin/opensearch-plugin install https://github.com/opensearch-project/user-behavior-insights/releases/download/2.18.0.0/opensearch-ubi-2.18.0.0.zip --batch**

### **Step 2**: - Initialize the UBI collections

In [3]:
requests.post(f"{OPENSEARCH_URL}/_plugins/ubi/initialize").json()

{'message': 'UBI indexes created.'}

### **Step 3**: - Bulk ingesting historic signals

Historic user events and queries should be bulk ingested into the appropriate UBI collections.

Here we bulk write all AIPS queries into the `ubi_queries` collection.

In [15]:
# By default, we batch load historical queries here because it's faster than simulating live traffic query-by-query
BATCH_LOAD_QUERIES=True

# Likewise later in the notebook we set simulation of live queries to False by default, since we've already batch loaded them here
#SIMULATE_LIVE_QUERIES=False

# To reverse this behavior, Set BATCH_LOAD_QUERIES=False here and SIMULATE_LIVE_QUERIES=True later in the notebook,
# or set them both to True to test doing both.

In [5]:
def get_queries_dataframe():
    signals_collection = engine.get_collection("signals")
    create_view_from_collection(signals_collection, "signals")
    queries = from_sql("SELECT * FROM signals WHERE type = 'query'")
    queries_transformed = queries.rdd.map(lambda r: 
        (r["signal_time"], r["query_id"], r["user"], r["target"]))
    ubi_queries_dataframe = queries_transformed.toDF(
        ["timestamp", "query_id", "client_id", "user_query"])
    return ubi_queries_dataframe

In [6]:
def batch_ingest_queries():
    #This function is not called, but is an example of batch loading.
    #Queries are ingested later in the ideal manner 
    queries_collection = engine.get_collection("ubi_queries")
    ubi_queries_dataframe = get_queries_dataframe()
    queries_collection.write(ubi_queries_dataframe)
    return queries_collection

if BATCH_LOAD_QUERIES:
    queries_collection = batch_ingest_queries()

Successfully written 725459 documents


Next we can index events into the `ubi_events` collection which is intended to hold all non-query signals

In [7]:
def get_events_dataframe():
    signals_collection = engine.get_collection("signals")
    products_collection = engine.get_collection("products")
    create_view_from_collection(signals_collection, "signals")
    create_view_from_collection(products_collection, "products")
    query = """SELECT REPLACE(type, '-', '_') AS action_name, query_id, user AS client_id,
                      signal_time AS timestamp, type AS message_type,
                      target AS target, p.name AS message
               FROM signals s
               LEFT JOIN products p ON s.target == p.upc
               WHERE type != 'query'"""
    events = from_sql(query)
    return events

In [8]:
def batch_ingest_signals():
    events_collection = engine.get_collection("ubi_events")
    ubi_events_dataframe = get_events_dataframe()
    events_collection.write(ubi_events_dataframe)
    return events_collection

events_collection = batch_ingest_signals()

Successfully written 1447146 documents


### **Step 4**: Live logging of queries and events

Queries and events must be ingested correctly and with complete data into UBI for best results. UBI stores queries seperate from other events, each in their respective collection `ubi_queries` and `ubi_events`. The collection of live event data should be hooked into the appropriate places in your stack.

Logging event data is as simple as writing an event document directly to the ubi_events collection. 

In [9]:
from  datetime import datetime

def add_example_event_to_ubi():
    collection = "products"
    event_doc = {"action_name": "purchase", #This is a name of the type of event/action that occurred
                 "client_id": "uid_000001", #This is id of the user/session taking the action
                 "message_type": "one_click_buy", #An additional action type, used for further action grouping
                 "message": "Succeeded", #An optional message string for the event
                 "query_id": "qid_000001", #The id of the query that led to this action
                 "target": "pid_000001", #Any string representing the target of the action. Normally a doc/item id?
                 "timestamp": datetime.now().microsecond} #The timestamp of the event in epoch_millis, if not passed becomes the current time

    response = requests.post(f"http://opensearch-node1:9200/ubi_events/_doc?",
                             json=event_doc)
    display(response.json())

add_example_event_to_ubi()

{'_index': 'ubi_events',
 '_id': 'LAKvRpoBlgjtwxAaWIpM',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 19199948,
 '_primary_term': 8}

Queries sent to OpenSearch containing a `ubi` section in the `ext` block will be automatically captured by the UBI plugin and stored in the internal `ubi_queries` index. Here is an example of ingesting query data by adding an `ubi` property to the `ext` object during a search request:

In [10]:
def execute_example_query_with_ubi():        
    collection = "products"
    query = "cable"
    ubi_extension_data = {"ubi": {"query_id": "qid_000001",
                                  "client_id": "cid_000001",
                                  "user_query": query}}
    search_request = {
        "query": {"query_string": {"query": query,
                                   "fields": ["name", "manufacturer",
                                              "long_description", "short_description"]}},
        "size": 11, 
        "fields": ["*"],
        "ext": ubi_extension_data
    }

    response = requests.post(f"http://opensearch-node1:9200/{collection}/_search?",
                             json=search_request)
    display(response.json())

execute_example_query_with_ubi()

{'took': 5,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1165, 'relation': 'eq'},
  'max_score': 7.091094,
  'hits': [{'_index': 'products',
    '_id': '640282090599',
    '_score': 7.091094,
    '_source': {'upc': '640282090599',
     'name': "Accu Cable - 5' 3-Pin DMX Cable",
     'manufacturer': 'Accu Cable',
     'short_description': "DMX connector; 5' length; 3 pins; 22 AWG; 110 ohms impedance",
     'long_description': "Connect lighting units with this 5' 3-pin DMX cable that provides safety and shielding from transmission interference."},
    'fields': {'short_description': ["DMX connector; 5' length; 3 pins; 22 AWG; 110 ohms impedance"],
     'name': ["Accu Cable - 5' 3-Pin DMX Cable"],
     'upc': ['640282090599'],
     'long_description': ["Connect lighting units with this 5' 3-pin DMX cable that provides safety and shielding from transmission interference."],
     'manufacturer': ['Accu Cable']}},
   

Notice UBI information is returned on the search response object with at least the ubi signal id linking to the ingested query. 

The following code will load all query signals into UBI by simulating user searches. This serves as a batch import of data for examples sake. Batch importing should normally just be done by batch indexing query signals directly into `ubi_queries` directly as shown earlier.


### Simulate live queries to UBI

The below cell will load all Retrotech query signals and run searches directly against the search engine with UBI with UBI enabled. This will simulate real traffic to the search engine, so that queries are logged end-to-end.

Since you already loaded all the queries once above, you must set the `SIMULATE_LIVE_QUERIES` variable to `True` for if you'd like to run them again as simulated "live" traffic.

Otherwise, you can continue directly to view the dashboards in the next section.

In [16]:
# By default, we batch load queries earlier in this notebook because it's faster than simulating live traffic query by query
#BATCH_LOAD_QUERIES=True

# Likewise we set live simulation of live queries to False by here default, since we've batch loaded them earlier
SIMULATE_LIVE_QUERIES=False

# To reverse this behavior, Set BATCH_LOAD_QUERIES=False earlier in the notebook and SIMULATE_LIVE_QUERIES=True here,
# or set them both to True to test doing both.

In [12]:
def execute_search(collection, signal, log=False):
    signal.pop("timestamp", None) #The timestamp of a query is the time of search and cannot be passed in
    request = {"query": signal["user_query"],
               "query_fields": ["name", "manufacturer",
                                "long_description", "short_description"],
               "return_fields": ["*"],
               "limit": 10,
               "ubi": signal | {"store_name": "aips_store"}}
    try:
        return collection.search(**request)
    except:
        pass

def search_and_log_all_query_signals():
    products_collection = engine.get_collection("products")
    ubi_queries_dataframe = get_queries_dataframe()
    for q in tqdm.tqdm(ubi_queries_dataframe.collect(), total=ubi_queries_dataframe.count()):
        execute_search(products_collection, q.asDict())

if SIMULATE_LIVE_QUERIES:
    search_and_log_all_query_signals()

### Loading UBI queries and events INTO AIPS

The previous sections showed you how to load AIPS signals data *FROM AIPS format* into UBI. This section does the opposite, showing you how to load data from UBI *INTO AIPS format*

Doing this allows your to take your live traffic and then convert it into the AIPS format to run all the notebooks in the AIPS code base. That way you can easily generate models on your own data (signals boosting, collaborative filtering, learning to rank, click models for learning to rank, etc.).

If you wish to load in UBI queries/events from your Opensearch cluster to work with the book, you can do so with the following code.

In [13]:
def load_ubi_events_as_aips_dataframe():
    ubi_events_collection = engine.get_collection("ubi_events")
    create_view_from_collection(ubi_events_collection, "ubi_events")
    events = from_sql("SELECT * FROM ubi_events")
    events_transformed = events.rdd.map(lambda r: 
        (r["timestamp"], r["query_id"], r["client_id"],
         r["message"], r["message_type"]))
    return events_transformed.toDF(["signal_time", "query_id", "user", "target", "type"])

def load_ubi_queries_as_aips_dataframe():
    ubi_queries_collection = engine.get_collection("ubi_queries")
    create_view_from_collection(ubi_queries_collection, "ubi_queries")
    queries = from_sql("SELECT timestamp, query_id, client_id, user_query, query FROM ubi_queries")
    queries_transformed = queries.rdd.map(lambda r: 
        (r["timestamp"], r["query_id"], r["client_id"],
         r["user_query"], "query"))
    return queries_transformed.toDF(["signal_time", "query_id", "user", "target", "type"])

def create_signals_collection_with_ubi_data():
    signals_collection = engine.create_collection("signals")
    
    if not SIMULATE_LIVE_QUERIES:
        queries = load_ubi_queries_as_aips_dataframe()
        signals_collection.write(queries)
        
    events = load_ubi_events_as_aips_dataframe()
    signals_collection.write(events, overwrite=False)
    return signals_collection

signals_collection = create_signals_collection_with_ubi_data()

Wiping "signals" collection
Creating "signals" collection
Successfully written 725460 documents
Successfully written 1447147 documents


### Creating and viewing the UBI Dashboard

The following code will import the default UBI dashboard objects. The dashboard can be viewed here

http://opensearch-aips:5601/app/dashboards


In [None]:
def import_ubi_dashboard():
    with open("./engines/opensearch/build/ubi-dashboard-objects.ndjson", "rb") as f: 
        dashboard_ndjson = f.read()
    response = requests.post(f"http://opensearch-dashboards:5601/api/saved_objects/_import?createNewCopies=true",
                            files={"file": ("request.ndjson", dashboard_ndjson)},
                            headers={"kbn-xsrf": "true",
                                     "osd-version": "2.18.0",
                                     "osd-xsrf": "osd-fetch"})
    display(response.json())

import_ubi_dashboard()