This notebook generates an evaluation dataset based on user logs. It runs within data workspace. The evaluation dataset consists of questions and document filenames, which can be used to evaluate Redbox.

In [None]:
from os import path, getenv
import ast
import pandas as pd
 
import sqlalchemy

from sqlalchemy.engine.base import Engine

from sqlalchemy.sql import text as sql_text

import numpy as np
 
engine = sqlalchemy.create_engine("postgresql://")
 
def query(

     sql: str = None,

     dataset: str = None,

     params: dict[str, str] = None,

     engine: Engine = engine,

 ) -> pd.DataFrame:

     """Read full results set from Data Workspace based on arbitrary query.
 
    Parameters:

         sql (str): a valid Postgres-SQL query

         dataset (str): specified in the format 'schema.table'

         params (dict of str: str): a dictionary of parameters to format

         engine (sqlalchemy.engine.base.Engine): a valid sqlalchemy engine
 
    Returns:

         A pandas dataframe read from Data Workspace

     """

     with engine.connect() as conn:

         if sql is None and dataset is None:

             raise ValueError("Either sql or dataset_name args must contain a value.")

         elif sql is not None and dataset is not None:

             raise ValueError(

                 "Either sql or dataset_name args must contain a value, " "not both."

             )

         elif sql is None:

             sql = f"SELECT * FROM {dataset};"

         return pd.read_sql(sql_text(sql), conn, params=params)
 


Extract data which contain user questions

In [None]:
df_chat_msg = query(

     """

     select 

         * 

     from 

         dbt.redbox__chat_message_details as rb

     """

 )
 

In [None]:
df_chat_msg.shape

Filter data to remove null or empty questions and make sure the records belong to the user (not AI response)

In [None]:
df_chat_msg_filt = df_chat_msg[(df_chat_msg.text.notnull()) & (df_chat_msg.text != '') & (df_chat_msg.role=='user')]

In [None]:
df_chat_msg_filt.shape

In [None]:
df_chat_msg_filt.columns

Select the columns of interest, mainly user id, message id and text (question asked by user)

In [None]:
df_msg_user = df_chat_msg_filt.groupby('user_id')[['message_id', 'text']].apply(lambda x: x)

In [None]:
df_msg_user.reset_index(inplace=True)

In [None]:
df_msg_user.drop('level_1', axis=1, inplace=True)

In [None]:
df_msg_user

Sample the dataset to extract a smaller dataset (to speed-up processing)

In [None]:
df_msg_user_sample = df_msg_user.sample(n=141)
df_msg_user_sample.reset_index(inplace=True)

derive document file name by extracting other rows with same message_id's as the records with non-empty question text. This is because when we previously filtered dataset to remove empty text (question), the document filename is empty. Most likely, Redbox records document filename only when user selects the document and not when user sends a question.

In [None]:
for i in range(len(df_msg_user_sample)):
    msg_id = df_msg_user_sample.loc[i, "message_id"]
    rows_msg_id = df_chat_msg[df_chat_msg.message_id==msg_id]
    if len(rows_msg_id) > 1:
        doc_row = rows_msg_id[(rows_msg_id.selected_file_name.notnull()) & (rows_msg_id.selected_file_name != '')]
        if len(doc_row) > 0:
            doc_filename = str(doc_row['selected_file_name'].values.tolist()) #this could be multiple documents
            df_msg_user_sample.loc[i, "document_filename"] = doc_filename
        else: 
            df_msg_user_sample.loc[i, "document_filename"] = np.nan
    else:
        df_msg_user_sample.loc[i, "document_filename"] = np.nan
   

In [None]:
df_msg_user_doc = df_msg_user_sample[df_msg_user_sample.document_filename.notnull()]

In [None]:
df_msg_user_doc.reset_index(inplace=True)
df_msg_user_doc.drop('index', axis=1, inplace=True)
df_msg_user_doc.drop('level_0', axis=1, inplace=True)

In [None]:
df_msg_user_doc.head()

In [None]:
df_msg_user_doc.shape

Extract different datasets containing email address of users

In [None]:
df_user_email = query(

     """

     select 

        user_id, email

     from 

         dbt.redbox__user_details as rb

     """

 )

join the two dataset based on user id to add email as a new column

In [None]:
df_final = df_msg_user_doc.merge(df_user_email, left_on = 'user_id', right_on = 'user_id', how='left')

contruct s3 path by concatenating email with document filename

In [None]:
df_final['s3_path'] =  df_final.apply(lambda x: str([x['email'] + '/' + y for y in ast.literal_eval(x['document_filename'])]), axis=1)

In [None]:
df_final.head()

In [None]:
df_final.shape

Save the evaluation dataset as csv file. Please make sure not to store this dataset in a non-approved system as the dataset contains user information

In [None]:
df_final.to_csv('evaluation_dataset.csv')