# AICC Scenario - Call Center Data Analysis

## Introduction
In today's data-driven world, effective customer service is a critical component of any successful business. Call centers, which handle customer inquiries and issues, generate vast amounts of data. This use case will guide you through analyzing call center data using ThanoSQL, a powerful tool for managing and querying data. We will classify call transcripts into categories using various language models (LLMs), and derive key performance metrics such as average call time, satisfaction score, and resolution rate for each category. This analysis can provide valuable insights into customer interactions, helping to improve service quality and overall customer satisfaction.

## Running the Tutorial in Different Environments
This tutorial can be executed both within ThanoSQL Lab and in a local Python/Jupyter environment. Whether you prefer to work directly within ThanoSQL Lab's integrated environment or set up a local development environment on your machine, the instructions provided will guide you through the necessary steps.

## Dataset
We will be working with the following datasets:
- **Counseling Staff Information Table (agents)**: Contains details about the call center agents.
  - `AgentID`: Unique identifier for each agent.
  - `AgentName`: Name of the agent.
- **Consultation Call Metadata Table (calls)**: Records metadata for each call.
  - `CallID`: Unique identifier for each call.
  - `AgentID`: Identifier linking the call to the agent.
  - `SatisfactionScore`: Customer satisfaction score for the call.
  - `CallDuration`: Duration of the call.
  - `ResolutionRate`: Rate at which the call issue was resolved.
- **Consultation Call Transcription Table (transcript)**: Contains the text of the conversations.
  - `CallID`: Identifier linking the call to the transcription.
  - `Conversation`: Text of the conversation during the call.
  - `prompt`: Prompt for LLM generation.

## Goals
1. Classify call transcripts into meaningful categories using LLMs.
2. Calculate and analyze the average call time, satisfaction score, and resolution rate for each category.

## Tokens 
To run the models in this tutorial, you will need the following tokens:
- **OpenAI Token**: Required to access all the OpenAI-related tasks when using OpenAI as an engine. This token enables the use of OpenAI's language models for various natural language processing tasks.
- **Huggingface Token**: Required only to access gated models such as Mistral on the Huggingface platform. Gated models are those that have restricted access due to licensing or usage policies, and a token is necessary to authenticate and use these models. For more information, check this [Huggingface documentation](https://huggingface.co/docs/hub/en/models-gated).
Make sure to have these tokens ready before proceeding with the tutorial to ensure a smooth and uninterrupted workflow.

## Displaying ThanoSQL Query Results in Jupyter Notebooks
The check_result function is designed to handle and display the results of a database query executed via the ThanoSQL client. It ensures that any errors are reported, and successful query results are displayed in a user-friendly format.

**Note: This function is specifically designed to work in Jupyter notebook environments.**

In [None]:
from IPython.display import display

def check_query_result(query_result):
    if query_result.error_result:
        print(query_result.error_result)
    else:
        if query_result.records is not None and len(query_result.records.data) > 0:
            df = query_result.records.to_df()
            display(df)
        else:
            print("Query executed successfully")

## Procedure

### Download Datasets

First, we will download the datasets. 

In [12]:
!wget -O use_case_1_data.zip https://raw.githubusercontent.com/smartmind-team/assets/main/datasets/use_cases/use_case_1_data.zip

--2024-06-11 01:41:58--  https://raw.githubusercontent.com/smartmind-team/assets/main/datasets/use_cases/use_case_1_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42005 (41K) [application/zip]
Saving to: ‘use_case_1_data.zip’


2024-06-11 01:41:58 (51.8 MB/s) - ‘use_case_1_data.zip’ saved [42005/42005]



In [13]:
!unzip use_case_1_data.zip

Archive:  use_case_1_data.zip
  inflating: agents.csv              
  inflating: calls.csv               
  inflating: transcript.csv          


### Import ThanoSQL Library
Import the ThanoSQL library and create a client instance. This client will be used to interact with the ThanoSQL engine.

**You can find your API Token and Engine URL by following these steps:**

1. Go to your workspace’s settings page.
2. Navigate to the "Developer" tab.
3. Locate and copy your API Token and Engine URL.

In [1]:
from thanosql import ThanoSQL
client = ThanoSQL(api_token="your_api_token", engine_url="engine_url")

### Upload Data to Tables
#### Upload the `agents` table which contains details about the call center agents.

In [2]:
table = client.table.upload('agents', 'agents.csv', if_exists='replace')
table.get_records(limit=10).to_df()

Unnamed: 0,AgentID,AgentName
0,0,Sarah
1,1,Rachel
2,2,John
3,3,Michael
4,4,David
5,5,Kevin
6,6,Jason
7,7,Jessica
8,8,Ryan
9,9,Matthew


This step uploads the `agents` data to ThanoSQL and retrieves the first 10 records to confirm the upload.

#### Upload the `calls` table which records metadata for each call.

In [3]:
table = client.table.upload('calls', 'calls.csv', if_exists='replace')
table.get_records(limit=10).to_df()

Unnamed: 0,CallID,AgentID,SatisfactionScore,CallDuration,ResolutionRate
0,1,5,5,8,100
1,2,0,5,10,100
2,3,10,5,14,80
3,4,7,5,9,80
4,5,12,5,14,70
5,6,17,5,20,100
6,7,11,5,12,80
7,8,0,5,8,70
8,9,3,5,15,86
9,10,13,5,12,70


This step uploads the `calls` data to ThanoSQL and retrieves the first 10 records to confirm the upload.

#### Upload the `transcript` table which contains the text of the conversations.

In [4]:
table = client.table.upload('transcript', 'transcript.csv', if_exists='replace')
table.get_records(limit=10).to_df()

Unnamed: 0,CallID,Conversation,prompt
0,1,Assistant: Hello! Thank you for reaching out t...,<s>[INST]You are a classification assistant. Y...
1,2,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...
2,3,Assistant: Hello! Thank you for reaching out t...,<s>[INST]You are a classification assistant. Y...
3,4,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...
4,5,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...
5,6,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...
6,7,Assistant: Hello! Thank you for reaching out t...,<s>[INST]You are a classification assistant. Y...
7,8,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...
8,9,Assistant: Hello! Thank you for reaching out t...,<s>[INST]You are a classification assistant. Y...
9,10,Assistant: Hello! Thank you for contacting Dig...,<s>[INST]You are a classification assistant. Y...


This step uploads the `transcript` data to ThanoSQL and retrieves the first 10 records to confirm the upload.

## Classify Conversations and Aggregate Metrics

#### Classify conversations using the Mistral LLM and calculate performance metrics.

Using Mistral LLM

In [5]:
query_result = client.query.execute("""
    SELECT thanosql.cleanup_resources();
    
    SELECT 
      c."AgentID", t.category,
      AVG(c."SatisfactionScore") AS avg_satisfaction_score,
      AVG(c."CallDuration") AS avg_call_duration,
      AVG(c."ResolutionRate") AS avg_resolution_rate
    FROM 
      (SELECT "CallID",
          thanosql.generate(
              input := prompt,
              engine := 'huggingface',
              model := 'mistralai/Mistral-7B-Instruct-v0.2',
              token := 'your_token',
              model_args := '{"max_new_tokens": 7}'
          ) AS category
      FROM
          transcript) AS t
    JOIN 
      calls AS c ON t."CallID" = c."CallID"
    GROUP BY 
      c."AgentID", t.category
    ORDER BY
      c."AgentID"
""")
check_query_result(query_result)

Unnamed: 0,AgentID,category,avg_satisfaction_score,avg_call_duration,avg_resolution_rate
0,0,ACCOUNT,5.0,4.25,100.0
1,0,Refund,5.0,5.0,100.0
2,0,Shipping,5.0,6.0,100.0
3,0,TECHNOLOGY,5.0,9.0,85.0
4,1,ACCOUNT,5.0,6.0,100.0
5,2,ACCOUNT,5.0,8.0,100.0
6,3,ACCOUNT,5.0,4.333333,98.333333
7,3,Refund,5.0,5.0,100.0
8,3,Shipping,5.0,6.0,100.0
9,3,Technology,5.0,15.0,86.0


This query classifies each call transcript into predefined categories using the Mistral LLM, then calculates and aggregates key performance metrics (average satisfaction score, average call duration, and average resolution rate) for each category and agent.


#### Classify conversations using the OpenAI GPT-4o and calculate performance metrics.

Using OpenAI GPT-4o

In [7]:
query_result = client.query.execute("""
    SELECT * FROM thanosql.cleanup_resources();
    
    SELECT 
      t.category,
      AVG(c."SatisfactionScore") AS avg_satisfaction_score,
      AVG(c."CallDuration") AS avg_call_duration,
      AVG(c."ResolutionRate") AS avg_resolution_rate
    FROM 
      (SELECT "CallID",
          thanosql.generate(
              input := prompt,
              engine := 'openai',
              model := 'gpt-4o',
              token := 'your_openai_api_key',
              model_args := '{"temperature": 0}'
          ) AS category
      FROM
          transcript) AS t
    JOIN 
      calls AS c ON t."CallID" = c."CallID"
    GROUP BY 
      t.category;
""")
check_query_result(query_result)

Unnamed: 0,category,avg_satisfaction_score,avg_call_duration,avg_resolution_rate
0,CANCELLATION_FEE,5.0,5.5,99.333333
1,Technology,4.913043,9.130435,88.086957
2,DELIVERY,4.727273,3.545455,92.272727
3,Refund,5.0,5.625,99.375
4,Newsletter,5.0,2.333333,100.0
5,CONTACT,4.75,4.25,93.75
6,Invoice,5.0,3.777778,98.888889
7,Shipping,5.0,6.333333,100.0
8,Feedback,4.8,3.0,95.0
9,ACCOUNT,4.944444,4.527778,95.694444


This query classifies each call transcript into predefined categories using the OpenAI GPT-4o model, then calculates and aggregates key performance metrics (average satisfaction score, average call duration, and average resolution rate) for each category.


## Conclusion
By following these steps, you can effectively analyze call center data, classify call transcripts into meaningful categories using natural language processing, and derive valuable insights into call durations, customer satisfaction, and resolution rates for each category. These insights can help you understand the performance of your call center agents and identify areas for improvement, ultimately leading to enhanced customer satisfaction and operational efficiency.