# Leveraging an LLM for Customer Experience Tracking


### Problem Statement

One Travel is facing a significant challenge quantitatively measuring customer feedback, especially when trying to capture subjective sentiments expressed in customer reviews. The need to answer specific questions, such as the performance of a new boarding process, changes in seating, and other common airline related changes.

They don't want to use surveys. Surveys are known to create different types of biases, and when large organizations like airlines have multiple policy changes it can be hard to capture all relevant feedback in one survey, leading to low completion rates. Surveys also don't allow you to go back and track different concepts as they change over time.

### An Innovative Solution
One Travel instead opts to use a Large Language Model (LLM) to track their feedback and the relevant categories. Instead of using a survey and asking a question for the customer to respond to, they instead ask the customer to write about their trip and explain what is top of mind. They then use the LLM to determine if the customer menitoned their category they are tracking and record what the customer experience was. Let's walk through what this approach could look like.

In [41]:
import os
import tiktoken
import pandas as pd
import numpy as np
from openai import OpenAI
import os
from tqdm import tqdm
import random
import seaborn as sns
from tenacity import retry, stop_after_attempt, wait_fixed
from huggingface_hub import InferenceClient

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv(".env")

OPENAI_KEY = os.environ["OPENAI_KEY"]
HF_TOKEN = os.environ["HF_TOKEN"]
TOGETHER_API_KEY = os.environ["TOGETHER_API_KEY"]


GPT3 = "gpt-3.5-turbo-1106"
GPT4 = "gpt-4-1106-preview"
ZEPHYR = "meta-llama/Llama-2-7b-chat-hf"


def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens


@retry(wait=wait_fixed(15), stop=stop_after_attempt(4))
def llm(user_prompt, model, temperature=0.3):
    model_kwargs = {"temperature": temperature}
    user_prompt = user_prompt[:3700]

    client = OpenAI(api_key=OPENAI_KEY)

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_prompt}],
        stream=False,
        **model_kwargs,
    )

    output = response.choices[0].message.content
    return output

In [2]:
CLIENT = OpenAI(
    api_key=TOGETHER_API_KEY,
    base_url="https://api.together.xyz",
)


def llm(user_prompt, model="mixtral", temperature=0.1):
    model_kwargs = {"temperature": temperature}
    user_prompt = user_prompt[:3700]

    chat_completion = CLIENT.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": user_prompt
            },
        ],
        model="mistralai/Mixtral-8x7B-Instruct-v0.1",
        # model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT",
        # model='mistralai/Mistral-7B-Instruct-v0.2',
        **model_kwargs
    )
    output = chat_completion.choices[0].message.content
    return output 

## Step 1: Choose Categories
The business stakeholders choose the categories they would like to track. This can be any arbitrarily changed and augmented as time goes one, making this solution very flexible.

In [3]:
categories = [
    "seat_comfort",
    "cabin_staff_service",
    "food_and_beverages",
    "inflight_entertainment",
    "ground_service",
    "wifi_and_connectivity",
    "value_for_money",
]

## Step 2: Load in Some Comments
Next we load in reviews from some airline passengers. This data can come from wherever your organization stores its data. In this case we just load from a file that we have on hand.

In [4]:
df = pd.read_csv('predictions.csv')

In [5]:
obs = df.sample(5)

In [6]:
obs[['review', 'category', 'sentiment']]

Unnamed: 0,review,category,sentiment
161,Not Verified | San Francisco to New York. Firs...,seat_comfort,Negative
621,✅ Trip Verified | Puerto Vallarta to Houston....,cabin_staff_service,Negative
1975,✅ Trip Verified | Newark to Austin. The worst...,inflight_entertainment,Neutral
1875,✅ Verified Review | Chicago O'Hare to Denver....,inflight_entertainment,Negative
476,My husband booked our flight out of Gulfport M...,seat_comfort,Negative


### Step 3: Tag the Comments
Now that we have the reviews loaded we can ask the LLM to review the information. Here's what that looks like. Notice the advantage that we can handle this problem with plain text. With a small amount of training, any non-technical stakeholder could easily extend this system to handle slightly different tasks.

In [7]:
tag_prompt_template = lambda review: f"""Here's a customer review for an experience they had on an airline.
For each of the following categories decide if the customer's sentiment is Positive, Negative, or Neutral.
If a category is not mentioned return "N/A".
The intended airline is Untied Airlines.

Return using ONLY the following output schema.
- seat_comfort: <sentiment>
- cabin_staff_service: <sentiment>
- food_and_beverages: <sentiment>
- inflight_entertainment: <sentiment>
- ground_service: <sentiment>
- wifi_and_connectivity: <sentiment>
- value_for_money: <sentiment>

Return only in the above format.

Review: {review}
Output: """

In [8]:
def parse_output(llm_output, categories):
    output = {}
    category_reviews = [x.lstrip("- ") for x in llm_output.split("\n")]
    for category_name, review in zip(categories, category_reviews):
        # format the information
        rating=review.replace(f"{category_name}: ", '').split('(')[0]
        output[f"{category_name} pred"] = rating

    return output

In [9]:
tmp = df.review.sample(1).iloc[0]
prompt = tag_prompt_template(tmp)
output = llm(prompt)
parse_output(output, categories)

{'seat_comfort pred': 'N/A',
 'cabin_staff_service pred': 'Negative',
 'food_and_beverages pred': 'N/A',
 'inflight_entertainment pred': 'N/A',
 'ground_service pred': 'Negative',
 'wifi_and_connectivity pred': 'N/A',
 'value_for_money pred': 'N/A'}

### Step 4: Report on Changes in Comments

This is a really good use case for Prediction-Powered Inference

In [53]:
from scipy.optimize import brentq
from scipy.stats import binom, norm


def binomial_iid(N, alpha, muhat):
    def invert_upper_tail(mu):
        return binom.cdf(N * muhat, N, mu) - (alpha / 2)

    def invert_lower_tail(mu):
        return binom.cdf(N * muhat, N, mu) - (1 - alpha / 2)

    u = brentq(invert_upper_tail, 0, 1)
    l = brentq(invert_lower_tail, 0, 1)
    return np.array([l, u])


def pp_mean_iid_asymptotic(Y_labeled, Yhat_labeled, Yhat_unlabeled, alpha):
    n = Y_labeled.shape[0]
    N = Yhat_unlabeled.shape[0]
    tildethetaf = Yhat_unlabeled.mean()
    rechat = (Yhat_labeled - Y_labeled).mean()
    thetahatPP = tildethetaf - rechat
    sigmaftilde = np.std(Yhat_unlabeled)
    sigmarec = np.std(Yhat_labeled - Y_labeled)
    hw = norm.ppf(1 - alpha / 2) * np.sqrt((sigmaftilde**2 / N) + (sigmarec**2 / n))
    return [thetahatPP - hw, thetahatPP + hw]


def calculate_ppi(Y_labeled, Yhat_labeled, Yhat_unlabeled, alpha, num_trials=100):
    n_max = Y_labeled.shape[0]
    # ns = np.linspace(100,n_max,20).astype(int)
    ns = np.linspace(0, n_max, 20).astype(int)

    # Imputed-only estimate
    imputed_estimate = (Yhat_labeled.sum() + Yhat_unlabeled.sum()) / (
        Yhat_labeled.shape[0] + Yhat_unlabeled.shape[0]
    )

    # Run prediction-powered inference and classical inference for many values of n
    ci = np.zeros((num_trials, ns.shape[0], 2))
    ci_classical = np.zeros((num_trials, ns.shape[0], 2))
    for i in tqdm(range(ns.shape[0])):
        for j in range(num_trials):
            # Prediction-Powered Inference
            n = ns[i]
            rand_idx = np.random.permutation(n)
            f = Yhat_labeled.astype(float)[rand_idx[:n]]
            y = Y_labeled.astype(float)[rand_idx[:n]]
            output = pp_mean_iid_asymptotic(y, f, Yhat_unlabeled, alpha)
            ci[j, i, :] = output
            # Classical interval
            try:
                ci_classical[j, i, :] = binomial_iid(n, alpha, y.mean())
            except:
                avg_ci_classical = None

    avg_ci = ci.mean(axis=0)[-1]

    try:
        ci_imputed = binomial_iid(Yhat_unlabeled.shape[0], alpha, imputed_estimate)
    except:
        ci_imputed = None
    try:
        avg_ci_classical = ci_classical.mean(axis=0)[-1]
    except:
        avg_ci_classical = None

    return avg_ci, avg_ci_classical, ci_imputed

In [122]:
c = categories[1]
model = "mistralai/Mixtral-8x7B-Instruct-v0.1_pred"
category_df = df[(df.category == c) & (~df[model].isna()) & (~df.sentiment.isna())]
category_df['sentiment'] = category_df.sentiment.map({'Positive': 1, 'Negative': 0})
category_df[model] = category_df[model].map({'Positive': 1, 'Negative': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  category_df['sentiment'] = category_df.sentiment.map({'Positive': 1, 'Negative': 0})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  category_df[model] = category_df[model].map({'Positive': 1, 'Negative': 0})


In [123]:
category_df = category_df[~category_df.sentiment.isna()]

In [124]:
observed = category_df.sample(frac=0.1)
y_labeled = observed['sentiment'].to_numpy()
y_hat_labeled = observed[model].to_numpy()
y_hat_unlabeled = category_df[~category_df.index.isin(observed.index)][model].to_numpy()

In [125]:
calculate_ppi(y_labeled, y_hat_labeled, y_hat_unlabeled, 0.05, num_trials=1000)

  rechat = (Yhat_labeled - Y_labeled).mean()
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean,
  ret = ret.dtype.type(ret / rcount)
  ci_classical[j, i, :] = binomial_iid(n, alpha, y.mean())
100%|██████████| 20/20 [00:17<00:00,  1.15it/s]


(array([nan, nan]), array([0.1303768 , 0.39326154]), None)

### Inference

### Quick Accuracy Check

In [117]:
def score_category_accuracies(model):
    for category_name in categories:
        tmp_df = df[
            (~df[model].isna())
            & (~df['sentiment'].isna())
            & (df["category"] == category_name)
        ]
        acc = tmp_df[tmp_df[model] == tmp_df['sentiment']].shape[0] / tmp_df.shape[0]
        print(f'{category_name} Accuracy: {round(acc,4)}  (n={tmp_df.shape[0]})')

In [118]:
model = "mistralai/Mixtral-8x7B-Instruct-v0.1_pred"
score_category_accuracies(model)

seat_comfort Accuracy: 0.62  (n=150)
cabin_staff_service Accuracy: 0.7273  (n=396)
food_and_beverages Accuracy: 0.6818  (n=154)
inflight_entertainment Accuracy: 0.6796  (n=103)
ground_service Accuracy: 0.7208  (n=394)
wifi_and_connectivity Accuracy: 0.4643  (n=56)
value_for_money Accuracy: 0.8792  (n=389)


In [119]:
model = "gpt-3.5-turbo-1106_pred"
score_category_accuracies(model)

seat_comfort Accuracy: 0.8447  (n=309)
cabin_staff_service Accuracy: 0.8223  (n=484)
food_and_beverages Accuracy: 0.7533  (n=150)
inflight_entertainment Accuracy: 0.7812  (n=96)
ground_service Accuracy: 0.7809  (n=429)
wifi_and_connectivity Accuracy: 0.7222  (n=36)
value_for_money Accuracy: 0.9071  (n=366)


In [120]:
model = "gpt-4-1106-preview_pred"
score_category_accuracies(model)

seat_comfort Accuracy: 0.7736  (n=212)
cabin_staff_service Accuracy: 0.8245  (n=416)
food_and_beverages Accuracy: 0.7534  (n=146)
inflight_entertainment Accuracy: 0.8172  (n=93)
ground_service Accuracy: 0.788  (n=401)
wifi_and_connectivity Accuracy: 0.7179  (n=39)
value_for_money Accuracy: 0.9106  (n=425)


In [121]:
model = 'small_model_predictions'
score_category_accuracies(model)

seat_comfort Accuracy: 0.798  (n=500)
cabin_staff_service Accuracy: 0.718  (n=500)
food_and_beverages Accuracy: 0.6124  (n=498)
inflight_entertainment Accuracy: 0.501  (n=497)
ground_service Accuracy: 0.674  (n=497)
wifi_and_connectivity Accuracy: 0.497  (n=497)
value_for_money Accuracy: 0.841  (n=497)


### Step 5: Flexibly Change the Output Classifications

In [16]:
tag_prompt_template = lambda review: f"""Here's a customer review for an experience they had on an airline.
For each of the following categories decide if the customer's sentiment is Positive, Negative, or Neutral.
If a category is not mentioned return "N/A".
The intended airline is Untied Airlines.

Return using ONLY the following output schema.
- seat_comfort: <sentiment>
- cabin_staff_service: <sentiment>
- food_and_beverages: <sentiment>
- inflight_entertainment: <sentiment>
- ground_service: <sentiment>
- wifi_and_connectivity: <sentiment>
- value_for_money: <sentiment>

Return only in the above format.

Review: {review}
Output: """

In [42]:
tmp = df.review.sample(1).iloc[0]
prompt = tag_prompt_template(tmp)
output = llm(prompt, GPT3)
parse_output(output, categories)

{'seat_comfort pred': 'N/A',
 'cabin_staff_service pred': 'Negative',
 'food_and_beverages pred': 'N/A',
 'inflight_entertainment pred': 'N/A',
 'ground_service pred': 'Negative',
 'wifi_and_connectivity pred': 'N/A',
 'value_for_money pred': 'Negative'}

### Step 6: Understand the Problems
- We may wish to understand the problems the customer is facing
- We can accomplish that by mining for information where the customer is unhappy
- There exist more efficient methods to do this, we have a flexible exploratory tool

In [46]:
problem_extraction_prompt_template = lambda review, category, sentiment: f"""This customer had a {sentiment} experience regarding {category}.
In less than 10 words describe the problem related to {category}.

Review:
{review}

Short Review:"""

problem_summarization_prompt_template = lambda problem_str, sentiment, category: f"""Here are some statements from customers that visited our airline.
Please break down the main themes that summarize what was {sentiment} about their experience with {category}.
Format this as a report in mardkwon for an expert that needs to make a decision about how to improve customer experience.

Problems:
{problem_str}

Report:"""

In [47]:
def draft_report(df, category, sentiment, sample_size=40):
    reviews = df[(df['category'] == category) & (df['sentiment'] == sentiment)].review.sample(sample_size).tolist()
    problem_prompts = [problem_extraction_prompt_template(r, category, sentiment) for r in reviews]

    problems = [llm(p, GPT3) for p in tqdm(problem_prompts, desc='Extracting problems')]
    problem_str = '\n'.join(problems)
    report_prompt = problem_summarization_prompt_template(problem_str, sentiment, category)
    report = llm(report_prompt, GPT3)
    return report

In [48]:
report = draft_report(df, 'cabin_staff_service', 'Negative')




[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


Extracting problems: 100%|██████████| 40/40 [01:00<00:00,  1.52s/it]


In [49]:
report

'After analyzing customer feedback, it is clear that the main themes regarding negative experiences with cabin staff service are unprofessional behavior, rude and dismissive attitude, poor customer service, lack of assistance and empathy, and inadequate communication. Customers have also mentioned issues with delayed flights, uncomfortable seats, lack of amenities, poor food quality, and cramped seating. There are also concerns about unorganized boarding processes, overbooking, split seats, and mishandling of luggage. Additionally, there are complaints about the handling of flight cancellations, lack of compensation, and unhelpful policies. It is evident that there is a need for improvement in the behavior and professionalism of the cabin staff, as well as in the overall customer service and communication. Measures should also be taken to address issues related to flight delays, seating arrangements, luggage handling, and the overall comfort and amenities provided to passengers.'