In [1]:
import pandas as pd

In [110]:
df = pd.read_excel("../CODES 20250616T141033_raw_data_20240601_20250616.xlsx", sheet_name="belgian_filtered")

## User Confirmation Rates

### Attempt-level confirmation rate

The percentage of individual attempts that were confirmed by the user. Each row is treated separately in the dataset.

In [53]:
confirmed_attempts = (df['IsUserConfirmed'] == True).sum()
total = df['IsUserConfirmed'].count()
attempt_conf_rate = confirmed_attempts / total * 100
attempt_conf_rate

33.16831683168317

Users who make many repeated attempts can skew the metric.
It doesn’t tell you if the same question was eventually confirmed later.
It mixes model performance with user persistence.

### Question-level confirmation rate

The percentage of unique question instances (per user/session) where the user eventually confirmed the model’s output - even if it took multiple attempts. This is grouped by user and by question creating unique user-question pairs. For example, User 1 - Q1 and User1 Q2 would be treated as separate cases.

In [28]:
question_level = (
    df.groupby(['SurveySessionID', 'QuestionID'])['IsUserConfirmed']
      .max()  # True if any attempt was confirmed
)

question_conf_rate = (question_level == True).sum() / question_level.count() * 100
question_conf_rate

80.48780487804879

In [26]:
question_level.describe()

count       82
unique       2
top       True
freq        66
Name: IsUserConfirmed, dtype: object

### User-level confirmation rate

The percentage of users who confirmed at least one model suggestion during their session (single or multiple questions answered).

In [124]:
user_level = (
    df.groupby('SurveySessionID')['IsUserConfirmed']
      .max()  # True if any attempt for any question was confirmed
)

user_conf_rate = (user_level == True).sum() / user_level.count() * 100
user_conf_rate

84.21052631578947

### First-attempt confirmation rate

The percentage of first attempts per question and user that were confirmed by the user.

In [86]:
first_attempts = df[df['AttemptID'] == 1]
confirmed_first_attempts = first_attempts['IsUserConfirmed'].sum()
total = first_attempts['IsUserConfirmed'].count()

first_attempt_conf_rate = confirmed_first_attempts / total * 100
first_attempt_conf_rate

51.21951219512195

## AI Model Accuracy

### AI main category classification accuracy (based on the first attempts and after dropping incorrect user responses)

First attempts were used to minimise the bias from uneven number of attempts per case.

In [88]:
collapsed_data = df[
    (df['AttemptID'] == 1) &
    (df['MainCategoryCorrect'] != 'UserInputIncorrect')
]

correct_n = (collapsed_data['MainCategoryCorrect'] == 'Yes').sum()
total_n = (collapsed_data['MainCategoryCorrect']).count()

main_accuracy = correct_n / total_n * 100
main_accuracy

92.7536231884058

In [89]:
collapsed_data['MainCategoryCorrect'].value_counts()

MainCategoryCorrect
Yes                         64
Not sure/Old label used      2
Not sure/"Unknown" label     1
Not sure/But seems close     1
No/But was correct later     1
Name: count, dtype: int64

Not sure/Old label used -> 2 cases had a main label that is no longer in use, so it might to be clear if it was the best option (but still seems reasonable)

Not sure/"Unknown" label -> user input was: create less trafic, model classification was: Unknown / Other

Not sure/But seems close ->   user input was: create sustainable migration channels
                              create sustainable and fair migration systems
                              improve migration flows, 
                              model classification was: cohesion between people

No/But was correct later ->   user input was: save the planet
                              model classification was: safety (the safety of air travel or the environment)

## User Confirmation Rates Revisited

Incorrect user responses were those that either didn't make much sense or involved multiple ideas.

### Question-level user confirmation rate (after dropping incorrect user responses)

In [106]:
correct_data = df[df['MainCategoryCorrect'] != 'UserInputIncorrect']

In [123]:
question_level = (
    correct_data.groupby(['SurveySessionID', 'QuestionID'])['IsUserConfirmed']
      .max()  # True if any attempt was confirmed
)
      
confirmed_n = question_level.sum()
total = question_level.count()

question_conf_rate = confirmed_n / total * 100
question_conf_rate

83.09859154929578

### First-attempt confirmation rate (after dropping incorrect user responses)

In [121]:
first_attempts = (
    df[(df['UserInputCorrect'] == True) & (df['AttemptID'] == 1)] 
)

confirmed_n = first_attempts['IsUserConfirmed'].sum()
total = first_attempts['IsUserConfirmed'].count()

question_conf_rate = confirmed_n / total * 100
question_conf_rate

52.112676056338024

### Failure reason

In [114]:
filtered_data = df[
    (df['AttemptID'] == 1)
]
filtered_data['FailureReason'].value_counts()

FailureReason
AI Correct/Dropped           3
AI Correct/Subclass Issue    3
AI Correct/Ingenuine User    2
Unclear/Not nuanced          2
AI Correct/Not nuanced       1
AI Correct/Multiple Ideas    1
Name: count, dtype: int64

<b>AI Correct/Dropped</b> -> 3 cases where the AI model returned the closest matching category at least for the main classification, but user did not confirm and has not completed at least 3 attempts [1]

<b>AI Correct/Subclass Issue</b> -> 3 cases had a problem with providing a reasonable subclassification despite the correct main classification [2]

<b>AI Correct/Ingenuine User</b> -> 1. user input was: sushi, model classification was: Unknown/Other 
                                    2. user input was: we need a phone, model classification was: Unknown/Other 

<b>Unclear/Not nuanced</b> -> 1. user input was: create less trafic, model classification was: Unknown / Other [3]
                              2. user input was: create sustainable migration channels
                              create sustainable and fair migration systems
                              improve migration flows, 
                              model classification was: cohesion between people  [4]

<b>AI Correct/Not nuanced</b> -> Seemed relatively good to me, but maybe user did not like it
                                 user input was: 
                                 i think they need to get their shit together and listen to what their citizens want
                                 they need to listen to their citizens and help everyone live a good life
                                 model classification was:
                                 standards of living (improving access to something	citizens' needs/improving the quality of something	quality of life)

<b>AI Correct/Multiple Ideas</b> -> user input was:
                                    A lot of communication and to be kind to everyone, we also can't forget about nature and te climat
                                    model classification was:
                                    environment (protecting something	nature/nature and climate/the environment) [5]

1. The first 3 cases show that there could be situations where users just dropped out half way through the survey for some reason and it's hard to asses why they haven't completed it. 
2. At least three cases had issue with subclassification (there were also issues in the cases when the user eventually confirmed)
3. One case suggests that a user might not want to confirm when presented with Unknown/Other, meaning that there was also no right category to match
4. In this case, subclassification was not available, which might have improve the outcome. Still "cohesion between people" seems quite far to the idea of migration, but it's probably the closest (human rights could also work)
5. One case that wasn't confirmed had two ideas in it (hence maybe the user didn't like that the first idea was disregarded), but the subclassification sounded also redundant