# Blameworthiness

## 1. Load Data 

In [1]:
import numpy as np
import pandas as pd

In [3]:
# Load the CSV file
df_cnn = pd.read_csv('./results_cnn.csv')
df_resnet = pd.read_csv('./results_resnet.csv')

In [6]:
# Extract probabilities
cnn_probs = df_cnn['Prob_1_PNEUMONIA']
resnet_probs = df_resnet['Prob_1_PNEUMONIA']
ground_truth = df_cnn['Ground Truth']

## 2. Build decision-making systems

In [10]:
# Define thresholds
uncertainty_threshold = 0.3  # Model 2 is considered certain if probability is > 0.7 or < 0.3
decision_threshold = 0.5     # Threshold for making a binary decision

# System 1: Direct decision from Model 1 based on its probability
decisions_human_only = (resnet_probs > decision_threshold).astype(int)

# System 2: If Model 2's probability is certain (outside [0.3, 0.7]), use its decision, otherwise use Model 1
decisions_HITL = np.where(
    (cnn_probs > (1-uncertainty_threshold)) | (cnn_probs < uncertainty_threshold),  # Model 2 is certain
    (cnn_probs > decision_threshold).astype(int),       # Use Model 2's decision
    decisions_human_only                                              # Otherwise, fallback to Model 1's decision
)

## 3. Evaluate decision-making system

In [11]:
from sklearn.metrics import f1_score


# Compute AUC for both systems
f1_human = f1_score(ground_truth, decisions_human_only)
f1_HITL = f1_score(ground_truth, decisions_HITL)

print(f"F1 score for Human-Only system: {f1_human}")
print(f"F1 score for HITL system: {f1_HITL}")

F1 score for Human-Only system: 0.8960739030023095
F1 score for HITL system: 0.8311965811965811


We can see that incorporating AI (CNN) resulted in drop of performance of the system. So the action of deploying such human-AI system is blameworthy against the action of deploying human-only system. However, the degree of blameworthiness can be discounted by the improvement of efficiency.

In [13]:
# Inevitable Errors: System 2 errors where Model 1 is also wrong
inevitable_errors = (decisions_HITL != ground_truth) & (decisions_human_only != ground_truth)

# Flagged by Model 2: Model 2 was uncertain, so System 2 fell back to Model 1's decision
flagged_by_AI = (cnn_probs >= 0.3) & (cnn_probs <= 0.7)

# Inevitable Errors that were flagged by Model 2
inevitable_flagged = inevitable_errors & flagged_by_AI

# Inevitable Errors that were not flagged by Model 2 (Model 2 was certain and used)
inevitable_not_flagged = inevitable_errors & ~flagged_by_AI

# Avoidable Errors: System 2 made an error but Model 1 was correct
avoidable_errors = (decisions_HITL != ground_truth) & (decisions_human_only == ground_truth)

# Count the errors
inevitable_flagged_count = np.sum(inevitable_flagged)
inevitable_not_flagged_count = np.sum(inevitable_not_flagged)
avoidable_errors_count = np.sum(avoidable_errors)

print(f"Inevitable Errors (flagged by AI): {inevitable_flagged_count}")
print(f"Inevitable Errors (not flagged by AI): {inevitable_not_flagged_count}")
print(f"Avoidable Errors: {avoidable_errors_count}")

Inevitable Errors (flagged by AI): 2
Inevitable Errors (not flagged by AI): 83
Avoidable Errors: 73


Inevitable errors are errors where human (ResNet) was wrong, and they are further divided into:
* Flagged by AI: AI was not certain and the system fell back to human. (Human bears the responsibility for this type of error)
* Not Flagged by AI: AI was certain and made a decision (Both human and AI are responsibility for this type of error). 

Avoidable errors refer to errors that could have been avoided if AI has requested intervention from human (Both AI and the party responsible for flagging mechanism are accountable for errors).