# Take home task

- Output of an Undamaged-Repair-Replace (URR) classifier
- per part detects whether the part is undamaged or whether it needs to be repaired or replaced
- parts that are lightly damaged are typically repaired and parts that are heavily damaged are typically replaced 
- undamaged < repair < replace

Task:
- based on the URR classifier output and some ground-truth metadata, find the right thresholds/ decision boundaries that distinguish the three classes - undamaged, repair and replace

<img src="img/car_parts.jpg" alt="drawing" width="700"/>

In [1]:
import glob

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, auc, precision_recall_fscore_support, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
import scipy

# Data

## Metadata
Claim data:
- claim_id: A unique ID for a claim
- make: Make description of the vehicle
- model: Model description of the vehicle
- year: Model year of the vehicle
- poi: The main point of impact (eg. Front Centre, Right Rear Corner, etc.)

Line data:
- line_num: Number of the line item
- part: Name of part (eg. fbumper, bbumper, etc.)
- operation: Name of operation (eg. repair, replace)
- part_price: Total price of the part if replaced in \\$
- labour_amt: Total labour amount to perform the operation in \\$

Additional info:
- a claim can have multiple line items (the claim data columns will be the same, and the line data columns differ)
- the line data contains information on the damaged parts for the claims: the operation (repair or replace) performed on the part and the cost associated with the operations
- If there isn’t a line for a part in a claim, assume that that part is undamaged
- Also assume that the vehicle details (make, model, year) and point of impact (poi) are known at inference time

In [None]:
def read_folder_to_pd(path):
    all_files = sorted(glob.glob(path + "/*.csv"))

    li = []

    for filename in all_files:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)

    return pd.concat(li, axis=0, ignore_index=True)

In [None]:
metadata = read_folder_to_pd("./tractable_ds_excercise_data/metadata/")

In [None]:
metadata = metadata.sort_values(by=["claim_id", "line_num"]).reset_index(drop=True)

In [None]:
metadata.head(10)

## Classifier output

Classifier output data:
- claim_id: A unique ID for the claim
- part: Name of part (eg. fbumper, bbumper, etc.)
- urr_score: Undamaged / Repair / Replace score, float value ranging from 0 to 1
- set: The set (train/val/test) that the claim belongs to, where 0 => train, 1 => val and 2 => test

Additional info:
- For each claim, the classifiers output scores for all 10 parts, so there will be 10 lines per claim in this file
- If the classifier score is missing, that means that the part was not identified by the AI in the images provided

In [None]:
clf_output = pd.read_csv("./tractable_ds_excercise_data/classifier_output.csv")

In [None]:
# dropping all NaNs because here NaN means the part was not identified, hence the score is not "undamaged" but meaningless
clf_output = clf_output.dropna(axis=0).reset_index(drop=True)

In [None]:
clf_output.head(10)

# Tasks

## Part 1
- improve the performance of the URR classifier by determining the right thresholds/decision boundaries between the three classes (undamaged/repair/replace)
- the purpose of the final system is to predict which parts need to be repaired and replaced in a given claim, but your task stops at finding the thresholds

### Task 1
Determine ground-truth labels from the metadata and merge the two data dumps for analysis (expected content in hand-in: code)

In [None]:
data = clf_output.merge(metadata[["claim_id", "part", "operation"]], on=["claim_id", "part"], how="left")
# here we fill all NaNs with "undamaged" because when a line is not in the metadata we can assume the part is undamaged
data["operation"] = data["operation"].fillna("undamaged")

In [None]:
data.head(10)

#### Checking for dataset imbalance

In [None]:
class_count = data["operation"].value_counts()
for op, count in class_count.items():
    print(f"{op}: {round(count/sum(class_count.values), 3)*100}%")
class_count.plot(kind="pie", legend=True)

In [None]:
N_CLASSES = len(data["operation"].unique())

In [None]:
CAT_TO_LABELS = {}
for category, label in enumerate(class_count.keys()):
    CAT_TO_LABELS[category] = label
    
LABELS_TO_CAT = {label: category for category, label in CAT_TO_LABELS.items()}

### Task 2
Analyse the performance of the classifiers (expected content in hand-in: code, 2-5 bullet-points and 1-3 tables/figures)

#### Answer:
- Because we don't have the threshold yet, we can use either ROC or Precision-Recall curves to analyse how the classifier performed for different threshold
- We will use the Precision-Recall curve because as seen in the cell above, we have an imbalanced dataset
- We have to use a multiclass Precision-Recall curve as there are 3 classes

- The way I am approaching this task is:
    - Given a list of URR scores
    - For each class (undamaged, repair, replace) I am going through all the URR scores
    - For each URR score I am saying: If this is the boundary for this class, what would be the precision, recall, f1 score and precision/recall AUC?
    - I can then retrieve the optimal boundary by finding the threshold which produced the best f1 score for each class
    - By looking at the precision recall curve and the AUC score I can evaluate its performance without knowing the threshold

In [None]:
# split the dataset into train, val and test set
train_data = data[data["set"] == 0]
val_data = data[data["set"] == 1]
test_data = data[data["set"] == 2]

In [None]:
def plot_multiclass_pr(y_pred, y_test, n_classes, figsize=(10, 8), labels={}):
    precision = {}
    recall = {}
    f_1_score = {}
    thresholds = {}
    pr_auc = {}
    no_skill = {}

    y_test = pd.get_dummies(y_test, drop_first=False).values
    for i in range(n_classes):
        precision[i], recall[i], thresholds[i] = precision_recall_curve(y_test[:, i], y_pred)
        f_1_score[i] =  (2 * precision[i] * recall[i]) / (precision[i] + recall[i] + 1e-10)
        pr_auc[i] = auc(recall[i], precision[i])
        no_skill[i] = np.count_nonzero(y_test[:, 0] == 1) / len(y_test)
        
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_title('Precision Recall curve')
    for i in range(n_classes):
        best_f1_index = np.argmax(f_1_score[i])
        print(f"Best threshold for label {labels.get(i, i)}: {thresholds[i][best_f1_index]}")
        ax.plot(recall[i], precision[i], 
                label=f'PR curve (area = {round(pr_auc[i], 2)}) for label {labels.get(i, i)}')
        ax.scatter(recall[i][best_f1_index], precision[i][best_f1_index], s=100, marker=(5, 1), 
                   label=f'Best F1 score ({round(f_1_score[i][best_f1_index], 2)}) for label {labels.get(i, i)}')
        ax.plot([0, 1], [no_skill[i], no_skill[i]], linestyle='--')
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    plt.show()

#### Train data

In [None]:
plot_multiclass_pr(y_pred=train_data["urr_score"].values, 
                   y_test=train_data["operation"], 
                   n_classes=N_CLASSES,
                   labels=CAT_TO_LABELS)

#### Validation data

In [None]:
plot_multiclass_pr(y_pred=val_data["urr_score"].values, 
                   y_test=val_data["operation"], 
                   n_classes=N_CLASSES,
                   labels=CAT_TO_LABELS)

#### Test data

In [None]:
plot_multiclass_pr(y_pred=test_data["urr_score"].values, 
                   y_test=test_data["operation"], 
                   n_classes=N_CLASSES,
                   labels=CAT_TO_LABELS)

### Task 3
Find the optimum thresholds to distinguish the undamaged/repair/replace classes. You are free to choose any objective here (accuracy, true positive rate, etc.) but make sure to justify your choice, and that the justification is sensible (expected content in hand-in: code, 2-5 bullet-points and 1-3 tables/figures)

#### Answer:
- why do I use weighted average
- why do I use precision and recall and f1 score
- why do I use scipy optimize, could also use linear regression

In [None]:
def get_y_true_and_scores(data):
    y_true = np.where(data["operation"] == "replace", 2, np.where(data["operation"] == "repair", 1, 0))
    y_scores = data['urr_score'].values
    return y_true, y_scores

In [None]:
def get_y_pred_w_threshold(y_scores, threshold):
    return np.where(y_scores > threshold[1], 2, np.where(y_scores > threshold[0], 1, 0))

#### Using scipy's optimizer

In [None]:
# use the training and validation data for tuning
y_true_train, y_scores_train = get_y_true_and_scores(train_data.append(val_data))
y_true_test, y_scores_test = get_y_true_and_scores(test_data)

In [None]:
def threshold_to_f1(threshold, y_true, y_scores):
    y_pred = get_y_pred_w_threshold(y_scores, threshold)
    return -f1_score(y_true, y_pred, average="weighted")

best_thr = scipy.optimize.fmin(threshold_to_f1, args=(y_true_train, y_scores_train), x0=[0.1, 0.9])
print(f"Best threshold: {best_thr}")

In [None]:
y_pred_test = get_y_pred_w_threshold(y_scores_test, best_thr)

precision, recall, f_1_score, _ = precision_recall_fscore_support(y_true_test, y_pred_test, average="weighted")
print(f"Precision: {round(precision, 4)}\nRecall:    {round(recall, 4)}\nF1 Score:  {round(f_1_score, 4)}")

#### Logistic Regression
- Here we are plugging the raw scores of the URR classfier into a Logistic Regression model
- We are training the Logistic Regression model using the URR scores as input data and the corresponding class labels as testing data

In [None]:
# Getting the training, validation and testing data
y_train_lr, x_train_lr = get_y_true_and_scores(train_data)
y_val_lr, x_val_lr = get_y_true_and_scores(val_data)
y_test_lr, x_test_lr = get_y_true_and_scores(test_data)

x_train_lr = x_train_lr.reshape(-1, 1)
x_val_lr = x_val_lr.reshape(-1, 1)
x_test_lr = x_test_lr.reshape(-1, 1)

In [None]:
# Specifying the Logistic Regression model (we could use a grid search here to get better parameters)
clf_lr = LogisticRegression(C=1.0, penalty='l2')
clf_lr.fit(x_train_lr, y_train_lr)
clf_lr.score(x_val_lr, y_val_lr)

In [None]:
y_pred_lr = clf_lr.predict(x_test_lr)
precision, recall, f_1_score, _ = precision_recall_fscore_support(y_test_lr, y_pred_lr, average="weighted")
print(f"Precision: {round(precision, 4)}\nRecall:    {round(recall, 4)}\nF1 Score:  {round(f_1_score, 4)}")

In [None]:
# Computing one example manually
example_calculation = (clf_lr.coef_ * x_test_lr[0]).T + clf_lr.intercept_
example_calculation[0]

In [None]:
print(f"Predicted class: {np.argmax(example_calculation[0])}\nActual class:    {y_pred_lr[0]}")

In [None]:
clf_lr.coef_, clf_lr.intercept_

## Part 2

- discuss how to take this solution closer to something that could be used live

### ~~Task 4~~~
- discuss ways you would scale your code to ingest more data, i.e. of the order of millions (expected content in hand-in: 2-5 bullet-points)

#### Answer:
- use Spark instead of Pandas

### ~~Task 5~~~
- discuss ways you would further improve the performance of the classifiers if you had more time (expected content in hand-in: 2-5 bullet-points)

### Task 6
- predict the cost of performing the operations for a particular claim as a function of the URR scores and the claim data
- How would you design a system that can predict these costs? 
- Which metadata fields would you use? 
- Would you also require any additional data (not provided in the data dump) that will help you improve the accuracy of your estimate? 
(expected content in hand-in: 5-10 bullet-points)

#### Answer:

- Designing a system to predict the costs:
    - The model would take the URR scores as well as certain metadata fields as its input data
        - $f(URR\_scores, make, model, year, poi) = cost\_of\_operation$
    - We would need a regression model
        - Popular choices include: linear regression, regression trees, random forest, xgboost, support vector regression, neural network
    - "part_price" and "labour_amt" would be the target columns (it wasn't clear from the description if it's only "labour_amt" but I would assume it is because "part_price" could be done via a lookup table), we could train a model for each or predict the two with the same model (I would choose to do the latter because there is a lot of overlap in the parameters that are relevant)
    - Because both fields can get quite large, we would normalize them ("learn" mean and variance from train set and apply to val and test set to avoid data leakage)
        - We can then reverse the normalization to get our actual values to generate our report


- Metadata fields:
    - To choose which metadata fields would be useful we would check which fields are most correlated to the "part_price" and "labour_amount" (example for "make" below)
        - If a field (feature) shows close to 0 correlation, it's probably useless
    - My intuition tells me that there is probably some correlation between the "make" and "model" of a car to how much it will cost to repair/replace it. Crashing the trunk of a Tesla is most likely more expensive than crashing the trunk of a Fiat.
    - There will also probably be a correlation between "poi" and the final price, e.g. fixing the door is more expensive than fixing the back bumper
    

- Feature engineering:
    - If we use a linear regression of neural net:
        - Once we have chosen the metadata fields we would need to one-hot-encode our features "make", "model", "poi"
            - It wouldn't make sense to label encode them because there is not particular order in any of them
            - Because one-hot-encoding will result in a very sparse input vector, we could try out other encoding methods like target encoding or count encoding (target makes more sense here)
        - I would also convert year into a numerical representation. I can imagine "age" would work well as the parts and labour might get more expensive the older the car is because parts might be more rare. It wouldn't be the amount of years the car has been driven but rather the amount of years since it has been first released.
            - The problem we run into here is that if we use (current_year - build_year) as our definition of age, it changes every year and we'd have to retrain our model at least on a yearly basis
            - As an alternative we could again, try the target encoding
    - If we use a decision tree we don't need to worry as much about encoding the categorical variables as DTs can handle them quite well


- Additional data
    - It would help to know the country and perhaps even region of the accident, e.g. companies in London will charge more than in Fordwich (which, as of 2011 has 381 inhabitants)
    
 
- Process:
    - Find a simple baseline (to assure the model is learning something), we could just take something like average cost per part or per make
    - Build a simple ML model to beat the baseline (to confirm that ML is even useful for this task)
    - Overfit our model (add (bigger) layers, train longer, use bigger model)
    - Apply regularization methods and tune hyperparameters (dropout, different architectures, more data, regularization methods)

In [None]:
metadata

#### Labour amount and Part Price for Volkswagen Tiguan (replace)

In [None]:
t = metadata[(metadata["make"] == "Volkswagen") & 
             (metadata["model"] == "Tiguan") & 
             (metadata["operation"] == "replace")] \
            [["make", "model", "year", "poi", "operation", "part_price", "labour_amt"]] \
            .dropna() \
            .sort_values(by=["year", "poi"])

print(t["labour_amt"].describe())
t.head(10)

#### Labour amount for Volkswagen Tiguan (repair)

In [None]:
t = metadata[(metadata["make"] == "Volkswagen") & 
             (metadata["model"] == "Tiguan") & 
             (metadata["operation"] == "repair")] \
            [["make", "model", "year",	"poi", "operation", "labour_amt"]] \
            .dropna() \
            .sort_values(by=["year", "poi"])

print(t["labour_amt"].describe())
t.head(10)

#### Correlation between car make - part_price, labour_amount

In [None]:
metadata["model"].isna().sum()

In [None]:
corr = pd.get_dummies(metadata[["make", "part_price", "labour_amt"]], columns=["make"]).dropna().corr()
corr.style.background_gradient(cmap='coolwarm')