# Why do we care about finding data issues in the VinBigData Chest X-ray competition?

In the [VinBigData Chest X-ray Abnormalities Detection competition](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/overview), we are given bounding boxes and their assignments to classes assigned independently by 3 radiologists as our training data. On the test data, there is an additional step, where 2 more experienced radiologists review what the 3 initial assessments were. These two more experienced ones then decide what to make of the disagreements (and can discuss with each other while doing so). The process is described on the [paper linked on the competition overview page](https://arxiv.org/pdf/2012.15029.pdf):
> For the test set, 5 radiologists involved into a two-stage labeling process. During the first stage, each image was independently annotated by 3 radiologists. In the second stage, 2 other
radiologists, who have a higher level of experience, reviewed the annotations of the 3 previous annotators and communicated with each other in order to decide the final labels. The disagreements among initial annotators were carefully discussed and resolved by the 2 reviewers. Finally, the consensus of their opinions will serve as reference ground-truth.

(see also some further discussion on this in [this discussion thread](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/211035), which also ends up discussing some data/labeling issues). 

**Bottom line:** our training data is not generated from the same process as the test data and a lot of the differences will be mistakes in the training data that would likely have been fixed, if the image were in the test data. Thus, it makes sense to try and fix the issues we can fix.

# What mistakes in the data do we already know?
Some of the [forum discussions](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/211035) and some [EDA notebooks](https://www.kaggle.com/bjoernholzhauer/eda-dicom-reading-vinbigdata-chest-x-ray#5.-How-big-do-bounding-boxes-tend-to-be-for-different-classes?-How-many-are-there?) already suggested that at least some 'aortic enlargement' and 'cardiomegaly' labels for bounding boxes are wrong. To me this looks like these were maybe less mis-assessments by the radiologists, but rather software usage issues, where they created a new bounding box, but accidentally gave it the same class as for the previous bounding box they had labelled. For these two classes this is relatively easy to see (see images in [data issues Section](https://www.kaggle.com/bjoernholzhauer/eda-dicom-reading-vinbigdata-chest-x-ray#11.Possible-data-issues) of [this EDA notebook](https://www.kaggle.com/bjoernholzhauer/eda-dicom-reading-vinbigdata-chest-x-ray#5.-How-big-do-bounding-boxes-tend-to-be-for-different-classes?-How-many-are-there?). However, it will also be interesting to see whether other approaches also identify these cases

In [None]:
import os
import re
import pandas as pd
#from fastai.medical.imaging import *
#from fastai.vision.all import *
import numpy as np
from pathlib import Path
import matplotlib
import matplotlib.patches as ptc
from tqdm import tqdm # for getting a progress bar on loops
from sklearn.model_selection import KFold
import seaborn as sns
import matplotlib.pyplot as plt
import shap
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import shelve

In [None]:
dicom_meta = pd.read_csv('../input/eda-dicom-reading-vinbigdata-chest-x-ray/train_dicom_properties.csv.bz2').rename(columns={'file':'image_id'})
train = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/train.csv')

In [None]:
bygroup = train[['image_id', 'class_name', 'rad_id','class_id']].copy().groupby(['image_id', 'class_name', 'rad_id']).count().reset_index()
strange_cases = bygroup.loc[ [ (cn[0] in ['Aortic enlargement', 'Cardiomegaly']) & cn[1]  for cn in zip(bygroup['class_name'].values, list(bygroup['class_id']==2)) ], ['image_id', 'rad_id']]
strange_cases = pd.merge(train, strange_cases, on=['image_id', 'rad_id'], how='inner').sort_values(['image_id', 'rad_id', 'class_name']).reset_index(drop=True)
strange_cases

We have also [been told](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/208111) that `PatientAge` in the `.dicom` meta data may be wrong. When I [did some exploratory data analysis](https://www.kaggle.com/bjoernholzhauer/eda-dicom-reading-vinbigdata-chest-x-ray#6.-What-is-in-the-.dicom-meta-data?), I saw 839 .dicom files with PatientAge given as 000Y (plus 000D = 0 days?), which could in theory be yet another placeholder for missing or might really mean 0 years of age (i.e. < 1 year-old). There's the issue of there being empty strings `''`, `000` as well as `NaN` string. Additionally, some of the ages are below < 18 years, e.g. there's 107 images with 001Y to 017Y. This is obbviously interesting, because obviously some diagnoses are just much more likely at certain ages. From the paper describing the methods to create the dataset, it appears that all data should be from adults
> "The collected raw data was mostly of adult PA-view CXRs, but also included a significant amount of outliers such as images of body parts other than chest (due to mismatched DICOM tags), pediatric scans, low-quality images, or lateral CXRs. […] All outliers were automatically excluded from the dataset […]."

That prompted me to ask for clarification on the forum and the response suggested that these are data issues, which makes you wonder how many other ages above 18 may also be wrong.

Certainly, several of the ages substantially above 100 years may also be doubtful.

In [None]:
def string_to_float_process(string):
    if (string=='') | (string=='000D') | (string=='000'):
        return float("NaN")
    else:
        return float(string)

PatientAge = [string_to_float_process(string) for string in dicom_meta['PatientAge'].replace(re.compile('Y$'), '') ]
sns.distplot(PatientAge);

# Bounding boxes with more than one label
A lot of these are probably not really data issues. Quite often the same bounding boxes are given a lot of labels. As per the competition host this is legitimate. As pointed out in a [discussion thread](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/212859#1161967), they have stated:
> Some dense lesion area contains many labels, our radiologists created a box then assign many labels to that box.

The only question is whether some of these multiple labels are accidental or erroneous, after all. Let's have a look at a list of how often each combination of labels occurs:

In [None]:
tmp1 = train.copy()
tmp_cols = ['image_id', 'rad_id', 'x_min', 'x_max', 'y_min', 'y_max']
tmp1 = tmp1[tmp_cols+[ 'class_id']]\
    .groupby(tmp_cols)\
    .count()\
    .reset_index()\
    .rename(columns={'class_id': 'Number of records'})
#tmp1 = tmp1[tmp1['Number of records']>1]
tmp1 = pd.merge(train.copy(), tmp1, 
                on=tmp_cols, 
                how='inner')\
    .sort_values(tmp_cols+['class_id'])\
    .reset_index(drop=True)
tmp1['class_id_name'] = [str(row['class_id']) + ': ' + row['class_name'] for idx, row in tmp1.iterrows()]


tmp1["grouprank"] = tmp1.groupby(tmp_cols)['class_id']\
    .rank(method="first", ascending=True)\
    .astype(np.int8)

tmp1 = tmp1.pivot(index=tmp_cols+['Number of records'],
           columns='grouprank',values='class_id_name')\
    .add_prefix('class_')\
    .reset_index()
   
def isNaN(string):
    return string != string

tmp1['class_combinations'] = [ ' + '.join([  row['class_'+str(i)]  for i in range(1,6) if (not( isNaN(row['class_'+str(i)])))])  for idx, row in tmp1.iterrows() ]
pd.set_option('display.max_rows', 1000)
tmp1['class_combinations'].value_counts()

As we can see the most common thing a single bounding box is labelled as when there's more than one label is **10: Pleural effusion + 11: Pleural thickening** (1285 cases) followed by **6: Infiltration + 13: Pulmonary fibrosis** (318 cases), and **6: Infiltration + 7: Lung Opacity** (216 cases). Does this fall under "dense lesion areas"?

Some of the combinations are clearly mistakes such as **0: Aortic enlargement + 3: Cardiomegaly** and do not seem very plausible such as **3: Cardiomegaly + 8: Nodule/Mass**.

# What's the idea in this notebook for finding the issues and mislabeled bounding boxes?

What are we trying to address? Two possible questions are:
* Are the labels wrong for an existing bounding box (metrics: Do we predict this label given the characteristics of the bounding box? This is an obvious question is whether we can more systematically do what humans did there more or less manually: identify implausible bounding box labels. In that case we make the assumption that the position of bounding boxes are right, we are just not sure whether they are mislabeled.
* Another question could be whether certain bounding boxes are in the wrong place. We could look at that by looking at whether other radiologists give this label for overlapping boxes or even have overlapping boxes, at all. A metric could be IOU.

In this notebook I use LightGBM to try and find misclassified bounding boxes. For classifying the `class_id` for a bounding box, categorical cross entropy loss is one obvious choice, but multi-class focal loss could be an alternative. Obviously, when we optimize hyperparameters and evaluate performance we need to do cross-validation by `image_id` to avoid issues with leakage. On the other hand, we want to **stay away** from `rad_id` as a feature - it may well be a great feature, but we don't want to train a model to repeat a radiologist's mistakes.

# Other possible approaches I did not try, yet

Alternatively, we could try neural networks, where we could even have an `image_id` embedding or even use the actual image as an input. One possibility would simply be a classification model and to look at it as a multi-label problem - and then look for the labels the model considers the most "surprising" (aka most confused predictions).

We could also try to predict IOU or some other metric on average vs. the two other radiologists that labelled the same image - possibly with the actually labelled class for the bounding box as an input. 

# Other possible uses of these approaches
Another thought that occured to me is that we could dobule-check the predictions we get from models for plausibility using a very different approach. Sure, you'd have to think quite carefully how skeptical these models would have to be for you to override a model prediction from a vision model using the actual image, but with some careful tuning this might very well be useful.

# Feature engineering

In [None]:
# Create a list of classes and a dictionary
classes = train[['class_id', 'class_name', 'rad_id']]\
    .groupby(['class_id', 'class_name'])\
    .count()\
    .rename(columns={'rad_id': 'Number of records'})\
    .reset_index()

for index, row in classes.iterrows():
    if index==0:
        label_dict = {row['class_id']: row['class_name']}
    else:
        label_dict.update({row['class_id']: row['class_name']})

train = pd.merge(train, dicom_meta, on='image_id', how='left')
train['x_max'] = train['x_max']/train['Columns']
train['x_min'] = train['x_min']/train['Columns']
train['y_max'] = train['y_max']/train['Rows']
train['y_min'] = train['y_min']/train['Rows']
train['width'] = (train['x_max']-train['x_min'])
train['height'] = (train['y_max']-train['y_min'])
train['area'] = train['height']*train['width']
train['x_center'] = (train['x_max']+train['x_min'])/2
train['y_center'] = (train['y_max']+train['y_min'])/2

In [None]:
# In this cell, we create a feature for how many other classes a radiologist has already assigned 
# for the same image.

tmp1 = pd.merge(train, train, on=['image_id', 'rad_id'], how='left').fillna(0)
#tmp1 = tmp1[ (tmp1['class_id_x']!=tmp1['class_id_y']) | (tmp1['x_min_x']!=tmp1['x_min_y']) | (tmp1['x_max_x']!=tmp1['x_max_y']) | (tmp1['y_min_x']!=tmp1['y_min_y']) | (tmp1['y_max_x']!=tmp1['y_max_y'])]

tmp_cols = ['image_id', 'rad_id', 'class_id_x', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x']
tmp1 = tmp1[ tmp_cols + ['class_id_y', 'x_max_y']]\
    .groupby(tmp_cols+['class_id_y'])\
    .count()\
    .reset_index()\
    .pivot(index=tmp_cols,
           columns='class_id_y', values='x_max_y')\
    .add_prefix('other_')\
    .reset_index()\
    .rename(columns={'class_id_x':'class_id',
                     'class_id_x': 'class_id',
                     'x_min_x': 'x_min',
                     'y_min_x': 'y_min',
                     'x_max_x': 'x_max',
                     'y_max_x': 'y_max'})

train = pd.merge(train, tmp1, 
                 on=['image_id', 'rad_id', 'class_id', 'x_min', 'y_min', 'x_max', 'y_max'],
                 how='left')
train[['other_'+str(i) for i in range(15)]] = train[['other_'+str(i) for i in range(15)]]\
    .fillna(0)\
    .astype(np.int)

# Finally, we subtract the extra count of +1 for each label itself (when we want to predict it, we
# do not want to have a feature that leaks the label, which it otherwise would).
for idx, row in train.iterrows():
    if row['class_id']<14:        
        train['other_' + str(row['class_id'])].values[idx] += -1

# Features based on where bounding boxes for a class tend to be

In [None]:
locations = np.zeros((14, 1000, 1000))
for index, row in tqdm(train.iterrows(), total=train.shape[0]):
    if row['class_id']<14:
        locations[row['class_id'], 
                  ((np.round(row['y_min'],3)*1000).astype(np.int)):((np.round(row['y_max'],3)*1000).astype(np.int)), 
                  ((np.round(row['x_min'],3)*1000).astype(np.int)):((np.round(row['x_max'],3)*1000).astype(np.int))] += 1
        
classcounts = train[['image_id', 'rad_id', 'class_id','class_name']]\
    .groupby(['image_id', 'rad_id', 'class_id'])\
    .count()\
    .reset_index()\
    .pivot(index=['image_id', 'rad_id'], columns='class_id', values='class_name')\
    .rename(columns={i:'n_class'+str(i) for i in range(15)})\
    .fillna(0)

classareas = train[['image_id', 'rad_id', 'class_id','area']]\
    .groupby(['image_id', 'rad_id', 'class_id'])\
    .sum()\
    .reset_index()\
    .pivot(index=['image_id', 'rad_id'], columns='class_id', values='area')\
    .rename(columns={i:'area_class'+str(i) for i in range(15)})\
    .fillna(0)

train = pd.merge( pd.merge( train, classcounts, on=['image_id', 'rad_id'], how='left'), 
                  classareas, on=['image_id', 'rad_id'], how='left')
train = train[train['class_id']!=14]

classes = train[['class_id', 'class_name', 'rad_id']]\
    .groupby(['class_id', 'class_name'])\
    .count()\
    .rename(columns={'rad_id': 'Number of records'})\
    .reset_index()

for index, row in classes.iterrows():
    if index==0:
        label_dict = {row['class_id']: row['class_name']}
    else:
        label_dict.update({row['class_id']: row['class_name']})
        
f, axs = plt.subplots(5, 3, sharey=True, sharex=True, figsize=(16,28));

for class_id in range(14):
    axs[class_id // 3, class_id - 3*(class_id // 3)].imshow(locations[class_id], cmap='inferno', interpolation='nearest');
    axs[class_id // 3, class_id - 3*(class_id // 3)].set_title(str(class_id) + ': ' + label_dict[class_id])
    
plt.show();

In [None]:
# Let's zero out really low counts (<=3) - especialyl to avoid target leakage
for class_id in range(14):
    for x in range(1000):
        for y in range(1000):
            if locations[class_id,x,y]<=3:
                locations[class_id,x,y] = 0

In [None]:
for class_id in range(14):
    train['overlap'+str(class_id)] = 0
    
for index, row in tqdm(train.iterrows(), total=train.shape[0]):    
    for class_id in range(14):
        train.loc[index, 'overlap'+str(class_id)] = np.mean( locations[class_id, int(np.floor(1000*row['x_min'])):int(np.ceil(1000*row['x_max'])), int(np.floor(1000*row['y_min'])):int(np.ceil(1000*row['y_max']))] )
        
#overlaps = ['overlap'+str(class_id) for class_id in range(14)]
#[ (class_id, train[overlaps[class_id]].values[idx]) for idx, class_id in enumerate(train['class_id'].values) if (class_id!=14) & (train[overlaps[class_id]].values[idx]>0)]        

# Creating our features

Now we get to features based what the other two radiologists said that looked at the same image:

In [None]:

def bbox_overlap(bbox1, bbox2):
    area1 = (bbox1['x_max']-bbox1['x_min'])*(bbox1['y_max']-bbox1['y_min'])
    area2 = (bbox2['x_max']-bbox2['x_min'])*(bbox2['y_max']-bbox2['y_min'])
    intersection = max(0, min(bbox1['x_max'], bbox2['x_max']) - max(bbox1['x_min'], bbox2['x_min'])) * max(0, min(bbox1['y_max'], bbox2['y_max']) - max(bbox1['y_min'], bbox2['y_min']))
    return intersection / ( area1 + area2 - intersection)

tmp1 = pd.merge(train, train, on='image_id', how='inner')

# Feature counting other classes assigne by the other two radiologists reviewing the image
class_counts = tmp1.loc[ (tmp1['rad_id_x']!=tmp1['rad_id_y']), ['image_id', 'rad_id_x', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x', 'rad_id_y']]\
    .groupby(['image_id', 'rad_id_x', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x'])\
    .count()\
    .reset_index()\
    .rename(columns={'rad_id_x': 'rad_id', 'class_id_x': 'class_id', 'x_min_x': 'x_min', 'x_max_x':'x_max', 'y_min_x': 'y_min', 'y_max_x': 'y_max', 'rad_id_y': 'class_count'})\
    .pivot(index=['image_id', 'rad_id', 'class_id', 'x_min', 'y_min', 'x_max', 'y_max'], columns='class_id_y', values='class_count')\
    .fillna(0)\
    .reset_index()

tmp1 = tmp1[ (tmp1['rad_id_x']!=tmp1['rad_id_y']) & (tmp1['class_id_x']!=14)  & (tmp1['class_id_y']!=14)]
tmp1 = tmp1[['image_id', 'rad_id_x', 'rad_id_y', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x', 'x_min_y', 'y_min_y', 'x_max_y', 'y_max_y']]

# Derive intersection-over-union vs. other bounding boxes by the other radiologists
tmp1['iou'] = [bbox_overlap({'x_min': row['x_min_x'], 'x_max': row['x_max_x'], 'y_min':row['y_min_x'], 'y_max':row['y_max_x']}, {'x_min': row['x_min_y'], 'x_max': row['x_max_y'], 'y_min':row['y_min_y'], 'y_max':row['y_max_y']}) for idx, row in tmp1.iterrows()]

# What is the maximum IOU?
maxiou = tmp1[['image_id', 'rad_id_x', 'rad_id_y', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x', 'iou']]\
    .groupby(['image_id', 'rad_id_x', 'rad_id_y', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x'])\
    .max()\
    .reset_index()\
    .rename(columns={'rad_id_x': 'rad_id', 'class_id_x': 'class_id', 'x_min_x': 'x_min', 'x_max_x':'x_max', 'y_min_x': 'y_min', 'y_max_x': 'y_max'})

# What is the sum of IOU?
sumiou = tmp1[['image_id', 'rad_id_x', 'rad_id_y', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x', 'iou']]\
    .groupby(['image_id', 'rad_id_x', 'rad_id_y', 'class_id_x', 'class_id_y', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x'])\
    .sum()\
    .reset_index()\
    .rename(columns={'rad_id_x': 'rad_id', 'class_id_x': 'class_id', 'x_min_x': 'x_min', 'x_max_x':'x_max', 'y_min_x': 'y_min', 'y_max_x': 'y_max'})

tmp_cols = ['image_id', 'rad_id', 'class_id', 'class_id_y', 'x_min', 'y_min', 'x_max', 'y_max']
# Take maximum or min or mean over the two other radiologists for these two metrics.
maxsumiou = sumiou[tmp_cols+['iou']].groupby(tmp_cols).max().reset_index()
minsumiou = sumiou[tmp_cols+['iou']].groupby(tmp_cols).min().reset_index()
meansumiou = sumiou[tmp_cols+['iou']].groupby(tmp_cols).mean().reset_index()
maxmaxiou = maxiou[tmp_cols+['iou']].groupby(tmp_cols).max().reset_index()
minmaxiou = maxiou[tmp_cols+['iou']].groupby(tmp_cols).min().reset_index()
meanmaxiou = maxiou[tmp_cols+['iou']].groupby(tmp_cols).mean().reset_index()

tmp_cols = ['image_id', 'rad_id', 'class_id', 'x_min', 'y_min', 'x_max', 'y_max']
maxsumiou = maxsumiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'maxsumiou'+str(i) for i in range(14)})
minsumiou = minsumiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'minsumiou'+str(i) for i in range(14)})
meansumiou = meansumiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'meansumiou'+str(i) for i in range(14)})
maxmaxiou = maxmaxiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'maxmaxiou'+str(i) for i in range(14)})
minmaxiou = minmaxiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'minmaxiou'+str(i) for i in range(14)})
meanmaxiou = meanmaxiou\
    .pivot(index=tmp_cols, columns='class_id_y', values='iou')\
    .fillna(0)\
    .reset_index()\
    .rename(columns={i:'meanmaxiou'+str(i) for i in range(14)})

train = pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(train,
    class_counts, on=tmp_cols, how='left'),
    maxsumiou, on=tmp_cols, how='left'),
    minsumiou, on=tmp_cols, how='left'),
    meansumiou,on=tmp_cols, how='left'),
    maxmaxiou, on=tmp_cols, how='left'),
    minmaxiou, on=tmp_cols, how='left'),
    meanmaxiou,on=tmp_cols, how='left')
train = train[train['class_id']!=14].fillna(0)
train

In [None]:
#PatientAge, PatientSex, PatientSize, PatientWeight
#list_of_ages = [string_to_float_process(string) for string in train_files['PatientAge'].replace(re.compile('Y$'), '') ]
#int(train['PatientAge'])

def string_to_float_process(string):
    if (string=='') | (string=='000D') | (string=='000'):
        return float("NaN")
    else:
        return float(string)

train['Age'] = [string_to_float_process(string) for string in train['PatientAge'].replace(re.compile('Y$'), '') ]

In [None]:
train['Sex'] = train['PatientSex'].fillna('Q').map({'M':0, 'F':3, 'O':1, 'Q':2}).astype('category')
train = train.reset_index(drop=True)

# Set-up for modelling including CV based on image_id

In [None]:
kf = KFold(n_splits=5)

image_ids = np.unique(train['image_id'].values)
fold_assignments = { image_ids[val_idx]: fold for fold, idxs in enumerate(kf.split(image_ids)) for val_idx in idxs[1]} #train_index, test_index
train['fold'] = train['image_id'].map(fold_assignments)

# LightGBM CV takes folds (generator or iterator of (train_idx, test_idx) tuples
fold_splits = [ [train[train['fold']!=fff].index.tolist(), train[train['fold']==fff].index.tolist()] for fff in range(5) ]

cont_features = ['Age', 'x_min', 'y_min', 'x_max', 'y_max', 'width', 'height', 'area', 'x_center', 'y_center']\
    + [f'minsumiou{i}' for i in range(14)]\
    + [f'maxsumiou{i}' for i in range(14)]\
    + [f'meansumiou{i}' for i in range(14)]\
    + [f'minmaxiou{i}' for i in range(14)]\
    + [f'maxmaxiou{i}' for i in range(14)]\
    + [f'meanmaxiou{i}' for i in range(14)]\
    + [f'overlap{i}' for i in range(14)]\
    + [f'other_{i}' for i in range(14)]
#cat_features = ['Sex']

X = np.array( train[cont_features] )#+ cat_features
y = np.array( train['class_id'] ).flatten()


# LightGBM modeling

I did an initial hyperparameter search using the `optuna.LightGBMTunerCV` function that implements a sensible tuning strategy for LightGBM. I'm now commenting this out, because it's a bit time-consuming.

If you want to find out more about tuning LightGBM with `optuna`, I've got [a notebook on this](https://www.kaggle.com/bjoernholzhauer/lightgbm-tuning-with-optuna) using the Titanic dataset.

In [None]:
#import optuna.integration.lightgbm as lgb
#import optuna

#dtrain = lgb.Dataset(X, label=y)

#params = {
#    'objective': 'multiclass',
#    'boosting_type': 'gbdt',    
#    'num_class':14,
#    'metric': 'multi_logloss',#['multi_logloss', 'multi_error'],
#    'learning_rate': 0.05, #0.005,
#}

#study_tuner = optuna.create_study(direction='minimize')

#tuner = lgb.LightGBMTunerCV(params, dtrain,                             
#                            study=study_tuner,
#                            verbose_eval=10, #True, #False,
#                            #categorical_feature=cat_features,
#                            early_stopping_rounds=25,
#                            time_budget=19800, # Time budget of 5 hours, we will not really need it
#                            seed = 42,
#                            folds=fold_splits,
#                            num_boost_round=800, #10000,
#                            #callbacks=[lgb.reset_parameter(learning_rate = [0.005]*200 + [0.001]*9800) ] #[0.1]*5 + [0.05]*15 + [0.01]*45 + 
#                           )
#tuner.run()
#print(tuner.best_params)
#print(tuner.best_score)



The results I received were
> {'objective': 'multiclass', 
>  'boosting_type': 'gbdt', 
>  'num_class': 14, 
>  'metric': 'multi_logloss', 
>  'learning_rate': 0.05, 
>  'feature_pre_filter': False, 
>  'lambda_l1': 4.053841983827334, 
>  'lambda_l2': 8.075199666030122e-06, 
>  'num_leaves': 25, 
>  'feature_fraction': 0.5, 
>  'bagging_fraction': 0.875336404331162, 
>  'bagging_freq': 6, 
>  'min_child_samples': 100}

And the best log-loss was:
> Best Score: 0.7220672051541743

Now let's round these parameter values a bit and fit the model with a lower learning rate:

In [None]:
import lightgbm as lgb

params = {'objective': 'multiclass', 
 'boosting_type': 'gbdt', 
 'num_class': 14, 'metric': ['multi_logloss', 'multi_error'],
 'learning_rate': 0.005, 
 'feature_pre_filter': False, 
 'lambda_l1': 4, 
 'lambda_l2': 8, 
 'num_leaves': 25, 
 'feature_fraction': 0.5, 
 'bagging_fraction': 0.875, 
 'bagging_freq': 6, 
 'min_child_samples': 100,
 'num_threads': 4}

dtrain = lgb.Dataset(X, label=y)

#lgbcv = lgb.cv(params, dtrain, num_boost_round=5000, folds=fold_splits,verbose_eval=True)

Doing cross-validation to see a good number of iterations suggests around 3950:
> [3948]	cv_agg's multi_logloss: 0.719552 + 0.019352	cv_agg's multi_error: 0.250182 + 0.00657187

> [3949]	cv_agg's multi_logloss: 0.719551 + 0.0193541	cv_agg's multi_error: 0.250182 + 0.00657187

> [3950]	cv_agg's multi_logloss: 0.719552 + 0.0193557	cv_agg's multi_error: 0.250209 + 0.00659723

> [3951]	cv_agg's multi_logloss: 0.719553 + 0.019359	cv_agg's multi_error: 0.250181 + 0.00660947

So we'll do that a refit without CV. We'll show the training error (since we no longer have a validation dataset).

In [None]:
lgbfit = lgb.train(params, dtrain, num_boost_round=3950, verbose_eval=250,valid_sets=[dtrain])
# We only add the valid_sets to get some updates as the model is trained.
# It only shows performance on training set.

# What labels does the model find weird?

As we can see in the plots below, some bounding boxes seem to be very distinctive (i.e. the model manages to assign high probabilities to them ones with this `class_id`). 

These are primarily **Aortic enlargment**, **Cardiomegaly** and **Pneumothorax** and for these classes there are then some observations with very low probabilities that we should regard as under suspicion for being mislabelled. Of course, this could also be a split into easy and difficult to label cases/bounding boxes. I'm especially unsure for pneumothorax, because there's quite a lot of low-probability records.

Then there's somewhat bimodal distributions of probabilities e.g. for **Pleural effusion** and **Pleural thickening**. Such a phenomenon may also be occuring for **Other lesion**, **Calcification** and **Consolidation**.

The other classes seem to have more of a smooth distribution.

In [None]:
preds = lgbfit.predict(X)
train['prob'] = [ preds[y] for idx, (y, preds) in enumerate(zip(y, preds))]

In [None]:
train['first_pred'] = np.argsort(preds, axis=1)[:,-1:].flatten()
train['second_pred'] = np.argsort(preds, axis=1)[:,-2:-1].flatten()
train['alternative'] = (train['first_pred']==train['class_id']) * train['second_pred']\
    + (train['first_pred']!=train['class_id'])*train['first_pred']
train['first_prob'] = np.sort(preds, axis=1)[:,-1:].flatten()
train['second_prob'] =np.sort(preds, axis=1)[:,-2:-1].flatten()
train['alternative_prob'] = (train['first_pred']==train['class_id']) * train['second_prob']\
    + (train['first_pred']!=train['class_id'])*train['first_prob']

train

In [None]:
ax = sns.distplot(train["prob"], kde=False)

In [None]:
#for class_id in range(7):
#    ax = sns.kdeplot(train.rename(columns={'prob':f'Class_{class_id}'}).loc[train['class_id']==class_id, f"Class_{class_id}"]);
#plt.ylim(0, 5);
#plt.xlim(0, 1);

train['class_id_name'] = [f'{class_id}: {class_name}' for class_id, class_name in zip(train['class_id'], train['class_name'])]
g = sns.FacetGrid(train.sort_values('class_id'), col='class_id_name', 
                  sharex=False, sharey=False,
                  col_wrap=3, height=5);
g.map(plt.hist, 'prob', bins=30);


In [None]:
train['logit'] = np.log(train['prob']) - np.log1p(-train['prob'])
g = sns.FacetGrid(train.sort_values('class_id'), col='class_id_name', 
                  sharex=False, sharey=False,
                  col_wrap=3, height=5);
g.map(plt.hist, 'logit', bins=50);

In [None]:
train['flagged'] = ((train['class_id'].isin([0,3])) & (train['logit']< -2))\
    | ((train['class_id'].isin([1,2,4,5,7,9,11,13])) & (train['logit']< -6))\
    | ((train['class_id'].isin([6,8,10,12])) & (train['logit']< -5))

In [None]:
train.to_csv('train_with_flags.csv', index=False)

In [None]:
train[train['flagged']]

# Getting SHAP explanations for cases the model considers implausible

To find out more about SHAP, have a look at [this notebook](https://www.kaggle.com/bjoernholzhauer/will-rose-or-jack-survive-lightgbm-and-shap) explaining LightGBM and SHAP.


In [None]:
explainer = shap.TreeExplainer(lgbfit)
shap_values = explainer.shap_values(train[cont_features].sample(n=10000, replace=False, random_state=42))

In [None]:
# Color-scheme from http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),  
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),  
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),  
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),  
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
for i in range(len(tableau20)):  
    r, g, b = tableau20[i]  
    tableau20[i] = (r / 255., g / 255., b / 255.)
    
shap.summary_plot(shap_values, 
                  train[cont_features], 
                  color=matplotlib.colors.ListedColormap(tableau20))

In [None]:
flagged_idxs = np.where( train['flagged'] == True)[0]
flagged_idxs

In [None]:
def get_image(image_id, rad_id):
    with shelve.open('../input/eda-dicom-reading-vinbigdata-chest-x-ray/training_data.db', 
                     flag='r', writeback=False) as myshelf:
        tmpdict = myshelf[image_id]                        
    which_indices = [idx for idx, val in enumerate(tmpdict['rad_id']) if val==rad_id]
    image = np.stack([tmpdict['image']]*3).transpose(1,2,0)        
    bboxes = tmpdict['bboxes'][which_indices]
    class_labels = tmpdict['class_labels'][which_indices]        
    return {'image': image, 'bboxes': bboxes, 'class_labels': class_labels}

for index, row in train.loc[train['flagged'], 
                            ['image_id', 'rad_id', 'class_id', 
                             'class_name', 'prob', 'alternative', 'alternative_prob']]\
        .drop_duplicates()\
        .reset_index(drop=True)\
        .iterrows():
    image_id = row['image_id']    
    problem_class_id = row['class_id']
    problem_class = str(row['class_id']) + ": " + row['class_name']
    rad_id = int(re.findall(r'\d+', row['rad_id'])[0])
    im = get_image(image_id, rad_id)
    alternative = str(row['alternative']) + ": " + label_dict[row['alternative']]
    alt_prob = np.round(row['alternative_prob'],8)
    prob = np.round(row['prob'], 8)

    plt.figure(figsize=(20,10));    
    plt.imshow(im['image']);
    plt.suptitle(f'Image {image_id}, radiologist {rad_id}\n red: {problem_class} (prob={prob})\nAlternative: {alternative} (prob={alt_prob})', fontsize=16)    
    ax = plt.gca();

    for bbox, class_id in zip(im['bboxes'], im['class_labels']):
        #print(f'{image_id}, {rad_id}, {bbox}, {class_id}')        
        if class_id==problem_class_id: 
            ecol='r'
            linestyle = 'dashed'
            offset=12
        else:
            ecol='b'
            linestyle = 'dotted'
            offset=24

        # Create a Rectangle patch
        rect = Rectangle((bbox[0],bbox[1]),bbox[2]-bbox[0],bbox[3]-bbox[1],
                         linewidth=1,linestyle=linestyle,
                         edgecolor=ecol,facecolor='none');
        # Add the patch to the Axes
        ax.add_patch(rect);
        ax.annotate(str(class_id) + ': ' + label_dict[class_id], 
                    xy=((bbox[2]+bbox[0])/2, bbox[1]+offset), 
                    xycoords='data', color=ecol);

# Individual explanations for predictions
Let's also look at 20 specific SHAP explanations for model predictions:

In [None]:
shap.initjs()
def shap_explain_flgidx(flgidx):
    print(train.iloc[flgidx])
    class_id = train.iloc[flgidx]['class_id']
    return shap.force_plot(explainer.expected_value[class_id],
                shap_values[class_id][flgidx],
                train[cont_features].iloc[flgidx])

In [None]:
shap_explain_flgidx(flagged_idxs[0])

In [None]:
shap_explain_flgidx(flagged_idxs[1])

In [None]:
shap_explain_flgidx(flagged_idxs[2])

In [None]:
shap_explain_flgidx(flagged_idxs[3])

In [None]:
shap_explain_flgidx(flagged_idxs[4])

In [None]:
shap_explain_flgidx(flagged_idxs[5])

In [None]:
shap_explain_flgidx(flagged_idxs[6])

In [None]:
shap_explain_flgidx(flagged_idxs[7])

In [None]:
shap_explain_flgidx(flagged_idxs[8])

In [None]:
shap_explain_flgidx(flagged_idxs[9])

In [None]:
shap_explain_flgidx(flagged_idxs[10])

In [None]:
shap_explain_flgidx(flagged_idxs[11])

In [None]:
shap_explain_flgidx(flagged_idxs[12])

In [None]:
shap_explain_flgidx(flagged_idxs[13])

In [None]:
shap_explain_flgidx(flagged_idxs[14])

In [None]:
shap_explain_flgidx(flagged_idxs[15])

In [None]:
shap_explain_flgidx(flagged_idxs[16])

In [None]:
shap_explain_flgidx(flagged_idxs[17])

In [None]:
shap_explain_flgidx(flagged_idxs[18])

In [None]:
shap_explain_flgidx(flagged_idxs[19])