# **CheXpert Chest X-Ray Dataset**

- [dataset](https://stanfordmlgroup.github.io/competitions/chexpert/)
- [Read the Paper (Irvin & Rajpurkar et al.)](https://arxiv.org/abs/1901.07031)


## What is CheXpert?

CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets.


## Why CheXpert?

Chest radiography is the most common imaging examination globally, critical for screening, diagnosis, and management of many life threatening diseases. Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives. For progress in both development and validation of automated algorithms, we realized there was a need for a labeled dataset that (1) was large, (2) had strong reference standards, and (3) provided expert human performance metrics for comparison.

---

## In the CheXpert series:

- [**🩻 CheXpert X-Rays Dataset | Data Visuals 🫁**](https://www.kaggle.com/code/shreydan/chexpert-x-rays-dataset-data-visuals): we explored the dataset visually as per the diseases
- [**🩻 CheXpert | multi-label classifier 📊️**](https://www.kaggle.com/code/shreydan/chexpert-multi-label-classifier/): we trained a multi-label classifier for the following labels: Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion. We used asymmetric loss function with multiple metrics: auroc, specificity, hamming_loss, exact_match. 

### In this notebook, we'll be creating pseudo-labels using test-time augmentation

### TTA Process:

- for every sample, apply random augmentations we apply during training.
- get N predictions for 1 sample.
- choose the most-occuring hard prediction (0/1) per sample from those N times.
- more data :)

# Imports

In [1]:
!pip install -Uq transformers accelerate datasets

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from PIL import Image
from torchvision.utils import make_grid
import torchvision.transforms.functional as TF
import albumentations as A
from albumentations.pytorch import ToTensorV2
from tqdm.auto import tqdm
from sklearn.model_selection import StratifiedGroupKFold
from pathlib import Path
from collections import Counter
import datasets
from transformers import AutoModelForImageClassification, Trainer, TrainingArguments
from torchmetrics.functional.classification import (multilabel_auroc, 
                                                    multilabel_exact_match,
                                                    multilabel_specificity,
                                                    multilabel_hamming_distance
                                                   )
from types import SimpleNamespace
import wandb

tqdm.pandas()

2024-05-20 19:17:30.681311: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-20 19:17:30.681417: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-20 19:17:30.808196: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Our Trained Model and Original Labels

In [3]:
model_path = '/kaggle/input/chexpert-multi-label-classifier/convnextv2-tiny-chexpert'
train_df_path = '/kaggle/input/chexpert-multi-label-classifier/train_df.csv'

In [4]:
df = pd.read_csv(train_df_path)

### Let's predict on samples that have uncertain (-1) labels only, to save time.

- good_df: no uncertainty
- label_df: samples have -1 values, we'll predict for all these samples.

In [5]:
good_df = df[~df['labels'].str.contains('-1')].reset_index(drop=True)
good_df

Unnamed: 0,Path,PatientID,fold,labels
0,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,1,4,00000
1,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,3,4,00010
2,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,4,2,00000
3,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,4,2,00000
4,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,5,5,00000
...,...,...,...,...
159324,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64533,1,01011
159325,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64536,1,00011
159326,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64536,1,10010
159327,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64539,4,11000


In [6]:
label_df = df[df['labels'].str.contains('-1')].reset_index(drop=True)
label_df

Unnamed: 0,Path,PatientID,fold,labels
0,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"-1,-1,-1,-1,-1"
1,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"0,0,-1,0,0"
2,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"0,0,-1,0,0"
3,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,"0,0,-1,0,-1"
4,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,"-1,0,0,0,-1"
...,...,...,...,...
64080,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64534,1,-10000
64081,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64535,2,"-1,0,-1,0,0"
64082,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,-10001
64083,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,"-1,0,0,0,-1"


# Augmentations

In [7]:
img_size = 384
mean = (0.485,0.456,0.406)
std = (0.229,0.224,0.225)
label_tfms = A.Compose([
    A.Rotate(15),
    A.Resize(img_size,img_size),
    A.HorizontalFlip(),
    A.Normalize(mean=mean,std=std),
    ToTensorV2(),
])

In [8]:
def tta_transforms(batch):
    batch['image'] = [np.array(x.convert('RGB')) for x in batch['Path']]
    batch['image'] = [label_tfms(image=x)['image'] for x in batch['image']]
    batch['labels'] = [list(map(int,l.split(','))) for l in batch['labels']]
    return batch

In [9]:
def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['image'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }

In [10]:
label_ds = datasets.Dataset.from_pandas(label_df).cast_column('Path',datasets.Image())
label_ds = label_ds.with_transform(tta_transforms)

In [11]:
dl = torch.utils.data.DataLoader(
    label_ds,
    batch_size = 64,
    shuffle=False,
    collate_fn=collate_fn,
    num_workers = 4,
    pin_memory = True
)

## I'm choosing N=5 for number of predictions per sample

In [12]:
tta_count = 5

In [13]:
model = AutoModelForImageClassification.from_pretrained(model_path).to('cuda')

In [14]:
target_cols = ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Pleural Effusion']

In [15]:
tta_predictions = []
model.eval()
with torch.inference_mode():
    
    for _ in range(tta_count):
        
        predictions = {t:[] for t in target_cols}
        for idx,batch in tqdm(enumerate(dl), total=len(dl)):
            batch_preds = []
            pixel_values = batch['pixel_values'].to('cuda')
            logits = model(pixel_values=pixel_values).logits
            probs = logits.sigmoid().detach().cpu().numpy()
            hard_labels = (probs > 0.5).astype(int) # threshold = 0.5
            # get predictions as per the disease {disease: hard_labels_of_all_samples}
            preds = {t:h for t,h in zip(target_cols,hard_labels.T)}
#             print(f'batch_{idx}',preds)
            del logits, probs, hard_labels
            
            # accumulate predictions as per the disease 
            for t in target_cols:
                predictions[t].append(preds[t])
        
        # flatten batches
        predictions = {t:np.hstack(p).flatten() for t,p in predictions.items()}
#         print(predictions)
        tta_predictions.append(predictions)
        
final_predictions = {t:[] for t in target_cols}
for t in target_cols:
    # column stack: num_samples x N, N = num. of tta performed
    col_tta = np.column_stack([tta[t] for tta in tta_predictions])
    # select most occuring from each row of num_samples
    final_t_preds = [np.bincount(x).argmax() for x in col_tta]
    # final predictions for target disease 
    final_predictions[t] = final_t_preds

  0%|          | 0/1002 [00:00<?, ?it/s]

  0%|          | 0/1002 [00:00<?, ?it/s]

  0%|          | 0/1002 [00:00<?, ?it/s]

  0%|          | 0/1002 [00:00<?, ?it/s]

  0%|          | 0/1002 [00:00<?, ?it/s]

# Pseudo-Labeling

In [16]:
predictions_df = pd.DataFrame(final_predictions)

In [17]:
# split original labels back into separate columns
label_df[target_cols] = label_df['labels'].str.split(',',expand=True)
label_df[target_cols] = label_df[target_cols].astype(int)

In [18]:
# copy original labels
new_label_df = label_df.copy()
new_label_df.head()

Unnamed: 0,Path,PatientID,fold,labels,Atelectasis,Cardiomegaly,Consolidation,Edema,Pleural Effusion
0,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"-1,-1,-1,-1,-1",-1,-1,-1,-1,-1
1,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"0,0,-1,0,0",0,0,-1,0,0
2,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,"0,0,-1,0,0",0,0,-1,0,0
3,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,"0,0,-1,0,-1",0,0,-1,0,-1
4,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,"-1,0,0,0,-1",-1,0,0,0,-1


In [19]:
# replace uncertain labels ONLY from the predictions_df in the new_label_df
res = new_label_df.where(new_label_df != -1, predictions_df) # replace values where condition is False

In [20]:
predictions_df.tail()

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Pleural Effusion
64080,0,0,0,0,0
64081,0,0,0,0,0
64082,1,0,0,0,1
64083,0,0,0,0,0
64084,0,0,0,0,1


In [21]:
# as you'll notice, only the -1 values have been replaced with their predictions
# original labels are kept as it is.
res.tail()

Unnamed: 0,Path,PatientID,fold,labels,Atelectasis,Cardiomegaly,Consolidation,Edema,Pleural Effusion
64080,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64534,1,-10000,0,0,0,0,0
64081,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64535,2,"-1,0,-1,0,0",0,0,0,0,0
64082,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,-10001,1,0,0,0,1
64083,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,"-1,0,0,0,-1",0,0,0,0,0
64084,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64538,0,"0,0,0,-1,0",0,0,0,0,0


In [22]:
def combine_label_cols(df, label_cols):
    combine = lambda x: ','.join([f"{x[c]}" for c in label_cols])
    df['labels'] = df.progress_apply(combine, axis=1)
    df = df.drop(columns=label_cols)
    return df.reset_index(drop=True)

In [23]:
# recombine into one labels column
res = combine_label_cols(res, target_cols)

  0%|          | 0/64085 [00:00<?, ?it/s]

In [24]:
res

Unnamed: 0,Path,PatientID,fold,labels
0,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,00000
1,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,00000
2,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,2,2,00000
3,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,00000
4,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,11,6,00001
...,...,...,...,...
64080,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64534,1,00000
64081,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64535,2,00000
64082,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,10001
64083,/kaggle/input/chexpert/CheXpert-v1.0-small/tra...,64537,3,00000


In [25]:
# merge the good_df and new pseudo labels, and shuffle
# we still have the fold values to train a new model again
final_df = pd.concat([good_df, res]).sample(frac=1).reset_index(drop=True)

# Distribution of labels (original + pseudo)

In [26]:
def get_fold(df,fold=0):
    train = df[df['fold']!=fold].reset_index(drop=True)
    valid = df[df['fold']==fold].reset_index(drop=True)
    return train, valid

dist_df = final_df.copy()
dist_df[target_cols] = dist_df['labels'].str.split(',',expand=True)
dist_df[target_cols] = dist_df[target_cols].astype(int)
fold_0_train,fold_0_valid = get_fold(dist_df, fold=0)

for col in target_cols:
    t = fold_0_train[col].value_counts().to_dict()
    v = fold_0_valid[col].value_counts().to_dict()
    print(col)
    for k in [0,1]:
        print(f"{k}\ttrain: {t[k]} valid: {v[k]}")
    print()

Atelectasis
0	train: 154210 valid: 21667
1	train: 41641 valid: 5896

Cardiomegaly
0	train: 169344 valid: 23893
1	train: 26507 valid: 3670

Consolidation
0	train: 182131 valid: 25731
1	train: 13720 valid: 1832

Edema
0	train: 141971 valid: 20084
1	train: 53880 valid: 7479

Pleural Effusion
0	train: 113402 valid: 16000
1	train: 82449 valid: 11563



# Saving

we'll be using this final_df to train our new model!

In [27]:
final_df.to_csv('train_df.csv',index=False)

# Upcoming:

- Training a new model from the newly created pseudo-labels along with the original labels

# Thank you for checking out my work! :)

## **Follow [@shreydan](https://kagggle.com/shreydan) if you haven't already!**

### Stay tuned for more notebooks on this dataset :)