### What is the competition?

This competition is about identifying and localizing COVID-19 abnormalities on chest radiographs. We need to categorize a given chest radiograph image into 4 different classes namely;

* Negative for Pneumonia
* Typical Appearance 
* Indeterminate Appearance
* Atypical Appearance

Alogn with that we need to predicta bounding box that describes the abnormalities. Hence this is a This is an object detection/localization and classification problem. The train and test images given in this competition are in DICOM format. 

### What is DICOM?

DICOM® — [Digital Imaging and Communications in Medicine](https://www.dicomstandard.org) — is the ISO recognized international standard for medical images and related information. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use.

### What is Object Detection?
A typical usage of convolutional neural network is to classify an image into respective classes. For example Dog vs Cat or Classifying the handwritten digits like the MNIST dataset. However, Object detection is about identifying where a particular object(s) is in the given image and to classify it. Object localization is to define a bounding box (the location) of the classified object within the image. An example is given below. 

![YOLO-Object-Detection](https://user-images.githubusercontent.com/48846576/119601441-cca11500-bdae-11eb-8711-8e0b2683dd19.png)
 <div align='center'>Source: You Only Look Once: Unified, Real-Time Object Detection <a href="https://pjreddie.com/darknet/yolo/">https://pjreddie.com/darknet/yolo/</a></div>


In [None]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
from plotly.subplots import make_subplots
# setting default template to plotly_white for all visualizations
pio.templates.default = "plotly_white"
%matplotlib inline
import gc

from colorama import Fore, Back, Style

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
res = Style.RESET_ALL

import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install python-gdcm

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Load and Explore Data</b></div>


In [None]:
PATH = '/kaggle/input/siim-covid19-detection/'
submission = pd.read_csv('/kaggle/input/siim-covid19-detection/sample_submission.csv', index_col=None)
image_df = pd.read_csv('/kaggle/input/siim-covid19-detection/train_image_level.csv', index_col=None)
study_df = pd.read_csv('/kaggle/input/siim-covid19-detection/train_study_level.csv', index_col=None)
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None)
print(f"{y_}Train image level csv shape : {image_df.shape}{res}\n{g_}Train study level csv shape : {study_df.shape}{res}")

In [None]:
import os
all_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        all_files.append(os.path.join(dirname, filename))

In [None]:
study_df

In [None]:
image_df

In [None]:
study_grp = pd.melt(study_df, id_vars=list(study_df.columns)[:1], value_vars=list(study_df.columns)[1:],
             var_name='label', value_name='value')
study_grp = study_grp.loc[study_grp['value']!=0]
colors = {'Typical Appearance' : '#DCD427',
'Negative for Pneumonia' : '#0092CC',
'Indeterminate Appearance' : '#CC3333',
#'Atypical Appearance' : '#779933',
          'Atypical Appearance' : '#E6E6E6'
         }

study_grp = study_grp.groupby('label').sum().sort_values('value',ascending=False).reset_index()
study_grp['color'] = study_grp['label'].apply(lambda x: colors[x])
study_grp

In [None]:
def plot_study_label(df):
    pio.templates.default = "plotly_dark"
    fig = px.bar(df, x='label', y='value',
             hover_data=['label', 'value'], color='label',
             #labels={column: label},
             color_discrete_map=colors,
             text='value')
    fig.update_layout(xaxis={'categoryorder':'array', 'categoryarray': df['label'],
                             'title' : None, 
                             'showgrid':False},
                      yaxis={'showgrid':False,
                            'title' : 'Count'},
                      showlegend=False,
                     title = 'Study samples in train data')
    fig.update_traces(textfont_size=16)
    fig.show()


In [None]:
plot_study_label(study_grp)

In [None]:
study_grp['pct'] = round((study_grp['value'] / study_grp['value'].sum())*100,2)

fig = go.Figure(data=[go.Pie(labels=study_grp['label'],
                             values=study_grp['pct'],
                             hole=.3,
                             pull=[0.1, 0.1, 0.1, 0.1]
                            )
                     ]
               )
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=16,
                  marker=dict(colors=study_grp['color'], line=dict(color='#000000', width=2))
                 )
fig.update_layout(title={'text': "% of labels in training data",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>DICOM Image</b></div>


### Overview 
DICOM - Digital Imaging and Communication in Medicine is a standard format for encoding and transmitting medical images. This format stores image metadata like patient information, image acquistion parameters, image size, pixel size, etc along with the actual image. The image metadata is stored in the DICOM header. The image pixel data may be compressed using various techniques like JPEG, lossless JPEG, run length encoding (RLE), etc. Let's load a sample image to look at its header (metadata) and the actual pixel_array.

### Bit-Depth

The pixel range of an image format is determined by its bit-depth. The range is $ [0, 2^{bitdepth} -1] $. For example, an 8-bit image will have a range of $[0, 2^{8} -1] = [0, 255]$. Most of the common photographic formats such as JPEG, PNG, etc use 8 bits for storage and only have positive values. A JPEG image containing 3 channels (RGB) will have a bit-depth of 8 for each channel hence a total bit-depth of 24.

However, medical images use a higher bit-depth since higher accuracy is needed. For example the sample image metadata shown below uses ((0028, 0100) Bits Allocated field) uses 16 bits. Hence the range of pixel values for a 16 bit image is $[0, 65535] $ for a total of $2^{16} = 65536$ values. 


In [None]:
from pydicom import dcmread, read_file
from pydicom.data import get_testdata_file
file_path = PATH+"train/00086460a852/9e8302230c91/65761e66de9f.dcm"
dicom = read_file(file_path, stop_before_pixels=False)

In [None]:
dicom

In [None]:
img=dicom.pixel_array
type(img), img.shape

In [None]:
box = image_df.loc[image_df['id']=='65761e66de9f_image'].reset_index(drop=True)
from ast import literal_eval

import matplotlib.patches as patches
# Create figure and axes
fig, ax = plt.subplots(figsize=(10, 8))
ax.imshow(img, cmap="gray")
# Create a Rectangle patch
rect1 = patches.Rectangle((720.65215, 636.51048), 332.19348, 648.12561, linewidth=1.5, edgecolor='r', facecolor='none')
rect2 = patches.Rectangle((2044.77989, 847.90622), 329.87049, 576.11169, linewidth=1.5, edgecolor='r', facecolor='none')
# Add the patch to the Axes
ax.add_patch(rect1)
ax.add_patch(rect2)
plt.show()

Let's display sample images..

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Display Samples</b></div>


In [None]:
from ast import literal_eval
def get_samples(num):
    study_df_grp = pd.melt(study_df, id_vars=list(study_df.columns)[:1], value_vars=list(study_df.columns)[1:],
             var_name='label', value_name='value')
    study_df_grp = study_df_grp.loc[study_df_grp['value']!=0].reset_index(drop=True)
    labels = list(study_df_grp['label'].unique())
    study_samples = {}
    for label in labels:
        study_ids = study_df_grp.loc[study_df_grp['label'] == label].sample(num)['id'].tolist() #Get num sample rows from the datafame
        samples = []
        for study_id in study_ids:
            image = {}
            study_instance_id = study_id.split('_')[0]
            image_id = image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['id'].values[0].split('_')[0] #Get the image matching study id
            file_name = [string for string in all_files if image_id in string]
            image['study_id'] = study_instance_id
            image['dicom_file'] = file_name[0]
            #Get the bounding boxes
            box = None
            try:
                box = literal_eval(image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['boxes'].values[0])
            except ValueError:
                pass
            image['boxes'] = box
            samples.append(image)
        study_samples[label] = samples
    return study_samples

samples = get_samples(6)

def display_all_class_samples():
    ''' Input : List of samples 
    '''
    all_class_samples = []
    for key in samples:
        sample_dict = samples[key][0]
        sample_dict['class'] = key
        all_class_samples.append(sample_dict)
    fig1, ax1 = plt.subplots(1,4, figsize=(18, 5), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    axs = ax1.ravel()
    for item, ax in zip(all_class_samples, axs):
        dicom = read_file(item['dicom_file'], stop_before_pixels=False)
        img = dicom.pixel_array
        ax.imshow(img, cmap="gray")
        if 'boxes' in item and item['boxes'] is not None:
            for box in item['boxes']:             
                rect = patches.Rectangle((box['x'], box['y']), box['width'], box['height'], linewidth=1.5, edgecolor='r', facecolor='none')
                ax.add_patch(rect)
        ax.set_title('{}'.format(item['class']),fontsize = 18)    
    plt.tight_layout(pad=3.0)
    plt.subplots_adjust(top=0.91)
    plt.suptitle('Samples across all classes',fontsize = 20)
    plt.show()

In [None]:
display_all_class_samples()

In [None]:
def display_samples(samples, title, draw_boxes=False):
    ''' Input : List of samples 
    '''
    fig1, ax1 = plt.subplots(2,3, figsize=(18, 12), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    axs = ax1.ravel()
    for item, ax in zip(samples, axs):
        dicom = read_file(item['dicom_file'], stop_before_pixels=False)
        img = dicom.pixel_array
        ax.imshow(img, cmap="gray")
        if draw_boxes == True and item['boxes'] is not None:
            for box in item['boxes']:             
                rect = patches.Rectangle((box['x'], box['y']), box['width'], box['height'], linewidth=1.5, edgecolor='r', facecolor='none')
                ax.add_patch(rect)
        ax.set_title('Study : {}'.format(item['study_id']),fontsize = 18)
        
    plt.tight_layout(pad=3.0)
    plt.subplots_adjust(top=0.91)
    plt.suptitle(title,fontsize = 20)
    plt.show()

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Negative for Pneumonia</b></div>


In [None]:
display_samples(samples['Negative for Pneumonia'],'Negative for Pneumonia')

In [None]:
def display_histogram(samples, title):
    ''' Input : List of samples 
    '''
    fig1, ax1 = plt.subplots(2,3, figsize=(18, 12), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    axs = ax1.ravel()
    for item, ax in zip(samples, axs):
        dicom = read_file(item['dicom_file'], stop_before_pixels=False)
        img = dicom.pixel_array
        sub_plot = sns.histplot(img.flatten(), ax=ax)
        ax.set_title('Study : {}'.format(item['study_id']),fontsize = 18)
        
    plt.tight_layout(pad=3.0)
    plt.subplots_adjust(top=0.91)
    plt.suptitle('Image Histogram - {}'.format(title),fontsize = 20)
    plt.show()


In [None]:
display_histogram(samples['Negative for Pneumonia'],'Negative for Pneumonia')

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Typical Appearance</b></div>


In [None]:
display_samples(samples['Typical Appearance'],'Typical Appearance', draw_boxes=True)

In [None]:
display_histogram(samples['Typical Appearance'],'Typical Appearance')

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Indeterminate Appearance</b></div>


In [None]:
display_samples(samples['Indeterminate Appearance'],'Indeterminate Appearance', draw_boxes=True)

In [None]:
display_histogram(samples['Indeterminate Appearance'],'Indeterminate Appearance')

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Atypical Appearance</b></div>


In [None]:
display_samples(samples['Atypical Appearance'],'Atypical Appearance', draw_boxes=True)

In [None]:
display_histogram(samples['Atypical Appearance'],'Atypical Appearance')

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Extract Metadata from DICOM Files</b></div>


In [None]:
def get_files(file_format):
    files=[]
    train_files = []
    for file in all_files:
        if file_format in file:
            files.append(file)
    return files
train_files = get_files('/train/')
test_files = get_files('/test/')

In [None]:
from pydicom.tag import Tag
from tqdm import tqdm
pvt_creator1 = Tag(0x2001, 0x10) #Private Creator 1
pvt_creator2 = Tag(0x0903, 0x10) #Private Creator 1

columns = [ 'StudyID',
 'StudyInstanceUID',
 'PatientSex', 
 'BitsAllocated',
 'BitsStored',
 'Columns',
 'Rows',
 'BodyPartExamined', 
 'HighBit', 
 'ImageType',
 'ImagerPixelSpacing',
 'InstanceNumber',
 'Modality',
 'PatientID',
 'PatientName',
 'AccessionNumber',
 'DeidentificationMethod',
 #'DeidentificationMethodCodeSequence',
 'PhotometricInterpretation',
 'PixelRepresentation',
 'SOPClassUID',
 'SOPInstanceUID',
 'SamplesPerPixel',
 'SeriesInstanceUID',
 'SeriesNumber',
 'SpecificCharacterSet',
 'StudyDate',
 'StudyTime',
 'PrivateCreator1',
 'PrivateCreator2']

def extract_metadata(columns, files):
    df = pd.DataFrame(columns=columns)
    for num, file in tqdm(enumerate(files)):
        row = {}
        dicom_img = read_file(file, stop_before_pixels=True)
        for col in columns:
            if col not in ['PrivateCreator1', 'PrivateCreator2']:
                row[col] = dicom_img[col].value
        try:        
            row['PrivateCreator1'] = dicom_img.get_item(pvt_creator1).value
            row['PrivateCreator2'] = dicom_img.get_item(pvt_creator2).value    
        except AttributeError:
            pass
        df = df.append(row,ignore_index=True)
    return df

In [None]:
train_df = extract_metadata(columns, train_files)

In [None]:
test_df = extract_metadata(columns, test_files)

In [None]:
train_df['Rows'] = train_df['Rows'].astype(int)
train_df['Columns'] = train_df['Columns'].astype(int)
test_df['Rows'] = test_df['Rows'].astype(int)
test_df['Columns'] = test_df['Columns'].astype(int)
train_df.to_csv('train_imgs_meta.csv', index=None)
test_df.to_csv('test_imgs_meta.csv', index=None)

### Metadata of train images

In [None]:
train_df

In [None]:
train_df['PatientSex'].value_counts().reset_index()\
    .style.background_gradient(subset=['PatientSex'], cmap='winter_r')\

In [None]:
train_df['BodyPartExamined'].value_counts().reset_index()\
    .style.background_gradient(subset=['BodyPartExamined'], cmap='nipy_spectral_r')\

In [None]:
train_df['BitsStored'] = train_df['BitsStored'].astype(int)
def combine_image_size(row):
    return str(row['Rows']) + ',' + str(row['Columns'])
train_df['ImageSize'] = train_df.apply(lambda x: combine_image_size(x), axis=1)

In [None]:
fig = go.Figure(go.Scattergl(
    x=train_df['Rows'], y=train_df['Columns'],
    name='Image Size',
    mode='markers',  
    marker=dict(
        color='#0092CC',
    )
))
fig.update_layout(xaxis={'title' : 'Rows', 
                             'showgrid':False},
                      yaxis={'showgrid':False,
                            'title' : 'Columns'},
                      showlegend=False,
                     title = 'Train - image size')
fig.update_traces(textfont_size=16)
fig.show()



### Metadata of test images

In [None]:
test_df

In [None]:
test_df['PatientSex'].value_counts().reset_index()\
    .style.background_gradient(subset=['PatientSex'], cmap='winter_r')\

In [None]:
test_df['BodyPartExamined'].value_counts().reset_index()\
    .style.background_gradient(subset=['BodyPartExamined'], cmap='magma_r')\

In [None]:
fig = go.Figure(go.Scattergl(
    x=test_df['Rows'], y=test_df['Columns'],
    name='Image Size',
    mode='markers',  
    marker=dict(
        color='#DCD427',
    )
))
fig.update_layout(xaxis={'title' : 'Rows', 
                             'showgrid':False},
                      yaxis={'showgrid':False,
                            'title' : 'Columns'},
                      showlegend=False,
                     title = 'Test - image size')
fig.update_traces(textfont_size=16)
fig.show()



In [None]:
train_df['ImageSize'] = train_df['Rows'] * train_df['Columns']
test_df['ImageSize'] = test_df['Rows'] * test_df['Columns']

fig = go.Figure()
fig.add_trace(go.Box(y=train_df['ImageSize'], 
                         name='Train', 
                         jitter=0.5,
                         whiskerwidth=0.6,
                         fillcolor='#0092CC',
                         marker_size=5,
                         line_width=1))
fig.add_trace(go.Box(y=test_df['ImageSize'], 
                         name='Test', 
                         jitter=0.5,
                         whiskerwidth=0.6,
                         fillcolor='#DCD427',
                         marker_size=5,
                         line_width=1))

fig.update_layout(xaxis={'title' : None,'showgrid' :False},
                  yaxis=dict(title='Image Size (Pixels)',showgrid=False,zeroline=False),
                 title = 'Image Size IQR')    
fig.show()

### Image Modality
DICOM file has an attribute called "Modality". This describes the technology used to capture the radiography images.
* CR - Computed Radiography 
* DX - Digital Radiography 

Quick reference on these are [here](https://www.jpihealthcare.com/computed-radiography-cr-and-digital-radiography-dr-which-should-you-choose/)

The images given in this competitions have approximately equal distribution among these two techonologies. Digital Radiography has slightly more samples in both train and test images

In [None]:
modality_train = train_df['Modality'].value_counts().reset_index().rename(columns={'index':'Modality','Modality':'Count'})
modality_train['Type'] = 'Train'
modality_test = test_df['Modality'].value_counts().reset_index().rename(columns={'index':'Modality','Modality':'Count'})
modality_test['Type'] = 'Test'
modality_df = pd.concat([modality_train, modality_test]).reset_index(drop=True)
modality_map = {'DX' : 'Digital Radiography',
'CR' : 'Computed Radiography'}
modality_df['Desc'] = modality_df['Modality'].apply(lambda x : modality_map[x])
bar_colors = ['#FFA48E', '#4ACFAC']
modalities = list(modality_df.Desc.unique())
fig = go.Figure()
for modality, color in zip(modalities, bar_colors):
    df = modality_df.loc[modality_df['Desc'] == modality]
    fig.add_trace(go.Bar(
        x=df['Type'],
        y=df['Count'],
        name=modality,
        marker_color = color,
        #text = df['passenger_count_new'],
        #texttemplate='%{text:.2s}', 
        textposition='auto',
        marker_line_width=2.5, opacity=0.8,
        marker_line_color = color        
    ))
fig.update_layout(barmode='stack',
                  xaxis=dict(
                                tickmode = 'array',
                                 title=None,
                                 showgrid=False,
                                 zeroline=False,
                            ),
                      yaxis=dict(title='Count',
                                 showgrid=False,
                                 zeroline=False,
                                ), 
                      title = dict(text = 'Modality of images',
                                   xref = 'paper',
                                  ),
                      bargap=0.15, 
                    bargroupgap=0.1,
                     )     
fig.show()

In [None]:
def get_modality_sample(df, modality):
    studyinstance = df.loc[(df['Modality'] == modality) & (df['PatientSex'] == 'M')].sample(n=1)['StudyInstanceUID'].values[0]
    return studyinstance
    #files = [file for file in all_files if studyinstance in file]
    #return files[0]

def get_modality_samples():
    modality_samples = []
    modality_samples.append(get_modality_sample(train_df,'CR'))
    modality_samples.append(get_modality_sample(train_df,'DX'))
    study_samples = []
    for study_instance_id in modality_samples:
        samples = []
        image = {}
        image_id = image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['id'].values[0].split('_')[0] #Get the image matching study id
        file_name = [string for string in all_files if image_id in string]
        image['study_id'] = study_instance_id
        image['dicom_file'] = file_name[0]
        #Get the bounding boxes
        box = None
        try:
            box = literal_eval(image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['boxes'].values[0])
        except ValueError:
            pass
        image['boxes'] = box
        study_samples.append(image)
        #study_samples[study_instance_id] = samples
    return study_samples

def display_modality_samples(samples, title, draw_boxes=False):
    ''' Input : List of samples 
    '''
    fig1, ax1 = plt.subplots(1,2, figsize=(18, 12), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    axs = ax1.ravel()
    for item, ax in zip(samples, axs):
        dicom = read_file(item['dicom_file'], stop_before_pixels=False)
        img = dicom.pixel_array
        ax.imshow(img, cmap="gray")
        if draw_boxes == True and item['boxes'] is not None:
            for box in item['boxes']:             
                rect = patches.Rectangle((box['x'], box['y']), box['width'], box['height'], linewidth=1.5, edgecolor='r', facecolor='none')
                ax.add_patch(rect)
        ax.set_title('Study : {}'.format(item['study_id']),fontsize = 18)
        
    plt.tight_layout(pad=1.0)
#    plt.subplots_adjust(top=0.99)
    plt.suptitle(title,fontsize = 20)
    plt.show()

display_modality_samples(get_modality_samples(), 'Computed Radiography vs Digital Radiography', True)

### Photometric Interpretation

In DICOM, monochrome images are given a photometric interpretation of 'MONOCHROME1' (low values=bright, high values=dim) or 'MONOCHROME2' (low values=dark, high values=bright).

In [None]:
pi_train = train_df['PhotometricInterpretation'].value_counts().reset_index().rename(columns={'index':'PhotometricInterpretation','PhotometricInterpretation':'Count'})
pi_train['Type'] = 'Train'
pi_test = test_df['PhotometricInterpretation'].value_counts().reset_index().rename(columns={'index':'PhotometricInterpretation','PhotometricInterpretation':'Count'})
pi_test['Type'] = 'Test'
pi_df = pd.concat([pi_train, pi_test]).reset_index(drop=True)
bar_colors = ['#FFA48E', '#4ACFAC']
pi_values = list(pi_df.PhotometricInterpretation.unique())
fig = go.Figure()
for pi, color in zip(pi_values, bar_colors):
    df = pi_df.loc[pi_df['PhotometricInterpretation'] == pi]
    fig.add_trace(go.Bar(
        x=df['Type'],
        y=df['Count'],
        name=pi,
        marker_color = color,
        #text = df['passenger_count_new'],
        #texttemplate='%{text:.2s}', 
        textposition='auto',
        marker_line_width=2.5, opacity=0.8,
        marker_line_color = color        
    ))
fig.update_layout(barmode='stack',
                  xaxis=dict(
                                tickmode = 'array',
                                 title=None,
                                 showgrid=False,
                                 zeroline=False,
                            ),
                      yaxis=dict(title='Count',
                                 showgrid=False,
                                 zeroline=False,
                                ), 
                      title = dict(text = 'Photometric Interpretation of images',
                                   xref = 'paper',
                                  ),
                      bargap=0.15, 
                    bargroupgap=0.1,
                     )     
fig.show()

In [None]:
def get_photometricInterpretation_sample(df, modality):
    studyinstance = df.loc[(df['PhotometricInterpretation'] == modality) & (df['PatientSex'] == 'M')].sample(n=1)['StudyInstanceUID'].values[0]
    return studyinstance
    #files = [file for file in all_files if studyinstance in file]
    #return files[0]

def get_PhotometricInterpretation_samples():
    modality_samples = []
    modality_samples.append(get_photometricInterpretation_sample(train_df,'MONOCHROME2'))
    modality_samples.append(get_photometricInterpretation_sample(train_df,'MONOCHROME1'))
    study_samples = []
    for study_instance_id in modality_samples:
        samples = []
        image = {}
        image_id = image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['id'].values[0].split('_')[0] #Get the image matching study id
        file_name = [string for string in all_files if image_id in string]
        image['study_id'] = study_instance_id
        image['dicom_file'] = file_name[0]
        #Get the bounding boxes
        box = None
        try:
            box = literal_eval(image_df.loc[image_df['StudyInstanceUID']==study_instance_id]['boxes'].values[0])
        except ValueError:
            pass
        image['boxes'] = box
        study_samples.append(image)
        #study_samples[study_instance_id] = samples
    return study_samples

def display_photometricInterpretation_samples(samples, title, draw_boxes=False):
    ''' Input : List of samples 
    '''
    fig1, ax1 = plt.subplots(1,2, figsize=(18, 12), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    axs = ax1.ravel()
    for item, ax in zip(samples, axs):
        dicom = read_file(item['dicom_file'], stop_before_pixels=False)
        img = dicom.pixel_array
        ax.imshow(img, cmap="gray")
        if draw_boxes == True and item['boxes'] is not None:
            for box in item['boxes']:             
                rect = patches.Rectangle((box['x'], box['y']), box['width'], box['height'], linewidth=1.5, edgecolor='r', facecolor='none')
                ax.add_patch(rect)
        ax.set_title('Study : {}'.format(item['study_id']),fontsize = 18)
        
    plt.tight_layout(pad=1.0)
#    plt.subplots_adjust(top=0.99)
    plt.suptitle(title,fontsize = 20)
    plt.show()

display_photometricInterpretation_samples(get_PhotometricInterpretation_samples(), 'MONOCHROME2 vs MONOCHROME1', True)

<div style="background-color:#fdb913; font-size:120%;  font-family:sans-serif; text-align:center"><b>Explore bounding boxes in training data</b></div>


In [None]:
img_df = image_df.copy()
from ast import literal_eval

def count_bb(x):
    count=0
    bb_lst = literal_eval(x)
    for box in bb_lst:
        count+=1
    return count

def bb_size(x):
    sizes= []
    bb_lst = literal_eval(x)
    for box in bb_lst:
        w=box['width']
        h=box['height']
        sizes.append(w*h)
    return sizes
    
img_df['boxes'] = img_df['boxes'].fillna('[]')
img_df['bb_count'] = img_df['boxes'].apply(lambda x: count_bb(x))
img_df['bb_size'] = img_df['boxes'].apply(lambda x: bb_size(x))

bbs_lst = []
for index, row in img_df.iterrows():
    #lst = literal_eval(row['bb_size'])
    bbs_lst.extend(row['bb_size'])

In [None]:
fig1, ax1 = plt.subplots(1,1, figsize=(8, 6), facecolor='w', edgecolor='g')
fig1.subplots_adjust(hspace =.3, wspace=0.3)
sub_plot = sns.histplot(img_df['bb_count'], ax=ax1)
ax1.set_title('Number of bounding boxes in train images',fontsize = 18)
ax1.set_xlabel('Number of bounding boxes', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
plt.tight_layout(pad=3.0)
plt.subplots_adjust(top=0.91)
plt.show()


In [None]:
fig1, ax1 = plt.subplots(1,1, figsize=(8, 6), facecolor='w', edgecolor='g')
fig1.subplots_adjust(hspace =.3, wspace=0.3)
sub_plot = sns.histplot(bbs_lst,bins=100, ax=ax1)
ax1.set_title('Size of bounding boxes in train images',fontsize = 18)
ax1.set_xlabel('Size', fontsize=12)
ax1.ticklabel_format(style='plain')
#start, end = ax1.get_xlim()
#ax1.xaxis.set_ticks(np.arange(start, end, 500000))
ax1.set_ylabel('Count', fontsize=12)
plt.tight_layout(pad=3.0)
plt.subplots_adjust(top=0.91)
plt.show()

In [None]:
img_bb_df = pd.DataFrame(columns=['image_id','study_ins_id','bb_count', 'bb_size'])

for index, rec in img_df.iterrows():
    row= {}
    row['image_id'] = rec['id'].split('_')[0]
    row['study_ins_id'] = rec['StudyInstanceUID']
    row['bb_count'] = rec['bb_count']    
    #bb_lst = literal_eval(rec['bb_size'])
    for bb in rec['bb_size']:
        row['bb_size'] = bb
        img_bb_df = img_bb_df.append(row, ignore_index=True)


> Let's look at the distribution of bounding box sizes. The training images contain variable number of bounding boxes like 1, 2, 3, 4, 5 and 8. 

In [None]:
fig = go.Figure()
#Sky Blue, Hyper Red, Sulphur Yellow, Green
bb_colors = ['#0092CC','#FF3333','#DCD427','#779933']

for bb_count, color in zip([1,2,3,4], bb_colors):
    df = img_bb_df.loc[img_bb_df['bb_count']==bb_count].reset_index(drop=True)
    fig.add_trace(go.Box(y=df['bb_size'], 
                             name='BB count : {}'.format(bb_count), 
                             jitter=0.5,
                             whiskerwidth=0.6,
                             fillcolor=color,
                             marker_size=5,
                             line_width=1))
fig.update_layout(xaxis={'title' : None,'showgrid' :False},
                  yaxis=dict(title='Bounding Box Size (Pixels)',showgrid=False,zeroline=False),
                 title = 'Train images - Bounding Box Size IQR')    
fig.show()

In [None]:
# Add labels to the image_bb dataframe
sg_df = pd.melt(study_df, id_vars=list(study_df.columns)[:1], value_vars=list(study_df.columns)[1:],
             var_name='label', value_name='value')
sg_df = sg_df.loc[sg_df['value']!=0]
sg_df['study_id'] = sg_df['id'].apply(lambda x : str(x).split('_')[0])
lbl_map = {'Negative for Pneumonia' : 'Negative',
 'Typical Appearance' : 'Typical',
 'Indeterminate Appearance' : 'Indeterminate', 
 'Atypical Appearance' : 'Atypical'}
sg_df['lbl'] = sg_df['label'].apply(lambda x: lbl_map[x])
slbl_map = dict(zip(sg_df.study_id, sg_df.lbl)) 
img_bb_df['label'] = img_bb_df['study_ins_id'].apply(lambda x: slbl_map[x])
#slbl_map

In [None]:
large_bb = img_bb_df.sort_values('bb_size', ascending=False).head(20).reset_index(drop=True)
def find_file(img_id):
    imgs = [file for file in all_files if img_id in file]
    return imgs[0]

large_bb['file'] = large_bb['image_id'].apply(lambda x : find_file(x))
large_bb = large_bb[['image_id','study_ins_id','label','file']]
lg_bb_dict = large_bb.set_index('image_id').T.to_dict('list')


In [None]:
import matplotlib.patheffects as path_effects
COLOR='white'
def show_image(img, figsize=None, ax=None, cmap="gray"):
    if not ax: 
        fig, ax = plt.subplots(figsize=figsize)
    ax.imshow(img, cmap=cmap)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

def draw_outline(o, lw):
    o.set_path_effects([patheffects.Stroke(linewidth=lw, foreground='black'), patheffects.Normal()])

    
def draw_text(ax, x,y, txt, fontsize=14):
    text = ax.text(x,y, 
                   txt,
                  verticalalignment = 'top',
                  color=COLOR,
                  fontsize=fontsize,
                  weight='bold')

def draw_bb(ax, bb, label):
    for b in bb:
        patch = ax.add_patch(patches.Rectangle((b['x'],
                                           b['y']),
                                           b['width'],
                                           b['height'],
                        fill=False,
                         edgecolor='red',
                         lw=2))
        #draw_text(ax, b['x'], b['y'], label)


def get_image(file):
    dicom = read_file(file, stop_before_pixels=False)
    return dicom.pixel_array

def get_bb(img_id):
    bb = literal_eval(image_df.loc[image_df['id'] == "{}_image".format(img_id)]['boxes'].values[0])
    return bb
        
def show_bb(samples, title, rows=3, cols=4):
    ''' Input : Dict of samples
    '''
    fig, axs = plt.subplots(rows, cols, figsize=(18, 12), facecolor='w', edgecolor='b')
    #fig.subplots_adjust(hspace =.3, wspace=0.3)
    for i, ax in enumerate(axs.ravel()):
        img_id = list(samples.keys())[i]
        data = list(samples.values())[i]
        img = get_image(data[2])
        bb = get_bb(img_id)
        show_image(img, ax=ax)
        draw_bb(ax, bb, data[1])
        ax.set_title('{}, {}'.format(data[0],data[1]),fontsize = 14)

    plt.tight_layout(pad=3.0)
    plt.subplots_adjust(top=0.91)
    plt.suptitle(title,fontsize = 20)
    plt.show()
show_bb(lg_bb_dict, 'Large Bounding Boxes')

In [None]:
small_bb = img_bb_df.sort_values('bb_size', ascending=True).head(20).reset_index(drop=True)
small_bb['file'] = small_bb['image_id'].apply(lambda x : find_file(x))
small_bb = small_bb[['image_id','study_ins_id','label','file']]
sm_bb_dict = small_bb.set_index('image_id').T.to_dict('list')
show_bb(sm_bb_dict, 'Small Bounding Boxes')

> There are two images which are outliers. One has 5 bounding boxes with classification as 'Indeterminate Appearance' and the other one has 8 bounding boxes with classification as 'Typical Appearance'. So we are dealing with localization of upto and may be even more than 8 objects!

In [None]:
outlr_bb = img_bb_df.loc[img_bb_df['bb_count']>4]

outlr_bb['file'] = outlr_bb['image_id'].apply(lambda x : find_file(x))
outlr_bb = outlr_bb[['image_id','study_ins_id','label','file']]
ol_bb_dict = outlr_bb.set_index('image_id').T.to_dict('list')
#show_bb(sm_bb_dict, 'Small Bounding Boxes')
ol_bb_dict
show_bb(ol_bb_dict, 'Outliers!', rows=1, cols=2)

#### Work in progress