## Table of Contents

- [Let's look at the distribution of number of opacities](#Let's-look-at-the-distribution-of-number-of-opacities)
- [Pneumonia detected even if no opacity is observed](#Pneumonia-detected-even-if-no-opacity-is-observed)
- [Opacities vs Pneumonia](#Opacities-vs-Pneumonia)
    - [Couple of things to be clarified from the data](#Couple-of-things-to-be-clarified-from-the-data)
- [Quick sanity check on images and annotations](#Quick-sanity-check-on-images-and-annotations)
- [Reference](#Reference)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
'''for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
'''
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install python-gdcm
!pip install pylibjpeg-openjpeg
!pip install pylibjpeg

In [None]:
!pip uninstall numpy -y
!pip install numpy --no-cache-dir

In [None]:
!pip uninstall pydicom -y
!pip install pydicom

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pydicom
import pydicom.pixel_data_handlers.util as dicom_utils
import cv2
import gdcm
import pylibjpeg
import glob

In [None]:
ROOT_DIR = "/kaggle/input/siim-covid19-detection/"
train_image = pd.read_csv(ROOT_DIR+'train_image_level.csv')
train_study = pd.read_csv(ROOT_DIR+'train_study_level.csv')

In [None]:
train_image.head()

# Let's look at the distribution of number of opacities
[Back to top](#Table-of-Contents)

In [None]:
def get_opacity_count(item):
    return item.split(' ').count('opacity')

In [None]:
train_image['opacity_count'] = train_image['label'].apply(get_opacity_count)
train_image.head()

In [None]:
train_image['opacity_count'].value_counts()
ax = sns.countplot(x = 'opacity_count', data = train_image)

There are upto 8 opacities observed in an X-Ray.

# Pneumonia detected even if no opacity is observed
[Back to top](#Table-of-Contents)

In [None]:
train_study.head()

In [None]:
def map_study_observation(item):
    target_map = {'Negative for Pneumonia':0,'Typical Appearance':1,'Indeterminate Appearance':2,'Atypical Appearance':3}
    item += '_study'
    record = train_study.loc[train_study['id']==item, ['Negative for Pneumonia','Typical Appearance','Indeterminate Appearance','Atypical Appearance']]
    df = (record != 0).any()
    target = df.index[df]
    #print(target_map[target[0]])
    return target_map[target[0]]

In [None]:
train_image['target'] = train_image['StudyInstanceUID'].apply(map_study_observation)

In [None]:
train_image.head()

In [None]:
df_no_opacity_but_has_target = train_image[(train_image['opacity_count'] == 0) & (train_image['target'] != 0)]

In [None]:
print(f"Total images: {len(train_image)}")
print(f"Total images with no opacities: {sum(train_image['opacity_count']==0)}")
print(f"Total images with no opacity but identified pneumonia: {len(df_no_opacity_but_has_target)}")

We can see that, 304 out of 2040 were considered pneumonia though no opacities found. Can a pneumonia be diagnosed without spotting an opacity?

I feel the above records need to be discarded when modelling, as it may induce a data bias. Now let's observe the distribution of opacity count over different target cats. 

# Opacities vs Pneumonia
[Back to top](#Table-of-Contents)

In [None]:
grouped_opacities = train_image.groupby(['opacity_count','target'])
grouped_df2 = grouped_opacities['id'].count()

In [None]:
grouped_df2.unstack()

In [None]:
legends = ['Negative for Pneumonia','Typical Appearance','Indeterminate Appearance','Atypical Appearance']
fig = plt.figure(figsize=(10,10))
plt.suptitle("Distribution of opacity count over different targets")
for i in range(7):
    plt.subplot(3,3,i+1)
    df = grouped_df2.unstack().iloc[i]
    sns.barplot(x=df.index, y=df.values)
    plt.title(f'No. of opacities: {i}')
    plt.xlabel('target')
    plt.ylabel('count')
fig.text(0.5, 0.2, s='**********  Target Mappings  *******\n Negative for Pneumonia - 0\n Typical - 1\n Indeterminate - 2\n Atypical - 3', fontsize='x-large')
fig.tight_layout()

### Couple of things to be clarified from the data
[Back to top](#Table-of-Contents)

The above distribution raises couple of questions:
1. Pneumonia is identified even if there are no opacities. Can a pneumonia be diagnosed without spotting a single opacity?
2. If an opacity is identified, we do not see 'Negative for Pneumonia'. Does that mean opacity is always a sign of pneumonia? Can't there be opacities in the lung due to other factors? Or the experts annotated this dataset consider opacities only due to pneumonia and left the same due to other factors?

# Quick sanity check on images and annotations
[Back to top](#Table-of-Contents)

In [None]:
def get_bboxes(labels):
    opacities = labels.strip().split('opacity')
    opacity_list = []
    #print(opacities)
    for opacity in opacities:
        #print(opacity)
        if 'none' in opacity:
            continue
        if opacity != '':
            opacity_list.append(opacity.strip().split(' ')[1:])
    return opacity_list

In [None]:
import glob
target_map = ['Negative for Pneumonia','Typical Appearance','Indeterminate Appearance','Atypical Appearance']
fig = plt.figure(figsize=(50,50))
for i in range(9):
    random_record = train_image.sample()
    #print(type(random_record.index[0]))
    study_id = random_record['StudyInstanceUID'].values[0]
    image_id = random_record['id'].values[0]
    labels = random_record['label'].values[0]
    target = random_record['target'].values[0]
    bboxes = get_bboxes(labels)
    #print(image_id.values[0])
    img = str(image_id).split('_image')[0]+'.dcm'
    file_ = glob.glob(ROOT_DIR+'train/'+study_id+'/*/'+img)
    dicom = pydicom.dcmread(file_[0])
    img = dicom.pixel_array
    if 'PhotometricInterpretation' in dicom:
        if dicom.PhotometricInterpretation == 'MONOCHROME1':
            print("It's a monochrome1 image")
            img = np.amax(img) - img
    img = dicom_utils.apply_modality_lut(img, dicom)
    img = dicom_utils.apply_voi_lut(img, dicom)
    #print(img.max())
    img = np.stack([img, img, img])
    img = img.astype('float32')
    img = img - img.min()
    img = img / img.max()
    #img = img * 255
    img = img.transpose(1,2,0)
    #print(img.shape)
    #img = img / 255
    plt.subplot(9,1,i+1)
    #plt.subplot(3,3,i+1)
    for bbox in bboxes:
        #img = img.copy()
        xmin = int(float(bbox[0]))
        ymin = int(float(bbox[1]))
        xmax = int(float(bbox[2]))
        ymax = int(float(bbox[3]))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (255,0,0), 5)
    plt.imshow(img)
    plt.axis('off')
    plt.title(target_map[target])
plt.suptitle("Random lookup of images and its annotations")
fig.tight_layout()

## Reference

Here's a very good resource to learn more about lung opacities and pneumonia - https://www.kaggle.com/zahaviguy/what-are-lung-opacities

[Back to top](#Table-of-Contents)