In [None]:
!conda install -c conda-forge gdcm -y

import numpy as np 
import pandas as pd 
import os
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as patches
%matplotlib inline
import glob
from pydicom import dcmread
from pydicom.data import get_testdata_file

from tqdm import tqdm

import ast

!pip install hvplot
import hvplot.pandas 

!pip install pylibjpeg pylibjpeg-libjpeg pydicom python-gdcm
import gdcm
import pylibjpeg

# Introduction

COVID-19 is a **pulmonary infection resulting in inflammation and fluid in lungs**. This disease is similar to other pneumonias, which makes it difficult to diagnose. The idea of the project is to **detect and localize COVID-19** in order to help doctors to provide a quick and confident diagnosis. This will allow them to get the right treatment before severe effects of the virus.

The reason for using chest X-rays is that they are very easy to take and obtained in minutes. In addition, we can locate the disease and get a better idea of the patient's condition.

In this notebook, we will **visualise and understand** the data from a dataset X-rays. The objective is to get an overview of the data and to see if some information can be gathered. Feel free to leave a comment if you have any questions, I would be happy to answer them :)

> Link for the data : [SIIM-FISABIO-RSNA COVID-19 Detection](https://www.kaggle.com/c/siim-covid19-detection)

In [None]:
# Read the data
df_train_images = pd.read_csv('../input/siim-covid19-detection/train_image_level.csv')
df_train_study = pd.read_csv('../input/siim-covid19-detection/train_study_level.csv')

In this dataset, we have two important files :
- *train_study_level.csv* : We will find all the study information as a corresponding label for the image.
- *train_image_level.csv* : We will fing image information and the associated bounding boxes to locate the disease.

The train dataset comprises **6,334 chest scans in DICOM format**, which were de-identified to protect patient privacy. All images were labeled by a panel of experienced radiologists for the presence of opacities as well as overall appearance.

# Analysis of the studies

In this section we will focus mainly on the analysis of the studies. We will try to understand the different categories and their distribution.

In [None]:
labels = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']

print("Number of study : ", len(df_train_study))

df_train_study.head()

So, in this file, we have 6054 rows. However, we have 6344 chest X-rays. So a study can group several images of a patient. One question that can be asked is why several images for one study. Are there different shots that have been taken? Was it taken in a different time frame? 

Regarding the diagnosis, we will have 4 labels :

1. **Negative for Pneumonia**: No lung opacities.

2. **Typical Appearance**: Multifocal bilateral, peripheral opacities with rounded morphology, lower lung–predominant distribution

3. **Indeterminate Appearance**: Absence of typical findings AND unilateral, central or upper lung predominant distribution

4. **Atypical Appearance**: Pneumothorax, pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, cavity.


> https://www.kaggle.com/c/siim-covid19-detection/discussion/240250


**Pulmonary opacification represents the result of a decrease in the ratio of gas to soft tissue** (blood, lung parenchyma and stroma) in the lung. When reviewing an area of increased attenuation (opacification) on a chest radiograph or CT it is vital to determine where the opacification is. The patterns can broadly be divided into airspace opacification, lines and dots.

> https://radiopaedia.org/articles/pulmonary-opacification

## See some sample images from the different categories

In [None]:
NUMBER_OF_SAMPLE = 5

def read_image_from_study(study_id):
    study_name = study_id.split('_')[0]
    file = glob.glob("../input/siim-covid19-detection/train/" + study_name + "/*/*.dcm")
    ds = dcmread(file[0])
    return ds.pixel_array

def show_sample_data_from_study(sample_images, NB_SAMPLE = 5):
    fig, axes = plt.subplots(nrows=1, ncols=NB_SAMPLE, figsize=(NB_SAMPLE * 4, 4))
    i = 0
    for index, row in sample_images.iterrows():
        img = read_image_from_study(row['id'])
        axes[i].imshow(img, cmap=plt.cm.gray, aspect='auto')
        axes[i].axis('off')
        i += 1
    fig.show()

### Negative for pneumonia

In [None]:
sample_negative_pneumonia = df_train_study[df_train_study['Negative for Pneumonia'] == 1].sample(n=NUMBER_OF_SAMPLE, random_state=42)
show_sample_data_from_study(sample_negative_pneumonia, NUMBER_OF_SAMPLE)

### Typical appearance

In [None]:
sample_negative_pneumonia = df_train_study[df_train_study['Typical Appearance'] == 1].sample(n=NUMBER_OF_SAMPLE, random_state=42)
show_sample_data_from_study(sample_negative_pneumonia, NUMBER_OF_SAMPLE)

### Indeterminate appearance

In [None]:
sample_negative_pneumonia = df_train_study[df_train_study['Indeterminate Appearance'] == 1].sample(n=NUMBER_OF_SAMPLE, random_state=42)
show_sample_data_from_study(sample_negative_pneumonia, NUMBER_OF_SAMPLE)

### Atypical appearance

In [None]:
sample_negative_pneumonia = df_train_study[df_train_study['Atypical Appearance'] == 1].sample(n=NUMBER_OF_SAMPLE, random_state=42)
show_sample_data_from_study(sample_negative_pneumonia, NUMBER_OF_SAMPLE)

At a first glance, it's really complicated to really see the opacities. It's a difficult task.
If we focus a little bit, for the *Typical appearance*, we could see some opacities.
With a bigger picture, we can maybe have a better view.

Note : I think that could be interesting to use images enhancement in order to help the visualisation of x-rays images. [I found on github a library called *X-Ray Images Enhancement* that could be interesting](https://github.com/asalmada/x-ray-images-enhancement). I will try it in another kernel. If someone already applies this kind of techniques or used that library, feel free to share your experience with us :)

## Distribution of the different categories

In [None]:
# Count for each labels the number of occurence
study_case = [df_train_study[label].value_counts()[1] for label in labels]

plt.figure(figsize=(15, 6))
plt.bar(labels, study_case)
plt.title('Distribution of the different categories')
plt.show()

plt.figure(figsize=(8, 8))
plt.pie(study_case, labels=labels, autopct='%1.1f%%')
plt.title('Proportion of the different categories')
plt.show()

In [None]:
def count_column(x):
    return x.sum()
    
df_train_count = df_train_study[labels].apply(count_column, axis=1)
print("Number of multiple categories ?", df_train_count[df_train_count != 1].sum())

For the study case, we don't have multiple categories. Which means that **each categories are distinct**. 
Regarding the distribution of the data, **we have unbalanced categories**. As we can see in our diagram, we have 47% for *Typical Appearance*. And, regarding the *Atypical Appearance*, we have 7,8%. 

So, when the preprocessing of our training dataset, we should take in consideration that we're dealing with unbalance data in order to avoid important prediction on unique label.

# Analysis of the images

In [None]:
print("Number of images : ", len(df_train_images))
df_train_images.head()

## Image analysis with duplicate study

To recall, we had seen in the previous part that we have multiple images for a given study. It could be interesting to visualize those data in order to understand why we have multiple images.

In [None]:
print("Number of duplicate images :", df_train_images.id.duplicated().sum())
print("Number of duplicate study :", df_train_images.StudyInstanceUID.duplicated().sum())

In [None]:
# Count the number of duplicated images
unique_study_duplicate = df_train_images[df_train_images.StudyInstanceUID.duplicated()].StudyInstanceUID.unique()
print("Some duplicated id : ", ' ; '.join(unique_study_duplicate[:10]))

images_with_duplicate_study = df_train_images[df_train_images.StudyInstanceUID.isin(unique_study_duplicate)]
print("Number of image concernd with duplication :", len(images_with_duplicate_study))

### Visualization of some studies

In [None]:
def read_image_from_image(study_name, image_id):
    image_name = image_id.split('_')[0]
    file = glob.glob("../input/siim-covid19-detection/train/" + study_name + "/*/" + image_name + ".dcm")
    ds = dcmread(file[0])
    return ds.pixel_array

def show_sample_duplicate(samples):
    nb_show_sample = min(5, len(samples))
    fig, axes = plt.subplots(nrows=1, ncols=nb_show_sample, figsize=(nb_show_sample * 4, 4))
    i = 0
    for index, row in samples.iterrows():
        img = read_image_from_image(row['StudyInstanceUID'], row['id'])
        axes[i].imshow(img, cmap=plt.cm.gray, aspect='auto')
        axes[i].axis('off')
        i += 1
        if i == 5:
            break
        
    fig.suptitle(samples.StudyInstanceUID.unique()[0], fontsize=20)
    fig.show()
    

# Get some sample from duplicate study
np.random.seed(42)
duplicated_study_sample = np.random.choice(unique_study_duplicate, 5)

# See the different values
for sample_study_name in duplicated_study_sample:
    sample_duplicate_image = df_train_images[df_train_images.StudyInstanceUID == sample_study_name]
    
    show_sample_duplicate(sample_duplicate_image)    

It seems that for the images present for a given study, there are duplicated but with a different quality. If we took the first, the second and the last, we clearly have different brightness. Nevertheless the fourth seems to be the same. Finally, the third one is 4 different images with different brightness and cropping.

### The particular 0fd2db233deb ID

During my research, I found a particular ID, which have duplicated images.

In [None]:
df_train_images[df_train_images.StudyInstanceUID == "0fd2db233deb"]

In [None]:
df_train_study[df_train_study.id == "0fd2db233deb_study"]

In [None]:
show_sample_duplicate(df_train_images[df_train_images.StudyInstanceUID == "0fd2db233deb"])

Regarding the 0fd2db233deb ID, we have duplicated images. Moreover, regarding the image information, we have a box information for a unique rows. 

So, in our dataset, we have **duplicate images**. Some are different (with different brigthness, different cropping, different angle) and some are the same. They represent 512 images of our dataset. It's represent about 8% (512 * 100 / 6334) of our dataset. 

In [None]:
# Rename the 'StudyInstanceUID' column
df_train_study['StudyInstanceUID'] = df_train_study['id'].apply(lambda x : x.replace('_study', ''))

# Get the duplicated study
df_study_from_duplicate = df_train_study[df_train_study['StudyInstanceUID'].isin(images_with_duplicate_study['StudyInstanceUID'].unique())]

# Get the duplicated images
df_image_from_duplicate = df_train_images[df_train_images.StudyInstanceUID.isin(unique_study_duplicate)]


# Count for each category the number of duplicated study
duplicate_study_case = [df_study_from_duplicate[label].value_counts()[1] for label in labels]
total_study_case = [df_train_study[label].value_counts()[1] for label in labels]

# Get the percentage for each category
ratio_duplicate = [x / y for x, y in zip(duplicate_study_case, total_study_case)] 

print("Ratio total duplicated image : ", len(df_image_from_duplicate) / len(df_train_images))
print("Ratio total duplicated study : ", sum(duplicate_study_case) / sum(total_study_case))

print()
print("Percentage of duplicated study for each category :")
print() 

for i in range(len(ratio_duplicate)):
    print(labels[i], " : ", ratio_duplicate[i])

In [None]:
plt.figure(figsize=(8, 8))
plt.pie(ratio_duplicate, labels=labels, autopct='%1.1f%%', normalize=True)
plt.suptitle("Distribution of duplications for each category", fontsize=20)
plt.show()

In this section, we saw the different duplicated study and images. Those could be x-ray images that could be retaken, maybe duplicated from copy/past or even images that have been analyze multiple times. Radiography analisys is a complex task, and some errors are possible even for the most brillant doctor. So we should keep in mind that maybe we could have error in our dataset.

Nevertheless, concerning the application of the duplicate images, multiple possibilites are available. 
- The simplest solution is to decide to ignore these files. As this is a small percentage of our dataset, this might be feasible. With this, we could avoid the duplication problem.

- The other possibility is that we could decide to get some of the data. I mean not all the data have to be throws away. Some of them are duplicate files. For them, it would be good if we could analyse the group of images and keep only the best ones, with all the metadata information collected from the others. We could do the same for other similar images, those with a different brightness and cropping.

## See some image with their boxes

In [None]:
sample_images_with_boxes = df_train_images[df_train_images.boxes.notna()].sample(n=10, random_state=42)
sample_images_with_boxes.head()

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(5 * 4, 4 * 2))

i = 0
# Iterate through the sample
for index, row in sample_images_with_boxes.iterrows():
    # Read and show image
    img = read_image_from_image(row['StudyInstanceUID'], row['id'])
    axes[i // NUMBER_OF_SAMPLE, i % NUMBER_OF_SAMPLE].imshow(img, cmap=plt.cm.gray, aspect='auto')
    
    # The boxes are saved as str, we need to translate them to array of dict
    array_boxes = ast.literal_eval(row.boxes) 
    
    # Now, show the boxes
    for box in array_boxes:
        rect = patches.Rectangle((box['x'], box['y']),
                                 box['width'], 
                                 box['height'], 
                                 edgecolor='r', 
                                 facecolor="none")
        
        axes[i // NUMBER_OF_SAMPLE, i % NUMBER_OF_SAMPLE].add_patch(rect)
    
    # Remove axis information
    axes[i // NUMBER_OF_SAMPLE, i % NUMBER_OF_SAMPLE].axis('off')
    i += 1

With these images, we can first see that the images in this sample are really different. If we take the second one, I can barely see the content (and I have to turn the brightness of my screen to the maximum!). The fourth one is also interesting because the image has been rotated and cropped. In this sample we can really see the different image contrasts we have.


Regarding the boxes, on this sample, we see that we usually have two boxes and are places on the left and on the right. Moreover, they are mainly between the inferior and the middle lobe. However, this is a sample, we cannot make generalization on this small amount of data.


<img src="https://cdn.lecturio.com/assets/Lobes-and-fissures-of-the-lungs-1200x570.jpg" width="800" />

Credit : https://www.lecturio.com/concepts/lungs/ - Image by Lecturio.

### Box size analysis

In [None]:
sample_images_with_boxes = df_train_images[df_train_images.boxes.notna()]
box_size = pd.DataFrame()

for boxes in sample_images_with_boxes.boxes:
    array_boxes = ast.literal_eval(boxes) 
    for box in array_boxes:
        box_size = box_size.append(box, ignore_index=True)

box_size.head()

In [None]:
box_size.describe()

In [None]:
# Show image size
sizes = box_size.groupby(['height', 'width']).size().reset_index().rename(columns={0 : 'count'})
sizes.hvplot.scatter(
    x='height', 
    y='width', 
    size='count',
    title='Box size distribution',
    xlim=(0,3141), ylim=(0,1920), 
    grid=True, 
    height=500, width=1000).options(scaling_factor=0.1, line_alpha=1, fill_alpha=0)

Regarding the size of the boxes, they seem to have similar shapes, but very variable sizes. 

#### Annotation label

For each box, we have an *opacity* tag. The question we might ask is whether we have any other tags.

In [None]:
o = []
for label in df_train_images.label.values:
    a = label.split(' ')
    o.append(a[0])
    
pd.Series(o).value_counts()

Here, we know we have only two tags for boxes : *none* or *opacity*

# DICOM image metadata analysis

The x-ray images are stored using a DICOM (*Digital Imaging and Communication in Medicine*)is the standard for digital files created during medical imaging examinations. It also covers the specifications concerning their archiving and their transmission over a network (particularly important aspects in the medical field). Independent of technologies (scanner, MRI, etc.) and manufacturers, it allows standardised access to medical imaging results. In addition to the digital images from medical examinations, DICOM files also carry a lot of textual information about the patient (marital status, age, weight, etc.), the examination carried out (region explored, imaging technique used, etc.), the acquisition date, the practitioner, etc.

> https://sti-biotechnologies-pedagogie.web.ac-grenoble.fr/content/fichiers-dicom-format-dcm-en-imagerie-medicale

By analysing these files, we might be able to find interesting points that we could exploit.

In [None]:
# Merge the two dataframe
df_merged_data = df_train_study.merge(df_train_images, on="StudyInstanceUID")

## Visualize some DICOM image

In [None]:
path = '../input/siim-covid19-detection/train/00086460a852/9e8302230c91/65761e66de9f.dcm'
ds = dcmread(path)
print(ds)
plt.imshow(ds.pixel_array, cmap=plt.cm.gray)
plt.show()

In [None]:
path = '../input/siim-covid19-detection/train/057c02a959f1/6de2191aa170/ba463980acdb.dcm'
ds = dcmread(path)
print(ds)
plt.imshow(ds.pixel_array, cmap=plt.cm.gray)
plt.show()

## Get meta-information from train images

In [None]:
def dcm2metadata(sample):
    metadata = {}
    for key in sample.keys():
        if key.group < 50:
            item = sample.get(key)
        if hasattr(item, 'description') and hasattr(item, 'value'):
            metadata[item.description()] = str(item.value)
    return metadata

TRAIN_PATH = "../input/siim-covid19-detection/train"
train_images_path = glob.glob(TRAIN_PATH + "/*/*/*.dcm")
image_metadata = pd.DataFrame()


for image in tqdm(train_images_path):    
    # Read only the metadata here
    ds = dcmread(image, stop_before_pixels=True)
    info = dcm2metadata(ds)
    image_metadata = image_metadata.append(info, ignore_index=True)
        
image_metadata.head()

In [None]:
image_metadata.columns

Based on the metadata, I decide to focus only on the following columns :

- Patient ID
- Patient's Sex 
- Modality
- Body Part Examined
- Image type
- Columns
- Rows

### Patient ID

In [None]:
print("Number of unique patient : ", len(image_metadata["Patient ID"].unique()))

### Patient's Sex

In [None]:
nb_male = len(image_metadata[image_metadata["Patient's Sex"] == 'M'])
nb_female = len(image_metadata[image_metadata["Patient's Sex"] == 'F'])

plt.figure(figsize=(6,6))
plt.title("Gender distribution")
plt.pie([nb_male, nb_female], labels=['Male', 'Female'], autopct='%1.1f%%', colors=['b', 'r'])
plt.show()

### Modality

In [None]:
image_metadata["Modality"].value_counts()

Modality information :
- DX : Digital Radiography
- CR : Computed Radiography

> References : https://www.dicomlibrary.com/dicom/modality/

TL;DR

These are two methods of achieving x-ray images. Thus, DX offers superior throughput compared to CR.

More information :
> Computed radiography (CR) cassettes use photo-stimulated luminescence screens to capture the X-ray image, instead of traditional X-ray film. The CR cassette goes into a reader to convert the data into a digital image. Digital radiography (DR) systems use active matrix flat panels consisting of a detection layer deposited over an active matrix array of thin film transistors and photodiodes. With DR the image is converted to digital data in real-time and is available for review within seconds.

> While both CR and DR have a wider dose range and can be post processed to eliminate mistakes and avoid repeat examinations, DR has some significant advantages over CR. DR improves workflow by producing higher quality images instantaneously while providing two to three times more dose efficiency than CR.

> The good and bad of CR is that it enables digital imaging with the traditional workflow of X-ray film. With CR, like film, no synchronization to the generator is required, which had been a requirement for DR imaging. However, recent advances in DR panels are improving their flexibility, portability, and affordability.

Cited : Rick Colbeth - June 6, 2016 - https://www.vareximaging.com/computed-radiography-cr-and-digital-radiography-dr-which-should-you-choose

### Body Part Examined

In [None]:
image_metadata["Body Part Examined"].value_counts()

Regarding the different words, I think we can regroup the words that seems to be *Thorax* (*TORAX*, *T?RAX*, *THORAX*, *2- TORAX*, *TÒRAX*).

- The empty category is a bit odd. Maybe doctors forgot to assign the body part when he took the x-ray?

- *PORT CHEST* : referrence the upper chest where we can found a portal system, a small medical appliance, use to inject drugs or use to collect blood sample
> https://en.wikipedia.org/wiki/Port_(medical)

- *Pecho* : is spanish term for chest. We can group the *Pecho* and *PECHO* category.

- *SKULL* : I don't really know what it means. On the sample bellow, it seems the same as chest radiography.

- *ABDOMEN* : seems to be larger image in height. On the sample, we can see that we have the chest bust also the abdomen parti visible.

In [None]:
def get_sample_body_part(sample, title):
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(3 * 4, 4))
    fig.suptitle(title)
    i = 0
    for study_id in sample['Study Instance UID'].values:
        path = glob.glob('../input/siim-covid19-detection/train/' + study_id + '/*/*.dcm')
        ds = dcmread(path[0])
        axes[i].imshow(ds.pixel_array, cmap=plt.cm.gray, aspect='auto')
        axes[i].axis('off')
        i+=1    

sample_port_chest = image_metadata[image_metadata["Body Part Examined"] == "PORT CHEST"].sample(n=3, random_state=42)
get_sample_body_part(sample_port_chest, 'Radiography Port Chest')

In [None]:
sample_port_chest = image_metadata[image_metadata["Body Part Examined"] == ""].sample(n=3, random_state=42)
get_sample_body_part(sample_port_chest, 'Radiography Empty')

In [None]:
sample_port_chest = image_metadata[image_metadata["Body Part Examined"] == "SKULL"].sample(n=3, random_state=42)
get_sample_body_part(sample_port_chest, 'Radiography Skull')

In [None]:
sample_port_chest = image_metadata[image_metadata["Body Part Examined"] == "Pecho"].sample(n=3, random_state=42)
get_sample_body_part(sample_port_chest, 'Radiography Pecho')

In [None]:
sample_port_chest = image_metadata[image_metadata["Body Part Examined"] == "ABDOMEN"].sample(n=3, random_state=42)
get_sample_body_part(sample_port_chest, 'Radiography ABDOMEN')

### Image type

In [None]:
image_metadata["Image Type"].value_counts()

- Pixel Data Characteristics
    - is the image an ORIGINAL Image; an image whose pixel values are based on original or source data
    - is the image a DERIVED Image; an image whose pixel values have been derived in some manner from the pixel value of one or more other images



- Patient Examination Characteristics
    - is the image a PRIMARY Image; an image created as a direct result of the patient examination
    - is the image a SECONDARY Image; an image created after the initial patient examination



- Modality Specific Characteristics

- Implementation specific identifiers; other implementation specific identifiers shall be documented in an implementation's conformance statement.



> https://dicom.innolitics.com/ciods/ct-image/general-image/00080008

### Image size analysis

In [None]:
# Convert dtype
image_metadata.Columns = np.array(image_metadata.Columns, dtype=int)
image_metadata.Rows = np.array(image_metadata.Rows, dtype=int)

# Show image size
sizes = image_metadata.groupby(['Columns', 'Rows']).size().reset_index().rename(columns={0 : 'count'})
sizes.hvplot.scatter(
    x='Columns', 
    y='Rows', 
    size='count',
    title='Image size distribution',
    xlim=(0,5000), ylim=(0,5000), 
    grid=True, 
    height=500, width=1000).options(scaling_factor=0.1, line_alpha=1, fill_alpha=0)

As we can see on the top diagram, the size of the images seems to follow a linear line starting from 0. It seems that we have globally square sized images. Moreover, we have a high concentration of images with a size between 2000 and 3000 pixels.

# Conclusion

In this notebook, we have seen a lot of information. To summarise some of the main ideas:
- We have unbalanced data.
- A study can contain several images. Those images can be duplicated.
- The brightness of the images changes a lot.
- According to the metadata, the set of images corresponds well to the location of the chest.
- The image appear to be square and its size is concentrated between 2000 and 3000 pixels.
- The images were taken equally between CR and DX. 


Thank you for your reading time. I hope you found this notebook usefull. Let me know if you like it or not. And if you have questions or remarks don't hesitate :)