**Analysis of the image files for the [SIIM COVID-19 Detection](https://www.kaggle.com/c/siim-covid19-detection/overview) competition**.

**Conclusions**:
- The image must be preprocessed as explained at [this link](https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way).
- There are several group of images we can observed based on the pixel values.
- It is better to use matplotlib instead of plotly as long as interactivity is not needed.

# CONFS

In [None]:
ROOT = '/kaggle/input/siim-covid19-detection'
EXAMPLE = '/kaggle/input/siim-covid19-detection/train/cd5dd5e6f3f5/b2ee36aa2df5/d8ba599611e5.dcm'

# IMPORTS

In [None]:
# required to handle compress pixel values
!conda install --yes --channel=conda-forge gdcm

In [None]:
import pathlib
import pydicom
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

In [None]:
from concurrent.futures import ThreadPoolExecutor
from pydicom.pixel_data_handlers.util import apply_voi_lut

# DATASETS

In [None]:
def compute_statistics(file: pathlib.Path) -> pd.Series:
    """Generate statistics from the pixels of a dicom file."""
    pixels = pydicom.read_file(file).pixel_array # get pixels
    stats = pd.Series(pixels.flatten()).describe()
    stats['rows'] = pixels.shape[0]
    stats['cols'] = pixels.shape[1]
    stats.name = file.stem
    return stats

In [None]:
with ThreadPoolExecutor(100) as executor:
    files = pathlib.Path(ROOT).glob('**/*.dcm')
    stats = executor.map(compute_statistics, files)
df = pd.DataFrame(stats)
df.head()

# ANALYSIS

In [None]:
df.info()

- The number of pixels on the image is not evenly distributed.
- The relation between mean and standard deviation shows clusters.
- The same conclusion can be said about the median and interquartile range.
- On the other hand, the number of rows and columns follows a linear relationship.

In [None]:
px.histogram(df, x='count')

In [None]:
px.scatter(df, x='mean', y='std')

In [None]:
px.scatter(df, x='50%', y=df['75%']-df['25%'])

In [None]:
px.scatter(df, x='rows', y='cols')

# EXAMPLES

In [None]:
def read_pixels(path: pathlib.Path, voi_lut: bool = True, fix_monochrome: bool = True):
    """Read a dicom file and convert its pixel values to a proper image format/range."""
    # Original from: https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        pixels = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        pixels = dicom.pixel_array
    # depending on the photometric interpretation, X-ray may look inverted. Fix the problem by reversing values
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        pixels = np.amax(pixels) - pixels
    # normalize the pixel values to a range between 0 and 1: (X - min) / (max - min)
    pixels = (pixels - np.min(pixels)) / (np.max(pixels) - np.min(pixels))
    # convert the range from [0, 1] to [0, 255] and cast to uint8
    pixels = (pixels * 255).astype(np.uint8)
    return pixels

- Color are inverted on some images (monochrome1).
- The goal is to have black lungs on a white background.
- The fix monochrome attribute helps to fix this problem.
- the VOI LUT attribute seems to make images a little ligther.

In [None]:
title = f'VOI_LUT=False, FIX_MONOCHROME=False'
pixels = read_pixels(EXAMPLE, False, False)
plt.figure(figsize=(12,12))
plt.imshow(pixels, 'gray')
plt.title(title);

In [None]:
title = f'VOI_LUT=True, FIX_MONOCHROME=False'
pixels = read_pixels(EXAMPLE, True, False)
plt.figure(figsize=(12,12))
plt.imshow(pixels, 'gray')
plt.title(title);

In [None]:
title = f'VOI_LUT=False, FIX_MONOCHROME=True'
pixels = read_pixels(EXAMPLE, False, True)
plt.figure(figsize=(12,12))
plt.imshow(pixels, 'gray')
plt.title(title);

In [None]:
title = f'VOI_LUT=True, FIX_MONOCHROME=True'
pixels = read_pixels(EXAMPLE, True, True)
plt.figure(figsize=(12,12))
plt.imshow(pixels, 'gray')
plt.title(title);