# Preprocessing

A challenge for me at the start of this challenge Additionally the original image and csv files are a both bit hard to work with.
The goal of this notebook is to parse the input data into a format that is easier for the variouse object detection apis.

Additionally, I think it's always a good idea to have a look at the image to see if a non-radiologist can pick out some patterns in the boxed areas.
At the contrast levels for the original dicome images, this was very difficult for me to do personally.
So I used `skimage`'s `exposure` module to change the contrast level of the input images.  I also converted all of these image into PNG files so it is easier to look at hundreds of them in a browser.


This notebook will primarily show a few things:
1. How the image can be rescaled and how the contrast level can be adjusted to maximum visibility of the data. (note that we rescale instead of cropped since we are doing object detection and the bounding boxes can often sit near the edges of the image)
2. The study-level and image-level dataframes can be combined with the image size information to arrive at a master dataframe that is a bit more verbose but much easier to work with.

This notebook will show how we arrive at the data transformations, but we will only do this for 50 images in the set since these transformations are pretty expensive, the full result of these transformations will be uploaded as a different dataset.

The processed PNG files will be saved in `train_png` and a new dataframe where each row is a specific bonding box (so the same image can have multiple rows) is saved in `train_boxes.csv`.

Using the processed PNG files we will look at the images and see if we can figure out any patterns.

The dataset containing the parsed training and test images are shown here:
https://www.kaggle.com/jmmshn/siim-covid19-png1024  (need to rerun for the monochrome fix)

Updates:
The data must be converted so that high and low values mean the same thing across different images. 
Thanks to comment by @radar.

```
    if dimg.PhotometricInterpretation == "MONOCHROME1":
        img = np.amax(img) - img
```
Detailed explanation can be found here:
https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way


In [None]:
# a few of the images error out duing reading this must be installed to parse all the dicom files
!pip install python-gdcm

In [None]:
%matplotlib inline
from fastai.basics import *
from fastai.vision.all import *
from fastai.data.transforms import *
from fastai.medical.imaging import *
# import pydicom,kornia,skimage
from tqdm.auto import tqdm

try:
    import cv2
    cv2.setNumThreads(0)
except: pass
from pydicom.dataset import FileDataset

from matplotlib import patches, patheffects
from itertools import chain

DATA_ROOT = Path("../input/siim-covid19-detection/")

Reading the dicom files can be done easily with the `fastai` API.
The `get_dicom_files` function returns a list-like object that will go through all the images in a folder recursively

In [None]:
dicom_files = get_dicom_files(DATA_ROOT / "train")
print(f"There are {len(dicom_files)} images in the all of subdirectories.")


In this notebook, we will only use the first 100 files but the full set of data will be parsed and uploaded.


In [None]:
dicom_files = dicom_files[:50]

### Reading the image sizes

We will need the shape of the images returned int (n_rows, n_cols) for the pixel data.


The normal method, commented out below, took around 20 mins on the kaggle cloud.

The multi-process method finished in around 9 mins.

In [None]:
# Convert DICOM to PNG via openCV
from concurrent.futures import ThreadPoolExecutor, as_completed, ProcessPoolExecutor
from skimage import exposure
import cv2
import os

try:
    os.mkdir("./train_png")
except Exception:
    print("already exists")

def process_img(img_file):
    img_id = str(img_file).split("/")[-1].split(".")[0]
    dimg = img_file.dcmread()
    
    img = dimg.pixel_array
    # Convert the data so that the bones are white in all the images
    # https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
    # from comment by @raddar
    if dimg.PhotometricInterpretation == "MONOCHROME1":
        img = np.amax(img) - img
    img = exposure.equalize_adapthist(img, clip_limit=0.03) # optimize the contrast
    # resize image
    resized = cv2.resize(img*255, (1024, 1024), interpolation = cv2.INTER_LANCZOS4)
    cv2.imwrite(f"./train_png/{img_id}.png", resized)
    return img_id, dimg.pixel_array.shape

image_sizes = dict()
# for img_file in tqdm(dicom_files):
#     img_id, img_shape = process_img(img_file)
#     image_sizes[img_id] = img_shape
with ThreadPoolExecutor(max_workers=4) as ex:
    results = [ex.submit(process_img, img_file) for img_file in dicom_files]
    for f in tqdm(as_completed(results), total=len(dicom_files)):
        img_id, img_shape = f.result()
        image_sizes[img_id] = img_shape

The following error message occured around the 4000 block of data
```
/opt/conda/lib/python3.7/site-packages/pydicom/pixel_data_handlers/numpy_handler.py:341: UserWarning: The length of the pixel data in the dataset (13262360 bytes) indicates it contains excess padding. 216296 bytes will be removed from the end of the data
  warnings.warn(msg)
```

The CSV files can be parsed into pandas, note the the `id` fields in both have suffixes indicating whether it is a study id or a image id.

In [None]:
train_image_level = pd.read_csv(DATA_ROOT / "train_image_level.csv")
train_study_level = pd.read_csv(DATA_ROOT / "train_study_level.csv", index_col='id')

In [None]:
print(len(train_study_level))
train_image_level.head(2)

In [None]:
print(len(train_study_level))
train_study_level.head(2)

These dataframes can be combined based on the `StudyInstanceUID` value which is just the study id with the extra suffix.

The data in `boxes` are in string format, and should be converted into (x_min, y_min, x_max, y_max) format and turned into a new row for each box.
Note that for `boxes == NaN` we will convert to `0 0 1 1` just like in `label`.

In [None]:
# setting it to the index makes the id disappear after merge
train_study_level['StudyInstanceUID'] = train_study_level.apply(lambda x : x.name.split('_')[0], axis = 1)
train_study_level['study_outcome'] = train_study_level[['Negative for Pneumonia', 'Typical Appearance',
       'Indeterminate Appearance', 'Atypical Appearance']].idxmax(axis=1)

df_train = pd.merge(train_image_level,train_study_level[["study_outcome", "StudyInstanceUID"]],how='left',on='StudyInstanceUID')
# merging idea from https://www.kaggle.com/southsakura/covid19
df_train['image_id'] = df_train.id.apply(lambda x: x.split('_')[0]) # remove the suffix
# df_train.set_index('id', inplace=True)

# evaluate the boxes into a normal dict if possible, else convert it to 0 0 1 1 as just like in label
df_train['boxes'] = df_train['boxes'].map(lambda x: eval(x) if isinstance(x, str) else eval("[{'x': 0, 'y': 0, 'width': 1, 'height': 1}]"))

df_train.head(5)


For this example notebook we will filter for the images where the size has been measured.

In [None]:
df_train = df_train[df_train.image_id.apply(lambda x: x in image_sizes)]

We can first convert the data to list of dictionaries and then `explode` that list to get one row for each box in our dataset.

In [None]:
df_boxes = df_train.explode('boxes') # explode is a one-to-many mapping of iterable data

# apply(pd.Series) parses the dictionary in one column into multiple columns 
# https://stackoverflow.com/questions/38231591/split-explode-a-column-of-dictionaries-into-separate-columns-with-pandas 
df_boxes = pd.concat([df_boxes.drop(['boxes'], axis=1), df_boxes['boxes'].apply(pd.Series)], axis=1) 

df_boxes.head(10)

In [None]:
df_boxes['x_min'] =  df_boxes.x
df_boxes['y_min'] =  df_boxes.y
df_boxes['x_max'] =  df_boxes.x + df_boxes.width
df_boxes['y_max'] =  df_boxes.y + df_boxes.height


### Save the image sizes of each image into the dataframe



In [None]:
df_boxes['tp_x'] = df_boxes.image_id.apply(lambda x: image_sizes[x][1])
df_boxes['tp_y'] = df_boxes.image_id.apply(lambda x: image_sizes[x][0])
for k in ['x_min', 'x_max']:
    df_boxes[k+"_norm"] = df_boxes[k] / df_boxes.tp_x
for k in ['y_min', 'y_max']:
    df_boxes[k+"_norm"] = df_boxes[k] / df_boxes.tp_y

### Save the dataframe to a csv file

In [None]:
df_boxes.to_csv('./train_boxes.csv')

## Paranoia checks

In [None]:
# check for one-hot encoding
for j,row in train_study_level.iterrows():
    obs = row[['Negative for Pneumonia',
       'Typical Appearance', 'Indeterminate Appearance',
       'Atypical Appearance']]
    assert sum(obs.values) == 1

In [None]:
# the information in "boxes" and "label" agree
for _, row in df_train.iterrows():
    if isinstance(row.boxes, str):
        boxes = json.loads(row.boxes.replace("\'", "\""))
        labels = row.label.split('opacity')
        labels = [*filter(lambda x: x!="", labels)]
        labels = [*map(lambda x: x.lstrip().rstrip(), labels)]
        assert len(boxes) == len(labels)
        for b, l in zip(boxes, labels):
            assert l.split()[0] == "1"
            assert abs(float(l.split()[3]) - (b['x'] + b['width']))<0.1
            assert abs(float(l.split()[4]) - (b['y'] + b['height']))<0.1


In [None]:
df_boxes[df_boxes.study_outcome=="Atypical Appearance"]

# Plotting and investigation of the images

After the images have been processed it becomes a bit easier to look at them carefully.

We will plot both the original and scaled + enhanced image to see exactly what we are looking for.

In [None]:
# A simple lookup table to find the dicom files using the ids
id_to_dicom = dict()
for k in dicom_files:
    img_id = str(k).split("/")[-1].split(".")[0]
    id_to_dicom[img_id] = k

In [None]:
# drawing helper functions inspired by the fastai tutorials
def bb_hw(a):
    return np.array([a[1], a[0], a[3] - a[1], a[2] - a[0]])

def draw_outline(o, lw=1):
    o.set_path_effects(
        [patheffects.Stroke(linewidth=lw, foreground="black"), patheffects.Normal()]
    )

def draw_rect(ax, b: list):
    patch = ax.add_patch(
        patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor="white", lw=2)
    )
    draw_outline(patch, 4)


def draw_text(ax, xy: list, txt: str, sz=14):
    text = ax.text(*xy, txt, va="top", color="white", fontsize=sz, weight="bold")
    draw_outline(text)


def get_boxes_scale(ax, df, image_id, px_x=1024, px_y=1024):
    """
    look for image_id in a df and add all of the boxes to the plot on ax
    """
    
    for _, row in df[df.image_id == image_id].iterrows():
        if row.x == 0 and row.y == 0:
            continue
        x,y = row.x_min_norm * px_x, row.y_min_norm * px_y
        w,h = (row.x_max_norm - row.x_min_norm) * px_x, (row.y_max_norm - row.y_min_norm) * px_y
        
        patch = ax.add_patch(
            patches.Rectangle((x,y), w, h, fill=False, edgecolor="white", lw=2)
        )
        draw_outline(patch, 2)
        draw_text(
            ax, (x, y), row["study_outcome"].split()[0]
        )

def get_boxes_orig(ax, df, image_id):
    """
    look for image_id in a df and add all of the boxes to the plot on ax
    """
    for _, row in df[df.image_id == image_id].iterrows():
        if row.x == 0 and row.y == 0:
            continue
    
        patch = ax.add_patch(
            patches.Rectangle((row.x, row.y), row.width, row.height, fill=False, edgecolor="white", lw=2)
        )
        draw_outline(patch, 2)
        draw_text(
            ax, (row.x, row.y), row["study_outcome"].split()[0]
        )


def show_dicom(dimg: FileDataset, figsize=None, ax=None):
    """read the dicom data then """
    im = dimg.pixel_array
    if not ax:
        fig, ax = plt.subplots(figsize=figsize)
    ax.imshow(im, cmap='gray')
    ax.set_axis_off()
    ax.set_title("Original")
    get_boxes_orig(ax, df_boxes, dimg.SOPInstanceUID)
    

def show_png(img_id, figsize=None, ax=None):
    im = cv2.imread(f"./train_png/{img_id}.png")
    if not ax:
        fig, ax = plt.subplots(figsize=figsize)
    ax.imshow(im, cmap='gray')
    ax.set_axis_off()
    ax.set_title("Processed")
    get_boxes_scale(ax, df_boxes, img_id)
        
def show_img_by_id(img_id: str, **kwargs):
    img = id_to_dicom[img_id]
    dimg = img.dcmread()
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,6))
    show_dicom(dimg, ax=ax1)
    show_png(dimg.SOPInstanceUID, ax=ax2)
    fig.suptitle(img_id)

In [None]:
for iid in np.unique(df_boxes[df_boxes.study_outcome=="Typical Appearance"].image_id.values)[:10]:
    show_img_by_id(iid)

## Multiple images from studies
Note that for the two images `4cbc17936e7d` and `582c442e440b`  with no boxes, they belong to a study `79c3bf957d49` which has boxes in another image.

So during training it will probably make sense to drop such images from the data set.

In [None]:
df_train[df_train.StudyInstanceUID == "79c3bf957d49"]

In [None]:
for iid in np.unique(df_boxes[df_boxes.study_outcome=="Indeterminate Appearance"].image_id.values)[:10]:
    show_img_by_id(iid)

In [None]:
for iid in np.unique(df_boxes[df_boxes.study_outcome=="Negative for Pneumonia"].image_id.values)[:10]:
    show_img_by_id(iid)

From these images, I think we are mostly looking for anything that obstructs our view of the bronchioles.