# RSNA 2018 EDA, Part 1

In [None]:
import pandas as pd
import numpy as np
import pydicom
import pylab
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Exploring the Given Files

The directory listing below shows the files we are given, which we will explore one by one.  (Except for the credit request link, which I believe has expired,)

In [None]:
!ls -rtl ../input

### 1.1 Sample Submission

A sample submission file is show below.  (There is more information on this in the "Evaluation" section of the contest rules.)  There will be one row in the submission file for each observation in the test set, and each row will consist of two strings separated by a comma (the patient ID and our prediction for that patient.)  The prediction string will contain 5 numbers (separated by spaces) for EACH spot of pneumonia we find for that patient (a confidence level, the x,y location of the box (I believe this is upper left corner and not center, based on the sample submission data and the fact that the training labels are that way ,) and the width and heigth of the box.  For a patient for whom we predict NO pneumonia, the prediction string is enpty.

The information required in our submission will affect our choice of model.  We need something that will output bounding boxes and confidence levels.  I think YOLO will do this, but maybe there are other algorithms that will as well.

In [None]:
df = pd.read_csv('../input/stage_1_sample_submission.csv')
df.head()

### 1.2 Training Labels

The training labels file contains the "right answers" for the training set, so we can use some kind of supervised learning algorithm.  The contest information states that each row in this table contains information for just one spot of pneumonia, and there may be more than one row for a given patient.  Also, it looks like rows in which Target=0 (meaning the patient has NO pneumonia) also have null values for the bounding box, but we should verify this.

We see that there are 28,989 rows in the labels information, and 25,684 unique patients represented.  We also see that there are no values in the Target column other than 0 or 1, which is good, and that there are 20,025 patients having NO pneumonia.

In [None]:
df_lbl = pd.read_csv('../input/stage_1_train_labels.csv')
df_lbl.head()

In [None]:
df_lbl.info()

In [None]:
n_pat = df_lbl.patientId.nunique()
print('Number of unique patients in training data: ', n_pat)

In [None]:
df_lbl.Target.value_counts()

To make it more convenient to check things in the labels, let's split it into separate tables of positive and negative results.

We see first that all observations for patients with NO pneumonia have null values for box information.  I'm not sure it would matter much if this was not the case, but it is one way in which the data is consistent, which is always good.  More importantly, we see that all positive observations DO have bounding box information.  We also see that we have an unbalanced training set (only 22% of patients diagnosed positive,) and that, on average, we can expect to find multiple spots on each X-ray.

In [None]:
df_lbl_neg = df_lbl[df_lbl.Target == 0]
df_lbl_pos = df_lbl[df_lbl.Target == 1]
df_lbl_neg.info()

In [None]:
df_lbl_pos.info()

In [None]:
n_pat_pos = df_lbl_pos.patientId.nunique()
print('Number of unique patients having pneumonia: ', n_pat_pos)
print('Fracion of unique patients having pneumonia: ', n_pat_pos / n_pat)
print('Avg number of anomalies per patient having pneumonia: ', len(df_lbl_pos) / n_pat_pos)

Below we take a closer look at "spots per patient", and see that the most spots any patient in the training data has is 4, and the vast majority of patients with pneumonia have 1 or 2.

In [None]:
df_lbl_pos.patientId.value_counts().max()

In [None]:
df_lbl_pos.patientId.value_counts().hist()

Although we have not chosen an algorithm yet, we know that we will have to detect bounding boxes, so let's get more feel for what the set of boxes in the training data looks like, because it is reasonable to expect that the box population in the test data will be similar. This information might be useful if we choose an algorithm that uses anchor boxes.

I have learned from inspecting the DICOM images (see below) and from other sources such as contest information, discussions, sample kernels, etc, that the units of x, y, width, and heigth are pixels, that the X-ray images are 1024x1024, and than pixel 0,0 is in the upper left corner of the image. 

Let's add columns for the box center coords, because if we have to defined anchor boxes that is how we will do it, and because if we want to visualize the distribution of box locations across the lungs, that is more intuitive.  (I think we can ignore the "setting on a copy of a slice" warnings here, because we really do just want these new fields on the positive subset of the data.)

In [None]:
df_lbl_pos['xc'] = df_lbl_pos.x + df_lbl_pos.width / 2
df_lbl_pos['yc'] = df_lbl_pos.y + df_lbl_pos.height / 2
df_lbl_pos.head()

In [None]:
df_lbl_pos.describe()

Let's try to do some visualization.  Keep in mind that the y-axis of the plots is flipped with respect to the X-ray images.

The first plot below examines the distribution of box centers.  We see that the mean positions of spots are more or less in the center of each lung.  Surprisingly, there appears to be more slightly spots in one lung than the other.  (Which lung this is depends on which way the X-ray is shot, which information can be extracted from the image files, but I have not examined that yet.)  The plot suggests that if we have to choose anchor boxes, we might want to define two sets (one for each lung), or possibly divide each image in half vertically and examine the halves separately.

The second plot examines the distribution of box widths and heights.  We can see from the scales of the plot that spots tend to be taller than they are wide, which makes sense because that's the way the lungs are shaped.  Despite this, however, we also see that the mean in both dimensions is roughly 200 pixels.

In [None]:
sns.jointplot(x='xc', y='yc', data=df_lbl_pos, kind='kde')

In [None]:
sns.jointplot(x='width', y='height', data=df_lbl_pos, kind='kde')

### 1.3 Detailed Class Info

The only remaining file in the main level of the input directory is the detailed class file, which we see below has just 2 columns, the same number of rows as the labels file, and no missing values.  The only added information in this file is in the "class" column, which for each observation contains one of three values.  One of these values is "Lung Opacity", which corresponds to a pneumonia diagnosis, and there are the same number of occurences of this value as there are Target=1 in the lables file, which is good.  The other two classes designate either lungs that are normal or those with spots that are not pneumonia, and the sum of the occurrences of these two values matches the number of Target=0 observations in the labels file.

It may be that the information in this file will help the model discriminate against "spots" that are not pneumonia.  However, we do not have boxes in the training labels for any non-pneumonia spots, so it will be hard to tell if the model is on the right track in detecting such boxes.

In [None]:
df_class = pd.read_csv('../input/stage_1_detailed_class_info.csv')
df_class.head()

In [None]:
df_class.info()

In [None]:
df_class['class'].value_counts()

### 1.4 Image Files

The final part of the input data to examine is the X-ray image files that correspond to the training labels, and there are the same number of files as there are unique patients represented in the training data, which is good.

In [None]:
print('Number of training images:')
!ls ../input/stage_1_train_images | wc -l

The X-ray files are in DICOM format -- they contain not only the X-ray image itself but also related information (example below.)  The files have been sanitized of all patient-related information, but there are still a few fields that might be useful, such as age, sex, and X-ray orientation.

In [None]:
patientId = df_lbl_pos.iloc[0,0]
dcm_file = '../input/stage_1_train_images/%s.dcm' % patientId
dcm_data = pydicom.read_file(dcm_file)
print(dcm_data)

In [None]:
# extracting single field from DICOM file
age = dcm_data[0x10,0x1010].value
print(age)

In [None]:
# displaying image from DICOM file
img = dcm_data.pixel_array
pylab.imshow(img, cmap=pylab.cm.gist_gray)

Here we will put together some code (mostly copied or adapted from sample kernels) that will not only help us explore the x-rays a little bit but also might be helpful for evaluating the output of our model.

I want to see negative as well as positive x-rays, so I will go back to using the complete table of training labels.  For convenience in plotting the boxes, I will transform the x, y, w, h data and make sure the table is sorted by patient ID so that all boxes for a given patient are contiguous.

We may at some point want to show boxes in different colors, so we will convert the images from single-channel grayscale to 3-channel RGB.

In [None]:
df2 = pd.read_csv('../input/stage_1_train_labels.csv')
df2.fillna(0, inplace=True)
df2.iloc[:, 1:5] = df2.iloc[:, 1:5].astype('int')
df2.sort_values('patientId', inplace=True)
df2.head()

In [None]:
df2.info()

In [None]:
def add_box(img, x1, y1, w, h, rgb, lw=1):
    x2 = x1 + w
    y2 = y1 + h
    img[y1:y1 + lw, x1:x2] = rgb
    img[y2:y2 + lw, x1:x2] = rgb
    img[y1:y2, x1:x1 + lw] = rgb
    img[y1:y2, x2:x2 + lw] = rgb
    return img

def find_pat_rows(idx):
    pid = df2.iloc[idx,0]
    start = idx
    stop = idx + 1
    while (df2.iloc[start - 1, 0] == pid) and (start > 0):
        start -= 1
    while (df2.iloc[stop + 1, 0] == pid) and (stop < len(df2) - 1):
        stop += 1
    return start, stop

def show_xray(pat_idx):
    # get observation data from table index
    pid, x, y, w, h, t = df2.iloc[pat_idx,:]
    # load image, convert from grayscale to RGB
    dcm_file = '../input/stage_1_train_images/%s.dcm' % pid
    dcm_data = pydicom.read_file(dcm_file)
    img = np.stack([dcm_data.pixel_array] * 3, axis=2)
    # if this row has Target=1, add all relevant boxes
    if t == 1:
        start, stop = find_pat_rows(pat_idx)
        for n in range(start, stop):
            pid, x, y, w, h, t = df2.iloc[n,:]
            rgb = np.array([255, 0, 0])
            img = add_box(img, x, y, w, h, rgb, 5)
    # display image
    pylab.imshow(img, cmap=pylab.cm.gist_gray)

# for playing at this point, just choose a random x-ray each time it runs
idx = np.random.randint(0, len(df2), 1)
show_xray(idx[0])

# Notes on Next Steps (from Starter Kernel)

Now that you understand the data structures, imaging file formats and label types, it's time to make an algorithm! Keep in mind that the primary endpoint is the detection of bounding boxes, thus you will likely be considering various **object localization** algorithms. An alternative strategy is to consider the related family of **segmentation** algorithms with the acknowledgement that bounding boxes will only be a coarse approximation to true pixel-by-pixel image segmentation masks.

