Hello World, we love preprocessing and hate lung cancer.

# Preprocessing Agenda:

* Averaging every 10 pixels
* Decimating
* Chunking for the CNN?
* Flattening

## Original Work: 

* **Loading the DICOM files**, and adding missing metadata  
* **Converting the pixel values to *Hounsfield Units (HU)***, and what tissue these unit values correspond to

In [None]:
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import dicom
import os
import scipy.ndimage
import matplotlib.pyplot as plt

from skimage import measure, morphology
from mpl_toolkits.mplot3d.art3d import Poly3DCollection

# gun dev note: one patient per folder
INPUT_FOLDER = '../input/sample_images/'
patients = os.listdir(INPUT_FOLDER)
patients.sort()

# I/O: Scans to PyDicom Datasets

In [None]:
# Load the scans in given folder path
def load_scan(path):
    # gun dev note: every folder contains multiple slices, which is "a single scan"
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
        
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    
    # gdn: for capturing hidden slice thickness field; assuming that the slice thickness is uniform across all slices
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

# Houndsfield Units
* HU: std measure of radiodensity
* From Wikipedia:

![HU examples][1]

# Convert to proper HU units: 
* the pixel_array of each slice is encoded in unscaled HU
* pixels that fall outside of these bounds get the fixed value -2000
* multiplying with the rescale slope 
* adding the intercept

  [1]: http://i.imgur.com/4rlyReh.png

In [None]:
def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        
        # each slice in slices has lots of "hidden" attributes
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
            
        image[slice_number] += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

# Visualization of HU Distribution

In [None]:
# gdn: viz on patient num 0
first_patient = load_scan(INPUT_FOLDER + patients[0])

# gdn: display entire Dataset. use dir("search_term") function call to find fields that match 
    # a pydicom Dataset instance is essentially a dict mapping
    # large set of hidden fields contained in each Dataset
print(len(first_patient))
# gdn: don't print the repr unless you wanna see it real time; too many fields
print(first_patient)

first_patient_pixels = get_pixels_hu(first_patient)
# gdn: viz of HU distribution in 0th patient scan
    # what if... we run algorithms on this data?
plt.hist(first_patient_pixels.flatten(), bins=80, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

# Visualization of Random Sample

In [None]:
# 134 slices, 512 x 512 pixels per slice
print(first_patient_pixels.shape)
num_slices, num_rows, num_cols = first_patient_pixels.shape

# Higher HU = whiter; Lower HU = darker
# gdn: quick try at taking HU mean across all 134 slices of image
plt.imshow(first_patient_pixels.mean(axis=0), cmap=plt.cm.gray)
plt.show()
# gdn: visualize a random slice 
idx = np.random.randint(0, num_slices)
plt.imshow(first_patient_pixels[idx], cmap=plt.cm.gray)
plt.show()

# gdn: how can we quantify the viz though? how dark/light versus what HU value??

# Averaging per 10 Slices
* intuitively, averaging across all e.g. 134 slices is not effective
* try averaging on 10 slices instead

In [None]:
'''
input: a scan processed by get_pixels_hu; ndarray representation
output: same scan, same representation, averaged per 10-slices
'''
def averaging_10_slices(scan):
    # gdn: split into chunks of 10 slices each
        # outputs a list
    num_slices, num_rows, num_cols = scan.shape
    num_chunks = num_slices / 10 + 1
    split_scan = np.array_split(scan, num_chunks, axis=0)
    
    # gdh: take mean across axis 0 on each batch
    split_scan_out = []
    for each in split_scan:
        split_scan_out.append(each.mean(axis=0))
    
    return np.stack(split_scan_out, axis=0)

In [None]:
first_patient_pixels_avg10 = averaging_10_slices(first_patient_pixels)
num_slices, num_rows, num_cols = first_patient_pixels_avg10.shape

# gdn: viz all post-averaged slice
for _ in range(num_slices):
    plt.imshow(first_patient_pixels_avg10[_], cmap=plt.cm.gray)
    plt.show()
    
# gdn: compare this quantitatively with the raw stuff?

# Resampling
A scan may have a pixel spacing of `[2.5, 0.5, 0.5]`, which means that the distance between slices is `2.5` millimeters. For a different scan this may be `[1.5, 0.725, 0.725]`, this can be problematic for automatic analysis (e.g. using ConvNets)! 

A common method of dealing with this is resampling the full dataset to a certain isotropic resolution. If we choose to resample everything to 1mm*1mm*1mm pixels we can use 3D convnets without worrying about learning zoom/slice thickness invariance. 

Whilst this may seem like a very simple step, it has quite some edge cases due to rounding. Also, it takes quite a while.

Below code worked well for us (and deals with the edge cases):