# RSNA - Exploratory Data Analysis Part 2

In this notebook, we are going to take a look at the DICOM dataset. We are going to extract the <b>dicom elements</b>, such as window center, pixel spacing, rows and columns, from each image. We will put then these information into a table.  

By inspecting this table, we are going to gain insight on the technical specifications of these images, and hopefully answer the questions like 

"Are all images of about the same size?" 

"Do these images have multiple windows?" 

"Do all images share similar technical specfications?" 

"Is there any oddity that may skew the predictions?"


Later on, we will use the window center, window width, rescale intercept and rescale slope to process each image. The processed images will be the input to our deep learning models.

In [1]:
import pandas as pd
import os

import numpy as np
import pydicom

In [2]:
img_dir = "stage_1_train_images/"

In [3]:
data_table = pd.read_pickle('rsna_data_table.pkl') 
print(len(data_table))

674257


# 1. Take a look at a sample image

In [4]:
img_id = data_table.index[0]
file_name = os.path.join(img_dir,"ID_"+img_id+".dcm")
dicom_elements = pydicom.dcmread(file_name)

In [5]:
dicom_elements

(0008, 0018) SOP Instance UID                    UI: ID_000039fa0
(0008, 0060) Modality                            CS: 'CT'
(0010, 0020) Patient ID                          LO: 'ID_eeaf99e7'
(0020, 000d) Study Instance UID                  UI: ID_134d398b61
(0020, 000e) Series Instance UID                 UI: ID_5f8484c3e0
(0020, 0010) Study ID                            SH: ''
(0020, 0032) Image Position (Patient)            DS: ['-125.000000', '-141.318451', '62.720940']
(0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '0.968148', '-0.250380']
(0028, 0002) Samples per Pixel                   US: 1
(0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
(0028, 0010) Rows                                US: 512
(0028, 0011) Columns                             US: 512
(0028, 0030) Pixel Spacing                       DS: ['0.488281', '0.488281']
(0028, 0100) Bits Allocated                      US: 16
(0028, 0101) Bits Stored 

In [6]:
#These are the dicom elements we are interested in
dicom_dict = {'image pos':['0020','0032'],
             'image orient':['0020','0037'],
             'samples per pixel':['0028','0002'],
             'photometric interp':['0028','0004'],
             'rows':['0028','0010'],
             'columns':['0028','0011'] ,
             'pixel spacing':['0028','0030'],
             'bits allocated':['0028','0100'],
             'bits stored':['0028','0101'],
             'high bits':['0028','0102'],
             'pixel representation':['0028','0103'], 
             'window center':['0028','1050'], 
             'window width':['0028','1051'], 
             'rescale intercept':['0028','1052'], 
             'rescale slope':['0028','1053']}      

In [7]:
#A function to convert the element values into standard python types
def parse_element(dicom_elements, element_code):
    
    element = dicom_elements.get(element_code)
    
    if element:
        if isinstance(element.value, pydicom.multival.MultiValue):       
            val = []
            if isinstance(element.value[0], int):
                for x in element.value:
                    val.append(int(x))
                
            elif isinstance(element.value[0], str):
                for x in element.value:
                    val.append(str(x))
                  
            elif isinstance(element.value[0], pydicom.valuerep.DSfloat):    
                for x in element.value:
                    val.append(float(x))                
            else:
                val = 'warning: unknown list type'            
        
        elif isinstance(element.value, int):
                val = element.value
        elif isinstance(element.value, str):
                val = element.value
        elif isinstance(element.value, pydicom.valuerep.DSfloat):    
                val = float(element.value)
        else:
                val = 'warning: unknown type'     
    else:
        val = 'warning: element not found'
    
    return val   

In [8]:
for x in dicom_dict:
    print(parse_element(dicom_elements, dicom_dict[x]))

[-125.0, -141.318451, 62.72094]
[1.0, 0.0, 0.0, 0.0, 0.968148, -0.25038]
1
MONOCHROME2
512
512
[0.488281, 0.488281]
16
16
15
1
30.0
80.0
-1024.0
1.0


#### Remark: we parsed the element in a DICOM file and converted them to Python types, now we can make a table of for all DICOM files in the dataset

# 2. Create a DICOM info table for the entire training dataset

In [9]:
table_columns = list(dicom_dict.keys())  
table_columns.append('id')
dicom_table = pd.DataFrame(columns = table_columns) 

#the dicom_table share the same id with the data_table, 
#for each image, we will retrieve the diagnosis from the data table,
#and its dicom info from the DICOM info table.

dicom_table['id'] = data_table.index

In [10]:
#each row corresponds to the DICOM info of one image
def read_img(table_row, img_dir, dicom_dict):
    
    img_id = table_row['id']   
    img_file = os.path.join(img_dir,"ID_"+img_id+".dcm")
    dicom_elements = pydicom.dcmread(img_file)
    
    for  element_name, element_code in dicom_dict.items():
         table_row[element_name] = parse_element(dicom_elements, element_code)
    return table_row       
       

In [11]:
dicom_table = dicom_table.apply(lambda x: read_img(x, img_dir, dicom_dict), axis = 1)

In [12]:
dicom_table = dicom_table.set_index('id')

In [13]:
#Save the table to a file
dicom_table.to_pickle('rsna_dicom_table.pkl') 

# 3. Inspect the DICOM info table

In [14]:
dicom_table.head()

Unnamed: 0_level_0,image pos,image orient,samples per pixel,photometric interp,rows,columns,pixel spacing,bits allocated,bits stored,high bits,pixel representation,window center,window width,rescale intercept,rescale slope,multiple windows
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
000039fa0,"[-125.0, -141.318451, 62.72094]","[1.0, 0.0, 0.0, 0.0, 0.968148, -0.25038]",1,MONOCHROME2,512,512,"[0.488281, 0.488281]",16,16,15,1,30,80,-1024.0,1.0,False
00005679d,"[-134.463, -110.785, -39.569]","[1.0, 0.0, 0.0, 0.0, 1.0, 0.0]",1,MONOCHROME2,512,512,"[0.460938, 0.460938]",16,16,15,1,50,100,-1024.0,1.0,False
00008ce3c,"[-125.0, -83.0468112, 175.995344]","[1.0, 0.0, 0.0, 0.0, 0.994521895, 0.104528463]",1,MONOCHROME2,512,512,"[0.48828125, 0.48828125]",16,12,11,0,"[40.0, 40.0]","[80.0, 80.0]",-1024.0,1.0,False
0000950d7,"[-126.437378, -126.437378, 157.5]","[1.0, 0.0, 0.0, 0.0, 1.0, 0.0]",1,MONOCHROME2,512,512,"[0.494863, 0.494863]",16,16,15,1,35,135,-1024.0,1.0,False
0000aee4b,"[-108.5, 14.5, 94.0]","[1.0, 0.0, 0.0, 0.0, 1.0, 0.0]",1,MONOCHROME2,512,512,"[0.423828125, 0.423828125]",16,12,11,0,"[36.0, 36.0]","[80.0, 80.0]",-1024.0,1.0,False


#### observation: it seems some dicom elements, like window center  and window width, have variable length (1 or 2)

In [15]:
def item_length(x):
    if isinstance(x, int) or isinstance(x,float) or isinstance(x,str):
        return 1
    return len(x)

In [16]:
#check if the dicom elements have variable length
for  element_name, element_code in dicom_dict.items():
     print(dicom_table[element_name].apply(item_length).value_counts())

3    674258
Name: image pos, dtype: int64
6    674258
Name: image orient, dtype: int64
1    674258
Name: samples per pixel, dtype: int64
1    674258
Name: photometric interp, dtype: int64
1    674258
Name: rows, dtype: int64
1    674258
Name: columns, dtype: int64
2    674258
Name: pixel spacing, dtype: int64
1    674258
Name: bits allocated, dtype: int64
1    674258
Name: bits stored, dtype: int64
1    674258
Name: high bits, dtype: int64
1    674258
Name: pixel representation, dtype: int64
1    341679
2    332579
Name: window center, dtype: int64
1    341679
2    332579
Name: window width, dtype: int64
1    674258
Name: rescale intercept, dtype: int64
1    674258
Name: rescale slope, dtype: int64


#### observation: only window center and window width have variable length. That means some DICOM images in the dataset have multiple windows

In [17]:
def all_values_equal(x):
    if isinstance(x,float) or isinstance(x,int):
        return True
    y = np.asarray(x)
    return (y == min(y)).all()

In [18]:
dicom_table['window center'].apply(all_values_equal).value_counts()

True     674224
False        34
Name: window center, dtype: int64

In [19]:
dicom_table['window width'].apply(all_values_equal).value_counts()

True     674224
False        34
Name: window width, dtype: int64

In [20]:
#find the DICOM files that have multiple window
multiple_width = ~dicom_table['window width'].apply(all_values_equal) 
multiple_center = ~dicom_table['window center'].apply(all_values_equal)
dicom_table['multiple windows'] = multiple_width | multiple_center

len(dicom_table[dicom_table['multiple windows']])

34

#### observation: 34 images have multiple windows

In [21]:
dicom_table['samples per pixel'].value_counts()

1    674258
Name: samples per pixel, dtype: int64

In [22]:
dicom_table['photometric interp'].value_counts()

MONOCHROME2    674258
Name: photometric interp, dtype: int64

In [23]:
dicom_table['rows'].value_counts()

512    673989
638        49
436        36
408        33
464        32
462        32
430        31
666        29
768        27
Name: rows, dtype: int64

In [24]:
dicom_table['columns'].value_counts()

512    674018
490        49
436        36
374        33
464        32
462        32
404        31
768        27
Name: columns, dtype: int64

#### observation: image size varies quite a bit

In [25]:
dicom_table['bits allocated'].value_counts()

16    674258
Name: bits allocated, dtype: int64

In [26]:
dicom_table['bits stored'].value_counts()

16    341679
12    332579
Name: bits stored, dtype: int64

In [27]:
dicom_table['high bits'].value_counts()

15    341679
11    332579
Name: high bits, dtype: int64

In [28]:
dicom_table['pixel representation'].value_counts()

1    343931
0    330327
Name: pixel representation, dtype: int64

In [29]:
dicom_table['rescale intercept'].value_counts()

-1024.0    662279
-1000.0      6653
 0.0         5276
 1.0           50
Name: rescale intercept, dtype: int64

#### observation: about 5000 images used a very different linear function to convert from pixel to HU

In [30]:
dicom_table['rescale slope'].value_counts()

1.0    674258
Name: rescale slope, dtype: int64

In [31]:
dicom_table['pixel spacing'].apply(lambda x: x[0]).describe()

count    674258.000000
mean          0.478767
std           0.027809
min           0.292969
25%           0.488281
50%           0.488281
75%           0.488281
max           0.976562
Name: pixel spacing, dtype: float64

In [32]:
dicom_table['pixel spacing'].apply(lambda x: x[1]).describe()

count    674258.000000
mean          0.478767
std           0.027809
min           0.292969
25%           0.488281
50%           0.488281
75%           0.488281
max           0.976562
Name: pixel spacing, dtype: float64

#### observation: most images have similar pixel spacing, but there are extreme cases

In [33]:
#Save the table to a file
dicom_table.to_pickle('rsna_dicom_table.pkl') 