# DICOM 2 PNG Converter

The DICOM images in this competition prove very difficult to use as they are. This notebook walks through the code, process, and output for converting these DICOM images into a usable form that won't completely saturate your hard drive or cloud drive space. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from PIL import Image
import pydicom as dicom
import matplotlib.pylab as plt
import cv2
import itertools
import time
from skimage.transform import resize


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train_image_file_list = os.listdir('/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train/')

Each of these files are lung scans formatted in the DICOM image format, an industry standard in healthcare. 

In [None]:
# Show original dicom image and dicom pixel array
image_path = '/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train/'+ train_image_file_list[0]
ds = dicom.dcmread(image_path)

_= plt.imshow(ds.pixel_array)    
print(ds.pixel_array)

DICOM files have a lot of associated metadata with them. While it's very useful for healthcare practitioners, none of this metadata is going to be super useful for this classification problem. Converting to the .png format is more ideal for this particular use case. 

In [None]:
# Viewing dicom metadata to better understand what dicom is

image_name = train_image_file_list[0]
# specify your image path
input_path = '/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train/'

output_path_stem = '/kaggle/working/converted_train_images/' 

ds = dicom.dcmread(input_path + image_name)

ds

In [None]:
# Creating diretories for the final dump outside of the kaggle/working
in_dirs = ['/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train/', '/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/test/']

out_dirs = ['/kaggle/final_output/converted_train_images/', '/kaggle/final_output/converted_test_images/']
for thing in out_dirs: os.makedirs(thing)
    
# Remove directory
#import shutil
#shutil.rmtree('/kaggle/final_output/converted_train_images/')
#shutil.rmtree('/kaggle/final_output/converted_test_images/')

The following function converts the images into the PNG format. The DICOM images presented two challenges: brightness/contrast, and size. The DICOM images have very large pixel brightnesses which needed to be rescaled to the 0-255 scale. Additionally, the DICOM images are large and needed to be rescaled down to 512 * 512 pixels. These transformations will reduce the quality of the images going into the model, but it seems to be a solid tradeoff given how much more manageable all the data becomes. 

In [None]:
import pydicom

def image_processor(input_dir, output_dir, basewidth = 512, baseheight = 512): 
    
    
    file_list = [f for f in os.listdir(input_dir)]
    
    
    for f in file_list:
        
        # Read the dicom file
        ds = pydicom.read_file(input_dir + f, force = True)
        
        # Get image array
        img = ds.pixel_array
        
        # Rescaling image array brightness
        img_rescaled = (img / np.max(img) * 255)
        
        # Resizing image array dimensions
        img_resized = resize(img_rescaled, (512,512), anti_aliasing= True)
        
        # Creating new png path
        new_path = output_dir + f.replace('.dicom', '.png')
        
        # Writing image to path
        cv2.imwrite(new_path ,img_resized) # write png image 

In [None]:
# Doing it the least pythonic way possible since I want to eliminate any issues causing my output files to not be generated
image_processor(input_dir= in_dirs[0], output_dir= out_dirs[0])

image_processor(input_dir= in_dirs[1], output_dir= out_dirs[1])

In [None]:
# Converted training image sample
path = '/kaggle/final_output/converted_train_images/4d390e07733ba06e5ff07412f09c0a92.png'

img = Image.open(path)
img.size

Image.open(path)

In [None]:
# Converted test image sample
path = '/kaggle/final_output/converted_test_images/83caa8a85e03606cf57e49147d7ac569.png'

img = Image.open(path)
img.size

Image.open(path)

In [None]:
# Checking to make sure all 15,000 training images and 3,000 testing images were converted properly into their respective
# directories
converted_train_list = os.listdir('/kaggle/final_output/converted_train_images/')
converted_test_list = os.listdir('/kaggle/final_output/converted_test_images/')
print("train file number: " + str(len(converted_train_list)) + " test file number: "+ str(len(converted_test_list)))

In [None]:
# Hopefully this saves the output upon commit
!tar -zcf train.tar.gz -C "/kaggle/final_output/converted_train_images/" .
!tar -zcf test.tar.gz -C "/kaggle/final_output/converted_test_images/" .