# Introduction

**Open Source Imaging Consortium (OSIC) is a not-for-profit, co-operative effort between academia, industry and philanthropy. The group enables rapid advances in the fight against Idiopathic Pulmonary Fibrosis (IPF), fibrosing interstitial lung diseases (ILDs), and other respiratory diseases, including emphysematous conditions. Its mission is to bring together radiologists, clinicians and computational scientists from around the world to improve imaging-based treatments**

**What is Pulmonary Fibrosis about?**

**The word “pulmonary” means lung and the word “fibrosis” means scar tissue— similar to scars that you may have on your skin from an old injury or surgery. So, in its simplest sense, pulmonary fibrosis (PF) means scarring in the lungs.**

**It is serious as over time, the scar tissue can destroy the normal lung and make it hard for oxygen to get into your blood. Low oxygen levels (and the stiff scar tissue itself) can cause you to feel short of breath, particularly when walking and exercising. Pulmonary fibrosis isn’t just one disease. It is a family of more than 200 different lung diseases that all look very much alike.**

![](https://www.pulmonaryfibrosis.org/images/default-source/default-album/normal-and-impaired-gas-exchange.png?sfvrsn=c3b0918d_0)

**For more info do visit https://www.pulmonaryfibrosis.org/life-with-pf/about-pf** 

# About the Competition

![](https://www.kaggleusercontent.com/kf/40536606/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..4yf7NuyM78BCiuvU95LjCg.FRvwtJwLwfTAN4Jn01XO8g-cmK54Onr0h2vX9YV0fagrVPhJijWpRldRJCaXY3tKYfEQeFmukZ8Gftf0RVcn43oHReHZYQpvXEy0aKgsIvo_mMF4fJ0bK6KsaZPGIYu_mqv_dYs_jJi37dvJkqMI3DkZZu_MRHx48628X_tG9OGpmzCv9cscIoFF97r3FCOyNPcMOme8akhA5NXCpislOR4ELOqHSouIb0W5OccmjOF8Ur3abchozJMkwTuIYHzAZSlEs6ZnZsNjZpAaJuoMAR0MotpgnlT4ErK5RgMA21Ta1ZSYkJEzGoS3q6LHigr_zwtru06L_WAppP_0mUb20mKQ29KhlzbhZdd8Au8pucyQT9tBvigybd3V-nuB1fMjborZEiPUSzI57oGsq6gObtscx5VozG2iNdfJCYn2DFjpZ65Fkyq_WF16GhrKJW7a-8PTxVuzhjXs2B9_n8pm7-GxLNNVYQXUJjuITeMHwj0lGShv--3LLGhNWWAuLBILqflft7q7aA3GChMu_C-RfAj7LuFSswQhWD29RKnPsWmZJVS-Dkuwg4StrwTpn6McJ7LWlQjnbJ2SqMOu1NyI3lVIZ9TwZ07aIb-FxSV7timds2ImfSRSEuyoym-LCxVYVadB6eI-XD-5g7pZoJB-ULLW9oRKb7LXfkmu7HjQD68.uByHuDxikFpTmF-2vlZTBg/__results___files/__results___23_0.png)

**In this competition, you’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. You’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.**

**If successful, patients and their families would better understand their prognosis when they are first diagnosed with this incurable lung disease. Improved severity detection would also positively impact treatment trial design and accelerate the clinical development of novel treatments.**


**Files**

**This is a synchronous rerun code competition. The provided test set is a small representative set of files (copied from the training set) to demonstrate the format of the private test set. When you submit your notebook, Kaggle will rerun your code on the test set, which contains unseen images.**

> **train.csv - the training set, contains full history of clinical information**

> **test.csv - the test set, contains only the baseline measurement**

> **train/ - contains the training patients' baseline CT scan in DICOM format**

> **test/ - contains the test patients' baseline CT scan in DICOM format**

> **sample_submission.csv - demonstrates the submission format**


**Columns**


> **Patient- a unique Id for each patient (also the name of the patient's DICOM folder)**

> **Weeks- the relative number of weeks pre/post the baseline CT (may be negative)**

> **FVC - the recorded lung capacity in ml**

> **Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics**

>**Age**

>**Sex**

> **SmokingStatus**

**I will be using certain library called ONNX to convert pytorch based codes to tensorflow!!**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import copy
from datetime import timedelta, datetime
import imageio
import matplotlib.pyplot as plt
from matplotlib import cm
import multiprocessing
import numpy as np
import os
from pathlib import Path
import pydicom
import pytest
import scipy.ndimage as ndimage
from scipy.ndimage.interpolation import zoom
from skimage import measure, morphology, segmentation
from skimage.transform import resize
from time import time, sleep
from tqdm import trange, tqdm
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import *
from tensorflow.data import Dataset
import torch
from torch.utils.tensorboard import SummaryWriter
from torchvision import transforms
import warnings
import seaborn as sns
import glob as glob
import imageio
from IPython.display import Image

#for masking
from skimage.measure import label,regionprops
from sklearn.cluster import KMeans
from skimage.segmentation import clear_border

import onnx

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading Libraries

In [None]:
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
print('{} Rows and {} Columns in train data '.format(train_df.shape[0], train_df.shape[1]))
train_df.head()

In [None]:
test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')
print('{} Rows and {} Columns in test data '.format(test_df.shape[0], test_df.shape[1]))
test_df.head()

In [None]:
train_df.describe()

In [None]:
sns.heatmap(train_df.corr(), annot=True, cmap=plt.cm.plasma)

**FVC and Percent Column have good correlation among themselves**

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x='Age', hue='SmokingStatus', data=train_df).set_title('Age wise smokers distrbution')

**Let's check for Uniqueness in the training data**

In [None]:
print('The Number of Unique Patients in training data are : {}'.format(len(train_df['Patient'].unique()), "\n"))

# Is there any Decline in Lung Function?

In [None]:
#https://www.kaggle.com/carlossouza/probabilistic-machine-learning-a-diff-approach/

def chart_data(patient_id, ax):
    plot_data = train_df[train_df['Patient']==patient_id]
    fig1 = plot_data['Weeks']
    fig2 = plot_data['FVC']
    ax.set_title(patient_id)
    ax = sns.regplot(fig1, fig2, ax=ax, ci=None, line_kws={'color':'red'})
    

f, axes = plt.subplots(1, 3, figsize=(15, 5))
chart_data('ID00007637202177411956430', axes[0])
chart_data('ID00009637202177434476278', axes[1])
chart_data('ID00426637202313170790466', axes[2]) #non-smoker plot

**There is a clear evidence now as suggested from the plot that there is a decline in lung function of patients**

**Here I will be referring to some of the great implementations from [Andrada's notebook](https://www.kaggle.com/andradaolteanu/pulmonary-fibrosis-competition-eda-dicom-prep)**

In [None]:
# Select unique bio info for the patients
data = train_df.groupby(by="Patient")[["Patient", "Age", "Sex", "SmokingStatus"]].first().reset_index(drop=True)

# Figure
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (16, 6))

a = sns.distplot(data["Age"], ax=ax1, hist=False, kde_kws=dict(lw=6, ls="--"))
b = sns.countplot(data["Sex"], ax=ax2)
c = sns.countplot(data["SmokingStatus"], ax=ax3)

a.set_title("Patient Age Distribution", fontsize=16)
b.set_title("Sex Frequency", fontsize=16)
c.set_title("Smoking Status", fontsize=16);

**Let's visualize FVC and Percent columns as these had high correlation with each other**

In [None]:
# Figure
f, (ax1, ax2) = plt.subplots(1, 2, figsize = (16, 6))

a = sns.distplot(train_df["FVC"], ax=ax1, hist=False, kde_kws=dict(lw=6, ls="--"))
b = sns.distplot(train_df["Percent"], ax=ax2, hist=False, kde_kws=dict(lw=6, ls="-."))

a.set_title("FVC Distribution", fontsize=16)
b.set_title("Percent Distribution", fontsize=16);

# DICOM Images Loading and Preprocessing

**Let's start with creating some helper functions to load the DICOM images from the dataset**

In [None]:
data_path = '../input/osic-pulmonary-fibrosis-progression/train/'

output_path = '../input/output/'
train_image_files = sorted(glob.glob(os.path.join(data_path, '*','*.dcm')))
patients = os.listdir(data_path)
patients.sort()

print('Some sample Patient ID''s :', len(train_image_files))
print("\n".join(train_image_files[:5]))

In [None]:
def load_scan(path):
    """
    Loads scans from a folder and into a list.
    
    Parameters: path (Folder path)
    
    Returns: slices (List of slices)
    """
    
    slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.InstanceNumber))
    
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

**Credits to [Aadhav Vignesh](https://www.kaggle.com/aadhavvignesh/lung-segmentation-by-marker-controlled-watershed) for an awesome implementation**

**Hounsfield Units
The unit of measurement in CT scans is the Hounsfield Unit (HU), which is a measure of radiodensity.**

**Hounsfield units (HU) are a dimensionless unit universally used in computed tomography (CT) scanning to express CT numbers in a standardized and convenient form. Hounsfield units are obtained from a linear transformation of the measured attenuation coefficients.**

![HU Table](http://patentimages.storage.googleapis.com/WO2005055806A2/imgf000011_0001.png)

***HUs can be calculated from the pixel data with a DICOM Image using the following formula:***

> ***HU = m ∗ P + b***

***where,***

> ***m  = RescaleSlope attribute of the DICOM image,***

> ***b  = RescaleIntercept attribute of the DICOM image,***

> ***P  = Pixel Array***

**So Let's start with applying Houndsfield metrics to our images!!**

In [None]:
def get_pixels_hu(scans):
    """
    Converts raw images to Hounsfield Units (HU).
    
    Parameters: scans (Raw images)
    
    Returns: image (NumPy array)
    """
    
    image = np.stack([s.pixel_array for s in scans])
    image = image.astype(np.int16)

    # Since the scanning equipment is cylindrical in nature and image output is square,
    # we set the out-of-scan pixels to 0
    image[image == -2000] = 0
    
    
    # HU = m*P + b
    intercept = scans[0].RescaleIntercept
    slope = scans[0].RescaleSlope
    
    if slope != 1:
        image = slope * image.astype(np.float64)
        image = image.astype(np.int16)
        
    image += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)


![](https://media.tenor.com/images/b4c2f5c658c1d3ade7e506ee7ffe3c5e/tenor.gif)


**Now that we have created a helper function for converting it to HU units let's proceed ahead with Preprocessing**

In [None]:
test_patient_scans = load_scan(data_path + patients[2])
test_patient_images = get_pixels_hu(test_patient_scans)

#We'll be taking a random slice to perform segmentation:

for imgs in range(len(test_patient_images[0:5])):
    f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(15,15))
    ax1.imshow(test_patient_images[imgs], cmap=plt.cm.bone)
    ax1.set_title("Original Slice")
    
    ax2.imshow(test_patient_images[imgs], cmap=plt.cm.bone)
    ax2.set_title("Original Slice")
    
    ax3.imshow(test_patient_images[imgs], cmap=plt.cm.bone)
    ax3.set_title("Original Slice")
    plt.show()

# Animated Scans

In [None]:
def set_lungwin(img, hu=[-1200., 600.]):
    lungwin = np.array(hu)
    newimg = (img-lungwin[0]) / (lungwin[1]-lungwin[0])
    newimg[newimg < 0] = 0
    newimg[newimg > 1] = 1
    newimg = (newimg * 255).astype('uint8')
    return newimg


scans = load_scan('../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/')
scan_array = set_lungwin(get_pixels_hu(scans))

imageio.mimsave("/tmp/gif.gif", scan_array, duration=0.0001)
Image(filename="/tmp/gif.gif", format='png')

# Marker-controlled Watershed Transformation 

**We will be applying Marker-Controlled Watershed Transformation now**

**Why?**

**We do this because Watershed Transform is a really powerful segmentation algorithm, but has a drawback:**

**Over Segmentation: Oversegmentation occurs because every regional minimum forms its own catchment basin. Here is an example where steel grains are over-segmented by watershed transformation**

![](https://www.mathworks.com/company/newsletters/articles/the-watershed-transform-strategies-for-image-segmentation/_jcr_content/mainParsys/image_9.adapt.1200.high.gif/1542750812181.gif)

![](https://www.mathworks.com/company/newsletters/articles/the-watershed-transform-strategies-for-image-segmentation/_jcr_content/mainParsys/image_10.adapt.1200.high.gif/1542750812206.gif)

> Left: Steel Grains, Right: Oversegmented image as a result of using normal watershed transformation.

**To overcome this drawback, we use a marker-controlled watershed transformation, where we manually create markers where we start the flooding process.**

In [None]:
def generate_markers(image):
    """
    Generates markers for a given image.
    
    Parameters: image
    
    Returns: Internal Marker, External Marker, Watershed Marker
    """
    
    #Creation of the internal Marker
    marker_internal = image < -400
    marker_internal = segmentation.clear_border(marker_internal)
    marker_internal_labels = measure.label(marker_internal)
    
    areas = [r.area for r in measure.regionprops(marker_internal_labels)]
    areas.sort()
    
    if len(areas) > 2:
        for region in measure.regionprops(marker_internal_labels):
            if region.area < areas[-2]:
                for coordinates in region.coords:                
                       marker_internal_labels[coordinates[0], coordinates[1]] = 0
    
    marker_internal = marker_internal_labels > 0
    
    # Creation of the External Marker
    external_a = ndimage.binary_dilation(marker_internal, iterations=10)
    external_b = ndimage.binary_dilation(marker_internal, iterations=55)
    marker_external = external_b ^ external_a
    
    # Creation of the Watershed Marker
    marker_watershed = np.zeros((512, 512), dtype=np.int)
    marker_watershed += marker_internal * 255
    marker_watershed += marker_external * 128
    
    return marker_internal, marker_external, marker_watershed

**Let's see the samples of Marker Watershed Transformation**

In [None]:
test_patient_internal, test_patient_external, test_patient_watershed = generate_markers(test_patient_images[15])

f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(15,15))

ax1.imshow(test_patient_internal, cmap='gray')
ax1.set_title("Internal Marker")
ax1.axis('off')

ax2.imshow(test_patient_external, cmap='gray')
ax2.set_title("External Marker")
ax2.axis('off')

ax3.imshow(test_patient_watershed, cmap='gray')
ax3.set_title("Watershed Marker")
ax3.axis('off')

plt.show()

# Mask DICOM images

**Let's create a mask for some images of the Dataset, here the approach is morphological masking**

**All Credits to [Welf Crozzo](https://www.kaggle.com/miklgr500/unsupervise-lung-detection)**

**The determination of a mask for the lungs is the starting point in the algorithm for determining the volume of the lungs by CT images. The next step is the correct integration of all CT images to determine the volume of the lungs.**

In [None]:
sample_image = pydicom.dcmread(train_image_files[7])
img = sample_image.pixel_array

plt.imshow(img, cmap='gray')
plt.title('Original Image')

**Now we will create a Binary Mask Image with rescale intercept and slope and adjusting values below -400 HU**

In [None]:
img = (img + sample_image.RescaleIntercept) / sample_image.RescaleSlope
img = img < -400 #HU unit range for lungs CT SCAN

plt.imshow(img, cmap='gray')
plt.title('Binary Mask Image')

**Cleaning Border**

In [None]:
img = clear_border(img)
plt.imshow(img, cmap='gray')
plt.title('Cleaned Border Image')

**Labelling a small region of scan**

In [None]:
img = label(img)
plt.imshow(img, cmap='gray')

In [None]:
areas = [r.area for r in regionprops(img)]
areas.sort()
if len(areas) > 2:
    for region in regionprops(img):
        if region.area < areas[-2]:
            for coordinates in region.coords:                
                img[coordinates[0], coordinates[1]] = 0
img = img > 0
plt.imshow(img, cmap='gray')

# Other Masks

In [None]:
# https://www.raddq.com/dicom-processing-segmentation-visualization-in-python/

def make_lungmask(img, display=False):
    row_size= img.shape[0]
    col_size = img.shape[1]
    
    mean = np.mean(img)
    std = np.std(img)
    img = img-mean
    img = img/std
    
    # Find the average pixel value near the lungs
        # to renormalize washed out images
    middle = img[int(col_size/5):int(col_size/5*4),int(row_size/5):int(row_size/5*4)] 
    mean = np.mean(middle)  
    max = np.max(img)
    min = np.min(img)
    
    # To improve threshold finding, I'm moving the 
    # underflow and overflow on the pixel spectrum
    img[img==max]=mean
    img[img==min]=mean
    
    # Using Kmeans to separate foreground (soft tissue / bone) and background (lung/air)
    
    kmeans = KMeans(n_clusters=2).fit(np.reshape(middle,[np.prod(middle.shape),1]))
    centers = sorted(kmeans.cluster_centers_.flatten())
    threshold = np.mean(centers)
    thresh_img = np.where(img<threshold,1.0,0.0)  # threshold the image

    # First erode away the finer elements, then dilate to include some of the pixels surrounding the lung.  
    # We don't want to accidentally clip the lung.

    eroded = morphology.erosion(thresh_img,np.ones([3,3]))
    dilation = morphology.dilation(eroded,np.ones([8,8]))

    labels = measure.label(dilation) # Different labels are displayed in different colors
    label_vals = np.unique(labels)
    regions = measure.regionprops(labels)
    good_labels = []
    for prop in regions:
        B = prop.bbox
        if B[2]-B[0]<row_size/10*9 and B[3]-B[1]<col_size/10*9 and B[0]>row_size/5 and B[2]<col_size/5*4:
            good_labels.append(prop.label)
    mask = np.ndarray([row_size,col_size],dtype=np.int8)
    mask[:] = 0


    #  After just the lungs are left, we do another large dilation
    #  in order to fill in and out the lung mask 
    
    for N in good_labels:
        mask = mask + np.where(labels==N,1,0)
    mask = morphology.dilation(mask,np.ones([10,10])) # one last dilation

    if (display):
        fig, ax = plt.subplots(3, 2, figsize=[12, 12])
        ax[0, 0].set_title("Original")
        ax[0, 0].imshow(img, cmap='gray')
        ax[0, 0].axis('off')
        ax[0, 1].set_title("Threshold")
        ax[0, 1].imshow(thresh_img, cmap='gray')
        ax[0, 1].axis('off')
        ax[1, 0].set_title("After Erosion and Dilation")
        ax[1, 0].imshow(dilation, cmap='gray')
        ax[1, 0].axis('off')
        ax[1, 1].set_title("Color Labels")
        ax[1, 1].imshow(labels)
        ax[1, 1].axis('off')
        ax[2, 0].set_title("Final Mask")
        ax[2, 0].imshow(mask, cmap='gray')
        ax[2, 0].axis('off')
        ax[2, 1].set_title("Apply Mask on Original")
        ax[2, 1].imshow(mask*img, cmap='gray')
        ax[2, 1].axis('off')
        
        plt.show()
    return mask*img

In [None]:
# Select a sample
path = "../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/19.dcm"
dataset = pydicom.dcmread(path)
img = dataset.pixel_array

# Masked image
mask_img = make_lungmask(img, display=True)

In [None]:
import re
patient_dir = "../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430"
datasets = []

# First Order the files in the dataset
files = []
for dcm in list(os.listdir(patient_dir)):
    files.append(dcm) 
files.sort(key=lambda f: int(re.sub('\D', '', f)))

# Read in the Dataset
for dcm in files:
    path = patient_dir + "/" + dcm
    datasets.append(pydicom.dcmread(path))
    
imgs = []
for data in datasets:
    img = data.pixel_array
    imgs.append(img)
    
    
# Show masks
fig=plt.figure(figsize=(16, 6))
columns = 10
rows = 3

for i in range(1, columns*rows +1):
    img = make_lungmask(datasets[i-1].pixel_array)
    fig.add_subplot(rows, columns, i)
    plt.imshow(img, cmap="gray")
    plt.title(i, fontsize = 9)
    plt.axis('off');

# Extract Metadata
**Credits to [Rajasekhar](https://www.kaggle.com/trsekhar123/nb-to-extract-metadata-and-resize-images-train) for an easier implementation on this**

**MetaData is a file that has a complete information of a data path and it's relevance to the project. So let's create one!!**

In [None]:
def get_observation_data(path):
    '''Get information from the .dcm files.
    path: complete path to the .dcm file'''

    image_data = pydicom.read_file(path)

    # Dictionary to store the information from the image
    observation_data = {
        "FileNumber" : path.split('/')[5],
        "Rows" : image_data.Rows,
        "Columns" : image_data.Columns,

        "PatientID" : image_data.PatientID,
        "BodyPartExamined" : image_data.BodyPartExamined,
        "SliceThickness" : int(image_data.SliceThickness),
        "KVP" : int(image_data.KVP),
        "DistanceSourceToDetector" : int(image_data.DistanceSourceToDetector),
        "DistanceSourceToPatient" : int(image_data.DistanceSourceToPatient),
        "GantryDetectorTilt" : int(image_data.GantryDetectorTilt),
        "TableHeight" : int(image_data.TableHeight),
        "RotationDirection" : image_data.RotationDirection,
        "XRayTubeCurrent" : int(image_data.XRayTubeCurrent),
        "GeneratorPower" : int(image_data.GeneratorPower),
        "ConvolutionKernel" : image_data.ConvolutionKernel,
        "PatientPosition" : image_data.PatientPosition,

        "ImagePositionPatient" : str(image_data.ImagePositionPatient),
        "ImageOrientationPatient" : str(image_data.ImageOrientationPatient),
        "PhotometricInterpretation" : image_data.PhotometricInterpretation,
        "ImageType" : str(image_data.ImageType),
        "PixelSpacing" : str(image_data.PixelSpacing),
        "WindowCenter" : int(image_data.WindowCenter),
        "WindowWidth" : int(image_data.WindowWidth),
        "Modality" : image_data.Modality,
        "StudyInstanceUID" : image_data.StudyInstanceUID,
        "PixelPaddingValue" : image_data.PixelPaddingValue,
        "SamplesPerPixel" : image_data.SamplesPerPixel,
        "SliceLocation" : int(image_data.SliceLocation),
        "BitsAllocated" : image_data.BitsAllocated,
        "BitsStored" : image_data.BitsStored,
        "HighBit" : image_data.HighBit,
        "PixelRepresentation" : image_data.PixelRepresentation,
        "RescaleIntercept" : int(image_data.RescaleIntercept),
        "RescaleSlope" : int(image_data.RescaleSlope),
        "RescaleType" : image_data.RescaleType
    }
    
    return observation_data

**Uncomment this if it is your first time in editing my kernel**

In [None]:
#meta_data_df = []
#for filename in tqdm(train_image_files):
    #try:
       # meta_data_df.append(get_observation_data(filename))
   # except Exception as e:
      #  continue

**Takes 3-4 mins I think and does not require GPU!!**

# Creating a Metadata file

In [None]:
#meta_data_df = pd.DataFrame.from_dict(meta_data_df)
#meta_data_df.head()

**So here we have captured the information pertaining to CT Scans performed on lungs, these can be very useful in conveying informations about a person's FVC and other parameters**

# Exploring Metadata File

In [None]:
dicom_df = pd.read_csv('../input/osic-dicom-image-features/metadata.csv')
dicom_df.shape