In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import tqdm

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import matplotlib.pyplot as plt
%matplotlib inline

# Introduction

Open Source Imaging Consortium (OSIC) is a not-for-profit, co-operative effort between academia, industry and philanthropy. In this competition we will predict a patients severity of lung decline based on a CT scan of their lungs. Youâ€™ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with only the images as input. This high level description comes straight from the competition's overview page (https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/overview).

# Data

In the dataset, you are provided with a baseline chest CT scan and associated clinical information for a set of patients. A patient has an image acquired at time Week = 0 and has numerous follow up visits over the course of approximately 1-2 years, at which time their FVC (forced vital capacity) is measured.

In the training set, you are provided with an anonymized, baseline CT scan and the entire history of FVC measurements.
In the test set, you are provided with a baseline CT scan and only the initial FVC measurement. You are asked to predict the final three FVC measurements for each patient, as well as a confidence value in your prediction.

Since this is real medical data, you will notice the relative timing of FVC measurements varies widely. The timing of the initial measurement relative to the CT scan and the duration to the forecasted time points may be different for each patient. This is considered part of the challenge of the competition. To avoid potential leakage in the timing of follow up visits, you are asked to predict every patient's FVC measurement for every possible week. Those weeks which are not in the final three visits are ignored in scoring.

The data for the CT scan images are provided in DICOM format.

## DICOM Format
Digital Imaging and Communications in Medicine (DICOM) is the accepted standard for the communication and management of medical imaging information. DICOM is used for archiving and transmitting medical images. It enables the integration of medical imaging devices (radiological scanners), servers, network hardware and Picture Archiving and Communication Systems (PACS). The standard was widely adopted by hospitals and research centers and is steadly advancing as well toward small practice and cliniques.

# Visualizing DICOM Images

Before we visualize the DICOM images we will import some packages and create a utility function for visualizing the images. The packages that we will import to load the DICOM images is `pydicom`. To get a list of the images in a directory we will use the `glob` package. To visualize the images we will use `greyscale` format.

If we look at the length of the image files returned we find there are 33,026 images in the `train` folder. For a quick visualization we will only look at the first 12 images returned by the `glob` function.

In [None]:
import pydicom
import glob
import os
from typing import Dict, List

In [None]:
def visualize_osic_images(image_files: List[str]) -> None:
    # Take only the first 12 images in the list
    image_files = image_files[:12]
    
    fig, axes = plt.subplots(4, 3, figsize=(20, 16))
    axes = axes.flatten()
    for image_index, image_file in enumerate(image_files):
        # Load the DICOM image and convert to pixel array
        image_data = pydicom.read_file(image_file).pixel_array
        axes[image_index].imshow(image_data, cmap=plt.cm.bone)
        
        image_name = '-'.join(image_file.split('/')[-2:])
        axes[image_index].set_title(f'{image_name}')

In [None]:
train_image_path = '/kaggle/input/osic-pulmonary-fibrosis-progression/train'
train_image_files = glob.glob(os.path.join(train_image_path, '*', '*.dcm'))

In [None]:
visualize_osic_images(train_image_files)

# `pydicom` package

There is more than meets the eye with loading the dicom images using `pydicom`. `pydicom` also makes it easy to read meta-data from the dicom files. More information can be found here https://pydicom.github.io/pydicom/stable/old/getting_started.html. For example if we load in an image, instead of converting it to a pixel array straight away we can view more characteristics of the file. We can use this information to further inhance our knowledge of the images.

In [None]:
image_data = pydicom.read_file(train_image_files[0])
image_data

In [None]:
# Different calls in the image import
image_data.PatientName, image_data.Modality, image_data.BodyPartExamined

# Training CSV

We also have meta-data contained in `/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv` and `/kaggle/input/osic-pulmonary-fibrosis-progression/test.csv`. We will take a look at the contents. We are given 7 fields in `train.csv`.

1. Patient - patient id
2. Weeks - the relative number of weeks pre/post the baseline CT (may be negative)
3. FVC (Forced Vital Capacity) - the recorded lung capacity in ml
4. Percent - a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
5. Age - age of the patient
6. Sex - sex of the patient
7. SmokingStatus - The status of the patient relative to smoking - the three categories are: `['Ex-smoker', 'Never smoked', 'Currently smokes']`

Most of these definitions can also be found on the data page: https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/data

In [None]:
train_df = pd.read_csv('/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv')
train_df.head()

In [None]:
# Counts of the Weeks field. We will look at the top 20 weeks contained in the train.csv

weeks_frequency = train_df['Weeks'].value_counts().head(20)
weeks_frequency = weeks_frequency.reset_index()
weeks_frequency = weeks_frequency.rename(columns={'index': 'Weeks', 'Weeks': 'Frequency'})

plt.figure(figsize=(10, 7))
ax = sns.barplot(x='Weeks', y='Frequency', data=weeks_frequency, order=weeks_frequency['Weeks'])
ax.set_title('Top Weeks by Frequency')
plt.grid()

In [None]:
# Histogram of the Age field

plt.figure(figsize=(10, 7))
ax = sns.distplot(train_df['Age'])
ax.set_title('Histogram for Age')
plt.grid()

print(train_df['Age'].describe())

In [None]:
sex_frequency = train_df['Sex'].value_counts()
sex_frequency = sex_frequency.reset_index()
sex_frequency = sex_frequency.rename(columns={'index': 'Sex', 'Sex': 'Frequency'})

plt.figure(figsize=(10, 7))
ax = sns.barplot(x='Sex', y='Frequency', data=sex_frequency, order=sex_frequency['Sex'])
ax.set_title('Sex Barplot')
plt.grid()

In [None]:
smoking_status_frequency = train_df['SmokingStatus'].value_counts()
smoking_status_frequency = smoking_status_frequency.reset_index()
smoking_status_frequency = smoking_status_frequency.rename(columns={'index': 'SmokingStatus', 'SmokingStatus': 'Frequency'})

plt.figure(figsize=(10, 7))
ax = sns.barplot(x='SmokingStatus', y='Frequency', data=smoking_status_frequency, order=smoking_status_frequency['SmokingStatus'])
ax.set_title('Smoking Status Barplot')
plt.grid()

In [None]:
# Histogram of the FVC field

plt.figure(figsize=(10, 7))
ax = sns.distplot(train_df['FVC'])
ax.set_title('Histogram for FVC')
plt.grid()

print(train_df['FVC'].describe())

In [None]:
# Histogram of the FVC field

plt.figure(figsize=(10, 7))
ax = sns.distplot(train_df['Percent'])
ax.set_title('Histogram for Percent')
plt.grid()

print(train_df['Percent'].describe())

# Observations - Part 1

* For the weeks field we can see that the highest frequency is Weeks=8 followed by Weeks=12, etc. I will investigate more as I progress towards modeling.
* The mean age is ~67 years old with a standard deviation of ~7 years. The minimum age in the data set is ~49 years old while the maximum age is ~88 years old.
* There are more male patients than female patients in the training data.
* For the smoking status there are a lot of ex-smokers in the training data. Then the next largest group is non-smokers, followed by currently smoking. Interesting that the training data has a lot of ex-smokers.
* The average FVC is ~2690 ml with a standard deviation of ~832 ml. The minimum FCV value we find in the training data is ~827 ml while the max is ~6399 ml.
* The average percent is ~77 with a standard deviation of 19. The minimum is ~28 while the max is ~153.

I generally try to just keep stats around later on when it comes to the modeling phase.

Let's look as some more detailed analysis and see how our variables interact with one another.

# Advanced EDA & Plots

Here we will consider a multi-variate analysis and look at different features and their interations. The goal of our analysis is to see if we can find any useful relationships or anything that sticks out about the data through visualization and digging into the data.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
axes = axes.flatten()

# Distribution for Age
age_male = train_df.loc[train_df['Sex'] == 'Male']['Age']
age_female = train_df.loc[train_df['Sex'] == 'Female']['Age']

sns.kdeplot(age_male, label='Male', shade=True, ax=axes[0])
sns.kdeplot(age_female, label='Female', shade=True, ax=axes[0])

axes[0].legend()
axes[0].set_title('Age Distribution by Sex')
axes[0].set_xlabel('Age')
axes[0].grid()

# Distribution for Smoking Status
age_ex_smoker = train_df.loc[train_df['SmokingStatus'] == 'Ex-smoker']['Age']
age_never_smoked = train_df.loc[train_df['SmokingStatus'] == 'Never smoked']['Age']
age_currently_smoking = train_df.loc[train_df['SmokingStatus'] == 'Currently smokes']['Age']

sns.kdeplot(age_ex_smoker, label='Ex-Smoker', shade=True, ax=axes[1])
sns.kdeplot(age_never_smoked, label='Never Smoked', shade=True, ax=axes[1])
sns.kdeplot(age_currently_smoking, label='Currently Smokes', shade=True, ax=axes[1])

axes[1].legend()
axes[1].set_title('Age Distribution by Smoking Status')
axes[1].set_xlabel('Age')
axes[1].grid()

* From the above age distributions we start to learn a little more about our data. For the most part the age distribution between male and female look pretty simlilar but we can see that there is a higher concentration around the 71 age mark. So there seem to be older males in the training data while females are a bit younger. 

* The ex-smokers look like their concentration is above 70 years old, while for the non smokers the main concentation is around the 65 year old mark. The currently smokes is interesting because from the kdeplot we can see four different modes with the highest concentration happening around 70 years old.

In [None]:
sns.lmplot(x='Age', y='FVC', hue='Sex', col='Sex', data=train_df)

* There seems to be a small decresing relationship between age and FVC for the male category.
* There is also a small decreasing relationshop between the age and FVC for the female category although more subtle compared with the male category.

In [None]:
sns.lmplot(x='Age', y='FVC', hue='SmokingStatus', col='SmokingStatus', data=train_df)

* Investigating the age vs. FVC we can see that there is also a decreasing relationship between age and GVC for ex-smokers as well as for currently smokes categories.
* There is a positive relationship between age and FVC for people who have never smoked.

In [None]:
sns.lmplot(x='Age', y='Percent', hue='Sex', col='Sex', data=train_df)

* For age vs. percent for the male cateory we see that there is a slight positive relationship.
* for the female category we see that there is a stronger positive relationship between age and percent.

In [None]:
sns.lmplot(x='Age', y='Percent', hue='SmokingStatus', col='SmokingStatus', data=train_df)

* When we look at age vs. percent by the patients smoking status we observe a slight positive relationship between age and percent for ex-smokers.
* There is a strong positive relationship between age and percent for patients who have never smoked.
* There is a negative relationship between age and percent for people who are currenlty smoking.

In [None]:
sns.lmplot(x='Age', y='Weeks', hue='Sex', col='Sex', data=train_df)

* For age vs weeks we see a slight negative relationship for both the male and female category.

In [None]:
sns.lmplot(x='Age', y='Weeks', hue='SmokingStatus', col='SmokingStatus', data=train_df)

* There seems to be a strong negative relationship between age and weeks for people who are currently smoking.

# Extract Meta-Data from DICOM Images

Let's see if we can also extract more information from the images themselves. We saw above that there is meta-data included with the dicom images. Let's look at an example again.

In [None]:
image_data

In [None]:
def extract_dicom_meta_data(filename: str) -> Dict:
    # Load image
    image_data = pydicom.read_file(train_image_files[0])
    
    row = {
        'Patient': image_data.PatientID,
        'body_part_examined': image_data.BodyPartExamined,
        'image_position_patient': image_data.ImagePositionPatient,
        'image_orientation_patient': image_data.ImageOrientationPatient,
        'photometric_interpretation': image_data.PhotometricInterpretation,
        'rows': image_data.Rows,
        'columns': image_data.Columns,
        'pixel_spacing': image_data.PixelSpacing,
        'window_center': image_data.WindowCenter,
        'window_width': image_data.WindowWidth
    }
    
    return row

In [None]:
meta_data_df = []
for filename in tqdm.tqdm(train_image_files):
    meta_data_df.append(extract_dicom_meta_data(filename))

In [None]:
# Convert to a pd.DataFrame from dict
meta_data_df = pd.DataFrame.from_dict(meta_data_df)
meta_data_df.head()

* Looks now like we have extended our dataset and added more meta-data! I am not quite sure yet if any of this data is useful but I will try to explore it when I get a chance.

# More EDA ...

Thanks for taking a peek at what I have so far. I will prepare more analysis and utility functions soon.