In [None]:
import warnings
warnings.filterwarnings('ignore')

import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

from scipy import stats

import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns
from IPython.display import HTML

import pydicom

In [None]:
df_train = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
df_test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

print(f'Training Set Shape = {df_train.shape} - Patients = {df_train["Patient"].nunique()}')
print(f'Training Set Memory Usage = {df_train.memory_usage().sum() / 1024 ** 2:.2f} MB')
print(f'Test Set Shape = {df_test.shape} - Patients = {df_test["Patient"].nunique()}')
print(f'Test Set Memory Usage = {df_test.memory_usage().sum() / 1024 ** 2:.2f} MB')

## **1. Introduction**

There are **1549** samples in `train.csv` and they belong to **176** different patients. All of the patients have their images acquired at Week 0, but all of the patients don't have FVC measured at Week 0 and their number of FVC measurements change from patient to patient.

Most of the patients have FVC measurements at 9 different timesteps but this number can change between 6 and 10. Thus, number of FVC measurements are not consistent for different patients.

In [None]:
training_sample_counts = df_train.rename(columns={'Weeks': 'Samples'}).groupby('Patient').agg('count')['Samples'].value_counts()
print(f'Training Set FVC Measurements Per Patient \n{("-") * 41}\n{training_sample_counts}')

There are **5** samples in `test.csv` and they belong to **5** different patients. Those samples and patients also exist in `train.csv`. They are the last 5 patients of training set and the samples are first measurements of those patients. This is because `test.csv` is a placeholder and the real test set is hidden. The purpose of placeholder test set is showing the structure of real test set and testing submissions. When the notebook is submitted, it runs with the real test set.

The real test set will have more than 5 patients. However, it will only have a baseline CT scan and only the initial FVC measurement. You are asked to predict the final three FVC measurements for each patient, as well as a confidence value in your prediction. In order to avoid potential leakage, you are asked to predict every patient's FVC measurement for every possible week. Weeks prior to final three weeks are ignored in scoring.

In [None]:
df_test

The simplest way of creating submissions is predicting `FVC` and `Confidence` for all test samples, then creating `Patient_Week` on test set and merging it to `sample_submission.csv`. This way submission file will have all of the predictions from test set regardless of their count. 

In [None]:
df_submission = pd.read_csv( '../input/osic-pulmonary-fibrosis-progression/sample_submission.csv' )
df_submission

## **2. Laplace Log Likelihood**

Predictions are evaluated with a modified version of the Laplace Log Likelihood. For each sample in test set, an `FVC` and a `Confidence` measure (standard deviation Ïƒ) has to be predicted.

`Confidence` values smaller than 70 are clipped.

$\large \sigma_{clipped} = max(\sigma, 70),$

Errors greater than 1000 are also clipped in order to avoid large errors.

$\large \Delta = min ( |FVC_{true} - FVC_{predicted}|, 1000 ),$

The metric is defined as:

$\Large metric = -   \frac{\sqrt{2} \Delta}{\sigma_{clipped}} - \ln ( \sqrt{2} \sigma_{clipped} ).$


## **3. FVC (Forced Vital Capacity)**

`FVC` measurement shows the amount of air a person can forcefully and quickly exhale after taking a deep breath. It is defined as `the recorded lung capacity in ml` under the Data tab. The change in `FVC` over the course of weeks is used for predicting the patients' lung function decline.

Even though the `FVC` predictions smaller than 1000 are clipped, the minimum value in training set is **827**. The maximum `FVC` value in training set is **6399**. The distribution is heavily tailed on the right end because some patients have extremely high `FVC` measurements. However, most of the patients are close to mean `FVC`.

In [None]:
print(f'FVC Statistical Summary\n{"-" * 23}')

print(f'Mean: {df_train["FVC"].mean():.6}  -  Median: {df_train["FVC"].median():.6}  -  Std: {df_train["FVC"].std():.6}')
print(f'Min: {df_train["FVC"].min()}  -  25%: {df_train["FVC"].quantile(0.25)}  -  50%: {df_train["FVC"].quantile(0.5)}  -  75%: {df_train["FVC"].quantile(0.75)}  -  Max: {df_train["FVC"].max()}')
print(f'Skew: {df_train["FVC"].skew():.6}  -  Kurtosis: {df_train["FVC"].kurtosis():.6}')
missing_values_count = df_train[df_train["FVC"].isnull()].shape[0]
training_samples_count = df_train.shape[0]
print(f'Missing Values: {missing_values_count}/{training_samples_count} ({missing_values_count * 100 / training_samples_count:.4}%)')

fig, axes = plt.subplots(ncols=2, figsize=(18, 6), dpi=150)

sns.distplot(df_train['FVC'], label='FVC', ax=axes[0])
stats.probplot(df_train['FVC'], plot=axes[1])

for i in range(2):
    axes[i].tick_params(axis='x', labelsize=12)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    
axes[0].set_title(f'FVC Distribution in Training Set', size=15, pad=15)
axes[1].set_title(f'FVC Probability Plot', size=15, pad=15)

plt.show()

Every patients' `FVC` should be plotted individually as a function of time because the competition objective is predicting `FVC` values of different patients over 146 timesteps.

Majority of the patients' conditions got worse over the course of weeks except 1-2%. Increase of `FVC` in that small group may be random because `FVC` fluctuates too much over time in some patients. Those patients may not be responding to treatment very well or they are not getting better for some reason. However, some patients are clearly getting better because their `FVC` are increasing linearly with very few fluctuations. Those patients are very rare.

In [None]:
def plot_fvc(df, patient):
        
    df[['Weeks', 'FVC']].set_index('Weeks').plot(figsize=(30, 6), label='_nolegend_')
    
    plt.tick_params(axis='x', labelsize=20)
    plt.tick_params(axis='y', labelsize=20)
    plt.xlabel('')
    plt.ylabel('')
    plt.title(f'Patient: {patient} - {df["Age"].tolist()[0]} - {df["Sex"].tolist()[0]} - {df["SmokingStatus"].tolist()[0]} ({len(df)} Measurements in {(df["Weeks"].max() - df["Weeks"].min())} Weeks Period)', size=25, pad=25)
    plt.legend().set_visible(False)
    plt.show()

for patient, df in list(df_train.groupby('Patient')):
    
    df['FVC_diff-1'] = np.abs(df['FVC'].diff(-1))
    
    print(f'Patient: {patient} FVC Statistical Summary\n{"-" * 58}')
    print(f'Mean: {df["FVC"].mean():.6}  -  Median: {df["FVC"].median():.6}  -  Std: {df["FVC"].std():.6}')
    print(f'Min: {df["FVC"].min()} -  Max: {df["FVC"].max()}')
    print(f'Skew: {df["FVC"].skew():.6}  -  Kurtosis: {df["FVC"].kurtosis():.6}')
    print(f'Change Mean: {df["FVC_diff-1"].mean():.6}  - Change Median: {df["FVC_diff-1"].median():.6}  - Change Std: {df["FVC_diff-1"].std():.6}')
    print(f'Change Min: {df["FVC_diff-1"].min()} -  Change Max: {df["FVC_diff-1"].max()}')
    print(f'Change Skew: {df["FVC_diff-1"].skew():.6} -  Change Kurtosis: {df["FVC_diff-1"].kurtosis():.6}')
    
    plot_fvc(df, patient)


## **4. Tabular Data**

There are four continuous features along with `FVC` in tabular data. Those features are:

* `Weeks`: The relative number of weeks pre/post the baseline CT (may be negative). It doesn't have any significant relationship with other features because patients got both better or worse over the course of time regardless of their `Age`.
* `Percent`: A computed field which approximates the patient's `FVC` as a percent of the typical `FVC` for a person of similar characteristics. This feature has a strong relationship with `FVC` because it is derived from it, but it doesn't have any significant relationship with other features.
* `Age`: Age of the patient. `Age` has a slight relationship with `FVC` and `Percent` since younger patients have higher lung capacity.

Distributions of `FVC`, `Percent` and `Age` are very similar but `Weeks` is different than those features.

In [None]:
g = sns.pairplot(df_train[['FVC', 'Weeks', 'Percent', 'Age']], aspect=1.4, height=5, diag_kind='kde', kind='reg')

g.axes[3, 0].set_xlabel('FVC', fontsize=25)
g.axes[3, 1].set_xlabel('Weeks', fontsize=25)
g.axes[3, 2].set_xlabel('Percent', fontsize=25)
g.axes[3, 3].set_xlabel('Age', fontsize=25)
g.axes[0, 0].set_ylabel('FVC', fontsize=25)
g.axes[1, 0].set_ylabel('Weeks', fontsize=25)
g.axes[2, 0].set_ylabel('Percent', fontsize=25)
g.axes[3, 0].set_ylabel('Age', fontsize=25)

g.axes[3, 0].tick_params(axis='x', labelsize=20)
g.axes[3, 1].tick_params(axis='x', labelsize=20)
g.axes[3, 2].tick_params(axis='x', labelsize=20)
g.axes[3, 3].tick_params(axis='x', labelsize=20)
g.axes[0, 0].tick_params(axis='y', labelsize=20)
g.axes[1, 0].tick_params(axis='y', labelsize=20)
g.axes[2, 0].tick_params(axis='y', labelsize=20)
g.axes[3, 0].tick_params(axis='y', labelsize=20)

g.fig.suptitle('Tabular Data Feature Distributions and Interactions', fontsize=35, y=1.08)

plt.show()

The first categorical feature in tabular data is `Sex` which is basically gender of the patient. In training set, **139** (79%) patients are male and **37** (21%) patients are female.

In [None]:
fig = plt.figure(figsize=(10, 6), dpi=100)

sns.barplot(x=df_train.groupby('Patient')['Sex'].first().value_counts().index, y=df_train.groupby('Patient')['Sex'].first().value_counts())
percentages = [(count / df_train.groupby('Patient')['Sex'].first().value_counts().sum() * 100).round(2) for count in df_train.groupby('Patient')['Sex'].first().value_counts()]

plt.ylabel('')
plt.xticks(np.arange(2), [f'Male (%{percentages[0]})', f'Female (%{percentages[1]})'])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.title('Sex Counts in Training Set', size=15, pad=15)

plt.show()

`FVC` distributions of males and females are very different from each other. Females have lower lung capacity compared to males due to genetics. `FVC` relationships with other features are also very different for males and females. `FVC` of males have a stronger relationship with `Percent` and `Age` compared to `FVC` of females.

Comparing `Weeks` for different genders is not logical but females have a decent `FVC` improvement over the course weeks compared to males.

`Percent` distributions of males and females are very different from each other just like `FVC` distributions because `Percent` is derived from it.

`Age` has no differences between males and females in terms of relationships and distributions except female's `Age` distribution have slightly longer tails and a shorter peak.

In [None]:
g = sns.pairplot(df_train[['FVC', 'Weeks', 'Percent', 'Age', 'Sex']], hue='Sex', aspect=1.4, height=5, diag_kind='kde', kind='reg')

g.axes[3, 0].set_xlabel('FVC', fontsize=25)
g.axes[3, 1].set_xlabel('Weeks', fontsize=25)
g.axes[3, 2].set_xlabel('Percent', fontsize=25)
g.axes[3, 3].set_xlabel('Age', fontsize=25)
g.axes[0, 0].set_ylabel('FVC', fontsize=25)
g.axes[1, 0].set_ylabel('Weeks', fontsize=25)
g.axes[2, 0].set_ylabel('Percent', fontsize=25)
g.axes[3, 0].set_ylabel('Age', fontsize=25)

g.axes[3, 0].tick_params(axis='x', labelsize=20)
g.axes[3, 1].tick_params(axis='x', labelsize=20)
g.axes[3, 2].tick_params(axis='x', labelsize=20)
g.axes[3, 3].tick_params(axis='x', labelsize=20)
g.axes[0, 0].tick_params(axis='y', labelsize=20)
g.axes[1, 0].tick_params(axis='y', labelsize=20)
g.axes[2, 0].tick_params(axis='y', labelsize=20)
g.axes[3, 0].tick_params(axis='y', labelsize=20)

plt.legend(prop={'size': 20})
g._legend.remove()
g.fig.suptitle('Tabular Data Feature Distributions and Interactions Between Sex Groups', fontsize=35, y=1.08)

plt.show()

The second categorical feature in tabular data is `SmokingStatus` which is also self-explanatory. In training set, **118** (67%) patients are ex-smokers, **49** (28%) patients had never smoked and **9** (5%) patients are smokers.

In [None]:
fig = plt.figure(figsize=(10, 6), dpi=100)

sns.barplot(x=df_train.groupby('Patient')['SmokingStatus'].first().value_counts().index, y=df_train.groupby('Patient')['SmokingStatus'].first().value_counts())
percentages = [(count / df_train.groupby('Patient')['SmokingStatus'].first().value_counts().sum() * 100).round(2) for count in df_train.groupby('Patient')['SmokingStatus'].first().value_counts()]

plt.ylabel('')
plt.xticks(np.arange(3), [f'Ex-smoker (%{percentages[0]})', f'Never smoked (%{percentages[1]})', f'Currently Smokes (%{percentages[2]})'])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.title('SmokingStatus Counts in Training Set', size=15, pad=15)

plt.show()

`FVC` distributions of `SmokingStatus` groups are quite unexpected. Mean `FVC` of smokers is  higher than mean `FVC` of ex-smokers and patients who had never smoked.

Distribution of `Weeks` is similar for different `SmokingStatus`. Smokers have the strongest positive linear relationship between `FVC` and `Weeks` which is also another unexpected phenomenon.

`Percent` distributions of different `SmokingStatus` groups is very similar to `FVC` distributions but peaks are taller. The linear relationship between `Percent` and `Weeks` is also stronger compared to `FVC` and `Weeks`.

`Age` has no relationship with `SmokingStatus`.

In [None]:
g = sns.pairplot(df_train[['FVC', 'Weeks', 'Percent', 'Age', 'SmokingStatus']], hue='SmokingStatus', aspect=1.4, height=5, diag_kind='kde', kind='reg')

g.axes[3, 0].set_xlabel('FVC', fontsize=25)
g.axes[3, 1].set_xlabel('Weeks', fontsize=25)
g.axes[3, 2].set_xlabel('Percent', fontsize=25)
g.axes[3, 3].set_xlabel('Age', fontsize=25)
g.axes[0, 0].set_ylabel('FVC', fontsize=25)
g.axes[1, 0].set_ylabel('Weeks', fontsize=25)
g.axes[2, 0].set_ylabel('Percent', fontsize=25)
g.axes[3, 0].set_ylabel('Age', fontsize=25)

g.axes[3, 0].tick_params(axis='x', labelsize=20)
g.axes[3, 1].tick_params(axis='x', labelsize=20)
g.axes[3, 2].tick_params(axis='x', labelsize=20)
g.axes[3, 3].tick_params(axis='x', labelsize=20)
g.axes[0, 0].tick_params(axis='y', labelsize=20)
g.axes[1, 0].tick_params(axis='y', labelsize=20)
g.axes[2, 0].tick_params(axis='y', labelsize=20)
g.axes[3, 0].tick_params(axis='y', labelsize=20)

plt.legend(prop={'size': 20})
g._legend.remove()
g.fig.suptitle('Tabular Data Feature Distributions and Interactions Between SmokingStatus Groups', fontsize=35, y=1.08)

plt.show()

As seen from the plots above, the only strong correlation is between `FVC` and `Percent`. The other features' correlations are between -0.1 and 0.1.

In [None]:
fig = plt.figure(figsize=(10, 10), dpi=100)

sns.heatmap(df_train.corr(), annot=True, square=True, cmap='coolwarm', annot_kws={'size': 15},  fmt='.2f')   

plt.tick_params(axis='x', labelsize=18, rotation=0)
plt.tick_params(axis='y', labelsize=18, rotation=0)
plt.title('Tabular Data Feature Correlations', size=20, pad=20)

plt.show()

## **5. DICOM Files**

A `.dcm` file is an image file saved in the Digital Imaging and Communications in Medicine (DICOM) image format. It stores a medical image, such as a CT scan or ultrasound. DCM files may also include patient information to pair the image with the patient. There are **176** directories in `osic-pulmonary-fibrosis-progression/train` and **5** directories in `osic-pulmonary-fibrosis-progression/test`. Every directory has the `.dcm` files of the corresponding patient.

In [None]:
print(f'Patients (directories) in osic-pulmonary-fibrosis-progression/train: {len(os.listdir("../input/osic-pulmonary-fibrosis-progression/train"))}')
print(f'Patients (directories) in osic-pulmonary-fibrosis-progression/test: {len(os.listdir("../input/osic-pulmonary-fibrosis-progression/test"))}')

Multiple `.dcm` files represent different slices of a single CT scan which is acquired at Week 0. CT scans produce 3D volumes for each scan, those volumes consist of 2D slices and each slice is a `.dcm` file. Every directory has different number of slices in `osic-pulmonary-fibrosis-progression/train`. Those number of slices are between **12** and **1018** with a median of **98**. Total number of slices adds up to **33026** for 176 patients.

In [None]:
slice_counts = np.array([len(os.listdir(f'../input/osic-pulmonary-fibrosis-progression/train/{directory}')) for directory in os.listdir('../input/osic-pulmonary-fibrosis-progression/train')])

print(f'Number of Image Slices in Training Set\n{"-" * 38}')
print(f'Mean Slice Count: {slice_counts.mean():.6}  -  Median Slice Count: {int(np.median(slice_counts))} - Total Slice Count: {slice_counts.sum()}')
print(f'Min Slice Count: {slice_counts.min()} -  Max Slice Count: {slice_counts.max()}')

fig = plt.figure(figsize=(20, 5), dpi=150)
ax = sns.countplot(slice_counts)

for idx, label in enumerate(ax.get_xticklabels()):
    if idx % 10 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)

plt.ylabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.title('Number of Image Slices in Training Set', size=20, pad=20)

plt.show()

DICOM files can be read and processed easily with `pydicom` package. DICOM files allow to store metadata along with pixel data inside them. The first patient's (`ID00228637202259965313869`) first slice (`1.dcm`) is read with `dcmread` method in the cell below.

Reading the file creates a `pydicom.dataset.FileDataset` object. Dataset can be displayed by simply printing its string (`str()` or `repr()`) value. `FileDataset` object wraps `dict` and contains `DataElement` instances. The value of each element can be one of a regular numeric, string or text value, a `list` of regular values or a `Sequence` instance, where `Sequence` is a `list` of `Dataset` instances.

In [None]:
file_path = '../input/osic-pulmonary-fibrosis-progression/train/ID00228637202259965313869/1.dcm'
dicom_file = pydicom.dcmread(file_path)

print(f'Patient: ID00228637202259965313869 Image: 1.dcm Dataset\n{"-" * 55}\n\n{dicom_file}')

`pydicom` uses dictionary interface so data inside the files can be accessed with `keys()` and `values()` methods. `.keys()` method shows that keys are tuple pairs and they are called tag numbers. Specific elements can be accessed by their DICOM keywords or tag numbers, but using DICOM keywords is the recommended way. If data is accessed with the tag number, `.value` should be added to end of square brackets.

In [None]:
print(f'Accessing Patient Name with DICOM Keyword (PatientName): {dicom_file.PatientName}')
print(f'Accessing Patient Name with Tag Number ((0x10, 0x10)): {dicom_file[(0x10, 0x10)].value}')

If you don't remember the exact keywords, `Dataset.dir()` can be used to return all non-private element keywords in the dataset. All of the data associated with the keywords below, can be accessed by `Dataset.Keyword`.

In [None]:
dicom_file.dir()

`PixelData` keyword contains the raw bytes of the slice. It is much more convenient to use `Dataset.pixel_array` in order to return a `numpy.ndarray`. As seen from below, `ID00228637202259965313869/1.dcm` slice is a 512x512 greyscale image.

In [None]:
print(f'Patient: ID00228637202259965313869 Image: 1.dcm Pixel Array {dicom_file.pixel_array.shape}\n{"-" * 70}\n\n{dicom_file.pixel_array}')

fig = plt.figure(figsize=(6, 6), dpi=100)

plt.imshow(dicom_file.pixel_array, cmap=plt.cm.bone)

plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.title(f'ID00228637202259965313869/1.dcm', size=15, pad=15)

plt.show()

As the images are acquired at Week 0 in one CT scan, a single slice by itself isn't meaningful. All of the slices make up the 3D volume, so they have to be analyzed together. `load_scan` function loads every slice in the given patient directory and stacks them up. The output is a 3D volume with the shape of `(n_slices, 512, 512)`.

In [None]:
def load_scan(patient_name):
    
    patient_directory = sorted(os.listdir(f'../input/osic-pulmonary-fibrosis-progression/train/{patient_name}'), key=(lambda f: int(f.split('.')[0])))
    volume = np.zeros((len(patient_directory), 512, 512))

    for i, img in enumerate(patient_directory):
        img_slice = pydicom.dcmread(f'../input/osic-pulmonary-fibrosis-progression/train/{patient_name}/{img}')
        volume[i] = img_slice.pixel_array
            
    return volume

patient = 'ID00228637202259965313869'
patient_scan = load_scan(patient)
print(f'Patient {patient} CT scan is loaded - Volume Shape: {patient_scan.shape}')

This animation shows the slices in a chronological order. The slices are taken from bottom to top or top to bottom while the patients are holding their breath. That is the reason why area of the lungs is changing between different slices. The lung area is small at bottom and top, but it is larger at middle part.

In [None]:
%%capture

fig = plt.figure(figsize=(7, 7))

ims = []
for i in patient_scan:
    im = plt.imshow(i, animated=True, cmap=plt.cm.bone)
    plt.axis('off')
    ims.append([im])

ani = animation.ArtistAnimation(fig, ims, interval=25, blit=False, repeat_delay=1000)

In [None]:
HTML(ani.to_html5_video())

## **6. Image Data**

In [None]:
# To Be Continued
