<center>
<div class="alert alert-block alert-info">
    <h1>OSIC Pulmonary Fibrosis Progression</h1>
    <h3>Predict lung function decline</h3>
</div></center>

# <div class="alert alert-block alert-info">1. <a id='Introduction'>Introduction</a></div>

###  1.1 Pulmonary fibrosis[](http://)

[The word "**pulmonary**" means lung and the word "**fibrosis**" means scar tissue— similar to scars that you may have on your skin from an old injury or surgery.](https://www.pulmonaryfibrosis.org/life-with-pf/about-pf) So, in its simplest sense, pulmonary fibrosis (PF) means scarring in the lungs. Over time, the scar tissue can destroy the normal lung and make it hard for oxygen to get into your blood. Low oxygen levels (and the stiff scar tissue itself) can cause you to feel short of breath, particularly when walking and exercising.

<img src='https://www.pulmonaryfibrosis.org/images/default-source/default-album/normal-and-impaired-gas-exchange.png?sfvrsn=c3b0918d_0' />

Image Credits:- https://www.pulmonaryfibrosis.org/


* **Prognosis** - Prognosis is a term for the predicted course of a disease. People commonly use the word to refer to an individual’s life expectancy, how long the person is likely to live. However, prognosis can also refer to the chance that a disease can be cured and the outlook for functional recovery, which includes the prospects of being able to return to work, engage in recreation, as well as the expected degree of help that will be necessary to accomplish activities of daily living.

* **Forced vital capacity (FVC)** - FVC is the amount of air that can be forcibly exhaled from your lungs after taking the deepest breath possible, as measured by spirometry. FVC can also help doctors assess the progression of lung disease and evaluate the effectiveness of treatment.

Patient's FVC volume can be compared with the standard FVC for similir age, sex, height, and weight. Patient FVC can also be compared with her/his own previous FVC values, if applicable, to determine whether pulmonary condition is progressing or if lung function is improving under treatment. FVC also may be expressed as a percentage of the predicted FVC.

The normal FVC range for an adult is between 3.0 and 5.0 L.


###  1.2 Objective of Competition

The aim of this competition is to **predict a patient’s severity of decline in lung function** based on a CT scan of their lungs. Lung function is assessed based on output from a spirometer, which measures the **forced vital capacity (FVC)**, i.e. the volume of air exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.

###  1.3 Evaluated of competition

This competition is evaluated on a modified version of the **Laplace Log Likelihood**. In medical applications, it is useful to evaluate a model's confidence in its decisions. Accordingly, the metric is designed to reflect both the accuracy and certainty of each prediction.

<img src='https://www.vosesoftware.com/riskwiki/images/image15_632.gif' />

Image Credits:- https://www.vosesoftware.com/riskwiki/Laplacedistribution.php

For each true FVC measurement, you will predict both an **FVC** and a **confidence measure** (standard deviation σ). The metric is computed as:


**Confidence values smaller than 70 are clipped.**

$$ \large \sigma_{clipped} = max(\sigma, 70) $$


**Errors greater than 1000 are also clipped in order to avoid large errors.**

$$ \large \Delta = min ( |FVC_{true} - FVC_{predicted}|, 1000 ) $$


**The metric is defined as:**

$$ \Large metric = -   \frac{\sqrt{2} \Delta}{\sigma_{clipped}} - \ln ( \sqrt{2} \sigma_{clipped} ) $$


The leaderboard is calculated with approximately 15% of the test data. The final results will be based on the other 85%, so the final standings may be different.



## Contents

* [Introduction](#Introduction)
* [Importing libraries](#libraries)
* [Load Data](#dataLoad)
* [Exploratory Data Analysis](#EDA)
* [Visualising Images : DECOM](#ImageVisuals)


# <div class="alert alert-block alert-info">2. <a id='libraries'>Importing libraries</a></div>

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import random
import math
import matplotlib
from termcolor import colored
import os
from os import listdir
from os.path import join, getsize
import glob
import cv2

#plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

# 
from skimage import measure
from skimage.morphology import disk, opening, closing

import tensorflow as tf
from tensorflow.keras import Model
import tensorflow.keras.backend as K
import tensorflow.keras.layers as L
import tensorflow.keras.models as M
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, KFold

from tensorflow.keras.layers import (
    Dense, Dropout, Activation, Flatten, Input, GlobalAveragePooling2D, Add, Conv2D, AveragePooling2D, 
    LeakyReLU, Concatenate 
)


# Magic function to display In-Notebook display
%matplotlib inline

# Setting seabon style
sns.set(style='darkgrid', palette='Set2')

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

# pydicom
import pydicom

# Print versions of libraries
print(f"Numpy version : Numpy {np.__version__}")
print(f"Pandas version : Pandas {pd.__version__}")
print(f"Matplotlib version : Matplotlib {matplotlib.__version__}")
print(f"Seaborn version : Seaborn {sns.__version__}")
print(f"Tensorflow version : Tensorflow {tf.__version__}")

In [None]:
# Install the EfficientNet Keras Library
!pip install ../input/kerasapplications/keras-team-keras-applications-3b180cb -f ./ --no-index
!pip install ../input/efficientnet/efficientnet-1.1.0/ -f ./ --no-index

In [None]:
import efficientnet.tfkeras as efn

## Sets integer starting value 

In [None]:
def seed_everything(seed=100):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
seed_everything(101)

# <div class="alert alert-block alert-info">2. <a id='dataLoad'>Load Data</a></div>

In [None]:
# List files available
base_dir = "../input/osic-pulmonary-fibrosis-progression/"

In [None]:
list(os.listdir(base_dir))

### Dataset Details 
* train.csv - the training set, contains full history of clinical information
* test.csv - the test set, contains only the baseline measurement
* train/ - contains the training patients' baseline CT scan in DICOM format
* test/ - contains the test patients' baseline CT scan in DICOM format
* sample_submission.csv - demonstrates the submission format

In [None]:
# Train & Test set shape
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv', encoding = 'latin-1')
test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv', encoding = 'latin-1')
submission_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv', encoding = 'latin-1')

print(colored('Training data set shape.......... : ','yellow'),train_df.shape)
print(colored('Test data set shape...............: ','red'),test_df.shape)
print(colored('Submission data set shape.........: ','blue'),submission_df.shape)

In [None]:
# print top 5 rows of train set
train_df.head()

In [None]:
# print top 5 rows of test set
test_df.head()

### Columns Details in train.csv and test.csv
* **Patient** - a unique Id for each patient (also the name of the patient's DICOM folder)
* **Weeks** - the relative number of weeks pre/post the baseline CT (may be negative)
* **FVC** - the recorded lung capacity in ml
* **Percent** - a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
* **Age** 
* **Sex** 
* **SmokingStatus** 

# <div class="alert alert-block alert-info">3. <a id='EDA'>Exploratory Data Analysis</a></div>

The purpose of exploratory data analysis is to: Check for missing data and other mistakes. Gain maximum insight into the data set and its underlying structure. Uncover a parsimonious model, one which explains the data with a minimum number of predictor variables.

## Concise Summary

In [None]:
# Null values and Data types
print(colored('Train Set !!', 'yellow'))
print(colored('------------', 'yellow'))
print(train_df.info())

print('\n')

print(colored('Test Set !!','red'))
print(colored('-----------','red'))
print(test_df.info())

There is no missing values in train_df and test_df.

## Descriptive Statistics

In [None]:
# Null values and Data types
print(colored('Train Set !!', 'yellow'))
print(train_df.describe())

* Patient Age ranges from 49 years to 88 years with 67 years average age and 7 years standard deviation.
* The normal FVC range for an adult is between 3000ml to 5000ml. Dataset FVC ranges from 2690.48ml to 6399.00ml.
* FVC are measured earliest at -5th week and latest by 133th week.

## Missing Values

In [None]:
# Total missing values for each feature
print(colored('Missing values in Train Set !!', 'yellow'))
print(train_df.isnull().sum())

print("\n")

print(colored('Missing values in Test Set !!', 'red'))
print(test_df.isnull().sum())

In [None]:
train_df.groupby( ['Sex','SmokingStatus'] )['FVC'].agg( ['mean','std','count'] ).sort_values(by=['Sex','count'],ascending=False)

This is VERY weird: FVC and Percent are the highest for people that still smoke and the lowest for people that never smoked. HOWEVER, we need to keep in mind that the percentage of people that still smoke is very low. So, we CAN'T conclude that if a person smokes it's highly likely that will have a high FVC.

No missing value in either train/test datasets.

## Patients Counts

In [None]:
# Total number of Patient in the dataset(train+test)

print(colored("Total Patients in Train set... : ", 'yellow'),train_df['Patient'].count())
print(colored("Total Patients in Test set.... : ", 'red'),test_df['Patient'].count())
print("\n")
print(colored("Unique Patients in Train set...: ", 'yellow'),train_df['Patient'].nunique())
print(colored("Unique Patients in Test set....: ", 'red'),test_df['Patient'].nunique())

In [None]:
print(colored("Few most repeated Patients in Train set: ", 'yellow'))
print(train_df['Patient'].value_counts().head())

print("\n")

print(colored("Few most repeated Patients in Test set: ", 'red'))
print(test_df['Patient'].value_counts().head())

## Unique patients

Let's create a new data set having only unique patient details.

In [None]:
train_df_unique = train_df[['Patient', 'Age', 'Sex', 'SmokingStatus']].drop_duplicates().reset_index()
print(colored("Shape of unique patient data set : ",'yellow'),train_df_unique.shape)
train_df_unique.head()

## Frequency of a patient in Train set

Let's count how many times a particular patient repeated in train set.

In [None]:
patient_feq = train_df.groupby(['Patient'])['Patient'].count()
patient_feq = pd.DataFrame({'Patient':patient_feq.index, 'Frequency':patient_feq.values})

# Merge two dataframes based on patient's ids.
train_df_unique = pd.merge(train_df_unique,patient_feq,how='inner',on='Patient')

In [None]:
train_df_unique.sort_values(by='Frequency', ascending=False).head()

In [None]:
fig = px.bar(train_df_unique, x='Patient',y ='Frequency',color='Frequency')
fig.update_layout(xaxis={'categoryorder':'total ascending'},title='Frequency of each patient')
fig.update_xaxes(showticklabels=False)
fig.show()

Every patient is observed between 6 to 10 times however most of them have observed 9 times.

## Number of CT Scans for each patient in Train set

we are provided with a baseline chest CT scans at week intervats for each patients. So lets count how many CT Scan have each patient.

In [None]:
# Creating unique patient lists 
# (here patient == dictory and files == CT Scan)
train_dir = '../input/osic-pulmonary-fibrosis-progression/train/'

patient_ids = os.listdir(train_dir)
patient_ids = sorted(patient_ids)

#Creating a new blank dataframe
CtScan = pd.DataFrame(columns=['Patient','CtScanCount'])


for patient_id in patient_ids:
    # count number of images in each folder
    cnt = len(os.listdir(train_dir + patient_id))
    # insert patient id and ct scan count in dataframe
    CtScan.loc[len(CtScan)] = [patient_id,cnt]
    

# Merge two dataframes based on patient's ids.
patient_df = pd.merge(train_df_unique,CtScan,how='inner',on='Patient')

# Reset index
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
patient_df = patient_df.reset_index(drop=True)

# Print new dataframe
patient_df.head()


In [None]:
print(colored("CT Scans numbers in Train set ","yellow"))
print(colored("Maximum number of CT Scans for a patient.... : ","blue"),patient_df['CtScanCount'].max())
print(colored("Minimum number of CT Scans for a patient.... : ","blue"),patient_df['CtScanCount'].min())
print(colored("Average number of CT Scans per patient...... : ","blue"),round(patient_df['CtScanCount'].mean(),3))
print(colored("Total number of CT Scans of all patients.... : ","blue"),patient_df['CtScanCount'].sum())
print(colored("Median of CT Scans counts................... : ","blue"),patient_df['CtScanCount'].median())

Huge imbalance in the number of CT scans: half of the patients have less that 100 CT scans.

## Number of CT Scans for each patient in Test set

In [None]:
# Creating unique patient lists 
# (here patient == dictory and files == CT Scan)
test_dir = '../input/osic-pulmonary-fibrosis-progression/test/'

test_patient_ids = os.listdir(test_dir)
test_patient_ids = sorted(test_patient_ids)

#Creating a new blank dataframe
TestCtScan = pd.DataFrame(columns=['Patient','CtScanCount'])

for patient_id in test_patient_ids:
    # count number of images in each folder
    cnt = len(os.listdir(test_dir + patient_id))
    # insert patient id and ct scan count in dataframe
    TestCtScan.loc[len(TestCtScan)] = [patient_id,cnt]
    

# Merge two dataframes based on patient's ids.
test_patient_df = pd.merge(test_df,TestCtScan,how='inner',on='Patient').reset_index()

# Print new dataframe
test_patient_df.head()

In [None]:
print(colored("CT Scans numbers in Test set ","red"))
print(colored("Maximum number of CT Scans for a patient... : ","green"),test_patient_df['CtScanCount'].max())
print(colored("Minimum number of CT Scans for a patient... : ","green"),test_patient_df['CtScanCount'].min())
print(colored("Average number of CT Scans per patient..... : ","green"),test_patient_df['CtScanCount'].mean())
print(colored("Total number of CT Scans of all patients... : ","green"),test_patient_df['CtScanCount'].sum())

## Distribution of weeks

In [None]:
train_df['Weeks'].iplot(kind='hist',
                        bins=100, xTitle='Weeks', yTitle='Frequency', 
                        linecolor='white',opacity=0.7,
                        color='rgb(0, 200, 200)', theme='pearl',
                        bargap=0.01, title='Distribution of Weeks')

Most of the patients CT scans done in between 4th to 20th week.

## Distribution of Patients age

In [None]:
patient_df['Age'].iplot(kind='hist',
                        bins=10, xTitle='Age', yTitle='Frequency', 
                        linecolor='white',opacity=0.7,
                        color='rgb(0, 100, 200)', theme='pearl',
                        bargap=0.01, title='Distribution of Age column')

Range of patients age is between 48-88 years where we have more records for patients in the age range 64-74 years.

### Distribution of Patient gender

In [None]:
print(colored("Gender wise distribution of patients :","blue"))
print(patient_df['Sex'].value_counts())

In [None]:
sex_count = patient_df["Sex"].value_counts()
sex_labels = patient_df["Sex"].unique()

fig = px.pie(patient_df, values=sex_count, names=sex_labels, hover_name=sex_labels)
fig.show()

More number of male patients than female patients.

### Distribution of Age vs Gender

In [None]:
plt.figure(figsize=(16, 6))

sns.kdeplot(patient_df[patient_df['Sex'] == 'Male']['Age'], label = 'Male',shade=True)
sns.kdeplot(patient_df[patient_df['Sex'] == 'Female']['Age'], label = 'Female',shade=True)

plt.xlabel('Age (years)'); 
plt.ylabel('Density'); 
plt.title('Distribution of Ages');

Male and female records are almost distributed throughout the age range.

### Distribution of 'SmokingStatus' feature

In [None]:
print(colored('Total Smoking counts', 'red'))
print(patient_df['SmokingStatus'].value_counts())

print("\n")
print(colored("Male Smoking counts",'blue'))
print(patient_df[patient_df['Sex']=='Male']['SmokingStatus'].value_counts())

print("\n")
print(colored("Female Smoking counts",'green'))
print(patient_df[patient_df['Sex']=='Female']['SmokingStatus'].value_counts())

### Distribution of Age vs SmokingStatus

In [None]:
plt.figure(figsize=(16, 6))

sns.kdeplot(patient_df.loc[patient_df['SmokingStatus'] == 'Ex-smoker', 'Age'], label = 'Ex-smoker',shade=True)
sns.kdeplot(patient_df.loc[patient_df['SmokingStatus'] == 'Never smoked', 'Age'], label = 'Never smoked',shade=True)
sns.kdeplot(patient_df.loc[patient_df['SmokingStatus'] == 'Currently smokes', 'Age'], label = 'Currently smokes', shade=True)

# Labeling of plot
plt.xlabel('Age (years)'); 
plt.ylabel('Density'); 
plt.title('Distribution of Ages');

### Gender wise smoking distribution

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x='SmokingStatus', data=patient_df, hue='Sex')
plt.title('Gender split by SmokingStatus', fontsize=16)
plt.show()

Records with patient who have never smoked have almost equal distribution of male and female patients whereas majority of ex-smokers are males.

## FVC - The forced vital capacity

Lung function is assessed based on output from a spirometer, which measures the **forced vital capacity (FVC)**, i.e. the volume of air exhaled. FVC can also help doctors assess the progression of lung disease and evaluate the effectiveness of treatment.

A person who has Diagnose obstructive lung diseases such as asthma and chronic obstructive pulmonary disease (COPD) has a lower FVC result than a healthy person.Decreases in the FVC value may mean the lung disease is getting worse.

* Average normal values in healthy males aged 20-60 range from 5.5 to 4.75 liters.
* Average normal values for females aged 20-60 range from 3.75 to 3.25 liters. 
* Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics. Percentage with normal test values falling between 80% and 120% of the average values.

Referecne : https://www.nuvoair.com/blog/do-you-know-how-to-interpret-the-results-of-your-spirometry-test

In [None]:
print(colored("Maximum value of FVC... :",'blue'),colored(train_df['FVC'].max(),'blue'))
print(colored("Minimum value of FVC... :",'green'),colored(train_df['FVC'].min(),'green'))

print("\n")

# Distribution of FVC
print(colored("Distribution of FVC","yellow"))
print(colored(train_df['FVC'].value_counts(normalize=False, ascending=False, bins=62).head(),"yellow"))

### FVC Distribution

In [None]:
train_df['FVC'].iplot(kind='hist',
                      xTitle='Lung Capacity(ml)', 
                      yTitle='Frequency', 
                      linecolor='black', 
                      bargap=0.2,
                      title='Distribution of the FVC in the training set')

### FVC vs Smoking Status

In [None]:
fig = px.violin(train_df, y='FVC', x='SmokingStatus', 
                box=True, color='Sex', points="all", hover_data=train_df.columns, title="FVC of various Smoking Status")
fig.show()

### FVC vs Age

In [None]:
fig = px.scatter(train_df, x="Age", y="FVC", color='Sex', title='FVC values for Patient Age')
fig.show()

Males have higher FVC than females irrespective of age.

In [None]:
train_df[train_df['FVC'] > 5000].sort_values(by='FVC', ascending=False)

### FVC vs Week

In [None]:
fig = go.Figure()
fig = px.scatter(train_df, x="Weeks", y="FVC", color='SmokingStatus')
fig.show()

Most of the FVC test is done in between 0 to 20 weeks. Also Ex-smoker males have more FVC than others.

### FVC of oldest and youngest person

In [None]:
# patient = train_df[train_df['FVC'] == train_df['FVC'].max()]
patient = train_df[(train_df['Age'] == train_df['Age'].max()) | (train_df['Age'] == train_df['Age'].min())]
fig = px.line(patient, x="Weeks", y="FVC", color='Age',line_group="Sex", hover_name="SmokingStatus")
fig.show()

Aging is associated with progressive decline in lung function as shown in predicted by above plot that FVC of older person is less than younger person.

## Percent
Percent approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics. Percentage with normal test values falling between 80% and 120% of the average values.

In [None]:
print(colored("Maximum value of Percent... :",'blue'),colored(train_df['Percent'].max(),'blue'))
print(colored("Minimum value of Percent... :",'green'),colored(train_df['Percent'].min(),'green'))

print("\n")

# Distribution of Percent
print(colored("Distribution of Percent","yellow"))
print(colored(train_df['Percent'].value_counts(normalize=False, ascending=False, bins=62).head(),"yellow"))

### Percent Distribution

In [None]:
train_df['Percent'].iplot(kind='hist',
                      xTitle='Percent', 
                      yTitle='Frequency', 
                      linecolor='black', 
                      bargap=0.2,
                      title='Distribution of Percent in the training set')

### Percent vs SmokingStatus

In [None]:
fig = px.violin(train_df, y='Percent', x='SmokingStatus', 
                box=True, color='Sex', points="all", hover_data=train_df.columns, title="Percent of various Smoking Status")
fig.show()

### Percent of oldest and youngest person

In [None]:
patient = train_df[(train_df['Age'] == train_df['Age'].max()) | (train_df['Age'] == train_df['Age'].min())]
fig = px.line(patient, x="Weeks", y="Percent", color='Age',line_group="Sex", hover_name="SmokingStatus")

patient = train_df[(train_df['Age'] == train_df['Age'].max()) | (train_df['Age'] == train_df['Age'].min())]
fig = px.line(patient, x="Weeks", y="Percent", color='Age',line_group="Sex", hover_name="SmokingStatus")

fig.show()

### Percent vs Age 

In [None]:
fig = px.scatter(train_df, x="Age", y="Percent", color="SmokingStatus", marginal_y="violin",
           marginal_x="box", trendline="ols", template="simple_white")
fig.show()

### FVC vs Percent

In [None]:
fig = px.scatter(train_df, x="FVC", y="Percent", color='SmokingStatus', size='Age', 
                 hover_name='SmokingStatus',hover_data=['Weeks'])
fig.show()

## Patient Overlap

## Correlation among varous features

In [None]:
corrmat = train_df.corr() 
fig = px.imshow(corrmat, x=corrmat.columns, y=corrmat.columns)
fig.update_xaxes(side="top")
fig.show()

* There is high correlation between FVC and Percent: when the volume of air increases, the Percent increases as well.

* There is no correlation between FVC/Percent and Age, meaning that Age has no influence on the volume of exhaled air.

# <div class="alert alert-block alert-info">4. <a id='ImageVisuals'>Visualising DICOM Images</a></div>

### Digital Imaging and COmmunications in Medicine - DICOM

DICOM(Digital Imaging and COmmunications in Medicine) is the de-facto standard that establishes rules that allow medical images(X-Ray, MRI, CT) and associated information to be exchanged between imaging equipment from different vendors, computers, and hospitals.

DICOM files typically have a .dcm extension and provides a means of storing data in separate 'tags' such as patient information as well as image/pixel data. A DICOM file consists of a header and image data sets packed into a single file. The information within the header is organized as a constant and standardized series of tags.

By extracting data from these tags one can access important information regarding the patient demographics, study parameters, etc.

### Pydicom

Pydicom is a python package for parsing DICOM files and makes it easy to covert DICOM files into pythonic structures for easier manipulation. Files are opened using pydicom.dcmread

## Patients & their CT Scans in Training Images Folder

In [None]:
## Patients & their CT Scans in Training Images Folder

file_len = folder_len = 0
files = []

for dirpath, dirnames, filenames in os.walk(train_dir):
    file_len += len(filenames)
    folder_len += len(dirnames)
    files.append(len(filenames))

print("Training folder contains", f'{file_len:,}', "CT scans for all patients.") 
print('Training folder have only',f'{folder_len:,}', "unique patients.")

print("\n")

print('Each patient have', f'{round(np.mean(files)):,}', 'average number of CT scans.')
print('Maximum images per patient', f'{round(np.max(files)):,}')
print('Minimum images per patient', f'{round(np.min(files)):,}')

## Extracting DIOCOM files information in a dataframe

In [None]:
# https://www.kaggle.com/schlerp/getting-to-know-dicom-and-the-data

def show_dcm_info(file_path):
    #print(colored("Filename.........:",'yellow'),file_path)
    #print()
    print(colored("File Path...........:",'blue'), file_path)
    
    dataset = pydicom.dcmread(file_path)

    pat_name = dataset.PatientName
    display_name = pat_name.family_name + ", " + pat_name.given_name
    
    print(colored("Patient's name......:",'blue'), display_name)
    print(colored("Patient id..........:",'blue'), dataset.PatientID)
    print(colored("Patient's Sex.......:",'blue'), dataset.PatientSex)
    print(colored("Modality............:",'blue'), dataset.Modality)
    print(colored("Body Part Examined..:",'blue'), dataset.BodyPartExamined)
    
    if 'PixelData' in dataset:
        rows = int(dataset.Rows)
        cols = int(dataset.Columns)
        print(colored("Image size..........:",'blue')," {rows:d} x {cols:d}, {size:d} bytes".format(
            rows=rows, cols=cols, size=len(dataset.PixelData)))
        if 'PixelSpacing' in dataset:
            print(colored("Pixel spacing.......:",'blue'),dataset.PixelSpacing)
            dataset.PixelSpacing = [1, 1]
        plt.figure(figsize=(10, 10))
        plt.imshow(dataset.pixel_array, cmap='gray')
        plt.show()

In [None]:
for file_path in glob.glob(train_dir + '*/*.dcm'):
    show_dcm_info(file_path)
    break # Comment this out to see all

In [None]:
show_dcm_info(train_dir + 'ID00027637202179689871102/11.dcm')

In [None]:
patient_dir = train_dir + "ID00123637202217151272140"

print("total images for patient ID00123637202217151272140: ", len(os.listdir(patient_dir)))

# view first (columns*rows) images in order
fig=plt.figure(figsize=(16, 16))
columns = 4
rows = 5
imglist = os.listdir(patient_dir)
for i in range(1, columns*rows +1):
    filename = patient_dir + "/" + str(i) + ".dcm"
    ds = pydicom.dcmread(filename)
    fig.add_subplot(rows, columns, i)
    plt.imshow(ds.pixel_array, cmap='gray')
plt.show()

In [None]:
# view first (columns*rows) images in order
fig=plt.figure(figsize=(16, 16))
columns = 4
rows = 5
imglist = os.listdir(patient_dir)
for i in range(1, columns*rows +1):
    filename = patient_dir + "/" + str(i) + ".dcm"
    ds = pydicom.dcmread(filename)
    fig.add_subplot(rows, columns, i)
    plt.imshow(ds.pixel_array, cmap='jet')
    #plt.imshow(cv2.cvtColor(ds.pixel_array, cv2.COLOR_BGR2RGB))
plt.show()

### Loading DICOM files

Dicom files contain a lot of metadata (such as the pixel size, so how long one pixel is in every dimension in the real world).

This pixel size/coarseness of the scan differs from scan to scan (e.g. the distance between slices may differ), which can hurt performance of CNN approaches. 

Below is code to load a scan, which consists of multiple slices, which we simply save in a Python list. Every folder in the dataset is one scan (so one patient). One metadata field is missing, the pixel size in the Z direction, which is the slice thickness. Fortunately we can infer this, and we add this to the metadata.

In [None]:
# Ref : 
# https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial
# https://www.kaggle.com/akh64bit/full-preprocessing-tutorial
# https://www.researchgate.net/post/How_can_I_convert_pixel_intensity_values_to_housefield_CT_number

# Load the scans in given folder path
def load_scan(path):
    slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

The unit of measurement in CT scans is the Hounsfield Unit (HU), which is a measure of radiodensity. CT scanners are carefully calibrated to accurately measure this. From Wikipedia:

<img src="http://i.imgur.com/4rlyReh.png" />

By default however, the returned values are not in this unit. Let's fix this.

Some scanners have cylindrical scanning bounds, but the output image is square. The pixels that fall outside of these bounds get the fixed value -2000. The first step is setting these values to 0, which currently corresponds to air. Next, let's go back to HU units, by multiplying with the rescale slope and adding the intercept (which are conveniently stored in the metadata of the scans!).

In [None]:
def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
            
        image[slice_number] += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

Let's take a look at one of the patients.

In [None]:
first_patient = load_scan(train_dir + patient_ids[0])
first_patient_pixels = get_pixels_hu(first_patient)

plt.figure(figsize=(10, 10))
plt.hist(first_patient_pixels.flatten(), bins=80, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

# Show some slice in the middle
plt.figure(figsize=(10, 10))
plt.imshow(first_patient_pixels[15], cmap=plt.cm.gray)
plt.show()

Looking at the table from Wikipedia and this histogram, we can clearly see which pixels are air and which are tissue. We will use this for lung segmentation in a bit 

Let's take a look at the first dicom file of a patient:

In [None]:
first_patient_scan = load_scan(train_dir + patient_ids[0])

In [None]:
first_patient_scan[0]

### Visualization using gif

In [None]:
def set_lungwin(img, hu=[-1200., 600.]):
    lungwin = np.array(hu)
    newimg = (img-lungwin[0]) / (lungwin[1]-lungwin[0])
    newimg[newimg < 0] = 0
    newimg[newimg > 1] = 1
    newimg = (newimg * 255).astype('uint8')
    return newimg

In [None]:
first_patient_scan_array = set_lungwin(get_pixels_hu(first_patient_scan))

In [None]:
import imageio
from IPython.display import Image

imageio.mimsave("/tmp/gif.gif", first_patient_scan_array, duration=0.00001)
Image(filename="/tmp/gif.gif", format='png')

### Transforming to Hounsfield Units 
Before starting, let's plot the pixelarray distribution of some dicom files to get an impression of the raw data:

Ref : https://www.kaggle.com/allunia/pulmonary-fibrosis-dicom-preprocessing

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
for n in range(10):
    image = first_patient_scan[n].pixel_array.flatten()
    rescaled_image = image * first_patient_scan[n].RescaleSlope + first_patient_scan[n].RescaleIntercept
    sns.distplot(image.flatten(), ax=ax[0]);
    sns.distplot(rescaled_image.flatten(), ax=ax[1])
ax[0].set_title("Raw pixel array distributions for 10 examples")
ax[1].set_title("HU unit distributions for 10 examples");

There are some raw values at -2000. They correspond to images with a circular boundary within the image. The "outside" of this circle value is often set to -2000 (or in other competitions I found also -3000) by default.

In [None]:
fig, ax = plt.subplots(1,4,figsize=(20,3))
ax[0].set_title("Original CT-scan")
ax[0].imshow(first_patient_scan[0].pixel_array, cmap="bone")
ax[1].set_title("Pixelarray distribution");
sns.distplot(first_patient_scan[0].pixel_array.flatten(), ax=ax[1]);

ax[2].set_title("CT-scan in HU")
ax[2].imshow(first_patient_pixels[0], cmap="bone")
ax[3].set_title("HU values distribution");
sns.distplot(first_patient_pixels[0].flatten(), ax=ax[3]);

for m in [0,2]:
    ax[m].grid(False)

The scan of our example patient had a circular boundary and now all raw values per slice are scaled to H-units.

## Tissue segmentation 

A scan may have a pixel spacing of [2.5, 0.5, 0.5], which means that the distance between slices is 2.5 millimeters. For a different scan this may be [1.5, 0.725, 0.725], this can be problematic for automatic analysis (e.g. using ConvNets)!

In order to reduce the problem space, we can segment the lungs (and usually some tissue around it). 

It involves quite a few smart steps.

Threshold the image (-320 HU is a good threshold, but it doesn't matter much for this approach).
Do connected components, determine label of air around person, fill this with 1s in the binary image
Optionally: For every axial slice in the scan, determine the largest solid connected component (the body+air around the person), and set others to 0. This fills the structures in the lungs in the mask.
Keep only the largest air pocket (the human body has other pockets of air here and there).

With -320 we are separating between lungs (-700) /air (-1000) and tissue with values close to water (0).

In [None]:
def segment_lung_mask(image):
    segmented = np.zeros(image.shape)   
    
    for n in range(image.shape[0]):
        binary_image = np.array(image[n] > -320, dtype=np.int8)+1
        labels = measure.label(binary_image)
        
        background_label_1 = labels[0,0]
        background_label_2 = labels[0,-1]
        background_label_3 = labels[-1,0]
        background_label_4 = labels[-1,-1]
    
        #Fill the air around the person
        binary_image[background_label_1 == labels] = 2
        binary_image[background_label_2 == labels] = 2
        binary_image[background_label_3 == labels] = 2
        binary_image[background_label_4 == labels] = 2
    
        #We have a lot of remaining small signals outside of the lungs that need to be removed. 
        #In our competition closing is superior to fill_lungs 
        selem = disk(4)
        binary_image = closing(binary_image, selem)
    
        binary_image -= 1 #Make the image actual binary
        binary_image = 1-binary_image # Invert it, lungs are now 1
        
        segmented[n] = binary_image.copy() * image[n]
    
    return segmented

In [None]:
segmented = segment_lung_mask(np.array([first_patient_pixels[20]]))

fig, ax = plt.subplots(1,2,figsize=(20,10))
ax[0].imshow(first_patient_pixels[20], cmap="Blues_r")
ax[1].imshow(segmented[0], cmap="Blues_r")

In [None]:
segmented_lungs = segment_lung_mask(first_patient_pixels)

In [None]:
segmented_lungs.shape

In [None]:
fig, ax = plt.subplots(6,5, figsize=(20,20))
for n in range(6):
    for m in range(5):
        ax[n,m].imshow(segmented_lungs[n*5+m], cmap="Blues_r")

## Reset the index

In [None]:
train_df.reset_index(inplace = True , drop = True)
patient_df.reset_index(inplace = True , drop = True)
test_df.reset_index(inplace = True , drop = True)

# Model Building

In [None]:
# train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv', encoding = 'latin-1')
# test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv', encoding = 'latin-1')
# submission_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv', encoding = 'latin-1')

In [None]:
img_sub = submission_df[["Patient_Week","FVC","Confidence"]].copy()
print(img_sub.sample(5))

In [None]:
Dropout_model = 0.38559
FVC_weight = 0.2
Confidence_weight = 0.15

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

In [None]:
def get_efficientnet(model, shape):
    models_dict = {
        'b0': efn.EfficientNetB0(input_shape=shape,weights=None,include_top=False),
        'b1': efn.EfficientNetB1(input_shape=shape,weights=None,include_top=False),
        'b2': efn.EfficientNetB2(input_shape=shape,weights=None,include_top=False),
        'b3': efn.EfficientNetB3(input_shape=shape,weights=None,include_top=False),
        'b4': efn.EfficientNetB4(input_shape=shape,weights=None,include_top=False),
        'b5': efn.EfficientNetB5(input_shape=shape,weights=None,include_top=False),
        'b6': efn.EfficientNetB6(input_shape=shape,weights=None,include_top=False),
        'b7': efn.EfficientNetB7(input_shape=shape,weights=None,include_top=False)
    }
    return models_dict[model]

In [None]:
def build_model(shape=(512, 512, 1), model_class=None):
    inp = Input(shape=shape)
    base = get_efficientnet(model_class, shape)
    x = base(inp)
    x = GlobalAveragePooling2D()(x)
    inp2 = Input(shape=(5,))  # add the feature of MinFVC
    x2 = tf.keras.layers.GaussianNoise(0.2)(inp2)
    x = Concatenate()([x, x2]) 
    x = Dropout(Dropout_model)(x)
    x = Dense(1)(x)
    model = Model([inp, inp2] , x)
    
#     weights = [w for w in os.listdir('../input/osic-model-weights') if model_class in w][0]
#     model.load_weights('../input/osic-model-weights/' + weights)
    return model

In [None]:
model_classes = ['b5'] #['b0','b1','b2','b3',b4','b5','b6','b7']
models = [build_model(shape=(512, 512, 1), model_class=m) for m in model_classes]
print('Number of models: ' + str(len(models)))

In [None]:
model = models[0]
model.summary()

In [None]:
BATCH_SIZE=128

In [None]:
submission_df['Patient'] = submission_df['Patient_Week'].apply(lambda x:x.split('_')[0])
submission_df['Weeks'] = submission_df['Patient_Week'].apply(lambda x: int(x.split('_')[-1]))
submission_df =  submission_df[['Patient','Weeks','Confidence','Patient_Week']]
submission_df = submission_df.merge(test_df.drop('Weeks', axis=1), on="Patient")

submission_df.head()

In [None]:
# Merge Train, Test and Submission CSV files
train_df['WHERE'] = 'train'
test_df['WHERE'] = 'val'
submission_df['WHERE'] = 'test'

# Merge train, test and submission dataset
data = train_df.append([test_df, submission_df])

print(colored('Train data set shape.......: ','yellow'), train_df.shape)
print(colored('Test data set shape........: ','green'), test_df.shape)
print(colored('Submission data set shape..: ','blue'), submission_df.shape)
print(colored('Comibined data set shape...: ','red'), data.shape)

print("\n")

print(colored('Unique Patient in Train data set shape.......: ','yellow'), train_df.Patient.nunique())
print(colored('Unique Patient in Test data set shape........: ','green'), test_df.Patient.nunique())
print(colored('Unique Patient in Submission data set shape..: ','blue'), submission_df.Patient.nunique())
print(colored('Unique Patient in Comibined data set shape...: ','red'), data.Patient.nunique())

In [None]:
data.head(3).T

In [None]:
data['min_week'] = data['Weeks']
data.loc[data.WHERE=='test','min_week'] = np.nan
data['min_week'] = data.groupby('Patient')['min_week'].transform('min')

In [None]:
base = data.loc[data.Weeks == data.min_week]
base = base[['Patient','FVC']].copy()
base.columns = ['Patient','min_FVC']
base['nb'] = 1
base['nb'] = base.groupby('Patient')['nb'].transform('cumsum')
base = base[base.nb==1]
base.drop('nb', axis=1, inplace=True)

In [None]:
data = data.merge(base, on='Patient', how='left')
data['base_week'] = data['Weeks'] - data['min_week']

del base

In [None]:
COLS = ['Sex','SmokingStatus'] #,'Age'
features = []

for col in COLS:
    for mod in data[col].unique():
        features.append(mod)
        data[mod] = (data[col] == mod).astype(int)

In [None]:
data.head(3).T

In [None]:
#
data['age'] = (data['Age'] - data['Age'].min() ) / ( data['Age'].max() - data['Age'].min() )
data['BASE'] = (data['min_FVC'] - data['min_FVC'].min() ) / ( data['min_FVC'].max() - data['min_FVC'].min() )
data['week'] = (data['base_week'] - data['base_week'].min() ) / ( data['base_week'].max() - data['base_week'].min() )
data['percent'] = (data['Percent'] - data['Percent'].min() ) / ( data['Percent'].max() - data['Percent'].min() )
data['FVC_Percent'] = data['FVC'] / data['Percent']

features += ['age','percent','week','BASE']
print(features)

In [None]:
train_df = data.loc[data.WHERE=='train']
test_df = data.loc[data.WHERE=='val']
submission_df = data.loc[data.WHERE=='test']

del data

In [None]:
C1, C2 = tf.constant(70, dtype='float32'), tf.constant(1000, dtype="float32")

def score(y_true, y_pred):
    tf.dtypes.cast(y_true, tf.float32)
    tf.dtypes.cast(y_pred, tf.float32)
    sigma = y_pred[:, 2] - y_pred[:, 0]
    fvc_pred = y_pred[:, 1]
    
    #sigma_clip = sigma + C1
    sigma_clip = tf.maximum(sigma, C1)
    delta = tf.abs(y_true[:, 0] - fvc_pred)
    delta = tf.minimum(delta, C2)
    sq2 = tf.sqrt( tf.dtypes.cast(2, dtype=tf.float32) )
    metric = (delta / sigma_clip)*sq2 + tf.math.log(sigma_clip* sq2)
    
    return K.mean(metric)

In [None]:
def qloss(y_true, y_pred):
    # Pinball loss for multiple quantiles
    qs = [0.2, 0.50, 0.8]
    q = tf.constant(np.array([qs]), dtype=tf.float32)
    e = y_true - y_pred
    v = tf.maximum(q*e, (q-1)*e)
    
    return K.mean(v)

In [None]:
def mloss(_lambda):
    def loss(y_true, y_pred):
        return _lambda * qloss(y_true, y_pred) + (1 - _lambda)*score(y_true, y_pred)
    return loss

In [None]:
def make_model(no_feature):
    z = L.Input((no_feature,), name="Patient")
    x = L.Dense(100, activation="relu", name="d1")(z)
    x = L.Dense(100, activation="relu", name="d2")(x)
    p1 = L.Dense(3, activation="linear", name="p1")(x)
    p2 = L.Dense(3, activation="relu", name="p2")(x)
    preds = L.Lambda(lambda x: x[0] + tf.cumsum(x[1], axis=1), name="preds")([p1, p2])
    
    model = M.Model(z, preds, name="CNN")
    model.compile(loss=mloss(0.65), optimizer=tf.keras.optimizers.Adam(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.01, amsgrad=False), metrics=[score])
    
    return model

In [None]:
print(colored('Features for model building : ','yellow'),features)

y = train_df['FVC'].values  # train target
X = train_df[features].values  # fetures (1535, 9)
ze = submission_df[features].values  # fetures of submission (730, 9) e: estimate

print(colored('Training data set shape for model building : ','yellow'),X.shape)
print(colored('Shape of submission.csv: ','yellow'),ze.shape) 

nh = X.shape[1] 
print(colored('Number of features : ','yellow'),nh)  # feature numbers (9,)

pe = np.zeros((ze.shape[0], 3))  #estimate of prediction
pred = np.zeros((X.shape[0], 3))  # prediction of truth ground

In [None]:
net = make_model(nh)
print(net.summary())
print(net.count_params())

In [None]:
NFOLD = 5 # originally 5
kf = KFold(n_splits=NFOLD)

# %%time
cnt = 0
EPOCHS = 855


for tr_idx, val_idx in kf.split(X):
    cnt += 1
    print("\n")
    print(colored('Fold........... : ','red'),cnt)
    
    net = make_model(nh)
    net.fit(X[tr_idx], y[tr_idx], batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X[val_idx], y[val_idx]), verbose=0) #
    
    print(colored('Train.......... : ','yellow'),"train", net.evaluate(X[tr_idx], y[tr_idx], verbose=0, batch_size=BATCH_SIZE))
    print(colored('Validation............ : ','yellow'),"val", net.evaluate(X[val_idx], y[val_idx], verbose=0, batch_size=BATCH_SIZE))
    
    print(colored('Predict Validation.. : ','yellow'))
    pred[val_idx] = net.predict(X[val_idx], batch_size=BATCH_SIZE, verbose=0)
    
    print(colored('Predict Test... : ','yellow'))
    pe += net.predict(ze, batch_size=BATCH_SIZE, verbose=0) / NFOLD

In [None]:
sigma_opt = mean_absolute_error(y, pred[:, 1])
unc = pred[:,2] - pred[:, 0]
sigma_mean = np.mean(unc)

print(sigma_opt, sigma_mean)
print(unc.min(), unc.mean(), unc.max(), (unc>=0).mean())

In [None]:
idxs = np.random.randint(0, y.shape[0], 100)
plt.figure(figsize=(12,8))
plt.plot(y[idxs], label="ground truth")
plt.plot(pred[idxs, 0], label="q25")
plt.plot(pred[idxs, 1], label="q50")
plt.plot(pred[idxs, 2], label="q75")
plt.legend(loc="best")
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.hist(unc)
plt.title("uncertainty in prediction")
plt.show()

In [None]:
submission_df.head()

In [None]:
# PREDICTION
submission_df['FVC1'] = 1.*pe[:, 1]
submission_df['Confidence1'] = pe[:, 2] - pe[:, 0]
subm = submission_df[['Patient_Week','FVC','Confidence','FVC1','Confidence1']].copy()
subm.loc[~subm.FVC1.isnull()].head(10)

In [None]:
subm.loc[~subm.FVC1.isnull(),'FVC'] = subm.loc[~subm.FVC1.isnull(),'FVC1']
if sigma_mean<70:
    subm['Confidence'] = sigma_opt
else:
    subm.loc[~subm.FVC1.isnull(),'Confidence'] = subm.loc[~subm.FVC1.isnull(),'Confidence1']

In [None]:
subm.head()

In [None]:
subm.describe().T

In [None]:
otest = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

for i in range(len(otest)):
    subm.loc[subm['Patient_Week']==otest.Patient[i]+'_'+str(otest.Weeks[i]), 'FVC'] = otest.FVC[i]
    subm.loc[subm['Patient_Week']==otest.Patient[i]+'_'+str(otest.Weeks[i]), 'Confidence'] = 0.1

In [None]:
subm[["Patient_Week","FVC","Confidence"]].to_csv("submission_regression.csv", index=False)

In [None]:
reg_sub = subm[["Patient_Week","FVC","Confidence"]].copy()

In [None]:
df1 = img_sub.sort_values(by=['Patient_Week'], ascending=True).reset_index(drop=True)
df2 = reg_sub.sort_values(by=['Patient_Week'], ascending=True).reset_index(drop=True)

In [None]:
df = df1[['Patient_Week']].copy()
df['FVC'] = FVC_weight*df1['FVC'] + (1-FVC_weight)*df2['FVC']
df['Confidence'] = Confidence_weight*df1['Confidence'] + (1-Confidence_weight)*df2['Confidence']
df.head()

### Save submission

In [None]:
df.to_csv('submission.csv', index=False)

### Reference:
* https://err.ersjournals.com/content/23/132/215
* https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/discussion/165727
* https://www.kaggle.com/piantic/osic-pulmonary-fibrosis-progression-basic-eda
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6738634/
* https://www.semanticscholar.org/paper/Honeycomb-lung%3A-history-and-current-concepts.-Arakawa-Honma/9ea8579ddf8de97e308500ad73e680ba9b2c455d/figure/3
* https://err.ersjournals.com/content/23/132/215
* https://link.springer.com/article/10.1186/s12890-020-1061-x
* https://www.pulmonologyadvisor.com/home/topics/restrictive-lung-disease/ct-honeycombing-in-interstitial-lung-disease-linked-to-higher-mortality-rates/
* https://www.kaggle.com/thebigd8ta/osic-ensemble-iv/