# Introduction

Idiopathic interstitial pneumonia (IIP) is a group of interstitial lung diseases (ILD) of
unknown etiology, differing from each other in the pathomorphological type of non-infectious
inflammation and fibrosis, mainly in the interstitium of the lung, as well as a variant of
the clinical course and prognosis [1]. The differential diagnosis of IIP is largely based
on a set of uniform criteria and guidelines that have been proposed by the American Thoracic
Society and the European Respiratory Society (ATS / ERS). Early diagnosis is critical for
making policy decisions about treatment, especially for the idiopathic pulmonary fibrosis
(IPF), while misdiagnosis can lead to life-threatening complications [2].

It should be noted that IPF is the most common form of IIP - it accounts for 80–90% of all
cases of idiopathic pneumonia. The prevalence of IPF reaches about 20 cases per 100 thousand
among men and 13 among women [1]. Thus, the ratio depending on the gender is 1.5: 1 (male:
female).  The incidence of IPF reaches 11.3 cases/year per 100 thousand in men and 7.1 in
women, increasing with age. Approximately 2/3 of patients with IPF are over 60 years old.
Mortality from IPF is higher in the older age group and averages 3 per 100 thousand population
and the median survival rate ranges from 2.3 to 5 years [1]. Risk factors include cigarette
smoking, certain viral infections, and a family history of the condition [4].

IPF usually manifests itself as progressive dyspnea and an unproductive cough, which is often
paroxysmal and refractory to antitussives. Deformation of the nail phalanges is observed in
25-50% of patients. Signs of chronic cor pulmonale (peripheral edema) can be observed in the
later stages of the disease [1].

In most patients, the period from the onset of symptoms to a visit to the doctor is more than
6 months. The clinical course of IPF is characterized by a gradual deterioration in the
condition of patients, but often there is a sharp progression associated with a viral
infection, the development of pneumonia or diffuse alveolar damage. On radiography of the
lungs, peripheral reticular shadows are most often observed, mainly in the basal regions,
associated with the formation of cellular changes in the lung tissue and a decrease in the
volume of the lower lobes. However, an average of 16% of patients with histologically proven
IPF may have an unchanged radiographic picture [1].

High-resolution computed tomography (HRCT) reveals reticular changes, usually bilateral,
partly associated with traction bronchiectasis. "Cell lung" are often observed. Ground-glass
patches are less common than reticular changes. Disturbances in architectonics, reflecting
pulmonary fibrosis, are characteristic. Pathological changes are characterized by heterogeneity
and are localized mainly in the peripheral and basal regions. In a number of studies carried
out during the treatment of patients, it was found that the "ground glass" zones may decrease.
However, the most characteristic is the progression of fibrosis with the formation of a
"cell lung". The accuracy of IPF diagnostics according to HRCT data reaches 90% [1].

Correct identification of patterns in HRCT images in conjunction with the Fleischner
Society guidelines [5] is central to the diagnosis and further management of patients
with ILD [2]. The available treatments slow down but do not reverse the disease process,
so today there is an objective need for methods to accurately detect early interstitial
changes before their progression and predict rate of progression. The main problem today is
the early diagnosis of IPF. The solution to this particular problem will significantly
increase the life expectancy of patients [3].

The purpose of this work is to predict a patient’s severity of decline in lung function
based on a CT scan of their lungs. In this paper in first chapter we will analyse full
history of clinical information of patients and their baseline CT. In the next chapter
is described methodology of prediction patient's FVC measurement for every possible week.

# 1. Analysis of datasets

In [None]:
import pandas as pd
import matplotlib
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import math

%matplotlib inline
matplotlib.rcParams.update({'font.size': 15})
colors = ["#003f5c", "#bc5090", "#ffa600", "#127681", "#ea5455"]
customPalette = sns.set_palette(sns.color_palette(colors))
sns.set_style("whitegrid")
filepath = '../input/osic-pulmonary-fibrosis-progression/';
test = pd.read_csv(filepath + 'test.csv')
train = pd.read_csv(filepath + 'train.csv')

First, let's highlight the main features of train dataset.

In [None]:
def show_dataset_common_info(dataset, label, fignumber):
    df = dataset[['Patient', 'Sex', 'SmokingStatus']].drop_duplicates()

    _, axes = plt.subplots(1, 2, figsize=(12,5))
    axes[0].set_title('Fig.%s. Gender and patient habits of %s dataset' % (fignumber, label))

    df.groupby(['SmokingStatus'])['Patient'].count().plot.pie(
        label='', autopct='%.2f%%', labeldistance=None, ax=axes[1], textprops={'color':"w"})
    axes[1].legend(loc='upper right')

    df.groupby(['Sex'])['Patient'].count().plot.pie(
        label='', autopct='%.2f%%', labeldistance=None, ax=axes[0], textprops={'color':"w"})
    axes[0].legend(loc='upper right')

    plt.show()
    print('Total patient count in %s dataset: %s' % (label, df['Patient'].count()))

show_dataset_common_info(train, 'train', '1')

### Intermediate conclusion:
* In train dataset among patients, there are 3.7 fewer women than men,
* Most patients are ex-smokers.

In [None]:
df = train.groupby(['Sex', 'SmokingStatus'])['Patient'].unique().reset_index()
df['Patient'] = df['Patient'].apply(lambda x: len(x))
ax = sns.catplot(x="SmokingStatus", y="Patient",
                 kind="bar", data=df,
                 hue="Sex", palette=customPalette,
                 height=4, aspect=2.5)

start, end = ax.axes[0,0].get_ylim()

plt.title('Fig.2. The number of patients in different smoking status', fontsize=20)
plt.xlabel('', fontsize=19)
plt.ylabel('Count', fontsize=19)
plt.yticks(np.arange(start, end, max(int(math.fabs(end-start)/5),1)))
ax.axes[0,0].yaxis.set_major_formatter(ticker.ScalarFormatter())

### Intermediate conclusion:
* In train dataset among male patients, the ratio between ex-smokers and non-smokers is
significantly higher than that of female patients.

In [None]:
df = train.groupby(['Sex', 'Age'])['Patient'].unique().reset_index()
df['Patient'] = df['Patient'].apply(lambda x: len(x))
df = df.rename(columns={'Patient': 'Count'})

fig, axes = plt.subplots(2, 1, figsize=(12,5))
fig.subplots_adjust(wspace=0.2, hspace=0.2)
ax = sns.barplot(x="Age", y="Count",
                 hue='Sex', data=df,
                 palette=customPalette, ax=axes[0])
ax.set_title('Fig.3. Age distribution of patients')
ax.legend(loc='upper right')
ax.set_xlabel('')
step = max(math.floor(math.fabs(df['Age'].max()-df['Age'].min())/11),1)
ax.xaxis.set_major_locator(ticker.MultipleLocator(step))

color = {"Male": "C1", "Female": "C0"}
for label in df['Sex'].unique():
    data = df[df['Sex'] == label].set_index('Count')
    sns.distplot(data['Age'], label=label, hist=True, kde=True, ax=axes[1], rug=True, bins=10)

    mean = data['Age'].mean()
    median = data['Age'].median()
    std = data['Age'].std()

    axes[1].axvline(mean, color=color[label], label="Mean", ls='-')
    axes[1].axvline(mean+std, color=color[label], label="Mean+std", ls='--')
    axes[1].axvline(mean-std, color=color[label], label="Mean-std", ls='--')
    axes[1].axvline(median, color=color[label], label="Median", ls='-.')

plt.show()

### Intermediate conclusion:
* Minimum female age is 49, maximum - 87,
* Minimum male age is 51, maximum - 83,
* Mean age is about 66-67 years.

In [None]:
def _range(df):
    mini = df.min()
    maxi = df.max()
    rang = maxi - mini
    return rang

df = train[['Patient','Weeks','FVC','Percent']].groupby(['Patient']).agg([_range]).reset_index()
df.columns = ["".join(x) for x in df.columns.ravel()]
df['Patient'] = df['Patient'].keys()

fig, axes = plt.subplots(1, 2, figsize=(12,5))
fig.subplots_adjust(wspace=0.2, hspace=0.2)
ss_loc = df.groupby(['Weeks_range'])['Patient'].count().reset_index()
ax = sns.barplot(x='Weeks_range', y='Patient', color="C2", palette=customPalette,
                 data=ss_loc, ax=axes[0])
ax.set_xlabel('Weeks')
ax.set_ylabel('Count')
step = max(math.ceil(math.fabs(ss_loc['Weeks_range'].max()-ss_loc['Weeks_range'].min())/5),1)
ax.xaxis.set_major_locator(ticker.MultipleLocator(step))

sns.distplot(ss_loc['Weeks_range'], hist=True, kde=True, ax=axes[1], bins=5)
axes[1].set_xlabel('Weeks')

ax.set_title('Fig.4. Distribution of FVC observation interval')

plt.show()

### Intermediate conclusion:
* Minimum FVC observation interval 25 weeks, maximum - 63,
* For most patients, changes of FVC are observed within 50-60 weeks.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18,8))
fig.subplots_adjust(wspace=0.2, hspace=0.3)
labels = [('FVC_range', 'ml') , ('Percent_range', '%')]
axes[0,0].set_title('Fig.5. Distribution of change of lung capacity (LC) per obs. interval')

for i, label in enumerate(labels):
    ax = sns.barplot(x='Patient', y=label[0], color="C2", palette=customPalette,
                     data=df, ax=axes[i, 0])

    ax.set_xlabel('Patient #')
    ax.set_ylabel('Change of LC, %s' % label[1])
    step = max(math.ceil(math.fabs(df['Patient'].max()-df['Patient'].min())/5),1)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(step))

    sns.distplot(df[label[0]], hist=True, kde=True, ax=axes[i, 1])
    axes[i, 1].set_xlabel('Change of LC, %s' % label[1])

    mean = df[label[0]].mean()
    median = df[label[0]].median()
    std = df[label[0]].std()

    axes[i, 1].axvline(mean, label="Mean", ls='-')
    axes[i, 1].axvline(mean+std, label="Mean+std", ls='--')
    axes[i, 1].axvline(mean-std, label="Mean-std", ls='--')
    axes[i, 1].axvline(median, label="Median", ls='-.')

plt.show()

### Intermediate conclusion:
* Maximum of lung capacity change inside of interval of observation is more than 1500ml
and more than 40% of normal capacity,
* Mean of lung capacity change is about 500ml or 13% of normal capacity.

In [None]:
def _range(df):
    mini = df.min()
    maxi = df.max()
    rang = maxi - mini
    return rang

df = train[['Patient', 'Age', 'FVC', 'Percent']].groupby(['Age', 'Patient']).agg([_range]).reset_index()
df.columns = ["".join(x) for x in df.columns.ravel()]

fig, axes = plt.subplots(2, 2, figsize=(18,8))
fig.subplots_adjust(wspace=0.2, hspace=0.2)
labels = [('FVC_range', 'ml') , ('Percent_range', '%')]
axes[0, 0].set_title('Fig.6. Distribution of median change of lung capacity (LC) per patient age')

for i, label in enumerate(labels):
    ss_loc = df.groupby(['Age'])[label[0]].median().reset_index()
    ax = sns.barplot(x='Age', y=label[0], color="C2", palette=customPalette,
                     data=ss_loc, ax=axes[i, 0])
    ax.set_ylabel('Change of LC, %s' % label[1])
    ax.set_xlabel('Age')
    step = max(math.ceil(math.fabs(ss_loc['Age'].max()-ss_loc['Age'].min())/5),1)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(step))

    sns.distplot(ss_loc[label[0]], hist=True, kde=True, ax=axes[i, 1], bins=10)
    axes[i, 1].set_xlabel('Change of LC, %s' % label[1])

plt.show()

### Intermediate conclusion:
* Median change of lung capacity in absolute value decreases with age,
 which is not observed in relative values.

In [None]:
df = train[['Patient', 'Age', 'FVC', 'Percent', 'Sex']].groupby(['Age', 'Patient', 'Sex']).agg([_range]).reset_index()
df.columns = ["".join(x) for x in df.columns.ravel()]

fig, axes = plt.subplots(2, 2, figsize=(18,8))
fig.subplots_adjust(wspace=0.2, hspace=0.2)
labels = [('FVC_range', 'ml') , ('Percent_range', '%')]
axes[0, 0].set_title('Fig.7. Distribution of median change of lung capacity (LC) per patient age')

for i, label in enumerate(labels):
    ss_loc = df.groupby(['Age', 'Sex'])[label[0]].median()
    ss_rloc = ss_loc.reset_index()
    ax = sns.barplot(x="Age", y=label[0], data=ss_rloc,
                     hue="Sex", palette=customPalette, ax=axes[i, 0])
    ax.set_ylabel('Change of LC, %s' % label[1])
    ax.set_xlabel('Age')
    ax.legend(loc='upper right')
    step = max(math.ceil(math.fabs(ss_rloc['Age'].max()-ss_rloc['Age'].min())/5),1)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(step))
    sns.distplot(ss_loc[:,'Female',:], hist=True, kde=True, ax=axes[i, 1], label='Female')
    sns.distplot(ss_loc[:,'Male',:], hist=True, kde=True, ax=axes[i, 1], label='Male')
    axes[i, 1].set_xlabel('Change of LC, %s' % label[1])

    mean = ss_loc[:,'Female',:].mean()
    median = ss_loc[:,'Female',:].median()
    std = ss_loc[:,'Female',:].std()

    axes[i, 1].axvline(mean, label="Mean", ls='-')
    axes[i, 1].axvline(mean+std, label="Mean+-std", ls='--')
    axes[i, 1].axvline(mean-std, ls='--')
    axes[i, 1].axvline(median, label="Median", ls='-.')

    mean = ss_loc[:,'Male',:].mean()
    median = ss_loc[:,'Male',:].median()
    std = ss_loc[:,'Male',:].std()

    axes[i, 1].axvline(mean, ls='-', color="C1")
    axes[i, 1].axvline(mean+std, ls='--', color="C1")
    axes[i, 1].axvline(mean-std, ls='--', color="C1")
    axes[i, 1].axvline(median, ls='-.', color="C1")
    axes[i, 1].legend(loc='upper right')

plt.show()

### Intermediate conclusion:
* Median change of lung capacity of women has significant fluctuation (more than 30%) compared to the same
parameter in men, which do not depend of age.

In [None]:
show_dataset_common_info(test, 'test', '8')

### Intermediate conclusion:
* In the test dataset all of patients are men, most of them - ex-smoker.

In [None]:
fig, axes = plt.subplots(1,1, figsize=(12,5))
df = train[['Patient','Weeks','Percent']]
for patient in test['Patient'].to_numpy():
    data = df.groupby(['Patient']).get_group(patient)[['Weeks','Percent']]
    sns.lineplot(x='Weeks', y='Percent', data=data, palette=customPalette,
                 label=patient[0:6]+"***"+patient[-2:])
axes.set_ylabel('Change of LC, %')
axes.set_title('Fig.9. Observations of lung function of patients from test dataset')
plt.show()

### Intermediate conclusion:
* All patients from the test dataset show a decrease of lung function depending on time,
* Patients ID00419637202311204720264 and ID00423637202312137826377 have a sharp decrease
of lung function in the initial observation period.

# TODO:
1. Describe patient's CT scan analysis and algorithms of preprocessing.

# References

1. Yu. I. Feshchenko, V. K. Gavrysyuk, N. E. Monogarova Idiopathic interstitial pneumonias: classification, differential diagnosis. Ukrainian pulmonological journal. 2007, no. 2 - p 5-11.
2. Christe, Andreas MD∗; Peters, Alan A. MD∗ et a. Computer-Aided Diagnosis of Pulmonary Fibrosis Using Deep Learning and CT Images, Investigative Radiology: October 2019 - Volume 54 - Issue 10 - p 627-632. https://doi.org/10.1097/RLI.0000000000000574
3. Bermejo-Peláez, D., Ash, S.Y., Washko, G.R. et al. Classification of Interstitial Lung Abnormality Patterns with an Ensemble of Deep Convolutional Neural Networks. Sci Rep 10, 338 (2020). https://doi.org/10.1038/s41598-019-56989-5
4. https://www.nhlbi.nih.gov/health-topics/idiopathic-pulmonary-fibrosis
5. Lynch DA, Sverzellati N, Travis WD, et al. Diagnostic criteria for idiopathic pulmonary fibrosis: a Fleischner Society White Paper. Lancet Respir Med. 2018;6:138–153.