## <u>Introduction</u>

### [What is Pulmonary Fibrosis?](https://www.mayoclinic.org/diseases-conditions/pulmonary-fibrosis/symptoms-causes/syc-20353690)
![](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2016/08/10/14/57/mcdc7_pulmonaryfibrosis-8col.jpg)
* Pulmonary fibrosis is lung disease that occurs when the lung tissues become thick ans scarred. As a result, breathing becomes progressively more difficult for the patient suffering from the disease.
* The disease may be progressing a someone's body for a long time without any warning. And then suddenly symptoms show up and can make a patient much worse.

### [Causes](https://en.wikipedia.org/wiki/Pulmonary_fibrosis#Cause)
* Most of the time, it is difficult to find the cause. And in such cases it is termed as idiopathic pulmonary fibrosis.
* Pulmonary fibrosis can be a secondary effect to other diseases as well.  Examples include autoimmune disorders, viral infections and bacterial infection like tuberculosis which may cause fibrotic changes in both lung's upper or lower lobes and other microscopic injuries to the lung ([Source](https://en.wikipedia.org/wiki/Pulmonary_fibrosis#Cause)).
* The following image shows an chest x-ray with pulmonary fibrosis ([Source](https://en.wikipedia.org/wiki/Pulmonary_fibrosis#Cause)).
![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/IPF_amiodarone.JPG/450px-IPF_amiodarone.JPG)
* The following image is an HRCT of lung showing extensive fibrosis ([Source](https://en.wikipedia.org/wiki/File:Pulmon_fibrosis.PNG))
![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Pulmon_fibrosis.PNG/465px-Pulmon_fibrosis.PNG)

### Symptoms
* The following are the symptoms of pulmoary fibrosis ([Source](https://en.wikipedia.org/wiki/Pulmonary_fibrosis#Signs_and_symptoms)).
    * Shortness of breath, particularly with exertion.
    * Chronic dry, hacking coughing.
    * Fatigue and weakness.
    * Chest discomfort including chest pain.
    * Loss of appetite and rapid weight loss.

## <u>The Aim of This Competition</u>
In this competition, we’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. We’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.

* **Now let's move ahead and explore the data that is given to us in this competition**.

### Evaluation Metric
This competition is evaluated on a modified version of the Laplace Log Likelihood. In medical applications, it is useful to evaluate a model's confidence in its decisions. Accordingly, the metric is designed to reflect both the accuracy and certainty of each prediction.

#### What is FVC?
Lung function is assessed based on output from a spirometer, which measures the forced vital capacity (FVC), i.e. the volume of air exhaled.


For each true FVC measurement, we will predict both an FVC and a confidence measure (standard deviation σ). The metric is computed as:

$$\sigma_{clipped} = max(\sigma, 70)$$
$$\Delta = min ( |FVC_{true} - FVC_{predicted}|, 1000 )$$
$$metric = -   \frac{\sqrt{2} \Delta}{\sigma_{clipped}} - \ln ( \sqrt{2} \sigma_{clipped} )$$

In [None]:
import pydicom
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

matplotlib.style.use('ggplot')

In [None]:
DIR_ROOT = '../input/osic-pulmonary-fibrosis-progression'

train_csv = pd.read_csv(f"{DIR_ROOT}/train.csv")

## <u>Taking a Look at Patient Data</u>

In [None]:
train_csv.head()

In [None]:
print(f"Total number of patient IDs: {len(train_csv)}")

In [None]:
train_csv.info()

In [None]:
train_csv.describe()

In [None]:
train_csv.isnull().values.any()

So, we do not have any NaN values in the dataset. That means, we can peacefully move on to the EDA part of this notebook.

There are 1549 patient IDs in the CSV file. But all the IDs are not unique. This means that a patient might have returned after 2 years of the first scan and then may been given a new unique ID.

In [None]:
print(train_csv.Patient.value_counts())
print(train_csv.Patient.value_counts().keys())
patient_id_keys = [train_csv.Patient.value_counts()]
# print(patient_id_keys)

So, there are 176 unique patent IDs. Let's plot these out.

In [None]:
patient_id_dict = {'id': [], 'num': []}
patient_keys = train_csv.Patient.value_counts().keys()
for i, data in enumerate(train_csv.Patient.value_counts()):
    patient_id_dict['id'].append(patient_keys[i])
    patient_id_dict['num'].append(data)

In [None]:
plt.figure(figsize=(20, 17))
plt.bar(patient_id_dict['id'], patient_id_dict['num'], color='orange')
plt.tick_params(
    axis='x',         
    which='both',     
    bottom=False,      
    top=False,        
    labelbottom=False) 
plt.show()

The above plot does not look very clean. Now we know that the highest of times a patient has visited according to the dataset is 10, and the lowest number of 6. What we can do is visualize how many patients visited a particualr number of time starting from 10 to 6.

In [None]:
num_patients_list = []
num_visits = []
for i in range(10, 5, -1):
    num_visits.append(i)
    num_patients_counter = 0
    for j in range(len(train_csv.Patient.value_counts())):
        if i == patient_id_dict['num'][j]:
            num_patients_counter += 1
    num_patients_list.append(num_patients_counter)

In [None]:
plt.figure(figsize=(10, 7))
plt.bar(num_patients_list, num_visits, color='orange', width=1)
plt.xlabel('Number of patients')
plt.ylabel('Number of times visited')
plt.show()

So, it looks like 10 patients visited the most number of times, that is 10 times. And around 150 patients visited 9 times. 

### Grouping by Patient ID
We already know that we have 176 patients. So, we can group by patient ID and check all the valuable information. Maybe that will lead to some useful information and visualtion.

In [None]:
mean_csv = train_csv.groupby(['Patient']).mean()
print(mean_csv.head())
print(mean_csv.columns)

From the above information, we plot a lot of things inlcuding:
* Mean number of weeks a patient has visited.
* Mean FVC of each patient.
* Mean FVC percentage in accordance with other patients with similar symptoms.
* The age of each patient.

#### Number of Weeks Visited by The Patients

In [None]:
plt.figure(figsize=(15, 12))
sns.scatterplot(patient_id_dict['id'], mean_csv['Weeks'], 
                hue=mean_csv['Weeks'], size=mean_csv['Weeks'], 
                sizes=(10, 200))
plt.tick_params(
    axis='x',         
    which='both',     
    bottom=False,      
    top=False,        
    labelbottom=False) 

From the above plot we can easily infer that most patients visited somewhere within 20 to 30 weeks. There are a few outliers and only one patients visited more than 80 times for the Pulmonary Fibrosis treatment.

#### Mean FVC of Each Patient

In [None]:
plt.figure(figsize=(18, 15))
sns.barplot(patient_id_dict['id'], mean_csv['FVC'])
plt.ylabel('Mean FVC (in ml)')
plt.tick_params(
    axis='x',         
    which='both',     
    bottom=False,      
    top=False,        
    labelbottom=False) 

#### Mean FVC Percentage of Each Patient

In [None]:
plt.figure(figsize=(18, 15))
sns.barplot(patient_id_dict['id'], mean_csv['Percent'], 
            palette="rocket")
plt.ylabel('FVC Perncentage')
plt.tick_params(
    axis='x',         
    which='both',     
    bottom=False,      
    top=False,        
    labelbottom=False) 

#### Age of Each Patient

In [None]:
plt.figure(figsize=(15, 12))
sns.swarmplot(patient_id_dict['id'], mean_csv['Age'])
plt.ylabel('Age')
plt.tick_params(
    axis='x',         
    which='both',     
    bottom=False,      
    top=False,        
    labelbottom=False) 

The above swarm plot shows that most of the people are aged between 63 and 73. Few people below the age of 60 are suffering from Pulomonaru Fibrosis. The same is true for peole above the age of 80 as well.

**I hope that you liked this basic EDA. A lot more can be done with the data that we have at hand. But for now, it's time to build some deep learning models.**