# 1. Introduction

### What is Pulmonary Fibrosis? 

Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly. As pulmonary fibrosis worsens, you become progressively more short of breath.

The scarring associated with pulmonary fibrosis can be caused by a multitude of factors. But in most cases, doctors can't pinpoint what's causing the problem. When a cause can't be found, the condition is termed idiopathic pulmonary fibrosis.

The lung damage caused by pulmonary fibrosis can't be repaired, but medications and therapies can sometimes help ease symptoms and improve quality of life. For some people, a lung transplant might be appropriate.

![Image from Mayoclinic](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2016/08/10/14/57/mcdc7_pulmonaryfibrosis-8col.jpg)

### What are the Symptons of Pulmonary Fibrosis?

Signs and symptoms of pulmonary fibrosis may include:

* Shortness of breath (dyspnea)
* A dry cough
* Fatigue
* Unexplained weight loss
* Aching muscles and joints
* Widening and rounding of the tips of the fingers or toes (clubbing)

The course of pulmonary fibrosis â€” and the severity of symptoms â€” can vary considerably from person to person. Some people become ill very quickly with severe disease. Others have moderate symptoms that worsen more slowly, over months or years.

Install and Import Necessary Libraries

In [None]:
! pip -q install chart_studio

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pydicom
from tqdm.notebook import tqdm
import glob
import random
import os

import matplotlib.animation as animation
from matplotlib.widgets import Slider
from IPython.display import HTML, Image

import plotly.express as px
import plotly.figure_factory as ff
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot

from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage
from skimage import morphology
from skimage import measure

from colorama import Fore, Back, Style

DATA_DIR = "../input/osic-pulmonary-fibrosis-progression"
plt.style.use("fivethirtyeight")

# 2. EDA
Let's now explore the data.

In [None]:
train_df = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
test_df = pd.read_csv(os.path.join(DATA_DIR, "test.csv"))
train_df.head()

## 2.1 Dry EDA
Let's do some basic Data Statistics to get ourselves warmed up

### 2.1.1 Unique Patients

In [None]:
print(Fore.BLUE+f"In Total there are {train_df['Patient'].count()} patients ids in train set and {test_df['Patient'].count()} patient ids in test set"+Style.RESET_ALL)
print(Fore.GREEN+f"Of which there are {train_df['Patient'].nunique()} unique patients in the train set"+Style.RESET_ALL)
print(Fore.YELLOW+f"And {test_df['Patient'].nunique()} unique patients in the test set"+Style.RESET_ALL)

### 2.1.2 Null Values

In [None]:
print(Fore.GREEN+f"There are {train_df.any().isna().sum()} null values in the train set"+Style.RESET_ALL)
print(Fore.YELLOW+f"There are {test_df.any().isna().sum()} null values in the test set"+Style.RESET_ALL)

### 2.1.3 Intersecting Samples

Let's check if any samples intersect between training and testing sets.
As it turns out, all the 5 samples in test set are present in training set. This may cause overfitting in future.

In [None]:
train_set = set(train_df['Patient'])
test_set = set(test_df['Patient'])
inter = train_set.intersection(test_set)

print(Fore.CYAN + f"There are {len(inter)} Same Samples between both datasets" + Style.RESET_ALL)

### 2.1.4 DICOM Data Statistics

In [None]:
nb_train_imgs = glob.glob(os.path.join(DATA_DIR, "train/**/*.dcm"))
nb_test_imgs = glob.glob(os.path.join(DATA_DIR, "test/**/*.dcm"))

print(Fore.BLUE + f"There are {len(nb_train_imgs)+len(nb_test_imgs)} total DICOM files in this dataset."+Style.RESET_ALL)
print(Fore.GREEN+f"In training set, there are: {len(nb_train_imgs)} DICOM files"+Style.RESET_ALL)
print(Fore.YELLOW+f"In testing set, there are {len(nb_test_imgs)} DICOM files"+Style.RESET_ALL)

In [None]:
avg_train_imgs = len(nb_train_imgs) // train_df['Patient'].count()
avg_test_imgs = len(nb_test_imgs) // test_df['Patient'].count()

print(Fore.GREEN+f"In training set, an average patient has: {avg_train_imgs} images"+Style.RESET_ALL)
print(Fore.YELLOW+f"In testing set, an average patient has: {avg_test_imgs} images"+Style.RESET_ALL)

## 2.2 In-depth Analysis

Before we continue, let's make a new dataframe only with `Patient`, `Age`, `Sex` and `Smoking Status` that doesn't have duplicate values and shuffle it so we can easily create some visualizations from it.

In [None]:
new_df = train_df[['Patient', 'Age', 'Sex', 'SmokingStatus']].drop_duplicates()
new_df = new_df.sample(frac=1).reset_index(drop=True)
new_df.head()

### 2.2.1 Smoking Status Viz.
Let's now visualize how many of all the patients are `Currently smoker`, `Ex-smoker` and `Non-Smoker` with the help of a pie chart. I have seen many excellent notebook and visualizations that use Bar Graphs for this purpose, but I (personally) think that Pie-charts are more intuitive and severely under-rated. 

In [None]:
vals = new_df['SmokingStatus'].value_counts().tolist()
idx = ['Ex-Smoker', 'Never Smoked', 'Currently Smokes']
fig = px.pie(
    values=vals,
    names=idx,
    title='Smoking Status of Patients',
    color_discrete_sequence=['cyan', 'blue', 'darkblue']
)
iplot(fig)

As we can see that the majority of the patients were ex-smokers followed by non-smoker and current smoker. The reason that ex-smokers make up such a majority can be attributed to the fact that smoking has long-lasting effects, even after you quit it. If I am wrong, please correct me!

### 2.2.2 Gender Values Pie Chart
Let's now use our friendly pie-chart to take a look at how many patients are Male and how many are Female?

In [None]:
vals = new_df['Sex'].value_counts().tolist()
idx = new_df['Sex'].value_counts().keys().tolist()
fig = px.pie(
    values=vals,
    names=idx,
    title='Gender Distribution of Patients',
    color_discrete_sequence=['blue', 'magenta']
)
iplot(fig)

### 2.2.3 Unique Patients Age Distribution
Also take a look at Age Distribution of both genders.

In [None]:
fig = px.histogram(
    new_df, x="Age",
    marginal="violin",
    hover_data=new_df.columns,
    color='Sex',
    color_discrete_sequence=['blue', 'magenta'],
    title=f"Unique Patients Age Distribution [\u03BC : ~{int(new_df.mean())} years | \u03C3 : ~{int(new_df.std())} years]",
)

iplot(fig)

As we can see, most of the patients are between the ages `60-75`. Which confirms that the main targets of Pulmonary Fibrosis are old-aged citizens.

### 2.2.4 Weeks Distribution
Similar to Age Distribution Above, now let's plot a histogram of total Weeks Distribution of both genders.

In [None]:
fig = px.histogram(
    train_df, x="Weeks",
    marginal="box",
    hover_data=train_df.columns,
    color='Sex',
    title=f"Weeks Distribution [\u03BC : ~{int(train_df['Weeks'].mean())} weeks | \u03C3 : ~{int(train_df['Weeks'].std())} weeks]",
)

iplot(fig)

In [None]:
print(Fore.BLUE + f"Maximum Weeks for a Male Patient are: {train_df.loc[train_df['Sex']=='Male', 'Weeks'].max()}, average are: {int(train_df.loc[train_df['Sex']=='Male', 'Weeks'].mean())} and minimum are: {train_df.loc[train_df['Sex']=='Male', 'Weeks'].min()}" + Style.RESET_ALL)
print(Fore.MAGENTA + f"Maximum Weeks for a Female Patient are: {train_df.loc[train_df['Sex']=='Female', 'Weeks'].max()}, average are: {int(train_df.loc[train_df['Sex']=='Female', 'Weeks'].mean())} and minimum are: {train_df.loc[train_df['Sex']=='Female', 'Weeks'].min()}" + Style.RESET_ALL)

### 2.2.5 Age Distribution for Male and Female Patients

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(new_df.loc[new_df['Sex'] == 'Male', 'Age'], label = 'Male',shade=True)
sns.kdeplot(new_df.loc[new_df['Sex'] == 'Female', 'Age'], label = 'Female',shade=True)

# Labeling of plot
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Ages for Male and Female Patients')
plt.show()

### 2.2.6 Age Distribution for different smoking status categories

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(new_df.loc[new_df['SmokingStatus'] == 'Currently smokes', 'Age'], label = 'Currently Smokes',shade=True)
sns.kdeplot(new_df.loc[new_df['SmokingStatus'] == 'Never smoked', 'Age'], label = 'Never Smoked',shade=True)
sns.kdeplot(new_df.loc[new_df['SmokingStatus'] == 'Ex-smoker', 'Age'], label = 'Ex-Smoker',shade=True)

plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Ages vs SmokingStatus Category')
plt.show()

We can note few things from the above plot:

- Most Current Smokers lie in the age range of `~62-72`
- Most People who never smoked lie in the age range of `~55-75` 
- Most Ex-smokers lie in the age range of `~65-80`

In simple words. **Non-smokers tend to be younger, Current Smokers tend to be between `60-70`(let's say Old) and Ex-smokers tend be in `70-80` (let's say oldest).**

### 2.2.7 FVC Capacity
The Histogram and violin plot below shows the count of different ranges of Forces Vital Capacity (FVC)

In [None]:
fig = px.histogram(
    train_df,
    x='FVC',
    marginal='violin',
    hover_data=train_df.columns,
    color_discrete_sequence=['maroon'],
    title=f"FVC Count Distribution [ \u03BC : {int(train_df['FVC'].mean())} ml. | \u03C3 : {int(train_df['FVC'].std())} ml. ]"
)
iplot(fig)

We can see that, the average Forced Vital Capacity is 2690 mililiter with a standard deviation of 830 mililiter. We also have a few outliers such as the one lying on the far-right of plot with and FVC of ~6400 ml. That patient is at 0-weeks, is 71-Male and an Ex-Smoker. The reason for such a high FVC can be explained since they are at their 0th Week.

### 2.2.8 Patient's Lung Capacity over weeks
We'll now take a look at the lung capacity of 3 randomly picked patients with different smoking habits.

In [None]:
smoker = random.choice(train_df.query("SmokingStatus == 'Currently smokes'")['Patient'].unique())
non_smoker = random.choice(train_df.query("SmokingStatus == 'Never smoked'")['Patient'].unique())
exsmoker = random.choice(train_df.query("SmokingStatus == 'Ex-smoker'")['Patient'].unique())

fig = go.Figure()
fig.add_trace(go.Scatter(x=train_df[train_df.Patient==smoker]['Weeks'], y=train_df[train_df.Patient==smoker]['FVC'],
                    mode='lines+markers',
                    name='Current smoker'))
fig.add_trace(go.Scatter(x=train_df[train_df.Patient==non_smoker]['Weeks'], y=train_df[train_df.Patient==non_smoker]['FVC'],
                    mode='lines+markers',
                    name='Non-smoker'))
fig.add_trace(go.Scatter(x=train_df[train_df.Patient==exsmoker]['Weeks'], y=train_df[train_df.Patient==exsmoker]['FVC'],
                    mode='lines+markers', name='Ex-smoker'))

fig.update_layout(
    title="Patient Lung Capacity over weeks",
    xaxis_title="Weeks",
    yaxis_title="Lung Capacity (in ml)",
    legend_title="Smoker Status",
)

fig.show()

One thing to note here is that the Non-smoker has had a low lung capacity since the very beginning. I'm sure that I am wrong or missing something out as I am a beginner, but if you can point out what I am doing/guessing wrong, please commnet below, it'll be a learning experience for me ðŸ˜€

## 2.3 DICOM Image Analysis ðŸ’»

Let's now analyze DICOM images and see how they actually look like.

### 2.3.1 Multiple DICOM Image Plot
First, let's plot 20 DICOM images of a patient

In [None]:
def plot_dicom(patient_id="ID00019637202178323708467", cmap='jet'):
    image_dir = os.path.join("../input/osic-pulmonary-fibrosis-progression/train/", patient_id)
    fig = plt.figure(figsize=(12, 12))
    cols = 4
    row = 5
    img_list = os.listdir(image_dir)
    plt.title(f"DICOM Images of Patient: {patient_id}")
    for i in range(1, row*cols+1):
        filename = os.path.join(image_dir, str(i)+".dcm")
        image = pydicom.dcmread(filename)
        fig.add_subplot(row, cols, i)
        plt.grid(False)
        plt.imshow(image.pixel_array, cmap=cmap)

In [None]:
plot_dicom()

### 2.3.2 Lung CT Scan Animation
Let's make an animation out the of CT Scans to see them change in real-time (no not real time, shouldn't have said that ðŸ˜‚)
Learnt how to make animations from [Pranav's Kernel](https://www.kaggle.com/pranavkasela/interactive-lung-ct-scan-with-animation)

In [None]:
%%capture
# Choose a patient id and then get all the dicom images of that patient
patient_id = "ID00012637202177665765362"
dicom_path = "/kaggle/input/osic-pulmonary-fibrosis-progression/train"

# Just Sort all the dicom files based only on their first names and then put back all the sorted files with their .dcm extensions
files = np.array([f.replace(".dcm","") for f in os.listdir(f"{dicom_path}/{patient_id}/")])
files = np.sort(files.astype("int"))
dicoms = [f"{dicom_path}/{patient_id}/{f}.dcm" for f in files]

# Iterate through all the dicom images, read every image and reshape it
# Then save the plt.imshow(img) in img_ and append it to an array 
# At the end, pass it to the ArtistAnimation function with 0.1s interval and 1s delay to make an animation out of them
ims = []
fig = plt.figure()
for img in dicoms:
    img_ = pydicom.dcmread(img).pixel_array.reshape(512, 512)
    img_ = plt.imshow(img_, animated=True, cmap='gray')
    plt.axis("off")
    ims.append([img_])

ani = animation.ArtistAnimation(fig, ims, interval=100, blit=False, repeat_delay=1000)

In [None]:
# Now just show the animations as an HTML component.
HTML(ani.to_jshtml())

# 3. Conclusion
This marks the end of this kernel. If you liked my work, please consider giving it an upvote and if I have missed something or maybe wrote something wrong, please make sure to correct me down below!

I have learnt and implemented a lot of things from [Heroseo's Kernel](https://www.kaggle.com/piantic/osic-pulmonary-fibrosis-progression-basic-eda) and also did a little bit of experimentation of my own. 

Thank you!