<img src='https://img.medscape.com/news/2015/msr_150318_idiopathic_pulmonary_fibrosis_800x600.jpg'>
<br>
<h1><center>OSIC Pulmonary Fibrosis Progression - EDA (Beginner Friendly)</center><h1>
    
###  What is Pulmonary fibrosis?
* Pulmonary fibrosis is a condition in which the lungs become scarred over time. Symptoms include shortness of breath, a dry cough, feeling tired, weight loss, and nail clubbing. Complications may include pulmonary hypertension, respiratory failure, pneumothorax, and lung cancer.

* Causes include environmental pollution, certain medications, connective tissue diseases, infections, interstitial lung diseases and also due to SARS infection. Idiopathic pulmonary fibrosis (IPF), an interstitial lung disease of unknown cause, is most common. Diagnosis may be based on symptoms, medical imaging, lung biopsy, and lung function tests.

* There is no cure, however, there are limited treatment options available. Treatment is directed towards efforts to improve symptoms and may include oxygen therapy and pulmonary rehabilitation. Certain medications may be used to try to slow the worsening of scarring. Lung transplantation may occasionally be an option. At least 5 million people are affected globally. Life expectancy is generally less than five years.
    
### Signs and symptoms
* Shortness of breath, particularly with exertion
* Chronic dry, hacking coughing
* Fatigue and weakness
* Chest discomfort including chest pain
* Loss of appetite and rapid weight loss
    
### Source: <a href='https://en.wikipedia.org/wiki/Pulmonary_fibrosis'>Wikipedia</a>
    
* Thank you <a href='https://www.kaggle.com/piantic'>Heroseo</a> for your amazing notebook!

In [None]:
import os
from os import listdir
import pandas as pd
import numpy as np
import glob
import tqdm
from typing import Dict
import matplotlib.pyplot as plt
%matplotlib inline

#plotly
!pip install chart_studio
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

#color
from colorama import Fore, Back, Style

import seaborn as sns
sns.set(style="whitegrid")

#pydicom
import pydicom

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

#Sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

In [None]:
IMAGE_PATH='../input/osic-pulmonary-fibrosis-progression'
train=pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test=pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')


In [None]:
train.head()

In [None]:
train.shape

In [None]:
train['Patient'].nunique()

In [None]:
train.describe()

In [None]:
test.head()

# Basic EDA

### General info

In [None]:
print(Fore.BLUE+'Info about training set:',Style.RESET_ALL)
print(train.info())
print(Fore.YELLOW+'Info about testing set:',Style.RESET_ALL)
print(test.info())

In [None]:
print(Fore.BLUE+'Total number of patient entries are',Style.RESET_ALL,f"{train.shape[0]},",Fore.YELLOW+'whereas total number of unique patients are',Style.RESET_ALL,f"{train['Patient'].nunique()}.")

In [None]:
print(Fore.BLUE+'Total number of patient entries in training set are',Style.RESET_ALL,f"{train.shape[0]},",Fore.YELLOW+'whereas total number of patient entries in testing set are',Style.RESET_ALL,f"{test.shape[0]}.")

In [None]:
s_train=set(train['Patient'])
s_test=set(test['Patient'])

In [None]:
s_train.intersection(s_test)

We can clearly see that 5 patients who are in testing set are also in the training set.

### Missing data

In [None]:
train.isnull().sum()

No missing data in the training set!

In [None]:
test.isnull().sum()

No missing data in the testing set!

In [None]:
train['Sex'].value_counts()

79% of the patient records in the training dataset are of Male Patients!

In [None]:
test['Sex'].value_counts()

All of the patients in the testing set are Male!

In [None]:
train['Age'].describe()

* As you can see from the above table, the mean age of the patients is approximately 67 years.
* Youngest patient in the training set is 49 years old.
* Oldest patient in the training set is 88 yeras old.

In [None]:
train['SmokingStatus'].value_counts()

In [None]:
#Creating a dataset consisting of only one record per patient
train_dir = '../input/osic-pulmonary-fibrosis-progression/train/'
test_dir = '../input/osic-pulmonary-fibrosis-progression/test/'

patient_ids = os.listdir(train_dir)
patient_ids = sorted(patient_ids)

#Creating new rows
no_of_instances = []
age = []
sex = []
smoking_status = []
mean_FVC=[]
rec_checkup_weekno=[]
first_checkup_weekno=[]
min_FVC=[]
max_FVC=[]
for patient_id in patient_ids:
    patient_info = train[train['Patient'] == patient_id].reset_index()
    no_of_instances.append(len(os.listdir(train_dir + patient_id)))
    age.append(patient_info['Age'][0])
    sex.append(patient_info['Sex'][0])
    mean_FVC.append(round(patient_info['FVC'].mean()))
    min_FVC.append(patient_info['FVC'].min())
    max_FVC.append(patient_info['FVC'].max())
    rec_checkup_weekno.append(patient_info['Weeks'].max())
    first_checkup_weekno.append(patient_info['Weeks'].min())
    smoking_status.append(patient_info['SmokingStatus'][0])

#Creating the dataframe for the patient info    
patient_df = pd.DataFrame(list(zip(patient_ids, no_of_instances, age, sex,mean_FVC,min_FVC,max_FVC,rec_checkup_weekno,first_checkup_weekno, smoking_status)), 
                                 columns =['Patient', 'no_of_instances', 'Age', 'Sex','Mean FVC','Min FVC','Max FVC','Recent Checkup Week','First Checkup Week','SmokingStatus'])
print(patient_df.info())
patient_df.head()

# Detailed EDA


### Distribution of Age

In [None]:
import scipy

data = patient_df.Age.tolist()
plt.figure(figsize=(18,6))
_, bins, _ = plt.hist(data, 45, density=1, alpha=0.5)
mu, sigma = scipy.stats.norm.fit(data)
best_fit_line = scipy.stats.norm.pdf(bins, mu, sigma)
plt.plot(bins, best_fit_line, color = 'b', linewidth = 3, label = 'fitting curve')
plt.title(f'Age Distribution [ mean = {"{:.2f}".format(mu)}, standard_dev = {"{:.2f}".format(sigma)} ]', fontsize = 18)
plt.xlabel('Age -->')
plt.show()
patient_df['Age'].iplot(kind='hist',bins=45,color='blue',xTitle='Age Distribution',yTitle='Count')

**Inference:**
* From the first figure, we can fairly say that age is normally distributed around mean age which is approcimately 67 years.
* 15 patients are of 65 years of age followed by 14 patients who are 69 years.
* Only 5 patients are above 80 years! That means only approximately 3% of patients are above 80. But why? Does the disease really cure itself with age or is it just that there are more data entries in the 60s and 70s range? 

### FVC vs Age vs Smoking Status

In [None]:
fig = px.scatter(patient_df, x="Age", y="Mean FVC", color='SmokingStatus')
fig.show()

**Inference**
* Oldest person (88) in the group was an Ex-Smoker.
* Youngest person (49) in the group still smokes!
* Highest mean FVC is 5845 ml.
* Lowest mean FVC is 988 ml.


### Age vs Sex (Along with details of Mean FVC, Recent Checkup and SmokingStatus)

In [None]:
fig = px.scatter(patient_df, x="Age", color='Sex',hover_data=['Mean FVC','Recent Checkup Week','SmokingStatus'])
fig.show()

**Inference**
* Oldest male is 88 years old and is an Ex-Smoker with a mean FVC of 1981 ml (Recent Checkup week:83)
* Oldest female is 87 years old she never smoked and has a mean FVC of 1981 ml (Recent Checkup week:59)
* Youngest person in the group is a female and she is a current smoker with mean FVC of 2915 ml. She really needs to stop smoking! (Recent Checkup week:79)
* Youngest male is 51 years and he never smoked and has a mean FVC of 2526 ml (Recent Checkup week:55)

### FVC

### Distribution of FVC (On whole dataset)

In [None]:
data = train.FVC.tolist()
plt.figure(figsize=(18,6))
_, bins, _ = plt.hist(data, 45, density=1, alpha=0.5)
mu, sigma = scipy.stats.norm.fit(data)
best_fit_line = scipy.stats.norm.pdf(bins, mu, sigma)
plt.plot(bins, best_fit_line, color = 'b', linewidth = 3, label = 'fitting curve')
plt.title(f'FVC Distribution [ mean = {"{:.2f}".format(mu)}, standard_dev = {"{:.2f}".format(sigma)} ]', fontsize = 18)
plt.xlabel('FVC -->')
plt.show()



train['FVC'].iplot(kind='hist',
                      xTitle='Lung Capacity(ml)', 
                      linecolor='black', 
                      opacity=0.8,
                      color='blue',
                      bargap=0.5,
                      gridcolor='white',
                      title='Distribution of FVC (On whole dataset)')

**Inference**
* FVC distribution is slightly positively skewed (right skewed) and this because of some outliers.
* Mean FVC of total dataset: 2690.48 ml
* Since it is positively skewed, mean>median>mode.
* Only range of FVC values with 100+ records are 2800-2899 ml.

### Case Study of a patient (Patient ID=ID00078637202199415319443)

In [None]:
patient_df.sort_values(by='no_of_instances',ascending=False)

* I have selected this patient solely because she has the highest number of instances in the training dataset and also she appears to have taken regular checkups.

In [None]:
train_ID00078637202199415319443=train[train['Patient']=='ID00078637202199415319443']

In [None]:
train_ID00078637202199415319443

* The main purpose of this case study is to see whether the condition of the patient is really improving or not.

In [None]:
fig = px.line(train_ID00078637202199415319443, x="Weeks", y="FVC", title='FVC really increasing?')
fig.show()

* We can clearly see from the above graph that the patient's condition is clearly not improving even with regular checkups and medication as her FVC levels are dropping consistently.
* Ofcourse, we cannot assume this just from seeing the results of 1 patients. So let us look at 10 patients.

In [None]:
train_all=patient_df.sort_values(by='no_of_instances',ascending=False).head(10)
train_all

In [None]:
l=list(train_all['Patient'].values)

In [None]:
train[train['Patient']==(i for i in l)]

In [None]:
df= pd.DataFrame(columns=['Patient','Weeks','FVC','Percent','Age','Sex','SmokingStatus'])
df

In [None]:
for i in l:
    t=train[train['Patient']==i]
    frames=[df,t]
    df=pd.concat(frames)

In [None]:
df

In [None]:
fig = px.line(df, x="Weeks", y="FVC", color='Patient')
fig.show()

**Inference**
* Most of the patient's condition isn't improving as thier FVC levels are not increasing but instead dropping slightly.
* Patient with patient ID 'ID00042637202184406822975' had considerable improvement in FVC levels but that improvement stopped at week 44 and from there FVC levels are continously declining.

## Conclusion: FVC levels of patients aren't improving with treatment and regular checkups.

### Smoking Status

In [None]:
patient_df['SmokingStatus'].value_counts()

In [None]:
patient_df['SmokingStatus'].value_counts().iplot(kind='bar',
                                              yTitle='Counts', 
                                              linecolor='black', 
                                              opacity=0.7,
                                              color='red',
                                              theme='pearl',
                                              bargap=0.5,
                                              gridcolor='white',
                                              title='Distribution of the SmokingStatus of the patients')

**Inference**
* Most (118) of the patients were Ex-smokers and 9 patients still smoke!
* Approximately 28% of the patients never smoked in their life and yet they are affected by Pulmonary Fibrosis. 
* From the above point, we can fairly say that smoking alone cannot say anything about this disease. But again we can also see that 72% of the patients are still smoking or have smoked before in their life!

In [None]:
patient_df[['SmokingStatus','Sex']].value_counts().iplot(kind='bar',
                                              yTitle='Counts', 
                                              linecolor='black', 
                                              opacity=0.7,
                                              color='blue',
                                              theme='pearl',
                                              bargap=0.5,
                                              gridcolor='white',
                                              title='Distribution of the SmokingStatus of the patients along with their gender')

**Inference**
* Most of the Ex-smokers are Male.
* Even though the dataset consists mostly of Male patient records (79%) , we may think that the Smoking column will also be dominated in all 3 categories (Ex-Smoker,Never smoked and Currently smokes). But that is not the case here because almost equal number of male patients (26) and female patients (23) never smoked.

### Percent

Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics.


### Distribution of percentage

In [None]:
data = train.Percent.tolist()
plt.figure(figsize=(18,6))
_, bins, _ = plt.hist(data, 45, density=1, alpha=0.5)
mu, sigma = scipy.stats.norm.fit(data)
best_fit_line = scipy.stats.norm.pdf(bins, mu, sigma)
plt.plot(bins, best_fit_line, color = 'b', linewidth = 3, label = 'fitting curve')
plt.title(f'Percent Distribution [ mean = {"{:.2f}".format(mu)}, standard_dev = {"{:.2f}".format(sigma)} ]', fontsize = 18)
plt.xlabel('Percent -->')
plt.show()

train['Percent'].iplot(kind='hist',bins=30,color='blue',xTitle='Percent distribution',yTitle='Count')

**Inference**
* As we can see in the above pictures, the 'Percent' distribution is slighly **positively skewed** with mean 77.67
* Because it is positively skewed, we can say that mean>median>mode

**Note:** The above graphs are not of the individual patients but instead of all the patient records in the original dataset

### FVC vs Percent vs Age

In [None]:
dfFPA=train[['FVC','Percent','Age']].corr()

In [None]:
dfFPA

In [None]:
fig,ax =plt.subplots(figsize=(12,7))
title='FVC vs Percent vs Age'
plt.title(title,fontsize=18)



sns.heatmap(dfFPA,annot=True)
plt.show()

**Inference**
* Oh my world! FVC and Percent are highly correlated.


* Let's take a quick detour and plot the linear regression graph between FVC and Percent

In [None]:
df = train[['FVC','Percent']]
X = df.FVC.values.reshape(-1, 1)

model = LinearRegression()
model.fit(X, df.Percent)

x_range = np.linspace(X.min(), X.max(), 100)
y_range = model.predict(x_range.reshape(-1, 1))

fig = px.scatter(train, x='FVC', y='Percent', color='Age', opacity=0.65)
fig.add_traces(go.Scatter(x=x_range, y=y_range, name='Regression Fit'))
fig.show()

**Inference**
* Fitted a regression line but this is bad because we trained on the whole dataset without any test dataset which may cause in loss of generalization.
* So let's split the dataset into train and test and then again lets fit the regression line.

In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = train[['FVC','Percent']]
X = df.FVC.values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, df.Percent, random_state=0)

model1 = LinearRegression()
model1.fit(X_train, y_train)

x_range = np.linspace(X.min(), X.max(), 100)
y_range = model1.predict(x_range.reshape(-1, 1))


fig = go.Figure([
    go.Scatter(x=X_train.squeeze(), y=y_train, name='train', mode='markers'),
    go.Scatter(x=X_test.squeeze(), y=y_test, name='test', mode='markers'),
    go.Scatter(x=x_range, y=y_range, name='prediction')
])
fig.update_layout(xaxis_title="FVC",
    yaxis_title="Percent", 
    title="Generalized Regression fit",
)
fig.show()

### FVC vs Percent vs Weeks

In [None]:
dfFPW=train[['FVC','Percent','Weeks']].corr()


In [None]:
fig,ax =plt.subplots(figsize=(12,7))
title='FVC vs Percent vs Weeks'
plt.title(title,fontsize=18)



sns.heatmap(dfFPW,cmap='RdYlGn',annot=True)
plt.show()

### Percent vs SmokingStatus (On whole dataset)

In [None]:
df = train
fig = px.violin(df, y='Percent', x='SmokingStatus', box=True, color='Sex', points="all",
          hover_data=train.columns)
fig.show()

### Distribution of Percent and FVC in each age group

In [None]:

fig = px.bar(train, x='Age', y='Percent',
             color='FVC',
             height=400)
fig.show()

**Inference**
* All the patients  whose FVC is above 5000 ml are of 71 years of age. Hmm Interesting!

## DICOM EDA: Coming Soon!
* Please do upvote my notebook if you like my work. Have a good day!