<img src='https://www.pulmonaryfibrosis.org/images/default-source/default-album/normal-and-impaired-gas-exchange.png?sfvrsn=c3b0918d_0'>
<h1><center>OSIC Pulmonary Fibrosis Progression - EDA</center><h1>
    
# 1. Let's Understand Whats is this disease? ▶
    
###  1.1 What is Pulmonary fibrosis?
* [The word “pulmonary” means lung and the word “fibrosis” means scar tissue— similar to scars](https://www.pulmonaryfibrosis.org/life-with-pf/about-pf) that you may have on your skin from an old injury or surgery. So, in its simplest sense, pulmonary fibrosis (PF) means scarring in the lungs. Over time, the scar tissue can destroy the normal lung and make it hard for oxygen to get into your blood. Low oxygen levels (and the stiff scar tissue itself) can cause you to feel short of breath, particularly when walking and exercising.
    
###  1.2 What has to be done in this competition?
The aim of this competition is to predict a patient’s severity of decline in lung function based on a CT scan of their lungs. Lung function is assessed based on output from a spirometer, which measures the forced vital capacity (FVC), i.e. the volume of air exhaled.

In the dataset, you are provided with a baseline chest CT scan and associated clinical information for a set of patients. A patient has an image acquired at time Week = 0 and has numerous follow up visits over the course of approximately 1-2 years, at which time their FVC is measured.

In the training set, you are provided with an anonymized, baseline CT scan and the entire history of FVC measurements.
In the test set, you are provided with a baseline CT scan and only the initial FVC measurement. You are asked to predict the final three FVC measurements for each patient, as well as a confidence value in your prediction.
    
- Files

This is a synchronous rerun code competition. The provided test set is a small representative set of files (copied from the training set) to demonstrate the format of the private test set. When you submit your notebook, Kaggle will rerun your code on the test set, which contains unseen images.

train.csv - the training set, contains full history of clinical information
test.csv - the test set, contains only the baseline measurement
train/ - contains the training patients' baseline CT scan in DICOM format
test/ - contains the test patients' baseline CT scan in DICOM format
sample_submission.csv - demonstrates the submission format
    
- The leaderboard of this competition is calculated with approximately 1% of the test data. The final results will be based on the other 99%, so the final standings may be different.
    
- EXTERNAL DATA: Publicly and freely available external data is permitted if it is available for use that includes research or academic purposes

# 1. Import libraries 

In [None]:
import os
from os import listdir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#plotly
!pip install chart_studio
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

import seaborn as sns
sns.set(style="whitegrid")


#pydicom
import pydicom

# Beautiful plot scheme
plt.style.use('fivethirtyeight')
plt.show()

# 2. Reading training Meta data

In [None]:
# List files available
list(os.listdir("../input/osic-pulmonary-fibrosis-progression"))

In [None]:
IMAGE_PATH = "../input/osic-pulmonary-fibrosis-progressiont/"

train_data = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test_data = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

print('Training data shape: ', train_data.shape)
train_data.head(5)

In [None]:
train_data['SmokingStatus'].value_counts()

In [None]:
train_data.groupby(['SmokingStatus','Sex']).count()

# 3. Feast for the eyes: Visualization

In [None]:
# Lets Explore the  data
print('Train mets data Set !!')
print(train_data.info())


In [None]:
print('Test meta dataSet !!')
print(test_data.info())

In [None]:
# Total number of ecords in the Meta dataset(train+test)
print("Total Patient in Train data set: ",train_data['Patient'].count())
print("Total Patient in Test  data set: ",test_data['Patient'].count())



**In the dataset, you are provided with a baseline chest CT scan and associated clinical information for a set of patients. A patient has an image acquired at time Week = 0 and has numerous follow up visits over the course of approximately 1-2 years, at which time their FVC is measured.**

## Lets Check the Unique Patients in Train and Test data

In [None]:
print("The total patient ids are : ") 
print(train_data['Patient'].count())
print("Total Uniquw unique patients are :") 
print(train_data['Patient'].value_counts().shape[0] )

In [None]:
columns = train_data.keys()
columns = list(columns)
print(columns)

## People Smoke?  Hmm.. Lets Check it out!!

In [None]:
train_data['SmokingStatus'].value_counts()

In [None]:
iplot(train_data.SmokingStatus.iplot(asFigure=True, kind='histogram', title='Smoking Distribution Data', dimensions=(1000,400)))

## Weeks distribution : The relative number of weeks pre/post the baseline CT (may be negative)

In [None]:
train_data['Weeks'].value_counts()

### Number of patients count pre/post the baseline CT: week Wise

In [None]:
grpdata=train_data.groupby(['Weeks']).count()["Patient"]
grpdata

In [None]:
pd.options.plotting.backend = "plotly"
train_data.groupby(['Weeks']).count()["Patient"].plot()

In [None]:

grpval=train_data.groupby(['FVC']).count()["Patient"]
grpval

### Weeks vs SmokingStatus

In [None]:
z=train_data.groupby(['SmokingStatus','Weeks'])['FVC'].count().to_frame().reset_index()
z.style.background_gradient(cmap='Reds') 

## FVC VS Patients : How many patients are at same stage of disease 
**Lung function is assessed based on output from a spirometer, which measures the forced vital capacity (FVC), i.e. the volume of air exhaled.******

In [None]:
train_data.groupby(['FVC']).count()["Patient"].plot()

## Does Gender has any relationship with FVC?
### FVC VS Percentage VS SEX

In [None]:
train_data.groupby(['FVC']).count()["Patient"].plot()

import plotly.express as px
fig = px.line(train_data, x='FVC', y='Percent', color='Sex')
fig.show()

### Gender has any relationship with FVC and Smoking habits?

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Ex-smoker', 'FVC'], label = 'Ex-smoker',shade=True)
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Never smoked', 'FVC'], label = 'Never smoked',shade=True)
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Currently smokes', 'FVC'], label = 'Currently smokes',shade=True)

# Labeling of plot
plt.xlabel('FVC'); plt.ylabel('Density'); plt.title('Distribution of Gender');

## How many patients at different level of disease?
### Percent

A computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics

In [None]:
train_data['Percent'].value_counts()

In [None]:
train_data['Percent'].iplot(kind='hist',bins=35,color='green',xTitle='Percent distribution',yTitle='No Of Patients')

## FVC distribution at different Percent for different Genders ?

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(train_data.loc[train_data['Sex'] == 'Male', 'Percent'], label = 'Male',shade=True)
sns.kdeplot(train_data.loc[train_data['Sex'] == 'Female', 'Percent'], label = 'Female',shade=True)

# Labeling of plot
plt.xlabel('Perecent of FVC '); plt.ylabel('Density'); plt.title('Distribution of Gender for FVC');

## Aged patients are more prone to Pulmonary Fibrosis Progression?
### Lets Check

In [None]:
train_data['Age'].iplot(kind='hist',bins=30,color='red',xTitle='Age distribution',yTitle='Count')

### Smoking has any relationship with Age?

In [None]:
train_data['SmokingStatus'].value_counts()

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Ex-smoker', 'Age'], label = 'Ex-smoker',shade=True)
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Never smoked', 'Age'], label = 'Never smoked',shade=True)
sns.kdeplot(train_data.loc[train_data['SmokingStatus'] == 'Currently smokes', 'Age'], label = 'Currently smokes',shade=True)

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

### Distribution of Age vs gender
## Is it really Helpful: Check out yourself

In [None]:
plt.figure(figsize=(16, 6))
sns.kdeplot(train_data.loc[train_data['Sex'] == 'Female', 'Age'], label = 'Female',shade=True)
sns.kdeplot(train_data.loc[train_data['Sex'] == 'Male', 'Age'], label = 'Male',shade=True)

plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

## Gender distribution in train data
### Males are more affected from Pulmonary Fibrosis Progression??

In [None]:
train_data['Sex'].value_counts()

In [None]:
iplot(train_data.Sex.iplot(asFigure=True, kind='histogram', title='Sex Distribution Data', dimensions=(1000,400)))

## Males have bad habit of smoking?? 
### Gender vs SmokingStatus

In [None]:
plt.figure(figsize=(16, 6))
a = sns.countplot(data=train_data, x='SmokingStatus', hue='Sex')

for p in a.patches:
    a.annotate(format(p.get_height(), ','), 
           (p.get_x() + p.get_width() / 2., 
            p.get_height()), ha = 'center', va = 'center', 
           xytext = (0, 4), textcoords = 'offset points')

plt.title('Gender split by SmokingStatus', fontsize=16)
sns.despine(left=True, bottom=True);

## How are the images?
###  Visualising DECOM images
> A DICOM file is an image saved in the Digital Imaging and Communications in Medicine (DICOM) format. It contains an image from a medical scan, such as an ultrasound or MRI. DICOM files may also include identification data for patients so that the image is linked to a specific individual.

In [None]:
print('Train .dcm number of images:', len(list(os.listdir('../input/osic-pulmonary-fibrosis-progression/train'))), '\n' +
      'Test .dcm number of images:', len(list(os.listdir('../input/osic-pulmonary-fibrosis-progression/test'))), '\n' +
      '--------------------------------', '\n' +
      'There is the same number of images as in train/ test .csv datasets')



### Only 176 Images why??
**Because there are only 176 unique patients in database**

 ## Let's see what is in image

In [None]:
filename = "/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00123637202217151272140/137.dcm"
ds = pydicom.dcmread(filename)
plt.imshow(ds.pixel_array, cmap=plt.cm.bone) 

In [None]:
# directory for a patient
imdir = "/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430"
print("total images for patient ID00007637202177411956430: ", len(os.listdir(imdir)))

In [None]:
print("images for patient ID00007637202177411956430 :")
mylist = os.listdir(imdir)
mylist.sort()
print(mylist)

In [None]:
# view first (columns*rows) images in order
w=10
h=10
fig=plt.figure(figsize=(12, 12))
columns = 4
rows = 5
imglist = os.listdir(imdir)
for i in range(1, columns*rows +1):
    filename = imdir + "/" + str(i) + ".dcm"
    ds = pydicom.dcmread(filename)
    fig.add_subplot(rows, columns, i)
    plt.imshow(ds.pixel_array, cmap=plt.cm.bone)
plt.show()

## pydicom Hidden Information that could be useful
https://pydicom.github.io/pydicom/stable/old/getting_started.html. 

In [None]:
import glob
train_image_path = '../input/osic-pulmonary-fibrosis-progression/train'
train_image_files = glob.glob(os.path.join(train_image_path, '*', '*.dcm'))

train_image_data = pydicom.read_file(train_image_files[0])
train_image_data

Ideas taken from https://www.kaggle.com/piantic/osic-pulmonary-fibrosis-progression-basic-eda