# Introduction
What is Pulmonary Fibrosis?
<iframe style="text-align:center" width="560" height="315" src="https://www.youtube.com/embed/cnzOZ1KveMY" frameborder="1" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


* **Data:** CT scans of patients, a base CT scan and follow up scans for different weeks along with FVC and Percent. A train.csv and test.csv with PatientID, Weeks, FVC, Percent, Age, Sex, SmokingStatus
* **Problem :** To predict a patient's severity of decline in lung function.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
os.listdir('/kaggle/input/osic-pulmonary-fibrosis-progression')

## Additional libraries

In [None]:
import cv2
import seaborn as sns
import matplotlib.pyplot as plt
import pydicom
import glob

%matplotlib inline

In [None]:
traindf = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
traindf.shape

In [None]:
traindf.head()

In [None]:
traindf.info()

In [None]:
testdf = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')
testdf.shape

In [None]:
testdf.head()

In [None]:
testdf.info()

In [None]:
ROOT_DIR = '/kaggle/input/osic-pulmonary-fibrosis-progression/'
TRAIN_DIR = ROOT_DIR + 'train'
TEST_DIR = ROOT_DIR + 'test'

# Data exploration

In [None]:
# getting a brief summary on all the values of train set
traindf.describe()

The max value for FVC is seen to be around 6400 which might be an outlier considering that the normal FVC range for an adult lies somewhere between 3000 to 5000 ml.

[ Reference: https://en.wikipedia.org/wiki/Spirometry#Forced_vital_capacity_(FVC) ]

Let's plot some boxplot and violin plot to visualize this.

In [None]:
sns.boxplot(x = 'FVC', data = traindf)

In [None]:
sns.violinplot(x = 'FVC', data = traindf)

### Observations from boxplot and violin plot
* Most of the values are in the range of 2000 to 3000
* Very few values are greater than 5000 and maybe only 1 value greater than 6000.

In [None]:
# correlation matrix
corrMatrix = traindf.corr()
print(corrMatrix)
# plotting a heatmap
sns.heatmap(corrMatrix, vmin = -1, vmax = 1, center = 0, cmap = 'BuGn');

In [None]:
traindf.Patient.value_counts()

In [None]:
# getting the number of unique Patient IDs
print('The number of unique patient IDs in train set:',traindf.Patient.nunique())

Let's take a look at the number of records for each patient and how the data is distributed over each patient.
We group the traindf by Patient, Age, Sex and SmokingStatus as these values will be same for an individual patient and only the values for FVC and Percent changes over the Weeks for an individual patient.

In [None]:
traindfgrouped = traindf.groupby(['Patient','Age','Sex','SmokingStatus']).agg({'Patient': ['count']})
traindfgrouped.columns = ['Patient_record_count']
traindfgrouped = traindfgrouped.reset_index()
print(traindfgrouped)

In [None]:
traindfgrouped.Patient_record_count.describe()

Observations:
* Number of patients: 176
* Average number of records per patient: 8
* Number of records per patient ranges between 6 and 10

### 1. Age Distribution

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(15,8))
ax = sns.countplot(x = 'Age', data = traindfgrouped)
# number of unique age values
total = float(len(ax.patches))
for p in ax.patches:
    ht = p.get_height()
    ax.text(p.get_x(), ht+0.3, '{:1.2f}'.format(ht/total))

Most number of patient records are collected for ages between 64 to 74

What is the distribution of the genders over age?

In [None]:
df = traindfgrouped
ag = df.groupby(['Age','Sex']).sum().unstack()
ag.columns = ag.columns.droplevel()
ag.plot(kind = 'bar', width = 1, colormap = 'Accent', figsize = (15,8))
plt.show()

### 2. Sex Distribution

In [None]:
ax = sns.countplot(x = 'Sex', data = traindfgrouped, palette = 'pastel')
# number of unique patients
total = 176.0
for p in ax.patches:
    ht = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.0, ht+0.3, '{:1.2f}%'.format(ht*100/total), ha = 'center')

Majority of the patients are males. Nearly 79% of records are of male patients.

### 3. SmokingStatus Distribution

In [None]:
sns.countplot(x = 'SmokingStatus', data = traindfgrouped, palette = 'pastel')

Most of the patients are Ex-Smokers and very few of the patients belong to currently smoke category.

What is the distribution of smoking status over the genders?

In [None]:
sns.countplot(x = 'SmokingStatus', hue = 'Sex', data = traindfgrouped, palette = 'pastel')

* Most of the males belong to Ex-Smokers whereas most of the females belong to Never Smoked.
* The distribution of male and female patients who have never smoked is uniform.
* We have very few patients for currently smokes category.

In [None]:
traindfgrouped[traindfgrouped['SmokingStatus']=='Currently smokes']

There's very less records for females who currently smoke, only 2 females belong to Currently smokes category.

# Image visualization

### Working with DICOM files: 
Reference https://www.kdnuggets.com/2017/03/medical-image-analysis-deep-learning.html

Looking at the CT scan for 1 patient for a particular week

In [None]:
# plotting the CT scan of patient ID00060637202187965290703 for week 107
filepath1 = '/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00060637202187965290703/107.dcm'
file1 = pydicom.read_file(filepath1)
plt.imshow(file1.pixel_array, cmap = plt.cm.bone)
plt.title('Patient: ID00060637202187965290703')

### Looking at the Hounsfield units and pixel array:
Reference https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial

In [None]:
patients = os.listdir(TRAIN_DIR)
patients.sort()

In [None]:
def load_scan(path):
    slices = [pydicom.read_file(os.path.join(path,s)) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

In [None]:
def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)
    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
        image[slice_number] += np.int16(intercept)   
    return np.array(image, dtype=np.int16)

In [None]:
first_patient = load_scan(os.path.join(TRAIN_DIR, patients[0]))
first_patient_pixels = get_pixels_hu(first_patient)
plt.hist(first_patient_pixels.flatten(), bins=80, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

# Show some slice in the middle
plt.title('Slice 20')
plt.imshow(first_patient_pixels[20], cmap=plt.cm.bone)
plt.show()

In [None]:
print(patients[0])

Looking at all the CT scans of an individual patient with the ID: ID00007637202177411956430

In [None]:
plt.figure(figsize = (18,18))
for i in range(30):
    plt.subplot(5,6,i+1)
    plt.imshow(first_patient_pixels[i],cmap=plt.cm.bone)
    plt.title('slice ' + str(i+1))

### Work in Progress! I'll add more content to it as and when I discover new things. Please consider upvoting the notebook if you found it useful. Also comment any corrections and improvements. Thank You :)