Welcome to the notebook where in I explore data of the "OSIC Pulmonary Fibrosis Progression" competition. I will try to be as elaborate as possible.  

Let's get started! 

The first step towards approaching any problem is to understand the problem statement properly. 

# Problem statement 

**In this competition, you’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. You’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.**

# Dig deeper into Pulmonary Fibrosis

![](https://www.pulmonaryfibrosis.org/images/default-source/default-album/normal-and-impaired-gas-exchange.png?sfvrsn=c3b0918d_0)

The word **“Pulmonary”** means lung and the word **“fibrosis”** means scar tissue— similar to scars that you may have on your skin from an old injury or surgery. So, in its simplest sense, **pulmonary fibrosis (PF) means scarring in the lungs**. 

Over time, the scar tissue can destroy the normal lung and make it hard for oxygen to get into your blood. Low oxygen levels (and the stiff scar tissue itself) can cause you to feel short of breath, particularly when walking and exercising. **Pulmonary fibrosis** isn’t just one disease. It is a family of more than 200 different lung diseases that all look very much alike. The PF family of lung diseases falls into an even larger group of diseases called the interstitial lung diseases (also known as ILD), which includes all of the diseases that have inflammation and/or scarring in the lung. Some interstitial lung diseases don’t include scar tissue. When an interstitial lung disease does include scar tissue in the lung, we call it pulmonary fibrosis.

## Symptoms 

Signs and symptoms of pulmonary fibrosis may include:

1. Shortness of breath (dyspnea)
2. A dry cough
3. Fatigue
4. Unexplained weight loss
5. Aching muscles and joints
6. Widening and rounding of the tips of the fingers or toes (clubbing)

The course of pulmonary fibrosis — and the severity of symptoms — can vary considerably from person to person. Some people become ill very quickly with severe disease. Others have moderate symptoms that worsen more slowly, over months or years.

## Causes 

Pulmonary fibrosis scars and thickens the tissue around and between the air sacs (alveoli) in your lungs. This makes it more difficult for oxygen to pass into your bloodstream. The damage can be caused by many different factors — including long-term exposure to certain toxins, certain medical conditions, radiation therapy and some medications.

## Risk factors

**Age**. Although pulmonary fibrosis has been diagnosed in children and infants, the disorder is much more likely to affect middle-aged and older adults.

**Sex**. Idiopathic pulmonary fibrosis is more likely to affect men than women.

**Smoking**. Far more smokers and former smokers develop pulmonary fibrosis than do people who have never smoked. Pulmonary fibrosis can occur in patients with emphysema.

**Certain occupations**. You have an increased risk of developing pulmonary fibrosis if you work in mining, farming or construction or if you're exposed to pollutants known to damage your lungs.

**Cancer treatments**. Having radiation treatments to your chest or using certain chemotherapy drugs can increase your risk of pulmonary fibrosis.

**Genetic factors**. Some types of pulmonary fibrosis run in families, and genetic factors may be a component.

## Why is this competition important?

Current methods make fibrotic lung diseases difficult to treat, even with access to a chest CT scan. In addition, the wide range of varied prognoses create issues organizing clinical trials. Finally, patients suffer extreme anxiety—in addition to fibrosis-related symptoms—from the disease’s opaque path of progression.

If successful, patients and their families would better understand their prognosis when they are first diagnosed with this incurable lung disease. Improved severity detection would also positively impact treatment trial design and accelerate the clinical development of novel treatments.

# Data description 

We have seen the details of the disease and what competition expects us to do, let's see the description of the data and what we need to do about it. 

As we understand from the problem statement of the competition, our job is to make prognosis easier for the patients. We are provided with a baseline chest CT scan and associated clinical information for a set of patients. A patient has an image acquired at time Week = 0 and has numerous follow up visits over the course of approximately 1-2 years, at which time their FVC is measured. 

Lung function is assessed based on output from a spirometer, which measures the forced vital capacity (FVC), i.e. the volume of air exhaled.

**Train:** We are provided with an anonymized, baseline CT scan and the entire history of FVC measurements.

**Test:** We are provided with a baseline CT scan and only the initial FVC measurement. You are asked to predict the final three FVC measurements for each patient, as well as a confidence value in your prediction.

Let's get started!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
!pip install chart_studio
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
# import plotly.plotly as py
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)


sns.set()
plt.rcParams["figure.figsize"] = [8,12]


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df_train=pd.read_csv(r"/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv")
df_test=pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/test.csv")
df_train.head(170)

* So, there are 7 columns in our training dataset:

1. **Patient id:** Will be useful for us to follow the week-by-week status update of an individual. As mentioned in the problem statement, patients have been followed on a week-by-week basis, so there could be a lot of repeatitive values according to the weekly update. We will also attempt to find unique ids in the column so that we know how many actual patients are we dealing with, in the data.

2. **Weeks:** This denotes the week's number.

3. **FVC:** This represents the total amount of air exhaled during the FEV test. Forced expiratory volume and forced vital capacity are lung function tests that are measured during spirometry.

4. **Percent:** It is the computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics

5. **Age** 

6. **Sex**

7. **Smoking status:** This could be an important parameter in determining whether smoking status is relevant to the FVC or not. 

In [None]:
df_test.head()

Let us have a look at the shape our training and testing dataset: 

In [None]:
print("The shape of our training dataset is:{}".format(df_train.shape))
print("The shape of our testing dataset is:{}".format(df_test.shape))

Now it is time to see if our dataset has any missing value, we will check for both training and testing dataset:

In [None]:
df_train.info()

In [None]:
df_test.info()

Both the train and test dataset contain no missing values. Let us look at other statistical parameters of our training dataset, which is our quite important for us. 

In [None]:
df_train.describe()

Let us how many unique values we have in our dataset:

In [None]:
df_train['Patient'].nunique()

So, we have 176 unique patients ids. Rest of them are essentially the week-by-week record of these patients. Now we can analyze the data in two ways:

1. We can use all the data
2. We can use the data with uniqe ids.

Let us first make a new dataframe with unique ids and also make a columns expressing frequency of each patient

In [None]:
df_unique = df_train.groupby([df_train.Patient,df_train.Age,df_train.Sex, df_train.SmokingStatus])['Patient'].count()
df_unique.index = df_unique.index.set_names(['Patient Id','Age','Sex','SmokingStatus'])

df_unique = df_unique.reset_index()
df_unique.rename(columns = {'Patient': 'Frequency'},inplace = True)

df_unique.head()

Let us find out the average appointment a patient takes to get his FVC measured:

In [None]:
df_unique['Frequency'].mean()

We can see that on an average, patients had around 9 appointments to get their FVC tested. Some had fewer than that, but most of them had close to 9 appointments. Let us explore other features in the dataset. 

## Data distribution by Sex (uniques values)

In [None]:
print(df_unique['Sex'].value_counts())
percent_sexwise=df_unique['Sex'].value_counts()/len(df_unique['Sex'])
print(percent_sexwise)

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
sns.countplot(ax=ax, x="Sex", data=df_unique, palette="bone")

We see that more number of men suffer from pulmonary fibrosis than women. This could be because they are more exposed to environmental hazards because they work more in outdoor jobs. Women also tend to develop cardiovascular diseases around 7-12 years later than men. Cardiovascular diseases are also a potential factor behind development of pulmonary fibrosis. 

## Data distribution by Age (uniques values)

In [None]:
from matplotlib import colors

fig, ax = plt.subplots(figsize=(20,8))
N_points = 100000
n_bins = 15
N, bins, patches = plt.hist(df_unique['Age'], n_bins, alpha=0.5)

fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

Most of the patients develop the symptoms of pulmonary fibrosis between 65 - 75 years of age. People start developing symptoms around 50 years of age and continue till 80 years of age. People below that age and above that age have lesser risk of developing symptoms. 

## Data distribution by Age according to Sex (uniques values)

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
df_train.groupby(['Age', 'Sex']).size().unstack().plot(kind='bar', stacked=True, ax=ax)

## Data distribution by Smoking Status (uniques values)

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
sns.countplot(x="SmokingStatus", data=df_unique, palette="magma")

So, most of the people have had smoking history before they developed symptoms. They could have dropped smoking after the symptoms perhaps, but it is very much evident that people with a smoking history have greater chances of contracting pulmonary fibriosis. One more thing could be the bias in smoking status with a bias towards the ex-smokers. 

We will try to find that out!

## Data distribution by Smoking Status according to Sex (uniques values)

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
df_train.groupby(['SmokingStatus', 'Sex']).size().unstack().plot(kind='bar', stacked=True, ax=ax)

## Data distribution by FVC

### Before moving further, let us understand more about FVC:

![](http://)![](https://slideplayer.com/slide/10537676/36/images/8/Pulmonary+Fibrosis+Restrictive+Diseases+Pulmonary+Function+Tests.jpg)


**Forced vital capacity (FVC)** is the amount of air that can be forcibly exhaled from your lungs after taking the deepest breath possible, as measured by spirometry. This test may help distinguish obstructive lung diseases, such as asthma and COPD, from restrictive lung diseases, such as pulmonary fibrosis and sarcoidosis.

FVC can also help doctors assess the progression of lung disease and evaluate the effectiveness of treatment. An abnormal FVC value may be chronic, but sometimes the problem is reversible and the FVC can be corrected.

### Interpreting Results

Your total FVC volume can be compared with the standard FVC for your age, sex, height, and weight. Your FVC can also be compared with your own previous FVC values, if applicable, to determine whether your pulmonary condition is progressing or if your lung function is improving under treatment. **FVC also may be expressed as a percentage of the predicted FVC.** This is expressed as percent in our dataset. 


Forced vital capacity will be reported in two ways:

1. As an absolute value, reported as a number in liters (L)
2. On a linear graph to chart the dynamics of your exhalation


The normal FVC range for an adult is between 3.0 and 5.0 L. This depends on a lot of factors like genetics, height, body weight and other factors. It has nowhere been explicitly mentioned, but I am assuming that FVC is given in ml. 

In [None]:
fig, ax = plt.subplots(figsize=(20,8))

N_points = 100000
n_bins = 15
N, bins, patches = plt.hist(df_train['FVC'], n_bins, alpha=0.5)

fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.inferno(norm(thisfrac))
    thispatch.set_facecolor(color)
print(df_train['FVC'].min())
print(df_train['FVC'].max())
print(df_train['FVC'].mean())
print(df_train['FVC'].median())

Since this is real medical data, you will notice the relative timing of FVC measurements varies widely. The timing of the initial measurement relative to the CT scan and the duration to the forecasted time points may be different for each patient. This is considered part of the challenge of the competition.

Let us undestand more on how our data is distributed: 

In [None]:
from scipy.stats import kurtosis, skew


print("The mean of FVC data is: {}".format(df_train['FVC'].mean()))
print("The median of FVC data is: {}".format(df_train['FVC'].median()))
print("The standard deviation of FVC data is: {}".format(df_train['FVC'].std()))

print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(df_train['FVC']) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(df_train['FVC']) ))

We notice from the distribution that very few patients having very high FVC. The data is not normally distributed. We have a positive skew; Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode. 

In [None]:
fig, ax = plt.subplots(figsize=(20,15))
sns.boxplot(x="Sex", y="FVC",hue="SmokingStatus", data=df_train, palette="spring", ax=ax)

Citing this [research paper](https://www.jstage.jst.go.jp/article/jpts/24/1/24_1_5/_pdf#:~:text=It%20was%20found%20that%20the,lung%20capacity%20in%20females%20than), women have 20-25% lower capacity than men, owing to their smaller lungs. 

However, I have one concern about the data shown above. As seen from the boxplot, men and women who currently smoke have higher FVC than ex-smokers and non-smokers which is contradictory to this [research paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3944281/#:~:text=They%20showed%20that%20smoking%20decreased,airway%20obstruction%20and%20small%20airway) which states that:

**Some previous studies have demonstrated the effect of smoking on the pulmonary function of adults8,9,10). They showed that smoking decreased pulmonary function including forced vital capacity (FVC), forced expiratory volume in one second (FEV1), FEV1/FVC, and forced expiratory flow at 25–75% (FEF25–75%)9). Cigarette smoking causes deficits in both FEV1/FVC and FEF25–75 which indicate airway obstruction and small airway disease in adult smokers.**

These works suggest otherwise, which should be a cause of concern. I will read more and try to find the caauses, etc behind this. 

## Data distribution by Weeks 

This will give us an information on weekly visits by patients:

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
df_weeks=df_train.groupby(['Weeks']).count()
df_weeks.head(20)
sns.distplot(df_weeks, color='g')

The distribution shows that most of the people have had their measurements done in 0-10 weeks. This indicates two things:

1. Either patients are reluctant to visit hospitals to get their FVC mesured.
2. Or, the health of patients deteriorated over the course of weeks and gradually death would have occured.

## FVC vs percent 


Let us see what is the relationship between FVC and Percent:

In [None]:
fig, ax = plt.subplots(figsize=(20,15))

x=df_train['FVC']
y=df_train['Percent']

colors = df_train['FVC']  # 0 to 15 point radii

plt.scatter(x, y, c=colors, alpha=0.5)
plt.show()

It was obvious and it is evident that FVC and Percent share linear relationship so in our analysis further, we need not use both of these variables. 

## Data distribution of few individuals

We will see data distribution of few individual patients to see how the pattern of FVC over the weeks. 

In [None]:
df_patient=df_train.groupby(['Patient'])
df_patient.head()

### Patient 1 with id ID00010637202177584971671

In [None]:
patient_df = df_train[df_train['Patient'] == "ID00010637202177584971671"]
print(patient_df)

In [None]:
import plotly.express as px

fig = px.line(patient_df, x="Weeks", y="FVC", title='Patient FVC over the weeks')
fig.show()

For a person with Pulmonary fibrosis, it is expected that the capacity of lungs will deteriorate with time. Will analyse more in the next versions.

## Patient 2 with id ID00426637202313170790466

In [None]:
patient_df = df_train[df_train['Patient'] == "ID00426637202313170790466"]
print(patient_df)

In [None]:
import plotly.express as px

fig = px.line(patient_df, x="Weeks", y="FVC", title='Patient FVC over the weeks')
fig.show()

The deterioration of FVC in patients who have never smoked have less steep fall in FVC as compared to the patient seen above who was an ex-smoker. Let us look at one patient who currently smokes: 

## Patient 3 with id ID00082637202201836229724

In [None]:
patient_df = df_train[df_train['Patient'] == "ID00082637202201836229724"]
print(patient_df)

In [None]:
import plotly.express as px

fig = px.line(patient_df, x="Weeks", y="FVC", title='Patient FVC over the weeks')
fig.show()

The detrioration is quite steep in this case. 

We have seen the tabular dataset till now, it is time to explore the DICOM dataset. I tried to be very elaborate in tabular data, I will also try the same in the DICOM data.

# DICOM 

**DICOM® — Digital Imaging and Communications in Medicine** — is the international standard for medical images and related information. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use.

DICOM® is implemented in almost every radiology, cardiology imaging, and radiotherapy device (X-ray, CT, MRI, ultrasound, etc.), and increasingly in devices in other medical domains such as ophthalmology and dentistry. With hundreds of thousands of medical imaging devices in use, DICOM® is one of the most widely deployed healthcare messaging Standards in the world.

### What is the data related to DICOM?

A DICOM data object consists of a number of attributes, including items such as name, ID, etc., and also one special attribute containing the image pixel data. A single DICOM object can have only one attribute containing pixel data. For many modalities, this corresponds to a single image. However, the attribute may contain multiple "frames", allowing storage of cine loops or other multi-frame data.

Let us start by importing all the dependencies:

In [None]:
# common packages 
import numpy as np 
import os
import copy
from math import *
import matplotlib.pyplot as plt
from functools import reduce
# reading in dicom files
import pydicom as dicom
import glob
# skimage image processing packages
from skimage import measure, morphology
from skimage.morphology import ball, binary_closing
from skimage.measure import label, regionprops
# scipy linear algebra functions 
from scipy.linalg import norm
import scipy.ndimage
# ipywidgets for some interactive plots
from ipywidgets.widgets import * 
import ipywidgets as widgets
# plotly 3D interactive graphs 
import plotly
from plotly.graph_objs import *
import chart_studio.plotly as py
# set plotly credentials here 
# this allows you to send results to your account plotly.tools.set_credentials_file(username=your_username, api_key=your_key)

In [None]:
apply_resample = False

In [None]:
def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

In [None]:
def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
            
        image[slice_number] += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

In [None]:
def set_lungwin(img, hu=[-1200., 600.]):
    lungwin = np.array(hu)
    newimg = (img-lungwin[0]) / (lungwin[1]-lungwin[0])
    newimg[newimg < 0] = 0
    newimg[newimg > 1] = 1
    newimg = (newimg * 255).astype('uint8')
    return newimg

In [None]:
scans = load_scan('/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/')
scan_array = set_lungwin(get_pixels_hu(scans))

In [None]:
from scipy.ndimage.interpolation import zoom

def resample(imgs, spacing, new_spacing):
    new_shape = np.round(imgs.shape * spacing / new_spacing)
    true_spacing = spacing * imgs.shape / new_shape
    resize_factor = new_shape / imgs.shape
    imgs = zoom(imgs, resize_factor, mode='nearest')
    return imgs, true_spacing, new_shape

spacing_z = (scans[-1].ImagePositionPatient[2] - scans[0].ImagePositionPatient[2]) / len(scans)

if apply_resample:
    scan_array_resample = resample(scan_array, np.array(np.array([spacing_z, *scans[0].PixelSpacing])), np.array([1.,1.,1.]))[0]

In [None]:
import imageio
from IPython.display import Image

imageio.mimsave("/tmp/gif.gif", scan_array, duration=0.0001)
Image(filename="/tmp/gif.gif", format='png')

Please upvote if you liked the content in the notebook and please drop in your comments so that I can improve further.

Thanks for reading!