<h1><center>OSIC Pulmonary Fibrosis Progression</center></h1>

### 1.Introduction
Pulmonary fibrosis or scarring in the lungs is a family of more than 200 different lung diseases that are very similar. Over time, the scar tissue blocks the movement of oxygen from inside the tiny air sacs in the lungs into the bloodstream. There are five main categories of identifiable causes of pulmonary fibrosis: 
* drug-induced
* radiation-induced
* environmental
* autoimmune
* occupational

With all this, it can still be very challenging for doctors to determine the exact cause of a pulmonary fibrosis case. A pulmonary fibrosis case of unknown cause is called "idiopathic". With no known cure, current methods make fibrotic lung diseases difficult to treat even with access to a chest CT scan. On top of the disease having no cure, the prognosis of outcomes range from long-term stability to rapid deterioration, with doctors having no accurate way to determine where exactly a patient falls on that spectrum.<br>
### 2.Problem statement
With the use of neural networks this notebook aims to predict a patient's severity of decline in lung function based on a CT scan of their lungs, and several other data which we will describe later. Lung function is being measured with the help of a spirometer, which measures the volume of air inhaled and exhaled. This value is called a FVC, which stands for Forced Vital Capacity.<br>
### 3.Data
* train.csv - the training set, contains full history of clinical information
* test.csv - the test set, contains only the baseline measurement
* train/ - contains the training patients' baseline CT scan in DICOM format
* test/ - contains the test patients' baseline CT scan in DICOM format
* sample_submission.csv - demonstrates the submission format

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objs as go

import pydicom
import glob
import imageio
from IPython.display import Image

In [None]:
train_df = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/train.csv")
test_df = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/test.csv")

In [None]:
train_df.head()

In [None]:
test_df.head()

|Column name|Meaning|
|:------------:|:-----------:|
|Patient|a unique Id for each patient (also the name of the patient's DICOM folder)|
|Weeks|the relative number of weeks pre/post the baseline CT (may be negative)|
|FVC|the recorded lung capacity in ml|
|Percent|a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics|
|Age|age of patient|
|Sex|sex of patient|
|SmokingStatus|smoking status of patient|

In [None]:
## Exploring the data
print('Shape of Training data: ', train_df.shape)
print('Shape of Test data: ', test_df.shape)

In [None]:
train_df.info()

In [None]:
print(f"Number of unique ids are {train_df['Patient'].value_counts().shape[0]} ")

Since the number of unique ids is smaller than the number of entries(1549), it means that there are several entries per patient. The FVC value, as well as the percent value was probably measured more often.

In [None]:
new_df = train_df.groupby([train_df.Patient,train_df.Age,train_df.Sex, train_df.SmokingStatus])['Patient'].count()
new_df.index = new_df.index.set_names(['id','Age','Sex','SmokingStatus'])
new_df = new_df.reset_index()
new_df.rename(columns = {'Patient': 'freq'},inplace = True)
new_df.head()

In [None]:
fig = px.bar(new_df, x='id',y ='freq',color='freq')
fig.update_layout(xaxis={'categoryorder':'total ascending'},title='No. of observations for each patient')
fig.update_xaxes(showticklabels=False)
fig.show()

In [None]:
fig = px.histogram(new_df, x='Age',nbins = 42)
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)', marker_line_width=1.5, opacity=0.6)
fig.update_layout(title = 'Distribution of Age for unique patients')
fig.show()

In [None]:
fig = px.histogram(new_df, x='Sex')
fig.update_traces(marker_color='rgb(202,158,225)', marker_line_color='rgb(48,8,107)',
                 marker_line_width=2, opacity=0.8)
fig.update_layout(title = 'Distribution of Sex for unique patients')
fig.show()

In [None]:
fig = px.histogram(new_df, x='SmokingStatus')
fig.update_traces(marker_color='rgb(202,225,158)', marker_line_color='rgb(48,107,8)',
                 marker_line_width=2, opacity=0.8)
fig.update_layout(title = 'Distribution of SmokingStatus for unique patients')
fig.show()

In [None]:
fig = px.histogram(new_df, x='SmokingStatus',color = 'Sex')
fig.update_traces(marker_line_color='black',marker_line_width=2, opacity=0.85)
fig.update_layout(title = 'Distribution of SmokingStatus for unique patients')
fig.show()

### 3.DICOM
For every patient there is a folder named with their unique ids. These folders contain images from a medical scan, like a CT scan and information about the patient. It is important to understand that each folder has a number of images, but they are all from the same scan, carried out on the same day. CT scans are pictures that slice from the beginning of the torso up to the neck. FVC measurements were carried out before, after and sometimes on the day of the CT scan. These images are called DICOM files. DICOM stands for "Digital Imaging and Communications in Medicine" and has two parts: the header and the dataset. The header contains information on the encapsulated dataset. It consists of a File Preamble, a DICOM prefix, and the File Meta Elements.

In [None]:
img = "../input/osic-pulmonary-fibrosis-progression/train/ID00009637202177434476278/100.dcm"
ds = pydicom.dcmread(img)
plt.figure(figsize = (10,10))
plt.imshow(ds.pixel_array, cmap=plt.cm.bone)