# Problem Statement

In this competition, you’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. You’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.

### Let's first know a bit about Pulmonary Fibrosis

In [None]:
from IPython.display import IFrame, YouTubeVideo
YouTubeVideo('cRVRAKM5ono',width=600, height=400)

I hope it helped. If you like the kernel please upvote the kernel :)

Hope you already read the data description. If you haven't, its give below:

## Data Description

The aim of this competition is to predict a patient’s severity of decline in lung function based on a CT scan of their lungs. Lung function is assessed based on output from a spirometer, which measures the forced vital capacity (FVC), i.e. the volume of air exhaled.

In the dataset, you are provided with a baseline chest CT scan and associated clinical information for a set of patients. A patient has an image acquired at time Week = 0 and has numerous follow up visits over the course of approximately 1-2 years, at which time their FVC is measured.

In the training set, you are provided with an anonymized, baseline CT scan and the entire history of FVC measurements.
In the test set, you are provided with a baseline CT scan and only the initial FVC measurement. You are asked to predict the final three FVC measurements for each patient, as well as a confidence value in your prediction.

Since this is real medical data, you will notice the relative timing of FVC measurements varies widely. The timing of the initial measurement relative to the CT scan and the duration to the forecasted time points may be different for each patient. This is considered part of the challenge of the competition. To avoid potential leakage in the timing of follow up visits, you are asked to predict every patient's FVC measurement for every possible week. Those weeks which are not in the final three visits are ignored in scoring.

## Files
This is a synchronous rerun code competition. The provided test set is a small representative set of files (copied from the training set) to demonstrate the format of the private test set. When you submit your notebook, Kaggle will rerun your code on the test set, which contains unseen images.

* train.csv - the training set, contains full history of clinical information
* test.csv - the test set, contains only the baseline measurement
* train/ - contains the training patients' baseline CT scan in DICOM format
* test/ - contains the test patients' baseline CT scan in DICOM format
* sample_submission.csv - demonstrates the submission format

## Columns
train.csv and test.csv
* Patient- a unique Id for each patient (also the name of the patient's DICOM folder)
* Weeks- the relative number of weeks pre/post the baseline CT (may be negative)
* FVC - the recorded lung capacity in ml
* Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
* Age
* Sex
* SmokingStatus

## Importing Libraries

In [None]:
!pip install fastai2 -q

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

#Load the dependancies
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *
from fastai2.medical.imaging import *

import pydicom

In [None]:
df_train = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/train.csv")
df_test = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/test.csv")

In [None]:
df_train.head()

In [None]:
df_train.shape,df_test.shape

In [None]:
df_train.nunique()

In [None]:
df_test.nunique()

In [None]:
df_weeks = df_train.groupby("Patient").agg({"Weeks":"nunique","Age":"nunique"}).reset_index()
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10, 5))
sns.countplot(df_weeks.Weeks,ax = ax1);
sns.countplot(df_weeks.Age,ax =ax2);

What we see:
* More than 120 out of 176 Patients have 9 weeks and about 30 Patients have 8 weeks of clinical information/recordings 
* All of them have 1 age throught their weeks of clinical information/recordings

In [None]:
df_patients = df_train[["Patient","Sex","SmokingStatus","Age"]].drop_duplicates()
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(20, 5),gridspec_kw={'width_ratios': [1,1,2]})
sns.countplot(df_patients.Sex,ax = ax1);
sns.countplot(df_patients.SmokingStatus,ax =ax2);
sns.countplot(df_patients.Age,ax =ax3);

What we see:
*  ~140 out of 176 Patients are Male. Less than 40 Patients(22%) are female 
*  ~120(68%) out of 176 Patients were Ex-Smokers. ~ 50 Patients never smoked
* Age shows a normal distriution where 64-74 ages Patients show the highest peaK

**Let's look at the FVC of one Patient over the weeks**

In [None]:
sns.lineplot(x = "Weeks", y = "FVC", data = df_train[df_train.Patient=="ID00007637202177411956430"]);

Now, to understand DICOMs which is the format in which the CT Scans are shared, I went through this kernel : https://www.kaggle.com/avirdee/understanding-dicoms

DICOM(Digital Imaging and COmmunications in Medicine) is the de-facto standard that establishes rules that allow medical images(X-Ray, MRI, CT) and associated information to be exchanged between imaging equipment from different vendors, computers, and hospitals.

DICOM files typically have a .dcm extension and provides a means of storing data in separate 'tags' such as patient information as well as image/pixel data. A DICOM file consists of a header and image data sets packed into a single file. 

To access the files I will be using fastai2.medical.imaging module. Under the hood fastai uses pydicom to access the dicom files.

Pydicom is a python package for parsing DICOM files and makes it easy to covert DICOM files into pythonic structures for easier manipulation. Files are opened using pydicom.dcmread

In [None]:
TRAIN_DATA = "../input/osic-pulmonary-fibrosis-progression/train"

In [None]:
train_files = get_dicom_files(TRAIN_DATA)
train_files

There are 33,026 DICOM files.Let's look at one of the file

In [None]:
info_view = train_files[0]
dimg = dcmread(info_view)
dimg

In [None]:
dimg.show()