# About the competition
This competition is arranged by Open Source Imaging Consortium (OSIC) - a non-profit organization.

Pulmonary Fibrosis is an incurable lung disease. It occurs when lung tissue becomes damaged and scarred. This affects proper functioning of lungs and infact breathing.

Expectation from the competiton is to predict patient's severity of decline in the lung function based on data provided - CT scan of patient's lungs & allied details like gender, smoking status, FVC. We need to determine lung function based on the output from spirometer, which measures volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make prediction.

If the prediction outcome is successful, it will benefit patients and their families to better understand any decline in lung function in advance and try for better cure or improved health condition.

# 1. Importing the packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly as plty
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.io as pio
import os
%matplotlib inline


In [None]:
path = '../input/osic-pulmonary-fibrosis-progression/'

In [None]:
df_train = pd.read_csv(f'{path}train.csv')
df_test = pd.read_csv(f'{path}test.csv')

# 2. Training Data

2.1 Metadata Information

In [None]:
df_train.info()

In [None]:
df_train.describe(include='all').T

In [None]:
df_train.head()

In [None]:
df_tmp = df_train.groupby(['Patient', 'Sex'])['SmokingStatus'].unique().reset_index()

In [None]:
df_tmp

In [None]:
df_tmp['SmokingStatus'] = df_tmp['SmokingStatus'].str[0]
df_tmp['Sex'] = df_tmp['Sex'].str[0]

In [None]:
df_tmp['SmokingStatus'].value_counts()

In [None]:
df_tmp['Sex'].value_counts()

In [None]:
fig, ax = plt.subplots(1,2, figsize = (20,6), sharex=True)
sns.countplot(x='SmokingStatus',data=df_tmp,ax=ax[0])
sns.countplot(x='SmokingStatus',hue='Sex', data=df_tmp,ax=ax[1])
ax[0].title.set_text('Smoking Status')
ax[1].title.set_text('Smoking Status Vs Sex')
plt.show()

# What do we have in training dataset (metadata info excluding CT Scan)
* we have 1549 data with no missing values.
* 176 unique patient data is made available  along with data related to their age, gender, smoking status, FVC, weeks
* Age of patients is between 49 and 88. Average age of the patient within the dataset is 67
* We have 139 Male and 37 female patients
* We have 118 Ex-Smoker, 49 Never-Smoked and 9 people who are smoking currently (active)

# 3. Test Data

In [None]:
df_test.info()

In [None]:
df_test

We just have 5 patient data available in test set

# Exploration of Data will continue