# Overview of Pulmonary Fibrosis
**By Chase Eby**

This is my first public notebook on kaggle. Im taking this as an opportunity to make a entry point for the Data Exploration Process. Hopefully this will be a good starting point for people new to the Data Science field like me.

**Pulmonary Fibrosis Definition**

The word “pulmonary” means lung and the word “fibrosis” means scar tissue— similar to scars that you may have on your skin from an old injury or surgery. So, in its simplest sense, pulmonary fibrosis (PF) means scarring in the lungs. Over time, the scar tissue can destroy the normal lung and make it hard for oxygen to get into your blood. Low oxygen levels (and the stiff scar tissue itself) can cause you to feel short of breath, particularly when walking and exercising. Pulmonary fibrosis isn’t just one disease. It is a family of more than 200 different lung diseases that all look very much alike. The PF family of lung diseases falls into an even larger group of diseases called the interstitial lung diseases (also known as ILD), which includes all of the diseases that have inflammation and/or scarring in the lung. Some interstitial lung diseases don’t include scar tissue. When an interstitial lung disease does include scar tissue in the lung, we call it pulmonary fibrosis.

**Ideopathic Pulmonary Fibrosis occurs when no cause of the disease can be identified** 

Source "https://www.pulmonaryfibrosis.org/life-with-pf/about-pf"

# Basic Data Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
train = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

In [None]:
print(f'Training Set Shape = {train.shape} - Patients = {train["Patient"].nunique()}')
print(f'Test Set Shape = {test.shape} - Patients = {test["Patient"].nunique()}')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
# Change plot colors from default to colorblind.
cb = sns.color_palette("colorblind")

# Exploring Patients 

In [None]:
train.describe()

Check for nulls in the dataset

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

Both the training and test sets have zero null values

In [None]:
# https://www.kaggle.com/andradaolteanu/pulmonary-fibrosis-competition-eda-dicom-prep
# Select unique bio info for the patients
data = train.groupby(by="Patient")[["Patient", "Age", "Sex", "SmokingStatus"]].first().reset_index(drop=True)

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (20, 5))

# Distribution of Age
age = sns.distplot(data['Age'],ax=ax1, color = cb[2])
age.set_title("Patient Age Distribution", fontsize=16)

sex = sns.countplot(data['Sex'],ax=ax2, palette = cb[3:6])
sex.set_title("Patient Sex", fontsize=16)

smoke = sns.countplot(data["SmokingStatus"],ax=ax3, palette = cb[2:5])
smoke.set_title("Smoking Status", fontsize=16)

* The average age of patients happens to be a little under the age of 70. 
* Most of the patients are male
    * Lets look to see the different smoking status for each sex
* Many of the patients are ex-smoker.

In [None]:
train.groupby(['Sex', 'SmokingStatus'])['FVC'].agg(['mean','std','count'])

In [None]:
f, (ax1) = plt.subplots(1, figsize = (8, 6))
ax = sns.swarmplot(x="SmokingStatus", y="FVC", hue="Sex",
              palette=cb[1:3], data=train)
ax.set_title("Smoking Status(FVC Output) for Each Sex", fontsize=16)
ax

According to our graph, women tend to be on the lower end of FVC for all 3 of our categories.

# Which category of smoking status is has the highest FVC.

In [None]:
# Figure
f, (ax1, ax2) = plt.subplots(1,2, figsize = (16, 6))

a = sns.barplot(x = train["SmokingStatus"], y = train["FVC"], ax=ax1, palette=cb[0:4])
b = sns.barplot(x = train["SmokingStatus"], y = train["Percent"], ax=ax2, palette=cb[4:7])

a.set_title("Mean FVC per Smoking Status", fontsize=16)
b.set_title("Mean Perc per Smoking Status", fontsize=16)

**This is surprising because the category of smoker that has the highest FCV currently smoke.**
* Lets look to see and see if there is any information we can gleam more insight into the smoking categories.


# Smoking Status By Category
 

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize = (16, 6))
a = sns.barplot(x = train["SmokingStatus"], y = train["Age"],ax=ax1, palette=cb[0:4])

b = sns.violinplot(x=train["SmokingStatus"], y= train['Age'],ax=ax2, palette=cb[7:10])

a.set_title("Average Age for each Smoking Status", fontsize=16)
b.set_title("Distribution of Age For Each Smoking Status", fontsize=16)

While the average age of each smoking status very similar we can see that the distribution for each group is slighly different.
* Current smokers is slghtly skewed more negativly vs the other two groups.


Lets look at the FVC value for each point of age and each SmokingStatus.  Is there any insights that can be made from the data?

In [None]:
data = train.groupby(['Age', 'SmokingStatus'])['FVC'].agg(['mean','std'])
data.head()

In [None]:
f, (ax1) = plt.subplots(1, figsize = (16, 6))

ax = sns.lineplot(x="Age", y="FVC", hue = 'SmokingStatus', data=train)
ax.set_title("Average FVC for Each Smoking Status", fontsize=16)

As we can see current smokers tend to have a higher output on their FVC tests.  

Ex-smokers seem to be the most stable of the groups by slowly losing FVC in the later years of life. 

Non-Smokers tend to have the lower output compared to the other groups.  I wonder if many of the non smokers have other underlying health issues that may have affected their lung capacity.  I don't know if this data contains both **Idiopathic** and **Non-Idiopathic** pulmonary fibrosis patients.

**Potential Causes**

**Causes of pulmonary fibrosis include environmental pollutants, some medicines, some connective tissue diseases, and interstitial lung disease. Interstitial lung disease is the name for a large group of diseases that inflame or scar the lungs. In most cases, the cause cannot be found. This is called idiopathic pulmonary fibrosis.**
https://medlineplus.gov/pulmonaryfibrosis.html#:~:text=Causes%20of%20pulmonary%20fibrosis%20include,the%20cause%20cannot%20be%20found.