**HELLO KAGGLERS. THIS IS AN IN DEPTH AND THOROUGH VISUALIZATION OF OSIC PULMONARY FIBROSIS PROGRESSION DATASET. I HAVE COMPARED ALMOST EVERY FEATURE WITH ONE ANOTHER. HOPE YOU LIKE MY WORK. IF YOU WANNA ADD SOMETHING JUST TELL IN COMMENTS. IF YOU LIKE MY WORKK UPVOTE.**

**Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly.**

Imagine one day, your breathing became consistently labored and shallow. Months later you were finally diagnosed with pulmonary fibrosis, a disorder with no known cause and no known cure, created by scarring of the lungs. If that happened to you, you would want to know your prognosis. That’s where a troubling disease becomes frightening for the patient: outcomes can range from long-term stability to rapid deterioration, but doctors aren’t easily able to tell where an individual may fall on that spectrum. Your help, and data science, may be able to aid in this prediction, which would dramatically help both patients and clinicians.



Current methods make fibrotic lung diseases difficult to treat, even with access to a chest CT scan. In addition, the wide range of varied prognoses create issues organizing clinical trials. Finally, patients suffer extreme anxiety—in addition to fibrosis-related symptoms—from the disease’s opaque path of progression.

Open Source Imaging Consortium (OSIC) is a not-for-profit, co-operative effort between academia, industry and philanthropy. The group enables rapid advances in the fight against Idiopathic Pulmonary Fibrosis (IPF), fibrosing interstitial lung diseases (ILDs), and other respiratory diseases, including emphysematous conditions. Its mission is to bring together radiologists, clinicians and computational scientists from around the world to improve imaging-based treatments.

In this competition, you’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. You’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input.

If successful, patients and their families would better understand their prognosis when they are first diagnosed with this incurable lung disease. Improved severity detection would also positively impact treatment trial design and accelerate the clinical development of novel treatments.



In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os

In [None]:
train = pd.read_csv('/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv')

In [None]:
train.head()

# DATA ANALYSIS

LEST FIRST HAVE A LOOK THAT HOW MUCH NULL DATA WE HAVE.

# Data Cleaning

In [None]:
plt.figure(figsize=(10,10)) 
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')

ABOVE VISUALIZATION CLEARLY SHOWS WE HAVE NO OF NULL DATA

In [None]:
len(train)

In [None]:
train.isnull().sum()

In [None]:
train['SmokingStatus'].unique()

In [None]:
len(train['Patient'].unique())

**THIS MEANS ALTHOUGH THERE ABOUT 1500 EXAMPLES, BUT THERE ARE ONLY 176 UNIQUE PATIENTS**

# DATA VISUALIZATON

# Visualization Related to Smoking

In [None]:
labels = ['Ex-smoker', 'Never smoked', 'Currently smokes']
sizes = train['SmokingStatus'].value_counts()
colors = plt.cm.afmhot(np.linspace(0, 1, 5))
explode = [0.1, 0.1, 0.1,]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(sizes, labels = labels, colors = colors,explode=explode, shadow = True)
plt.title('Distribution of Smoking Status', fontsize = 20)
plt.legend()
plt.show()

1. **WE CAN INFER THAT MAJOR PATIENTS ARE EX-SMOKERS AND LEAST PEOPLE SMOKE CURRENTLY ALSO**

In [None]:
plt.style.use('seaborn-white')

sns.countplot(x='SmokingStatus',  data=train)

In [None]:
plt.figure(figsize=(10,10)) 

sns.countplot(x='SmokingStatus',data=train,hue='Sex')

In [None]:
plt.figure(figsize=(20,10)) 

sns.countplot(x='SmokingStatus',data=train,hue='Age')

* MAJORLY EX-SMOKERS ARE IN 70s RANGE OF AGE
* MAJORLY WHO NEVER SMOKED ARE IN RANGE OF 60s
* ALTHOUGH VERY LESS PATIENTS CURRENTLY SMOKE BUT MAJORY ARE YOUNG

In [None]:
sns.kdeplot(train.loc[train['SmokingStatus'] == 'Ex-smoker', 'Age'], label = 'Ex-smoker',shade=True)

sns.kdeplot(train.loc[train['SmokingStatus'] == 'Never smoked', 'Age'], label = 'Never smoked',shade=True)

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Smokers over Age');

In [None]:
plt.style.use('dark_background')

plt.rcParams['figure.figsize'] = (15, 7)
ax = sns.violinplot(x = train['SmokingStatus'], y = train['Age'], palette = 'Reds')
ax.set_xlabel(xlabel = 'Smoking habit', fontsize = 15)
ax.set_ylabel(ylabel = 'Age', fontsize = 15)
ax.set_title(label = 'Distribution of Smokers over Age', fontsize = 20)
plt.show()

THIS JUSTIFIES OUR ABOVE INFERENCE

In [None]:
plt.style.use('dark_background')

plt.rcParams['figure.figsize'] = (15, 7)
ax = sns.violinplot(x = train['SmokingStatus'], y = train['Percent'], palette = 'Reds')
ax.set_xlabel(xlabel = 'Smoking Habit', fontsize = 15)
ax.set_ylabel(ylabel = 'Percent', fontsize = 15)
ax.set_title(label = 'Distribution of Smoking Status Over Percentage', fontsize = 20)
plt.show()

# Visualization Related to Age

In [None]:
plt.style.use('dark_background')
train['Age'].value_counts().head(80).plot.bar(color = 'orange', figsize = (20, 7))
plt.title('Different Ages in Data', fontsize = 30, fontweight = 20)
plt.xlabel('Different Age Groups')
plt.ylabel('count')
plt.show()

MAXIMUM PATIENTS ARE OF 66 AGE

In [None]:
plt.style.use('seaborn-white')

sns.scatterplot(x='Age',y='Percent',data=train, color='Red')

In [None]:
sns.scatterplot(x='Weeks',y='Age',data=train, color = 'Black')

**MAJORITY OF PATEINTS WHO HAVE THEIR AGE FROM 60-75 ARE HAVING CHECKUPS OF AROUND 0-60 NUMBER OF TIMES**

In [None]:
plt.figure(figsize=(10,7))
sns.distplot(train['Age'], color='Black',  bins = 30 )

In [None]:
plt.style.use('dark_background')

sns.kdeplot(train.loc[train['Sex'] == 'Male', 'Age'], label = 'Male',shade=True)

sns.kdeplot(train.loc[train['Sex'] == 'Female', 'Age'], label = 'Female',shade=True)

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages over Gender');

In [None]:
plt.style.use('seaborn-white')
plt.rcParams['figure.figsize'] = (15, 8)
ax = sns.boxplot(x = train['Sex'], y = train['Age'], palette = 'viridis')
ax.set_xlabel(xlabel = 'Sex', fontsize = 9)
ax.set_ylabel(ylabel = 'Age', fontsize = 9)
ax.set_title(label = 'Distribution of Ages as per Sex', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

**AVERAGE AGE FOR FEMALE PATIENTS IS LESS THAN MALES**

In [None]:
sns.lineplot(train['Age'], train['Percent'] , color = 'black')
plt.title('Age vs Percent', fontsize = 20)

plt.show()

In [None]:
sns.lineplot(train['Age'], train['FVC'] , color = 'black')
plt.title('Age vs FVC', fontsize = 20)

plt.show()

In [None]:
sns.lineplot(train['Age'], train['Weeks'] , color = 'black')
plt.title('Age vs FVC', fontsize = 20)

plt.show()

In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(10, 1, 1)
sns.pointplot(train.Age ,train.Percent)
plt.title("Patinets Perecent over Age" , fontsize = 25)
plt.ylabel('Percent', fontsize = 15)
plt.xlabel('Age', fontsize = 15)

In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(10, 1, 1)
sns.pointplot(train.Sex ,train.Age)
plt.title("Patinets Age by Sex" , fontsize = 25)
plt.ylabel('Age', fontsize = 15)
plt.xlabel('Sex', fontsize = 15)

In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(8, 1, 1)
sns.pointplot(train.Age ,train.FVC)
plt.title("Patinets FVC over Age" , fontsize = 25)
plt.ylabel('FVC', fontsize = 15)
plt.xlabel('Age', fontsize = 15)

# Data Visualization Related to FVC

In [None]:
sns.scatterplot(x='FVC',y='Percent',data=train, color='Black')

**FVC is highly co-related to Percentage. They both share Linear Relationship. if Percentage Increases, so does FVC.**

In [None]:
sns.scatterplot(x='FVC',y='Age',data=train, color ='Red')

In [None]:
sns.scatterplot(x='FVC',y='Weeks',data=train, color='magenta')

**Majority Patients who have FVC from 1500-3000 are having 0-40 weeks. **

In [None]:
train.corr()['FVC'].sort_values()

In [None]:
plt.style.use('seaborn-white')
plt.figure(figsize=(10,7))
sns.distplot(train['FVC'], color='Blue')

**MAXIMUM PATIENTS HAVE FVC AROUND 3000.**

In [None]:
sns.lineplot(train['Percent'], train['FVC'] , color = 'black')
plt.title('Percent vs FVC', fontsize = 20)

plt.show()

# Data Visualization Related to Weeks

In [None]:
plt.style.use('seaborn-white')
train['Weeks'].value_counts().head(80).plot.bar(color = 'red', figsize = (25, 7))
plt.title('Number of Weeks in Data', fontsize = 50, fontweight = 20)
plt.xlabel('Weeks')
plt.ylabel('count')
plt.show()

In [None]:
sns.scatterplot(x='Weeks',y='Percent',data=train, color ='blue')

In [None]:
sns.scatterplot(x='Weeks',y='Age',data=train, color ='blue')

In [None]:
sns.scatterplot(x='Weeks',y='FVC',data=train, color ='blue')

In [None]:
plt.figure(figsize=(10,7))
sns.distplot(train['Weeks'], color='Blue')

**MOST PATIENTS HAVE CHECKED UP BY 8-10 TIMES.**

In [None]:
sns.lineplot(train['Weeks'], train['Percent'] , color = 'black')
plt.title('Percent vs Week', fontsize = 20)

plt.show()

# Data Visualization Related to Sex

In [None]:
sns.countplot(x='Sex',  data=train)

**MAJORITY ARE MALES **

In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(10, 1, 1)
sns.pointplot(train.Sex ,train.Percent)
plt.title("Patinets Perecent over Sex" , fontsize = 25)
plt.ylabel('Percent', fontsize = 15)
plt.xlabel('Sex', fontsize = 15)


In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(10, 1, 1)
sns.pointplot(train.Sex ,train.FVC)
plt.title("Patinet's FVC over Sex" , fontsize = 25)
plt.ylabel('Percent', fontsize = 15)
plt.xlabel('Sex', fontsize = 15)


In [None]:
plt.figure(figsize=(15,30))
a = plt.subplot(10, 1, 1)
sns.pointplot(train.Sex ,train.Weeks)
plt.title("Patinet's Weeks over Sex" , fontsize = 25)
plt.ylabel('Percent', fontsize = 15)
plt.xlabel('Sex', fontsize = 15)


In [None]:
plt.figure(figsize=(10,7))
sns.distplot(train['Percent'], color='Blue')

# MAJOR POINTS OF INFERRENCE:





# MAJOR POINTS OF INFERRENCE:

* DATA WE HAVE, CLEARLY SHOWS WE HAVE NO OF NULL DATA.
* THIS MEANS ALTHOUGH THERE ABOUT 1500 EXAMPLES, BUT THERE ARE ONLY 176 UNIQUE PATIENTS.
* WE CAN INFER THAT MAJOR PATIENTS ARE EX-SMOKERS AND LEAST PEOPLE SMOKE CURRENTLY ALSO.
* MAJORLY EX-SMOKERS ARE IN 70s RANGE OF AGE
* MAJORLY WHO NEVER SMOKED ARE IN RANGE OF 60s
* ALTHOUGH VERY LESS PATIENTS CURRENTLY SMOKE BUT MAJORY ARE YOUNG
* MAXIMUM PATIENTS ARE OF 66 AGE.
* MAJORITY OF PATEINTS WHO HAVE THEIR AGE FROM 60-75 ARE HAVING CHECKUPS OF AROUND 0-60 NUMBER OF TIMES.
* AVERAGE AGE FOR FEMALE PATIENTS IS LESS THAN MALES
* FVC is highly co-related to Percentage. They both share Linear Relationship. if Percentage Increases, so does FVC.
* Majority Patients who have FVC from 1500-3000 are having 0-40 weeks. 
* MOST PATIENTS HAVE CHECKED UP BY 8-10 TIMES.

In [None]:
train_path = '../input/osic-pulmonary-fibrosis-progression/train'

In [None]:
import pydicom as dicom

In [None]:
fs = dicom.dcmread("../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/1.dcm")

In [None]:
plt.imshow(fs.pixel_array) 
