## OSIC Pulmonary Fibrosis Progression

### What is Pulmonary fibrosis?

> lung disease that occurs when lung tissue is scarred or damaged in any way

> the damaged tissue appears to be thickned, or stiff 

> this damaged tissue affects how the lungs function and make it difficult for the patient to breathe. 

## PREDICTION 

> we need to predicts the level of damage in the patients lungs and lung function based on: 
    
    > CT scan of patients lungs
    > output from a spirometer, measuring the volume of air inhaled and exhaled.
    > age
    > gender
    > smoking habbit and smoking history, as well as
    > Weeks a patient has been suffering, 

## FVC score

> in our data the damage to patients lungs is calculated using fvc score
> For each sample in test set, an FVC and a Confidence measure has to be predicted.


# FIRST LOOK AT AVAILABLE DATA

### let us look at what data we are working with

In [None]:
import os
import numpy as np 
import pandas as pd 
# List files available
list(os.listdir("../input/osic-pulmonary-fibrosis-progression"))



### LOAD datasets

In [None]:
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')# read csv 
train_df.head(5)#display first 5 entries

## .info()

In [None]:
#let us look at the composition of our dataset, features and datatypes
train_df.info()

#we can see we have 7 columns and 1549 entries. in our training set

In [None]:
#lets also look at the test/ dataset
test_df.info()
#our test dataset is reletively small. 


#we have data about 1549 entries in our training data and, 5 e entries our test.

In [None]:
#look at how many people in our dataset smoke, smoked before or never smoked.
train_df.groupby(['SmokingStatus']).count()['Patient']
# we can see most of our patients ahve smoked before. 

### check for any null values

In [None]:
train_df.isnull().sum(),test_df.isnull().sum()

# we do not have any null values in our data

## how many *unique patients* do we have, 
### lets find out by seeing the number of unique values in the patients column

In [None]:
len(train_df['Patient'].unique()) # in our training set we have 176 unique/individual patients, 

# this means we have ongoing or progress information about patients. 
# we can say that roughly, every patient has about 9 entries in our data, 
# patients who have been sufferent for longer are likely to have more entries in our data 

In [None]:
len(test_df['Patient'].unique())# all the entries in our test set are unique

## Let us look at the number of image files and folders we have, 



In [None]:
#train data 
files = folders = 0
path = "/kaggle/input/osic-pulmonary-fibrosis-progression/train"

for _, dirnames, filenames in os.walk(path):
  # ^ this idiom means "we won't be using this value"
    files += len(filenames)
    folders += len(dirnames)


In [None]:
files,folders

### in our training data we have 176 folders for each patient and 33026 image files, each folder has multiple images relating to that patients lungs

In [None]:
files = folders = 0
path = "/kaggle/input/osic-pulmonary-fibrosis-progression/test"

for _, dirnames, filenames in os.walk(path):
  # ^ this idiom means "we won't be using this value"
    files += len(filenames)
    folders += len(dirnames)
files,folders

### in our test data we have 176 folders for each patient and 33026 image files, each folder has multiple images relating to that patients lungs

## EDA

let us explore our data using graphs

### create a dataset without duplicate entries, 


In [None]:
df=train_df[['Patient', 'Age', 'Sex', 'SmokingStatus']].drop_duplicates()

In [None]:
import plotly.express as px

fig = px.histogram(df, x="SmokingStatus",title="distribution of smoking status in our dataset")
fig.show()


In [None]:
import plotly.express as px

fig = px.histogram(train_df, x="Weeks",title="distribution of Weeks scores in our dataset")
fig.show()


In [None]:
import plotly.figure_factory as ff
import numpy as np

x1 = np.random.randn(200)
x2 = np.random.randn(200) + 2

group_labels = ['Group 1', 'Group 2']

colors = ['slategray', 'magenta']

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([x1,x2], group_labels, bin_size=.5,
                         curve_type='normal', # override default 'kde'
                         colors=colors)

# Add title
fig.update_layout(title_text='Distplot with Normal Distribution')
fig.show()

In [None]:
train_df["Weeks"].unique()

In [None]:
a = dict(train_df["Weeks"].value_counts())
a

In [None]:
b = []
# calculate prob
for key in a:
        b.append([key, a[key]/1549.0])
        
b

In [None]:
df["Sex"].value_counts()
#in our unique dataset this is the gender distribution that we have

#we can see that the number adds up to 176, ie the number of unique patients


In [None]:
import pandas as pd
from sklearn import preprocessing

x = pd.DataFrame(train_df['Weeks'])
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(train_df['Weeks'])

In [None]:

fig = px.histogram(train_df, x="Weeks",title="distribution of weeks",)
fig.show()


### we can see that most of the patients in our dataset have been suffereing between 9 t0 60 weeks

## let us look at the distribution of age in our training dataset

In [None]:

fig = px.histogram(train_df, x="Age",title="distribution of age",)
fig.show()


## we can say that most of the patients in our data fall betweek 55 and 75, 

the youngest person in our data is 49, and the oldest is 88 

# let us look at the gender distribution using a scatter plot

In [None]:

fig = px.scatter(train_df, x="Weeks", y="Age", color='Sex')
fig.show()



### here we can see that we have more men that women in our data,

In [None]:
fig = px.scatter(train_df, x="Weeks", y="Age", color='SmokingStatus')
fig.show()



FVC - The forced vital capacity

The forced vital capacity (FVC), i.e. the volume of air exhaled
,recorded lung capacity in ml


In [None]:
fig = px.histogram(train_df, x="FVC",title="distribution of FVC score",)
fig.show()

Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics

my analysis, what is the percent of air the patient is breathing reletive to someone healthy at that age/condition

In [None]:
fig = px.scatter(train_df, x="FVC", y="Percent", color='Age')
fig.show()

## here we can see te linear relationships between FVC and percentage


In [None]:
fig = px.scatter(train_df, x="FVC", y="Age", color='Percent')
fig.show()


In [None]:
fig = px.scatter(train_df, x="FVC", y="Age", color='Sex')
fig.show()

## here we can see that generally women are behind in fvc score, but this can also be because generally women are physically smaller, and would breathe out less volume of air

a

## Let us see how the FVC fluctuates over time, using 'weeks', for a few random patients

In [None]:
patients= train_df.Patient.unique()


patient = train_df[train_df.Patient.isin([patients[25]])  ]

fig = px.line(patient, x="Weeks", y="FVC", color='Patient',line_shape='spline')
fig.show()



In [None]:
patient['Weeks']

# fitting data to known statistical distributions 

In [None]:
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')


In [None]:
len(train_df), train_df['Age'].mean(),train_df['Age'].max(),train_df['Age'].min()

In [None]:
train_df['Age'].var()

In [None]:
%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []
    Di= _distn_names[:25]
    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in Di if not d in ['levy_stable', 'studentized_range','dweibull']]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')
                
                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                
                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        
        except Exception:
            pass

    
    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# copy the data
X = train_df

data = X['Age']

# Plot for comparison
plt.figure(figsize=(16,10))
ax = data.plot(kind='hist', bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'AGE of patients All Fitted Distributions')
ax.set_xlabel(u'AGE')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(16,10))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)

ax.set_title(u'AGE of patients All Fitted Distributions\n' + dist_str)
ax.set_xlabel(u'AGE')
ax.set_ylabel('Frequency')

In [None]:
pdf

## here we can see how severly the FVC score Fluctuates overtime for different patients 

In [None]:
fig = px.violin(train_df, y="Percent", color="SmokingStatus",
                violinmode='overlay',)
fig.show()

## interesting to note, 

### we can see that ex smokers and non smokers are scoring on low on the percentage metric, but 
### people who are CURRENTLY SMOKING are not only scorking generally highter, but there is is a clear group of people who are scoring a higher percentage than normal healthy people in their age (above 100 percent score) 
### could this indicate that their smoking habbit, apart from damaging their lungs in various ways, is somewhow exercising their lungs ??? 

In [None]:
fig = px.violin(train_df, y="FVC", color="SmokingStatus",
                violinmode='overlay',)
fig.show()

In [None]:
train_df[train_df['SmokingStatus'] == 'Never smoked']


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="darkgrid")

plt.figure(figsize=(16, 6))
sns.kdeplot(train_df.loc[train_df['SmokingStatus'] == 'Ex-smoker', 'Age'], label = 'Ex-smoker',shade=True)
sns.kdeplot(train_df.loc[train_df['SmokingStatus'] == 'Never smoked', 'Age'], label = 'Never smoked',shade=True)
sns.kdeplot(train_df.loc[train_df['SmokingStatus'] == 'Currently smokes', 'Age'], label = 'Currently smokes', shade=True)

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');



### we can see the smokers have an even distribution, people who used to smoke are generally older, and the people who have never smoked are generally younger, 
#### this also follows common sense 

# LETS EXPLORE THE IMAGE DATA

## test and train folders contain image files in .dcm format, https://en.wikipedia.org/wiki/DICOM

### the 'Digital Imaging and Communications in Medicine' format , 

# we have only seen the structured data till now, we can get some insights about the patient through this, but the condition of the lungs can be more accurately seen using images, this is the value of medical imaging. 


In [None]:
import os
len(os.listdir('../input/osic-pulmonary-fibrosis-progression/train')),len(os.listdir('../input/osic-pulmonary-fibrosis-progression/test'))
#we have 176 folders containing each patients lung images and we have 5 folders in test dir


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import pydicom # to view dicom images

imdir = "/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00123637202217151272140"
print("total images for patient ID00123637202217151272140: ", len(os.listdir(imdir)))

#set grid
# view first (columns*rows) images in order
fig=plt.figure(figsize=(12, 12))
columns = 3
rows = 3
imglist = os.listdir(imdir)# list of files inside ID00123637202217151272140 directory


for i in range(1, columns*rows +1):
    filename = imdir + "/" + str(i) + ".dcm"
    #eg, train/ID00123637202217151272140/1.dcm
    
    #read the file
    ds = pydicom.dcmread(filename)
    
    fig.add_subplot(rows, columns, i)# add space for figure at correct location 
    plt.imshow(ds.pixel_array, cmap='RdBu')# add file to the specific spot in fig
plt.show()#show final figure

# THESE are base line CT scans A CT scan, or computed tomography scan, gives a cross-sectional overview of the object or organ

### so we are basically looking at virtual "slices" of specific areas of a scanned object

therefore as these are sequential cross-section images of the patients lungs it will be useful to arrange the images into a image sequence or animation. 

https://www.kaggle.com/danpresil1/dicom-basic-preprocessing-and-visualization

In [None]:


import imageio
from IPython.display import Image

import os
import pydicom as dicom
import glob

apply_resample = False

def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):
        
        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope
        
        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)
            
        image[slice_number] += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

def set_lungwin(img, hu=[-1200., 600.]):
    lungwin = np.array(hu)
    newimg = (img-lungwin[0]) / (lungwin[1]-lungwin[0])
    newimg[newimg < 0] = 0
    newimg[newimg > 1] = 1
    newimg = (newimg * 255).astype('uint8')
    return newimg


In [None]:
import numpy as np
from scipy.ndimage.interpolation import zoom


scans = load_scan('/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/')
scan_array = set_lungwin(get_pixels_hu(scans))




In [None]:
imageio.mimsave("/tmp/gif.gif", scan_array, duration=0.0001)
Image(filename="/tmp/gif.gif", format='png')

In [None]:
import matplotlib.pyplot as plt 
plt.imshow(scan_array[5], animated=True, cmap="gist_rainbow_r")

# scan_array is the array object that contains all the images in sequence 

### scan_array is the array object that contains all the images in sequence 
## let us visualise this sequence using matplotlib.animation 

In [None]:
import matplotlib.animation as animation

fig = plt.figure()

ims = [] # list to store imshow renders

for image in scan_array:
    im = plt.imshow(image, animated=True, cmap="gist_rainbow_r") # render immage from arrayas variable im
    plt.axis("off")
    ims.append([im])#add to list of images 

ani = animation.ArtistAnimation(fig, ims, interval=100, blit=False,

                                repeat_delay=1000)#create animation using matplotlib.animation

In [None]:
from IPython.display import HTML # library required to display this animation
HTML(ani.to_html5_video())

# let us visualize another patients ct scan

In [None]:
patients[15]

In [None]:

scans = load_scan('/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00035637202182204917484/')
scan_array = set_lungwin(get_pixels_hu(scans))

fig = plt.figure()

ims = [] # list to store imshow renders

for image in scan_array:
    im = plt.imshow(image, animated=True, cmap="mako") # render immage from arrayas variable im
    plt.axis("off")
    ims.append([im])#add to list of images 

ani = animation.ArtistAnimation(fig, ims, interval=100, blit=False,

                                repeat_delay=1000)#create animation using matplotlib.animation



In [None]:
HTML(ani.to_html5_video())#display as html5 video

## we can see that the above scan has more slices ie, more cross-section resolution 

In [None]:
imageio.mimsave("/tmp/gif.gif", scan_array, duration=0.0001)
Image(filename="/tmp/gif.gif", format='png')