# OSIC Pulmonary Fibrosis Progression
<img src="https://health.clevelandclinic.org/wp-content/uploads/sites/3/2017/05/Lungs3D.jpg" width=400, height=400>


### what is Pulmonary Fibrosis?
#### Pulmonary fibrosis is a progressive disease that naturally gets worse over time. This worsening is related to the amount of fibrosis (scarring) in the lungs. As this occurs, a person's breathing becomes more difficult, eventually resulting in shortness of breath, even at rest
### Types Pulmonary Fibrosis
<img src="https://www.nationaljewish.org/getattachment/conditions/Familial-Pulmonary-Fibrosis/Forms/FPF_forms_800v2.png" width=600, height=300>

### what causes pulmonary fibrosis?

* Some cases of pulmonary fibrosis occur without known cause (this is called idiopathic pulmonary fibrosis). Other cases are caused by exposure to environmental hazards (such as asbestos) and autoimmune diseases (such as rheumatoid arthritis)

* Idiopathic pulmonary fibrosis (IPF) is a disease of unknown etiology with considerable morbidity and mortality. Cigarette smoking is one of the most recognized risk factors for development of IPF. [Furthermore](https://www.hindawi.com/journals/pm/2012/808260/)

### Death rate
**Idiopathic pulmonary fibrosis (IPF) portends a poor prognosis. With regard to idiopathic pulmonary fibrosis life expectancy, the estimated mean survival is 2-5 years from the time of diagnosis. Estimated mortality rates are 64.3 deaths per million in men and 58.4 deaths per million in women.**

Death rates in patients with idiopathic pulmonary fibrosis increase with increasing age, are consistently higher in men than women, and experience seasonal variation, with the highest death rates occurring in the winter, even when infectious causes are excluded.

### Treatment
There's currently no cure for idiopathic pulmonary fibrosis (IPF). The main aim of treatment is to relieve the symptoms as much as possible and slow down its progression. As the condition becomes more advanced, end of life (palliative) care will be offered.

## Let's get into the data

In [None]:
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import pandas as pd
import numpy as np
from colorama import Fore, Back, Style 
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import xgboost
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px
from statsmodels.formula.api import ols
import plotly.graph_objs as gobj
import argparse
import cv2
import cv2 as cv
import pydicom as dicom
from pydicom.filereader import dcmread
import pydicom
import re
from PIL import Image
from IPython.display import Image as show_gif
from PIL import Image
from IPython.display import Image as show_gif
import scipy.misc
import matplotlib
from skimage import exposure
import numpy as np
import os
import matplotlib.pyplot as plt
from glob import glob
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage
from skimage import morphology
from skimage import measure
from skimage.transform import resize
from sklearn.cluster import KMeans
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.tools import FigureFactory as FF
import plotly.figure_factory as ff
from plotly.graph_objs import *
init_notebook_mode(connected=True) 
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 10000)
pd.set_option('display.width', 10000)

%matplotlib inline

train=pd.read_csv("../input/osic-pulmonary-fibrosis-progression/train.csv")
train.head(5)

This is the train dataset. there is no missing values


In [None]:
train.isnull().mean()

## Age Distribution 

In [None]:
hist_data =[train["Age"].values]
group_labels = ['Age'] 

fig = ff.create_distplot(hist_data, group_labels,)
#fig.update_layout(title_text='Age Distribution plot')

fig.show()

* We can clearly see that patients are above 50 age. 

### Age distribution Gender wise

In [None]:
fig = px.box(train, x="Sex", y="Age", points="all",)
#fig.update_layout(
  #  title_text="Gender wise Age Spread - Male = 1224 Female =325")
print(train["Sex"].value_counts())
fig.show()

* Male's count is much higher than Female
* However age spread is looks similar by gencer

## Analysis in FVC on Smoking Status

In [None]:
train["SmokingStatus"].value_counts()

In [None]:
smok=train[train["SmokingStatus"]=="Ex-smoker"]["FVC"]
not_smok=train[train["SmokingStatus"]=="Never smoked"]["FVC"]
cr_smoke=train[train["SmokingStatus"]=="Currently smokes"]["FVC"]
hist_data = [smok,not_smok,cr_smoke]

group_labels = ['EX-Smoke', 'Never Smoke',"Currently Smoke"]

fig = ff.create_distplot(hist_data, group_labels, bin_size=.2)

fig.show()

* Never Smoke - Their spread is not increasing more than 4400-FVC and also more spread in between 1200 to 3400
* Ex Smoke - the spread is all over the FVC , mostly between 1200 to 4700 
* Currently smoke - the data is low for this category , so we can not be sure , but it is going higher than the never smoke spread 

# Analysis in Percent - Gender and Smoking Status wise 

In [None]:
fig = px.violin(train, y="Percent", x="Sex", color="SmokingStatus", box=True,
          hover_data=train.columns)

fig.show()


* Maturation of the airways and lungs continues through childhood and into adolescence during which time, for the most part, males continue to have larger lungs than females
*  Fibrosis stiffens the lungs, reducing their size and capacity
* Never Smoke - their Percent is not increased more than 149 both Male an Female wise
* Those who are currently smoking is increasing their Percent 

# Patient wise FVC on weeks

In [None]:
dt = train.groupby(by="Patient")["Weeks"].count().reset_index()
train["time"] = 0

for patient, times in zip(dt["Patient"], dt["Weeks"]):
    train.loc[train["Patient"] == patient, 'time'] = range(1, times+1)
df = px.data.gapminder().query("continent != 'Asia'") # remove Asia for visibility
fig = px.line(train, x="Weeks", y="FVC", color="SmokingStatus",
              line_group="Patient",hover_name="time")
fig.show()

* We can see that Currently smokes value are really changing up and down, even they are taking treatments but still smoking it makes the case much worse
* If you hover the mouse around the lines you can see that count of the week they have attended.
* Highest FVC value case is EX-smoke, Lowest is Never Smoke.
* But we can see most of that differences between Never smoke and Ex-smoker case's FVC values.

## What is Dicom images?

Digital Imaging and Communications in Medicine (DICOM) is the standard for the communication and management of medical imaging information and related data.

#### Let's look into one of the CT-Scan 

Sample CT-scan of ID00007637202177411956430

In [None]:
dataset = dcmread("../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/23.dcm")
fig = px.imshow(dataset.pixel_array, color_continuous_scale='plasma')
fig.update_layout(coloraxis_showscale=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()

* Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly
* So if we focus on slicing the portion of lung tissue with image processing  may the ML model can learn and can give much more accuracy

In [None]:
from PIL import Image

def img2gif(id_num):
    tr=train.iloc[id_num,:]
    d=tr.Patient
    smoke=tr.SmokingStatus
    age=tr.Age
    gender=tr.Sex
    inputdir = '../input/osic-pulmonary-fibrosis-progression/train/'+ d
    outdir = './'

    test_list = [ f for f in  os.listdir(inputdir)]
    tt=[]

    for f in test_list[:]: 
        ds = pydicom.read_file(inputdir +"/"+ f) 
        img = ds.pixel_array
        img=exposure.equalize_adapthist(img)

        plt.axis('off')

        plt.imsave(outdir + f.replace('.dcm','.png'),img,cmap="plasma")

        #cv2.imwrite(outdir + f.replace('.dcm','.png'),img) 
        tt.append(outdir + f.replace('.dcm','.png'))
    tt.sort(key=lambda f: int(re.sub('\D', '', f)))
    im_cnt=len(tt)
    for i in tt:
        im_gray = cv2.imread(i)
        kernel = np.ones((1,1), np.uint8)
        erosion = cv2.erode(im_gray, kernel, iterations = 1)

        dilation = cv2.dilate(erosion, kernel, iterations = 1)
        cv2.imwrite(i,dilation)

    new_im=[]
    for file in tt:
        new_frame = Image.open(file)
        new_im.append(new_frame)
    new_im[0].save("./"+'gif_ok.gif', format='GIF',append_images=new_im[:],save_all=True,duration=400, loop=0)
    return im_cnt,d,smoke,age,gender
im_cnt,id_num,smoke,age, gender=img2gif(1)



### DICOM images to Gif

In [None]:

print("Image count : ",im_cnt,"\nPatient id : ",id_num,"\nSmokingStatus : ",smoke, "\nAge : ",age,"\nGender : ",gender)
show_gif(filename="gif_ok.gif", format='png', width=400, height=400)


# Dilation and Erosion, Morphology

* Dilation and erosion are two fundamental morphological operations. Dilation adds pixels to the boundaries of objects in an image, while erosion removes pixels on object boundaries.
* Opening is just another name of erosion followed by dilation. It is useful in removing noise
* Closing is reverse of Opening, Dilation followed by Erosion. It is useful in closing small holes inside the foreground objects, or small black points on the object

In [None]:
# Let's define our kernel size
kernel = np.ones((5,5), np.uint8)
image=dataset.pixel_array
image=exposure.equalize_adapthist(image)
plt.figure(figsize = (65,35))
plt.axis('off')

plt.subplot(341)

# Now we erode
erosion = cv2.erode(image, kernel, iterations = 1)
plt.axis('off')
plt.title("Erosion", fontsize=50)

plt.imshow(erosion)

plt.subplot(342, frameon=False)

kernel = np.ones((5,5), np.uint8)
dilation = cv2.dilate(image, kernel, iterations = 1)
plt.axis('off')
plt.title("Dilation", fontsize=50)

plt.imshow(dilation)

plt.subplot(343, frameon=False)

# Opening - Good for removing noise
opening = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
plt.axis('off')
plt.title("Opening", fontsize=50)

plt.imshow(opening)

plt.subplot(344, frameon=False)

# Closing - Good for removing noise
closing = cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel)
plt.title("Closing", fontsize=50)
plt.axis('off')

plt.imshow(closing)



# Hounsfield unit (HU) 

#### Further analysis I'm going to use the case id's which is having much FVC in the data.

In [None]:
srted=train.sort_values(by="FVC")
hr=srted.iloc[-1,0]

data_path = "../input/osic-pulmonary-fibrosis-progression/train/"+hr
output_path = working_path = "../input/output/"
g = glob(data_path + '/*.dcm')

# Print out the first 5 file names to verify we're in the right folder.
print ("Total of %d DICOM images.\nFirst 5 filenames:" % len(g))
print ('\n'.join(g[:5]))

#### The Hounsfield unit (HU) is a relative quantitative measurement of radio density used by radiologists in the interpretation of computed tomography (CT) images. The absorption/attenuation coefficient of radiation within a tissue is used during CT reconstruction to produce a grayscale image.

<table style="width:50%">
  <tr>
    <th>Substance</th>
    <th>HU</th>
  </tr>
  
 
  
   <tr>
 <td> Air	</td>
  <td>−1000</td>
  </tr>
  
   <tr>
<td>Lung</td>
<td>−500</td>
 </tr>
  <tr>
<td>Fat</td>
<td>−100 to −50</td>
 </tr>
  <tr>
<td>Water</td>
<td>0</td>
 </tr>
  <tr>
<td>Blood</td>
<td>+30 to +70</td>
 </tr>
  <tr>
<td>Muscle</td>
<td>+10 to +40</td>
 </tr>
 <tr>
<td>Liver</td>
<td>+40 to +60</td>
 <tr>
  <tr>
<td>Bone	</td>

<td>+700 (cancellous bone) to +3000 (cortical bone)</td>
 </tr>

  
</table>


### How to see this histogram ?

Based on the above table we have plotted that in x-axis , to underastand this x axis are the table values how much y - value it has that much of that overall CT-Scan's of that person is covered.

<br>

For exmaple air value is -1000<br>
In the given below histogram you can see that air value is high by y -axis , so There is lots of air


In [None]:
# Loop over the image files and store everything into a list.

def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.InstanceNumber))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
        
    for s in slices:
        s.SliceThickness = slice_thickness
        
    return slices

def get_pixels_hu(scans):
    image = np.stack([s.pixel_array for s in scans])

    image = image.astype(np.int16)

    image[image == -2000] = 0
    
    # Convert to Hounsfield units (HU)
    intercept = scans[0].RescaleIntercept
    slope = scans[0].RescaleSlope
    
    if slope != 1:
        image = slope * image.astype(np.float64)
        image = image.astype(np.int16)
        
    image += np.int16(intercept)
    
    return np.array(image, dtype=np.int16)

id=0
patient = load_scan(data_path)
imgs = get_pixels_hu(patient)
np.save("fullimages_0.npy", imgs)
file_used="fullimages_0.npy"
imgs_to_process = np.load(file_used).astype(np.float64) 
plt.figure(figsize=(20,6))
plt.hist(imgs_to_process.flatten(), bins=50, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()



HU's is standardized across all CT-scan , so it can be useful.

* There is lots of air
* There is some lung
* Lots of Fat and Water, Blood ,Muscle ,Liver
* The bone level is low. Not even reached the 3000

# Image Stack

In [None]:
imgs_to_process = np.load('fullimages_0.npy')

def sample_stack(stack, rows=6, cols=6, start_with=10, show_every=6):
    fig,ax = plt.subplots(rows,cols,figsize=[12,12])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/rows),int(i % rows)].set_title('slice %d' % ind)
        ax[int(i/rows),int(i % rows)].imshow(stack[ind],cmap='gray')
        ax[int(i/rows),int(i % rows)].axis('off')
    plt.axis('off')
    plt.show()

sample_stack(imgs_to_process)

### Slice Thickness

In [None]:
print ("Slice Thickness: %f" % patient[0].SliceThickness)
print ("Pixel Spacing (row, col): (%f, %f) " % (patient[0].PixelSpacing[0], patient[0].PixelSpacing[1]))


###### This means we have 1.0 mm slices, and each voxel represents 0.79 mm.

Slices means?

* Slice thickness and slice increment are central concepts that surround CT/MRI imaging. Slice thickness refers to the (often axial) resolution of the scan (2 mm in the illustration). Slice Increment refers to the movement of the table/scanner for scanning the next slice (varying from 1 mm to 4 mm in the illustration).

Voxle means?

* A voxel is a combination of “volume” and “pixel”, and represents a value on a regular grid in three dimensional space. 

* All the slices of ct are only constructed as the 512 x 512

* We can use the DICOM to know about the size of each voxel as the slice thickness


# 3D plotting

#### In order to plot it in 3d , we have to resample each slice into 1x1x1 mm pixels and slices

In [None]:
imgs_to_process = np.load('fullimages_0.npy')
def resample(image, scan, new_spacing=[1,1,1]):
    # Determine current pixel spacing
    spacing = map(float, ([scan[0].SliceThickness] + list(scan[0].PixelSpacing)))
    spacing = np.array(list(spacing))

    resize_factor = spacing / new_spacing
    new_real_shape = image.shape * resize_factor
    new_shape = np.round(new_real_shape)
    real_resize_factor = new_shape / image.shape
    new_spacing = spacing / real_resize_factor
    
    image = scipy.ndimage.interpolation.zoom(image, real_resize_factor)
    
    return image, new_spacing

print ("Shape before resampling\t", imgs_to_process.shape)
imgs_after_resamp, spacing = resample(imgs_to_process, patient, [1,1,1])
print ("Shape after resampling\t", imgs_after_resamp.shape)

In [None]:
def make_mesh(image, threshold=-300, step_size=1):

    print ("Transposing surface")
    p = image.transpose(2,1,0)
    
    print ("Calculating surface")
    verts, faces, norm, val = measure.marching_cubes_lewiner(p, threshold, step_size=step_size, allow_degenerate=True) 
    return verts, faces

def plotly_3d(verts, faces):
    x,y,z = zip(*verts) 
    
    print ("Drawing")
    
    # Make the colormap single color since the axes are positional not intensity. 
#    colormap=['rgb(255,105,180)','rgb(255,255,51)','rgb(0,191,255)']
    colormap=['rgb(236, 236, 212)','rgb(236, 236, 212)']
    
    fig = FF.create_trisurf(x=x,
                        y=y, 
                        z=z, 
                        plot_edges=False,
                        colormap=colormap,
                        simplices=faces,
                        backgroundcolor='rgb(64, 64, 64)',
                        title="Interactive Visualization")
    iplot(fig)

def plt_3d(verts, faces):
    print("Drawing")
    x,y,z = zip(*verts) 
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111, projection='3d')

    # Fancy indexing: `verts[faces]` to generate a collection of triangles
    mesh = Poly3DCollection(verts[faces], linewidths=0.05, alpha=1)
    face_color = [1, 1, 0.9]
    mesh.set_facecolor(face_color)
    ax.add_collection3d(mesh)

    ax.set_xlim(0, max(x))
    ax.set_ylim(0, max(y))
    ax.set_zlim(0, max(z))
    ax.set_facecolor((0.7, 0.7, 0.7))
    plt.show()
v, f = make_mesh(imgs_after_resamp, 730, 2)
plotly_3d(v, f)


* Here we have set the Threshold higher , because we are rendering Bones , as we seen in the Hounsfield unit's analysis bones were too low and also slice 1.0 mm slices.

* The marching cubes algorithm is used to generate a 3D mesh from the dataset. The plotly model will utilize a higher step_size with lower voxel threshold to avoid overwhelming the web browser.



# Apply Mask

Masking Images. Using an image as a mask. A mask image is simply an image where some of the pixel intensity values are zero, and others are non-zero. Wherever the pixel intensity value is zero in the mask image, then the pixel intensity of the resulting masked image will be set to the background value (normally zero).[know more](http://www.xinapse.com/Manual/masking.html)

In [None]:
def make_lungmask(img, display=False):
    row_size= img.shape[0]
    col_size = img.shape[1]
    
    mean = np.mean(img)
    std = np.std(img)
    img = img-mean
    img = img/std
    # Find the average pixel value near the lungs
    # to renormalize washed out images
    middle = img[int(col_size/5):int(col_size/5*4),int(row_size/5):int(row_size/5*4)] 
    mean = np.mean(middle)  
    max = np.max(img)
    min = np.min(img)
    # To improve threshold finding, I'm moving the 
    # underflow and overflow on the pixel spectrum
    img[img==max]=mean
    img[img==min]=mean
    #
    # Using Kmeans to separate foreground (soft tissue / bone) and background (lung/air)
    #
    kmeans = KMeans(n_clusters=2).fit(np.reshape(middle,[np.prod(middle.shape),1]))
    centers = sorted(kmeans.cluster_centers_.flatten())
    threshold = np.mean(centers)
    thresh_img = np.where(img<threshold,1.0,0.0)  # threshold the image

    # First erode away the finer elements, then dilate to include some of the pixels surrounding the lung.  
    # We don't want to accidentally clip the lung.

    eroded = morphology.erosion(thresh_img,np.ones([3,3]))
    dilation = morphology.dilation(eroded,np.ones([8,8]))

    labels = measure.label(dilation) # Different labels are displayed in different colors
    label_vals = np.unique(labels)
    regions = measure.regionprops(labels)
    good_labels = []
    for prop in regions:
        B = prop.bbox
        if B[2]-B[0]<row_size/10*9 and B[3]-B[1]<col_size/10*9 and B[0]>row_size/5 and B[2]<col_size/5*4:
            good_labels.append(prop.label)
    mask = np.ndarray([row_size,col_size],dtype=np.int8)
    mask[:] = 0

    #
    #  After just the lungs are left, we do another large dilation
    #  in order to fill in and out the lung mask 
    #
    for N in good_labels:
        mask = mask + np.where(labels==N,1,0)
    mask = morphology.dilation(mask,np.ones([10,10])) # one last dilation

    if (display):
        fig, ax = plt.subplots(3, 2, figsize=[12, 12])
        ax[0, 0].set_title("Original")
        ax[0, 0].imshow(img, cmap='gray')
        ax[0, 0].axis('off')
        ax[0, 1].set_title("Threshold")
        ax[0, 1].imshow(thresh_img, cmap='gray')
        ax[0, 1].axis('off')
        ax[1, 0].set_title("After Erosion and Dilation")
        ax[1, 0].imshow(dilation, cmap='gray')
        ax[1, 0].axis('off')
        ax[1, 1].set_title("Color Labels")
        ax[1, 1].imshow(labels)
        ax[1, 1].axis('off')
        ax[2, 0].set_title("Final Mask")
        ax[2, 0].imshow(mask, cmap='gray')
        ax[2, 0].axis('off')
        ax[2, 1].set_title("Apply Mask on Original")
        ax[2, 1].imshow(mask*img, cmap='gray')
        ax[2, 1].axis('off')
        
        plt.show()
    return mask*img
img = imgs_after_resamp[230]
make_lungmask(img, display=True)

* Here we applied mask to single image
* It is more often used to reveal some portions
* In the final image Mask in original , clearly shows it , if we fine tune that abd create an ml model would give somwhat good accuracy.

### Mask to all slices

In [None]:
masked_lung = []

for img in imgs_after_resamp:
    masked_lung.append(make_lungmask(img))

sample_stack(masked_lung, show_every=10)

* Pulmonary fibrosis is damages and scarres the lung tissue. So by this mask if we change it little bit we can clearly see the lung tissue. Now we can see little bit of lung tissue.
* If we do much research we can slice the portion of lung tissue cleaned and contrast. so that can increase ML model's accuracy


In [None]:
#to remove all png siles in working dir
import os

filelist = [ f for f in os.listdir("./") if f.endswith(".png") ]
for f in filelist:
    os.remove(os.path.join("./", f))


<img src="https://quotefancy.com/media/wallpaper/3840x2160/1753004-Desiderius-Erasmus-Quote-Prevention-is-better-than-cure.jpg" width=400>

There are currently no established ways to prevent pulmonary fibrosis, particularly since in most cases the cause of the disease cannot be identified.

whilst

### Factors that Impact PF Prevention

* One of the most common and avoidable risk factors for IPF is smoking, and everyone should quit smoking to avoid not only pulmonary fibrosis, but also many other respiratory diseases. 
* Those who are above 50 can undergo regular medical examinations.
* In cases where people need to work in contact with toxins and pollutants like silica dust, asbestos fibers, grain dust, and bird and animal droppings, it is important to reduce exposure by using a mask, and make sure regulations regarding these materials are being followed.
* Age and genetics are also risk factors,  Patients who have family members who are suffering from or have had pulmonary fibrosis can undergo genetic testes, which can help in disease prevention.
* Open Source Imaging Consortium (OSIC) can also make much more aware of these all.

## would continue.....