This is my first exploratory data analysis. Of course, there are still many things I can't fix, so I plan to update as much as possible. I'd appreciate it if you all could give me accurate advice.

# Overview

*  [Code Requirements](https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/overview/code-requirements) say that No internet access enabled.Let's turn off the Internet.At the same time, the TPU says it cannot be used to enter this competition, so be careful when creating your submission file!

# What is pulmonary fibrosis?

* The lungs are made up of many small balloon-shaped pouches called alveoli. Pulmonary fibrosis is a general term for a disease that causes inflammation and damage to the walls of these alveoli. This inflammation and damage is thought to gradually cause the walls of the alveoli to become thicker and harder. The hardening of the alveolar walls is called fibrosis. In other words, we seem to be able to determine that patients with thicker alveolar walls are the ones with this disease progressing and tails.

* It should be noted that more than half of pulmonary fibrosis cases are still unexplained. It is characterized by the appearance of a beehive of broken lungs on chest CT.

# Environmental construction

In [None]:
# linear algebra
import numpy as np
# data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd
#Unix commands
import os

# import useful tools
from glob import glob
from PIL import Image
import cv2

# import data visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns

from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file
from bokeh.models import ColumnDataSource, HoverTool, Panel
from bokeh.models.widgets import Tabs

# import data augmentation
import albumentations as albu

# import math module
import math

In [None]:
#Libraries
import pandas_profiling
import xgboost as xgb
from sklearn.metrics import log_loss
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

In [None]:
# One-hot encoding
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

# Loading data

In [None]:
# Setup the paths to train and test images
DATASET = '../input/osic-pulmonary-fibrosis-progression'
TEST_DIR = '../input/osic-pulmonary-fibrosis-progression/test'
TRAIN_CSV_PATH = '../input/osic-pulmonary-fibrosis-progression/train.csv'

# Glob the directories and get the lists of train and test images
train_fns = glob(DATASET + '*')
test_fns = glob(TEST_DIR + '*')

In [None]:
# Loading training data and test data
train = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
train_dcm = pd.read_csv('../input/osic-image-eda/n_dicom_df.csv')
train_dcm_shp = pd.read_csv('../input/osic-image-eda/shape_df.csv')
train_meta_dcm = pd.read_csv('../input/pulmonary-fibrosis-prep-data/meta_train_data.csv')
test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

In [None]:
# Display of training data
print(train)

* Patient_Week is supposed to be a value obtained by concatenating PatientID and Week with underscore.

In [None]:
#Loading Sample Files for Submission
sample = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv')
# Confirmation of the format of samples for submission
sample.head(3)

* Patient_Week is supposed to be a value obtained by concatenating PatientID and Week with underscore.

In [None]:
#Loading Sample Files for Submission
sample = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv')

# Checking data statistics

In [None]:
# display the smoking status of the training data without duplicates
print(train['SmokingStatus'].drop_duplicates())

* We understand that there are three types of smoking status.

In [None]:
# display the training data without gender duplication
print(train['Sex'].drop_duplicates())

* We can find that the only gender values are male and female and no other values

In [None]:
# Display some of the training data
train.head(10)

In [None]:
# Display some of the training data
train_dcm.head(10)

In [None]:
# Display some of the training data
train_dcm_shp.head(10)

In [None]:
# Display some of the training data
train_meta_dcm.head(10)

In [None]:
# Check for missing values in the training data
train.isnull().sum()

* Therefore, we can conclude that there is no missing training data

In [None]:
# Let's check the max value and the max value for Weeks
print("Minimum number of value for Weeks is: {}".format(train['Weeks'].min()), "\n" +
      "Maximum number of value for Weeks is: {}".format(train['Weeks'].max() ))

In [None]:
# Check the Patient statistics of the training data
train['Patient'].describe()

In [None]:
# Check the FVC statistics of the training data
train['FVC'].describe(percentiles=[0.1,0.2,0.5,0.75,0.9])

* From the output results, we can say that it is in the bottom 25% if the FVC is smaller than 2109.

In [None]:
# Check age-related statistics in the training data
train['Age'].describe()

* It seem to be understood that this group is predominantly elderly, given that the average age of this group is 67 years old!

# Create test data

In [None]:
# Display of test data
print(test)

In [None]:
# Combine the Patient ID and Week columns
test_patient_weeklist = test['Patient_Week'] = test['Patient'].astype(str)+"_"+test['Weeks'].astype(str)
test2 = test.drop('Patient', axis=1)
test3 = test.drop('Weeks', axis=1)
test4 = test.reindex(columns=['Patient_Week', 'FVC', 'Percent', 'Age', 'Sex', 'SmokingStatus'])
test4.head(7)

In [None]:
# Find the unique number of patient IDs. 
n = train['Patient'].nunique()
print(n)

In [None]:
# First, I'll use Sturgess's formula to find the appropriate number of classes in the histogram 
k = 1 + math.log2(n)

In [None]:
# Display a histogram of the FVC of the training data
sns.distplot(train['FVC'], kde=True, rug=False, bins=int(k)) 
# Graph Title
plt.title('FVC')
# Show Histogram
plt.show() 

In [None]:
# Display a histogram of the age of the training data
sns.distplot(train['Age'], kde=True, rug=False, bins=int(k)) 
# Title of the study data age graph
plt.title('Age')
# Display a histogram of the age of the training data
plt.show() 

In [None]:
# Show the correlation between age and FVC in the training data
sns.scatterplot(data=train, x='Age', y='FVC')

In [None]:
# Produce correlation coefficients between age and FVC of the training data
df = train
df.corr()['Age']['FVC']

* From the output, we see that there is no correlation between age and fvc.

In [None]:
# Narrowing down to smokers in the training data to produce a correlation coefficient between age and FVC 
df_smk = train.query('SmokingStatus == "Currently smokes"')

df_smk.corr()['Age']['FVC']

* At the same time, there appears to be no correlation between age and fvc when focusing on smokers

In [None]:
# Scatterplots of age and FVC for training data extracted by smokers
sns.scatterplot(data=df_smk, x='Age', y='FVC')

In [None]:
# Show the correlation between age and FVC in the training data
sns.scatterplot(data=train, x='Percent', y='FVC')

* Explicitly, there appears to be no correlation between age and fvc when focusing on smokers

In [None]:
# Compute summary statistics for FVC aggregated by age
df.groupby('Age').describe()['FVC']

In [None]:
# Calculate summary statistics for FVC aggregated by patient ID 
df.groupby('Patient').describe(percentiles=[0.1,0.2,0.5,0.8])['FVC']

# * Overview of Correlation

In [None]:
df_corr = df.corr()
print(df_corr)

In [None]:
# View the correlation heat map
corr_mat = df.corr(method='pearson')
sns.heatmap(corr_mat,
            vmin=-1.0,
            vmax=1.0,
            center=0,
            annot=True, # True:Displays values in a grid
            fmt='.1f',
            xticklabels=corr_mat.columns.values,
            yticklabels=corr_mat.columns.values
           )
plt.show()

In [None]:
# Draw a pie chart about gender.
plt.pie(train["Sex"].value_counts(),labels=["Male","Female"],autopct="%.1f%%")
plt.title("Ratio of Sex")
plt.show()

* From the output results, we can see that we are overwhelmingly male.

In [None]:
# Draw a pie chart about smoking status
plt.pie(train["SmokingStatus"].value_counts(),labels=["Ex-smoker","Never smoked","Currently smokes"],autopct="%.1f%%")
plt.title("SmokingStatus")
plt.show()

* From the output results, we can see that far fewer people are currently smoking

# Intuitive understanding through images

* We've been able to determine the general composition of the patient population, but it's still difficult to understand the relationship between the disease . 
* Let's move away from looking at the overall data and focus our search on the patients with the worst symptoms.
* Let's get the information needed for the FVC to display the top 10% and bottom 10% of the image data.
* We want to identify multiple patient IDs whenever possible, as we will consider the possibility of picking up outliers.

In [None]:
print(train[train.FVC < 1651])

* Let's take a look at the images of patients in the bottom 10% of FVC.

In [None]:
def extract_num(s, p, ret=0):
    search = p.search(s)
    if search:
        return int(search.groups()[0])
    else:
        return ret

In [None]:
import pydicom

def plot_pixel_array(dataset, figsize=(5,5)):
    plt.figure(figsize=figsize)
    plt.imshow(dataset.pixel_array, cmap=plt.cm.bone)
    plt.show()

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00023637202179104603099/3.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00023637202179104603099/5.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00023637202179104603099/7.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00023637202179104603099/15.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

* Next, let's look at the images of patients in the top 10% of FVC.

In [None]:
print(train[train.FVC > 3874])

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00009637202177434476278/11.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00014637202177757139317/2.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00032637202181710233084/30.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

In [None]:
file_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00032637202181710233084/35.dcm"
dataset = pydicom.dcmread(file_path)
plot_pixel_array(dataset)

# Creat Features

In [None]:
# Create models and train them with training data
train_x = train.drop(['FVC'], axis=1)
train_y = df['FVC']

In [None]:
#Combining training data with data frames with DICOM data and patient IDs as keys
train_x2 = pd.merge(train_x, train_dcm, on='Patient')

In [None]:
# Conversion of category variables to arbitrary values
train_x2['Sex'] = train_x2['Sex'].map({'Male': 0, 'Female': 1})
train_x2['SmokingStatus'] = train_x2['SmokingStatus'].map({'Never smoked': 0, 'Ex-smoker': 1, 'Currently smokes': 2})

In [None]:
# Confirmation of current value
print(train_x2)

In [None]:
# Combine the Patient ID and Week columns
train_x2['Patient_Week'] = train_x2['Patient'].astype(str)+"_"+train_x2['Weeks'].astype(str)
train_x3 = train_x2.drop('Patient', axis=1)
train_x4 = train_x3.drop('Weeks', axis=1)
train_x5 = train_x4.reindex(columns=['Patient_Week', 'Percent', 'Age', 'Sex', 'SmokingStatus', 'n_dicom', 'n_list'])
train_x5.head(7)
# Confirming the converted value
print(train_x5)

In [None]:
# Conversion of category variables to arbitrary values of test data
test2['Sex'] = test['Sex'].map({'Male': 0, 'Female': 1})
test2['SmokingStatus'] = test2['SmokingStatus'].map({'Never smoked': 0, 'Ex-smoker': 1, 'Currently smokes': 2})

In [None]:
test3 = test2.drop('Weeks', axis=1)
test4 = test3.drop('FVC', axis=1)
test5 = test4.reindex(columns=['Patient_Week', 'Percent', 'Age', 'Sex', 'SmokingStatus'])
test5.head(7)
# Confirming the converted value
print(test5)

In [None]:
# I'll just copy test data
test_x = test5.copy()
print(test_x)

# Choosing "Features"

In [None]:
osic_features = ['Percent', 'Age', 'Sex', 'SmokingStatus']

In [None]:
X = train_x5[osic_features]

# Building Model

In [None]:
# Define model. Specify a number for random_state to ensure same results each run
osic_model = DecisionTreeRegressor(random_state=1)

# Fit model
osic_model.fit(X, train_y)

In [None]:
print(X.head())
print("The predictions are")
print(osic_model.predict(X.head()))

In [None]:
# Let's visualize the FVC of Training Data
# plt.figure(figsize=(18,6))
# plt.plot(train_x5["FVC"], label = "Train_Data")
# plt.legend()

In [None]:
# Let's visualize the FVC predictions
plt.figure(figsize=(18,6))

Y_train_Graph = pd.DataFrame(X)
plt.plot(Y_train_Graph, label = "Predict")
plt.legend()

# Creation of a submission file

In [None]:
#Reading the file
submission = pd.DataFrame(columns = ["Patient_Week", "FVC", "Confidence"])
#Exporting Files
submission.to_csv('submission.csv', index=False)

# References & Credits

* [Tokyo Medical and Dental University](http://www.tmd.ac.jp/med/pulm/d1.html)
* [using-categorical-data-with-one-hot-encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding)
* [OSIC - EDA & SimpleModel & Overview 【With日本語】](https://www.kaggle.com/koheist/osic-eda-simplemodel-overview-with)
* [Random Forest Submission](https://www.kaggle.com/srikanthpotukuchi/random-forest-submission)
* [OSIC / image shape EDA and preprocess](https://www.kaggle.com/currypurin/osic-image-shape-eda-and-preprocess/data)
* [Doing something with DICOM images in Python](https://qiita.com/fukuit/items/ed163f9b566baf3a6c3f)