# What is Pulmonary Fibrosis?
According to [this website](https://www.pulmonaryfibrosis.org/life-with-pf/about-pf), Pulmonary Fibrosis is a lung disease that can be caused by many factors (e.g., drug, radiation, environmen, etc.)

Some symptoms for this include:
- Fatigue and weakness
- Loss of appetite
- Unexplained weight loss

![A brief explanation of the Pulmonary Fibrosis](https://www.pulmonaryfibrosis.org/images/default-source/default-album/normal-and-impaired-gas-exchange.png?sfvrsn=c3b0918d_0)

In [None]:
! ls -all ../input/osic-pulmonary-fibrosis-progression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
dir = '../input/osic-pulmonary-fibrosis-progression/'
# Read in train.csv.
train_df = pd.read_csv(dir + 'train.csv')
train_df.head(10)

# What's the age distribution of our patients?

In [None]:
sns.distplot(train_df['Age'])

# FVC distribution for both genders

In [None]:
# Plot the distribution of FVC for Male patients.
sns.distplot(train_df[train_df['Sex'] == 'Male']['FVC'])

In [None]:
# Plot the distribution of FVC for Female patients.
sns.distplot(train_df[train_df['Sex'] == 'Female']['FVC'])

According to [this site](https://www.nuvoair.com/blog/do-you-know-how-to-interpret-the-results-of-your-spirometry-test), 
- Average normal FVC value in healthy males aged 20~60 range from 4750 to 5500 ml.
- Average normal FVC value in healthy females aged 20~60 range from 3250 to 3750 ml.


Our data shows
- Male patients' FVC is around 3000
- Femaile patients' FVC is around 1500. (Although there is a fat tail in the distribution).

# What avg(FVC) partition by week looks like?

Since the patients are receiving treatment, I'm expecting to see the aggregated trend for avg(FVC) from all patients to move up.

In [None]:
# Calculate FVC avg partition by week column.
train_df['avg_of_week'] = train_df.groupby(['Weeks'])['FVC'].transform(np.mean)

In [None]:
# A specific user's avg(FVC) movement compared to avg(FVC) per week.

id_recover = 'ID00076637202199015035026'

def plot_specific_patient(id):
    train_df_specificUser = train_df[train_df['Patient'] == id]
    train_df_specificUser = train_df_specificUser.sort_values(
        by='Weeks', ascending=False
    )
    # Plot
    sns.lineplot(
        x='Weeks',
        y='value',
        hue='variable',
        data=pd.melt(train_df_specificUser[['Weeks', 'FVC', 'avg_of_week']], 'Weeks')
    )

plot_specific_patient(id_recover)

For this patient (ID00076637202199015035026), the treatment is obvious working. The avg(FVC) of this patient (blue line) sees a surge as the week moves forward. (Becoming healthy?)

In [None]:
id_notRecover = 'ID00007637202177411956430'
plot_specific_patient(id_notRecover)

There are some other unfortunate patients (like ID00007637202177411956430) whose avg(FVC) kept falling, that's why the avg(FVC) for all patients (orange line) exhibits no clear trend of uprising.

# What does the CT look like?

There are multiple CT images (.dcm) for a single patient.

In [None]:
def plot_CT(id):
    import pydicom
    import os
    
    col = 5
    row = 5
    
    fig = plt.figure(figsize=(12,12))
    path = dir + 'train/'+id+'/'
    imgs = os.listdir(path)
    
    for i in range(1, row*col+1):
        filename = path+str(i)+'.dcm'
        ds = pydicom.dcmread(filename)
        fig.add_subplot(row, col, i)
        plt.imshow(ds.pixel_array, cmap='gray')
    plt.show()

# For recovered patient.
plot_CT(id_recover)

In [None]:
# For NOT recovered patient.
plot_CT(id_notRecover)

For the recovered patient, the CT shows the lungs are **turning dark** (I assume this is a good thing).

For the not recovered patient, the CT shows the lungs are **turning white** (I guess this means the situation is getting worse...)

# Build a baseline model.
Modified based on [Y.Nakama's notebook](https://www.kaggle.com/yasufuminakama/osic-lgb-baseline).

In [None]:
import logging
import os
from logging import getLogger, StreamHandler, FileHandler, Formatter
from tqdm.notebook import tqdm
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

In [None]:
"""Utils declaration."""
def get_logger(filename='log'):
    logger = getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter('%(message)s'))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    return logger

logger = get_logger()

In [None]:
"""Config for training."""
OUTPUT_DIC = './'
ID = 'Patient_Week'
TARGET = 'FVC'
N_Fold = 4

In [None]:
"""Re-load data for a clean start."""
train = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
train['Patient_Week'] = train['Patient'].astype(str) + '_' + train['Weeks'].astype(str)
print(train.shape)
train.head()

In [None]:
"""Construct train dataframe."""
output = pd.DataFrame()
gb = train.groupby('Patient')
tk0 = tqdm(gb, total=len(gb))
for _, usr_df in tk0:
    usr_output = pd.DataFrame()
    for week, tmp in usr_df.groupby('Weeks'):
        rename_cols = {'Weeks': 'base_Week', 'FVC': 'base_FVC', 'Percent': 'base_Percent', 'Age': 'base_Age'}
        tmp = tmp.drop(columns='Patient_Week').rename(columns=rename_cols)
        drop_cols = ['Age', 'Sex', 'SmokingStatus', 'Percent']
        _usr_output = usr_df.drop(columns=drop_cols).rename(columns={'Weeks': 'predict_Week'}).merge(tmp, on='Patient')
        _usr_output['Week_passed'] = _usr_output['predict_Week'] - _usr_output['base_Week']
        usr_output = pd.concat([usr_output, _usr_output])
    output = pd.concat([output, usr_output])
    
train = output[output['Week_passed']!=0].reset_index(drop=True)
print(train.shape)
train.head()

In [None]:
"""Construct test dataframe."""
test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')\
        .rename(columns={'Weeks': 'base_Week', 'FVC': 'base_FVC', 'Percent': 'base_Percent', 'Age': 'base_Age'})
submission = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv')
submission['Patient'] = submission['Patient_Week'].apply(lambda x: x.split('_')[0])
submission['predict_Week'] = submission['Patient_Week'].apply(lambda x: x.split('_')[1]).astype(int)
test = submission.drop(columns=['FVC', 'Confidence']).merge(test, on='Patient')
test['Week_passed'] = test['predict_Week'] - test['base_Week']
print(test.shape)
test.head()

In [None]:
"""Read in submission.csv."""
submission = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv')
print(submission.shape)
submission.head()

In [None]:
folds = train[['Patient_Week', 'Patient', 'FVC']].copy()
N_FOLD = 4
Fold = GroupKFold(n_splits=N_FOLD)
groups = folds['Patient'].values
for n, (train_index, val_index) in enumerate(Fold.split(folds, folds[TARGET], groups)):
    folds.loc[val_index, 'fold'] = int(n)
folds['fold'] = folds['fold'].astype(int)
folds.head()

In [None]:
"""Model declaration."""
def run_single_lightgbm(param, train_df, test_df, folds, features, target, fold_num=0, categorical=[]):
    
    trn_idx = folds[folds.fold != fold_num].index
    val_idx = folds[folds.fold == fold_num].index
    logger.info(f'len(trn_idx) : {len(trn_idx)}')
    logger.info(f'len(val_idx) : {len(val_idx)}')
    
    if categorical == []:
        trn_data = lgb.Dataset(train_df.iloc[trn_idx][features],
                               label=target.iloc[trn_idx])
        val_data = lgb.Dataset(train_df.iloc[val_idx][features],
                               label=target.iloc[val_idx])
    else:
        trn_data = lgb.Dataset(train_df.iloc[trn_idx][features],
                               label=target.iloc[trn_idx],
                               categorical_feature=categorical)
        val_data = lgb.Dataset(train_df.iloc[val_idx][features],
                               label=target.iloc[val_idx],
                               categorical_feature=categorical)

    oof = np.zeros(len(train_df))
    predictions = np.zeros(len(test_df))

    num_round = 10000

    clf = lgb.train(param,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=100,
                    early_stopping_rounds=100)

    oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance(importance_type='gain')
    fold_importance_df["fold"] = fold_num

    predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration)
    
    # RMSE
    logger.info("fold{} RMSE score: {:<8.5f}".format(fold_num, np.sqrt(mean_squared_error(target[val_idx], oof[val_idx]))))
    
    return oof, predictions, fold_importance_df


def run_kfold_lightgbm(param, train, test, folds, features, target, n_fold=5, categorical=[]):
    
    logger.info(f"================================= {n_fold}fold lightgbm =================================")
    
    oof = np.zeros(len(train))
    predictions = np.zeros(len(test))
    feature_importance_df = pd.DataFrame()

    for fold_ in range(n_fold):
        print("Fold {}".format(fold_))
        _oof, _predictions, fold_importance_df = run_single_lightgbm(param,
                                                                     train,
                                                                     test,
                                                                     folds,
                                                                     features,
                                                                     target,
                                                                     fold_num=fold_,
                                                                     categorical=categorical)
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
        oof += _oof
        predictions += _predictions / n_fold

    # RMSE
    logger.info("CV RMSE score: {:<8.5f}".format(np.sqrt(mean_squared_error(target, oof))))

    logger.info(f"=========================================================================================")
    
    return feature_importance_df, predictions, oof

    
def show_feature_importance(feature_importance_df, name):
    cols = (feature_importance_df[["Feature", "importance"]]
            .groupby("Feature")
            .mean()
            .sort_values(by="importance", ascending=False)[:50].index)
    best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

    #plt.figure(figsize=(8, 16))
    plt.figure(figsize=(6, 4))
    sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance", ascending=False))
    plt.title('Features importance (averaged/folds)')
    plt.tight_layout()
    plt.savefig(OUTPUT_DICT+f'feature_importance_{name}.png')

In [None]:
import category_encoders as ce

target = train['FVC']
test['FVC'] = np.nan

# features
cat_features = ['Sex', 'SmokingStatus']
num_features = [c for c in test.columns if (test.dtypes[c] != 'object') & (c not in cat_features)]
features = num_features + cat_features
drop_features = ['Patient_Week', 'FVC', 'predict_Week', 'base_Week']
features = [c for c in features if c not in drop_features]

if cat_features:
    ce_oe = ce.OrdinalEncoder(cols=cat_features, handle_unknown='impute')
    ce_oe.fit(train)
    train = ce_oe.transform(train)
    test = ce_oe.transform(test)
        
lgb_param = {'objective': 'regression',
             'metric': 'rmse',
             'boosting_type': 'gbdt',
             'learning_rate': 0.01,
             'seed': 42,
             'max_depth': -1,
             'verbosity': -1,
            }

feature_importance_df, predictions, oof = run_kfold_lightgbm(lgb_param, train, test, folds, features, target, 
                                                             n_fold=N_FOLD, categorical=cat_features)
    
show_feature_importance(feature_importance_df, TARGET)

In [None]:
"""Create confidence labels."""
import math
train['FVC_pred'] = oof
test['FVC_pred'] = predictions

# baseline score
train['Confidence'] = 100
train['sigma_clipped'] = train['Confidence'].apply(lambda x: max(x, 70))
train['diff'] = abs(train['FVC'] - train['FVC_pred'])
train['delta'] = train['diff'].apply(lambda x: min(x, 1000))
train['score'] = -math.sqrt(2)*train['delta']/train['sigma_clipped'] - np.log(math.sqrt(2)*train['sigma_clipped'])
score = train['score'].mean()
print(score)
train.head(10)


In [None]:
import scipy as sp
from functools import partial

def loss_func(weight, row):
    confidence = weight
    sigma_clipped = max(confidence, 70)
    diff = abs(row['FVC'] - row['FVC_pred'])
    delta = min(diff, 1000)
    score = -math.sqrt(2)*delta/sigma_clipped - np.log(math.sqrt(2)*sigma_clipped)
    return -score

results = []
tk0 = tqdm(train.iterrows(), total=len(train))
for _, row in tk0:
    loss_partial = partial(loss_func, row=row)
    weight = [100]
    #bounds = [(70, 100)]
    #result = sp.optimize.minimize(loss_partial, weight, method='SLSQP', bounds=bounds)
    result = sp.optimize.minimize(loss_partial, weight, method='SLSQP')
    x = result['x']
    results.append(x[0])

In [None]:
# optimized score
train['Confidence'] = results
train['sigma_clipped'] = train['Confidence'].apply(lambda x: max(x, 70))
train['diff'] = abs(train['FVC'] - train['FVC_pred'])
train['delta'] = train['diff'].apply(lambda x: min(x, 1000))
train['score'] = -math.sqrt(2)*train['delta']/train['sigma_clipped'] - np.log(math.sqrt(2)*train['sigma_clipped'])
score = train['score'].mean()
print(score)
train.head(10)

In [None]:
TARGET = 'Confidence'

target = train[TARGET]
test[TARGET] = np.nan

# features
cat_features = ['Sex', 'SmokingStatus']
num_features = [c for c in test.columns if (test.dtypes[c] != 'object') & (c not in cat_features)]
features = num_features + cat_features
drop_features = [ID, TARGET, 'predict_Week', 'base_Week', 'FVC', 'FVC_pred']
features = [c for c in features if c not in drop_features]

lgb_param = {'objective': 'regression',
             'metric': 'rmse',
             'boosting_type': 'gbdt',
             'learning_rate': 0.01,
             'seed': SEED,
             'max_depth': -1,
             'verbosity': -1,
            }

feature_importance_df, predictions, oof = run_kfold_lightgbm(lgb_param, train, test, folds, features, target, 
                                                             n_fold=N_FOLD, categorical=cat_features)
    
show_feature_importance(feature_importance_df, TARGET)

In [None]:
train['Confidence'] = oof
train['sigma_clipped'] = train['Confidence'].apply(lambda x: max(x, 70))
train['diff'] = abs(train['FVC'] - train['FVC_pred'])
train['delta'] = train['diff'].apply(lambda x: min(x, 1000))
train['score'] = -math.sqrt(2)*train['delta']/train['sigma_clipped'] - np.log(math.sqrt(2)*train['sigma_clipped'])
score = train['score'].mean()
print(score)

In [None]:
def lb_metric(train):
    train['Confidence'] = oof
    train['sigma_clipped'] = train['Confidence'].apply(lambda x: max(x, 70))
    train['diff'] = abs(train['FVC'] - train['FVC_pred'])
    train['delta'] = train['diff'].apply(lambda x: min(x, 1000))
    train['score'] = -math.sqrt(2)*train['delta']/train['sigma_clipped'] - np.log(math.sqrt(2)*train['sigma_clipped'])
    score = train['score'].mean()
    return score

In [None]:
score = lb_metric(train)
logger.info(f'Local Score: {score}')

In [None]:
test['Confidence'] = predictions

In [None]:
"""Submission."""
sub = submission.drop(columns=['FVC', 'Confidence']).merge(test[['Patient_Week', 'FVC_pred', 'Confidence']], 
                                                           on='Patient_Week')
sub.columns = submission.columns
sub.to_csv('submission.csv', index=False)
sub.head()