# My best solution (-6.8348)

Althouth I sadly didn't select this one for submission, if I had it would have placed me right into the 5th place. I wrote this one after realising two things:

1. I couldn't trust public LB at all and cross-validation was the key.

2. Image data wasn't giving me any advantage, so relying solely on tabular data would enable me faster iteration and better results.

However, at this stage I hadn't improved my internal testing was still poorly made and that's exactly why I didn't trust my internal results at the moment. I will update this kernel as soon as I can so that I can include all my findings.

## How I envisioned it

The very first thing I did was inspecting the data, so I was also keen to rely on linear regression (and my submitted versions relied on that). Nevertheless, in this version I gave a try to a direct prediction method, where I tried to fit the whole function with a randomized ensemble of multilayer perceptrons.

## What I learned in this competition

1. The value of visualization: visualize everything you can in order to properly understand it and, if you think you can't visualize it, think of why you can't and how you could solve it.

2. The value of a good test and a cross-validation strategy. This turned out to be very important in this competition, but even in ones in which you can trust the public LB you don't want to do all your testing on that. A good CV lets you try all sort of things.

3. How to compute uncertainty in a trustable manner. I already was familiar with MC-dropout, a technique developed by [Yarin Gal around 2016](https://arxiv.org/abs/1506.02142), but then I read [this scientific article](https://arxiv.org/abs/1709.01907) written by researchers at Uber, in which they compute 95% confidence intervals for time-series prediction. They manage some good concepts there.

4. Try first your own ideas and approaches. Better don't contaminate yourself with others' solutions before you've done this, or you're doomed to be mundane ;)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

INTERNAL_TEST = True

random.seed(0)
np.random.seed(0)

def metric(fvc_pred, fvc_true, sigma):
    sigma_clip = np.maximum(sigma, 70)
    err = np.minimum(np.abs(fvc_true - fvc_pred), 1000)
    twosqr = np.sqrt(2.0)
    metric = - (twosqr*err)/sigma_clip - np.log(twosqr*sigma_clip)
    return np.mean(metric)

def get_stats(sex):
    mean = np.zeros(len(sex), dtype=np.float32)
    std  = mean.copy()
    for s in (0,1):
        k = sex==s
        mean[k] = FVC_MEAN[s]
        std[k]  = FVC_STD[s]
    return mean, std

def normalize(values, sex):
    mean, std = get_stats(sex)
    return (values - mean) / std

def denormalize(values, sex, only_std=False):
    mean, std = get_stats(sex)
    return values * std if only_std else (values * std) + mean

def random_state_generator():
    s = 0
    while True:
        yield s
        s += 1
rsit = random_state_generator()

In [None]:
train_data = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test_data  = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

# Normalize with respect to the Patients' first measure
all_data = pd.concat([train_data, test_data])
females  = pd.concat([ pd.iloc[:1] for _, pd in all_data[all_data['Sex']=='Female'].groupby('Patient') ])
males    = pd.concat([ pd.iloc[:1] for _, pd in all_data[all_data['Sex']=='Male'].groupby('Patient') ])
FVC_MEAN = (females['FVC'].mean(), males['FVC'].mean())
FVC_STD  = (females['FVC'].std(),  males['FVC'].std())

sex_map = {'Female':0, 'Male':1}

train_data['Sex'] = train_data['Sex'].map(sex_map)
train_data = train_data.join(pd.get_dummies(train_data['SmokingStatus'], prefix=''))
train_data.drop(columns=['SmokingStatus'], inplace=True)
train_data['FVC'] = normalize(train_data['FVC'], train_data['Sex'])

test_data['Sex'] = test_data['Sex'].map(sex_map)
test_data = test_data.join(pd.get_dummies(test_data['SmokingStatus'], prefix=''))
test_data.drop(columns=['SmokingStatus'], inplace=True)
test_data['FVC'] = normalize(test_data['FVC'], test_data['Sex'])
for c in train_data.columns:
    if c[0] == '_' and c not in test_data.columns:
        test_data[c] = 0

In [None]:
# Cross validation
# 8-fold (round number of patients per fold)
# 10x repetition
# Sex-balanced folds
# Augmenting and (random) weeks/sex-balancing is done in-fold

patient_ids = sorted(train_data['Patient'].unique().tolist())
random.shuffle(patient_ids)
# 176 patients

def split(seq, n):
    fold_size = len(seq)/n
    splits = [0] + [ int(fold_size*(i+1)) for i in range(n) ]
    folds  = [ seq[splits[i]:splits[i+1]] for i in range(n) ]
    return folds

def group(seqs, groups):
    assert len(seqs) == sum(groups)
    glist = list()
    it = iter(seqs)
    for g in groups:
        gr = list()
        for i in range(g):
            gr.extend(next(it))
        glist.append(gr)
    return glist

cv_size = 8
fem_ids = train_data[train_data['Sex']==0]['Patient'].unique().tolist()
mal_ids = train_data[train_data['Sex']==1]['Patient'].unique().tolist()
females_folds = split(fem_ids, cv_size)
males_folds   = split(mal_ids, cv_size)
pat_id_folds = [ f+m for f,m in zip(females_folds, males_folds) ]
final_conf_ids = pat_id_folds[-1]
pat_id_folds = pat_id_folds[:-1]
cv_size -= 1
    
def augment(data, only_extremes=False):
    data_list = list()
    for pat_id, df in data.groupby('Patient'):
        df = df.copy()
        df['FirstFVC']     = 0.0
        df['FirstPercent'] = 0.0
        df['FirstWeeks']   = 0
        for i in [0] if only_extremes else range(len(df)-1):
            df = df.copy()
            row = df.iloc[i]
            idx = df['Patient']==pat_id
            df.loc[idx, 'FirstFVC']     = row['FVC']
            df.loc[idx, 'FirstPercent'] = row['Percent']
            df.loc[idx, 'FirstWeeks']   = row['Weeks']
            if only_extremes:
                data_list.append(df.iloc[-3:])
            else:
                data_list.append(df.iloc[i+1:])
    aug = pd.concat(data_list)
    aug['DiffWeeks'] = aug['Weeks'] - aug['FirstWeeks']
    return aug

def balance_sex(data):
    males, females = list(), list()
    for pat_id, df in data.groupby('Patient'):
        if df['Sex'].values[0] < 0.5:
            females.append(pat_id)
        else:
            males.append(pat_id)
    random.shuffle(males)
    random.shuffle(females)
    min_pats = min(len(males), len(females))
    assert min_pats > 0
    males    = males[:min_pats]
    females  = females[:min_pats]
    return data[data['Patient'].isin(males+females)]

def balance_weeks(data):
    min_weeks = min([ len(df) for _, df in data.groupby('Patient') ])
    pats = list()
    for _, df in data.groupby('Patient'):
        if len(df) > min_weeks:
            i = random.randint(0, len(df)-min_weeks-1)
            df = df.iloc[i:i+min_weeks]
            pats.append(df)
    return pd.concat(pats)

In [None]:
def predict(regr_list, scaler, in_tab, out_tab):
    cv_regr_list   = regr_list if isinstance(regr_list, list) else [regr_list]
    cv_scaler_list = scaler if isinstance(scaler, list) else [scaler]
    cols = list()
    i = 0
    for regr_list, scaler in zip(cv_regr_list, cv_scaler_list):
        for reg in regr_list['TargetFVC']:
            c = f'PredTargetFVC{i}'
            out_tab[c] = reg.predict(scaler.transform(in_tab[reg_columns])) * reg.target_fvc_std + in_tab['FirstFVC'] + reg.target_fvc_avg
            cols.append(c)
            i += 1
    preds = out_tab[cols].values
    out_tab['PredFVC']    = np.mean(preds, axis=1)
    out_tab['PredFVCVar'] = np.var(preds, axis=1)
    return cols

def compute_inherent_noise(regr_list, scaler, data, pat_ids):
    data = data[data['Patient'].isin(pat_ids)].copy()
    data = balance_sex(data)
    data = augment(data, only_extremes=True)
    predict(regr_list, scaler, data, data)
    return (data['FVC'] - data['PredFVC']).var()

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

reg_columns  = ['FirstFVC','FirstPercent','DiffWeeks','FirstWeeks','Weeks','Age','Sex','_Currently smokes','_Ex-smoker','_Never smoked']
pred_columns = ['TargetFVC']

instantiators = [ lambda: MLPRegressor(solver='sgd', activation='relu', random_state=next(rsit),
                                       learning_rate_init=0.001, learning_rate='adaptive', batch_size=16, alpha=1e-1,
                                       hidden_layer_sizes=(512,256,128,64), validation_fraction=0.3, n_iter_no_change=5,
                                       max_iter=int(1e6), max_fun=int(1e6), early_stopping=True, verbose=False) ]

train_data_orig = train_data.copy()

inh_noise_list = list()
loss_list = list()
cv_regr_list = list()
cv_scaler_list = list()

for rep in range(1 if INTERNAL_TEST else 1):
    for cv_i in range(cv_size if INTERNAL_TEST else 1):

        if INTERNAL_TEST:
            test_ids  = pat_id_folds[(cv_i - 1) % cv_size]
            conf_ids  = pat_id_folds[(cv_i - 2) % cv_size]
        train_ids = list()
        for i in range(cv_size-2 if INTERNAL_TEST else cv_size):
            train_ids.extend(pat_id_folds[(cv_i + i) % cv_size])

        train_data = train_data_orig.copy()
        data    = train_data[train_data['Patient'].isin(train_ids)]
        fem_ids = data[data['Sex']==0]['Patient'].unique().tolist()
        mal_ids = data[data['Sex']==1]['Patient'].unique().tolist()

        n_folds  = 1
        fem_folds   = split(fem_ids, n_folds)
        mal_folds   = split(mal_ids, n_folds)
        #id_folds = [ f+m for f,m in zip(fem_folds, mal_folds) ]
        sample_len = int(1.0*min(len(fem_ids), len(mal_ids)))
        id_folds = [ random.sample(fem_ids, sample_len) + random.sample(mal_ids, sample_len) for i in range(n_folds) ]
        if cv_i == 0:
            print('Fold size:', len(id_folds[0]))

        # Fit data scaler
        data   = train_data[train_data['Patient'].isin(train_ids)].copy()
        data   = balance_sex(data)
        data   = augment(data)
        scaler = StandardScaler().fit(data[reg_columns])
        cv_scaler_list.append(scaler)

        # Prediction
        regr_list = { k:list() for k in pred_columns }
        for inst_i, inst in enumerate(instantiators):
            for fold_i in range(n_folds):
                i = inst_i*n_folds + fold_i
                print('Member', i)

                id_list = id_folds[fold_i]
                data = train_data[train_data['Patient'].isin(id_list)].copy()
                data = balance_sex(data)
                #data = balance_weeks(data)
                data = augment(data)
                data['TargetFVC'] = data['FVC'] - data['FirstFVC']

                X = data[reg_columns]
                X = scaler.transform(X)
                for k in pred_columns:
                    reg = inst()
                    avg, std = data['TargetFVC'].mean(), data['TargetFVC'].std()
                    reg.target_fvc_avg = avg
                    reg.target_fvc_std = std
                    Y = (data[k] - avg) / std
                    reg.fit(X,Y)
                    regr_list[k].append(reg)
        cv_regr_list.append(regr_list)

        if INTERNAL_TEST:
            inh_noise = compute_inherent_noise(regr_list, scaler, train_data, conf_ids)
            inh_noise_list.append(inh_noise)
            print('Fold', cv_i)
            print('Inh noise:', inh_noise * FVC_STD[data['Sex'].values[0]])

            data = train_data[train_data['Patient'].isin(test_ids)].copy()
            data = balance_sex(data)
            data = augment(data, only_extremes=True)
            predict(regr_list, scaler, data, data)
            ens_pred = denormalize(data['PredFVC'], data['Sex'])
            conf = np.sqrt(data['PredFVCVar'] + inh_noise)
            conf = denormalize(conf, data['Sex'], only_std=True)
            fvc  = denormalize(data['FVC'], data['Sex'])
            loss = metric(ens_pred, fvc, conf)
            print('Loss:', loss)
            loss_list.append(loss)
    
if len(loss_list) > 0:
    print('Expected loss', np.mean(loss_list), np.var(loss_list))
    inh_noise = np.mean(inh_noise_list)
    print('Mean inh noise', inh_noise * FVC_STD[data['Sex'].values[0]])
    
inh_noise = compute_inherent_noise(cv_regr_list, cv_scaler_list, train_data_orig, final_conf_ids)
print('Final inherent noise', inh_noise)

In [None]:
subdf = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv')
subdf['Sex']        = 0
subdf['FirstWeeks'] = 0
subdf['FirstFVC']   = 0.0
subdf['Patient'] = subdf['Patient_Week'].str.split('_').str[0]
subdf['Weeks']   = subdf['Patient_Week'].str.split('_').str[1].astype(int)

fvc_columns = list()
i = 0
for regr_list in cv_regr_list:
    for reg in regr_list['TargetFVC']:
        c = f'PredTargetFVC{i}'
        subdf[c] = 0.0
        fvc_columns.append(c)
        i += 1

test_data['FirstFVC']     = test_data['FVC']
test_data['FirstPercent'] = test_data['Percent']
test_data['FirstWeeks']   = test_data['Weeks']

copy_cols = ['FirstFVC','FirstPercent','FirstWeeks','Age','Sex','_Currently smokes','_Ex-smoker','_Never smoked']
for c in copy_cols:
    subdf[c] = 0.0
    
for pat_id, df in test_data.groupby('Patient'):
    row = df.iloc[0]
    idx = subdf['Patient']==pat_id
    for c in copy_cols:
        subdf.loc[idx, c] = row[c]
subdf['DiffWeeks'] = subdf['Weeks'] - subdf['FirstWeeks']

i = 0
for regr_list, scaler in zip(cv_regr_list, cv_scaler_list):
    X = subdf[reg_columns]
    X = scaler.transform(X)
    for reg in regr_list['TargetFVC']:
        subdf[f'PredTargetFVC{i}'] = reg.predict(X) * reg.target_fvc_std + reg.target_fvc_avg
        i += 1
        
preds = subdf[fvc_columns].values + subdf['FirstFVC'].values.reshape(-1,1)
subdf['FVC']    = np.mean(preds, axis=1)
subdf['FVCVar'] = np.var(preds, axis=1)
        
subdf['FVC']        = denormalize(subdf['FVC'], subdf['Sex'])
subdf['Confidence'] = np.sqrt(subdf['FVCVar'] + inh_noise)
subdf['Confidence'] = denormalize(subdf['Confidence'], subdf['Sex'], only_std=True)

idx = subdf['Weeks'] == subdf['FirstWeeks']
subdf.loc[idx, 'FVC']        = denormalize(subdf['FirstFVC'], subdf['Sex'])
subdf.loc[idx, 'Confidence'] = 0

if True:
    for i, (pat_id, df) in enumerate(subdf.groupby('Patient')):
        plt.figure()
        w    = df['Weeks']
        fvc  = df['FVC']
        conf = df['Confidence']
        plt.plot(w, fvc, label='pred')
        plt.plot(w, fvc + conf, linestyle='dashed', color='grey')
        plt.plot(w, fvc - conf, linestyle='dashed', color='grey')
        plt.show()
        if i==4:
            break

subdf = subdf[['Patient_Week','FVC','Confidence']]
subdf.to_csv('submission.csv', index=False)
subdf.sort_values(by=['Confidence']).head(20)