# Introduction

Testing use of sklearn kfolds to see how well we should expect to do on test data when using the sklearn nearest neighbor routine. 

Split data into nine groups with the same R, C. Within each group, for each "test" breath find the two "train" breaths with the most similar u_in (MAE metric) and use average of their pressures as the prediction.

# Import libraries and data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import KFold

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')

In [None]:
BL = 80 #Breath length - how many data points we have for each breath

# Extract u_in, pressure, u_out as numpy arrays, check

In [None]:
u_in     = train['u_in'].values.reshape((-1, BL))
pressure = train['pressure'].values.reshape((-1, BL))
u_out    = train['u_out'].values.reshape((-1, BL))

In [None]:
plt.plot(u_in[0]);
plt.plot(pressure[0]);

# Unique values of R, C

In [None]:
r_values = train['R'].unique()
c_values = train['C'].unique()

# Define the estimator and fit it to the train data in five folds. Check average MAE in the region of interest

In [None]:
kf = KFold(5)
neigh = NearestNeighbors(n_neighbors=2, metric = 'manhattan')

for r in r_values:
    for c in c_values:
        print(f'R {r:02d} C {c:02d}')
        rc_correct = (train['R'][::BL] == r) & (train['C'][::BL] == c) #Use ::BL to have one per breath
        rc_u_in     = u_in[rc_correct]      #Arrays with u_in, pressure and u_out with this RC combo
        rc_pressure = pressure[rc_correct]
        rc_u_out    = u_out[rc_correct]
        
        #Folds
        for train_idx, test_idx in kf.split(rc_pressure):

            #Fit the nearest neighbor estimator
            neigh.fit(rc_u_in[train_idx])

            #Initialize
            mae = 0
            Y_train = rc_pressure[train_idx]
            Y_test  = rc_pressure[test_idx]
            X_test  = rc_u_in[test_idx]
            filt    = 1. - rc_u_out[test_idx] #only calculate error from times when u_out == 0

            #Loop over the "test" breaths
            for idx in range(len(test_idx)):
                nn1, nn2 = neigh.kneighbors([X_test[idx]], 2, return_distance=False)[0]
                Y_pred = (Y_train[nn1] + Y_train[nn2])/2.
                mae += np.sum(np.abs(Y_test[idx] - Y_pred)*filt[idx])/sum(filt[idx])

            #Print result
            print(f'    {mae/len(test_idx):.2f}')

We see that we are mostly limited by R = 50 breaths.