# Introduction

Testing use of sklearn kfolds to see how well we should expect to do on test data.

A short code using the sklearn nearest neighbor routine. For each "test" breath, find the "train" breath with the most similar u_in (MAE metric) and use its pressure.

# Import libraries and data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import KFold

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')

In [None]:
BL = 80 #Breath length - how many data points we have for each breath

# Extract u_in, pressure, u_out as numpy arrays, check

In [None]:
u_in     = train['u_in'].values.reshape((-1, BL))
pressure = train['pressure'].values.reshape((-1, BL))
u_out    = train['u_out'].values.reshape((-1, BL))

In [None]:
plt.plot(u_in[0]);
plt.plot(pressure[0]);

# Define the estimator and fit it to the train data in five folds. Check average MAE in the region of interest

In [None]:
kf = KFold(5)
neigh = NearestNeighbors(n_neighbors=1, metric = 'manhattan')

for train_idx, test_idx in kf.split(pressure):
    
    #Fit the nearest neighbor estimator
    neigh.fit(u_in[train_idx])
    
    #Initialize
    mae = 0
    Y_train = pressure[train_idx]
    Y_test  = pressure[test_idx]
    X_test  = u_in[test_idx]
    filt    = 1. - u_out[test_idx] #only calculate error from times when u_out == 0
    
    #Loop over the "test" breaths
    for idx in range(len(test_idx)):
        nn = neigh.kneighbors([X_test[idx]], 1, return_distance=False)[0,0]
        mae += np.sum(np.abs(Y_test[idx] - Y_train[nn])*filt[idx])/sum(filt[idx])
    
    print(mae/len(test_idx))

These are a bit over what we see on the actual test data. This is understandable, because there we train on a larger dataset, not only on 80% of it as we do here.