# Introduction

In this notebook we for each breath in the test file find the breath in the train file that
1. has the same R and C values as the test breath
2. has the most similar u_in as measured by chi squared value

We save the nearest neighbor breath_id.

We build on the [exploratory analysis](https://www.kaggle.com/motloch/ventilator-pressure-train-data-exploration) we did earlier. The results are used for [submission](https://www.kaggle.com/motloch/ventilator-pressure-use-nn-u-in-same-r-c) leading to score 0.650.

Version 4: stores information about several nearest neighbors, not just one

# Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load train and test data

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

# Number of data points per breath, number of breaths in test and train data

In [None]:
BREATH_LENGTH = 80
num_train_breaths = len(train) // BREATH_LENGTH
num_test_breaths  = len(test)  // BREATH_LENGTH

# List of R and C values

In [None]:
r_values = train['R'].unique()
c_values = train['C'].unique()

# Use numpy array directly (for speed). Separate breaths with different R, C values

For each combination of R/C we store 2D array of u_in (BREATH_LENGTH = 80 numbers for each breath) and the list of breath_id for the breaths with this R/C combination. 

For quick lookup we store the results in dictionaries.

In [None]:
train_u_in = {}
breath_ids = {}

for r in r_values:
    for c in c_values:
        rc_combo = str(r) + '_' + str(c)
        current_dat = train[(train['R'] == r) & (train['C'] == c)]
        train_u_in[rc_combo] = np.reshape(current_dat['u_in'].values, (-1, BREATH_LENGTH))
        breath_ids[rc_combo] = current_dat['breath_id'][::BREATH_LENGTH].values

Test we got the ordering right when reshaping - first ten u_in values for the first breath with R = 50, C = 10

In [None]:
print(train_u_in['50_10'][0,:10])

In [None]:
first_50_10_breath = train[(train['breath_id'] == breath_ids['50_10'][0])]
print(first_50_10_breath['u_in'][:10].values)

# Find the nearest u_in curve for the first five test breaths

For the first five test breaths, find the breaths with the same R and C that have the most similar u_in (defined through chi squared). 

Plot both test and train u_in curves to check.

In [None]:
def find_nn(which):
    current_u_in = test['u_in'][which*BREATH_LENGTH:(which+1)*BREATH_LENGTH].values
    rc_combo = str(test['R'][which*BREATH_LENGTH]) + '_' + str(test['C'][which*BREATH_LENGTH])
    
    chi2 = np.sum((train_u_in[rc_combo] - current_u_in)**2, axis = -1)
    nn = breath_ids[rc_combo][np.argmin(chi2)]
    
    plt.plot(current_u_in, label = 'test')
    plt.plot(train[train['breath_id'] == nn]['u_in'].values, label = 'train')
    plt.legend()
    plt.ylabel('u_in')
    plt.xlabel('time step')
    plt.show()

for idx in range(5):
    find_nn(idx)

# Find nearest neighbors for each test breath and save

In [None]:
STORE_NN = 10

nn_breath_id = np.zeros((num_test_breaths, STORE_NN), dtype = int)

for idx in range(num_test_breaths):
    if idx % 1000 == 0:
        print(idx)
        
    current_u_in = test['u_in'][idx*BREATH_LENGTH:(idx+1)*BREATH_LENGTH].values
    current_name = str(test['R'][idx*BREATH_LENGTH]) + '_' + str(test['C'][idx*BREATH_LENGTH])
    
    chi2 = np.sum((train_u_in[current_name] - current_u_in)**2, axis = -1)
    nn_breath_id[idx] = breath_ids[current_name][np.argsort(chi2)[:STORE_NN]]

Sanity check on the results

In [None]:
current_u_in = test['u_in'][BREATH_LENGTH:2*BREATH_LENGTH].values
nn = nn_breath_id[1]

for i in range(STORE_NN):
    plt.plot(train[train['breath_id'] == nn[i]]['u_in'].values)
plt.scatter(range(BREATH_LENGTH), current_u_in, label = 'test', color = 'k')
plt.legend()
plt.ylabel('u_in')
plt.xlabel('time step')
plt.xlim([0,20])
plt.show()

In [None]:
np.savetxt('nn.txt', nn_breath_id)