# Introduction and findings

A super helpful response by [@PC Jimmy](https://www.kaggle.com/pcjimmmy) in the [discussion about the peak airway pressure (PIP)](https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/283827) made me look a bit more into the behavior of pressure knowing that the underlying controller is trying to reach some predetermined PIP. We have to keep in mind each breath has a different PIP (and potentially other parameters).

We only consider R = 20, C = 10.

During our analysis we found that the following features might be useful:
* Boolean variable denoting whether $u_{in}$ reaches zero anywhere before the inhale phase. This is an imperfect proxy for pressure overshooting PIP. Notice sometimes $u_{in} = 0$ at the first timestep, this should be discarded
* Integral of $u_{in}$ up to the time of the first zero $u_{in}$. Should be a proxy for PIP, as the first zero in $u_{in}$ represent roughly the time when the pressure crosses PIP and the integral of $u_{in}$ roughly represents the amount of air injected
* $u_{in}^{0.69}$ and its shifts, lags, cumulative sum, ...

Version 4: Checked the idea we might estimate PIP by a simple algebraic equation. Does not work.

# Preliminaries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')

In [None]:
BL = 80 #Each breath consists of 80 timestamps

# Look into the first 20 breaths with R = 20, C = 10

Based on the (EDA)[https://www.kaggle.com/motloch/ventilator-pressure-train-data-exploration#Pressure], for this combination of R, C the artificial lung often reaches the PIP

In [None]:
train_RC_20_10 = train[(train['R'] == 20) & (train['C'] == 10)]

After plotting the pressure curves, we read off these values of PIP by eye

In [None]:
pips = np.array([37,15,25,30,20,25,30,40,25,30,40,35,25,37,37,35,10,37,35,35])

For the first 20 breaths with R = 20, C = 10 we plot the first 30 values of pressure and u_in. We also show the value of PIP we estimated from the graph.

In [None]:
for i in range(20):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+30].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+30].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+30].values

    plt.plot(u_in/2, label = 'u_in/2')
    plt.plot(p, label = 'pressure')
    plt.axhline(pips[i], color = 'gray', ls = '--')
    plt.axhline(0, color = 'gray', ls = '--')
    plt.legend()
    plt.show()

Take aways:
* In most of the cases pressure reached the PIP value, though in some cases there was not enough time to reach PIP
* In many cases (though not always), zero u_in coincide with times when the pressure is above PIP. This makes sense - controller discovers too much pressure is in the lung and shuts off the valve. 
* When this happen, notice there is no delay. This goes against the "two time units shift" found for example in [notebook](https://www.kaggle.com/luizflpe/vpp-pressure-hysteresis-impact-of-r-feat-eng) . Looks like the u_in reaction to pressure overshoot is immediate.
* In some cases shutting off the valve happens immediatelly and u_in goes to zero straight away, in others there is a long gradual decrease of u_in
* We found PIP values (for example 40) inconsistent with the [preprint](https://arxiv.org/pdf/2102.06779.pdf) , although this can be just some overshoot that would have been corrected if given enough time

Potential new features:
* Does u_in reach zero anywhere before the inhale phase (when u_out = 0)? This is an imperfect proxy for pressure overcoming PIP. Notice sometimes u_in = 0 at the first timestep, this should be discarded
* Integral of u_in up to the time of the first zero u_in. Should be a proxy for PIP, as the first zero in u_in represent roughly the time when the pressure crosses PIP and the integral of u_in roughly represents the amount of air injected

# Test integral of u_in to the PIP transition time

Here we check how well the integral idea from above works. We will use the PIP values from above as a target and sum u_in to either 
* the end of inhalation, which has been done in other notebooks
* the first zero u_in (discarding the first timestep)

We ignore actual time differences as they are [mostly all the same](https://www.kaggle.com/motloch/ventilator-pressure-train-data-exploration/notebook#Time-steps-in-individual-breaths). This can potentially be improved.

In [None]:
area            = np.zeros(20)
area_to_pip_cross = np.zeros(20)

for i in range(20):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+32].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+32].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+32].values

    #position of the first zero u_in during the inhale
    #(not counting the first timestep)
    #(proxy for pressure crossing of PIP)
    t_cross = 1 + np.argmax(u_in[1:]*(1-u_out[1:]) == 0)
    
    area_to_pip_cross[i] = sum(u_in[:t_cross])
    area[i]              = sum(u_in)

How well do we predict PIP?

In [None]:
plt.scatter(area, pips, label = 'area')
plt.scatter(area_to_pip_cross, pips, marker = 'x', c = 'r', label = 'area to PIP crossing')
plt.ylabel('PIP')
plt.legend();

Ok, so there are some differences, as expected. Looks like the area to PIP crossing (if PIP crossing present) has tighter spread. The correlation is a bit higher too:

In [None]:
print('Correlation of PIP and')
print(f'Area {np.corrcoef(area, pips)[0,1]:.2f}')
print(f'Area to PIP crossing {np.corrcoef(area_to_pip_cross, pips)[0,1]:.2f}')

# Air inflow proportional to pressure difference relative to PIP

The previous model is a bit naive in that physically you would expect the air inflow (and thus change in pressure) to depend on the difference between the current pressure and PIP. We can thus improve our expectation to

$PIP \propto \sum_{i = 1}^{t_{cross}} (PIP - p_i) u_{in,i}$

Let's see how well this does.

In [None]:
area_to_pip_crossB = np.zeros(20)
for i in range(20):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+32].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+32].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+32].values
    
    #position of the first zero u_in during the inhale
    #(not counting the first timestep)
    #(proxy for pressure crossing of PIP)
    t_cross = 1 + np.argmax(u_in[1:]*(1-u_out[1:]) == 0)
    
    area_to_pip_crossB[i] = sum((pips[i] - p[:t_cross])*u_in[:t_cross])

It does not look so much better in here though the curve is a bit smoother..

In [None]:
plt.scatter(  area_to_pip_crossB, pips, label = 'pressure correction')
plt.scatter(13*area_to_pip_cross, pips, marker = 'x', c = 'r', label = 'no pressure correction')
plt.ylabel('PIP')
plt.legend();

But the correlation goes up a notch

In [None]:
print('Correlation of PIP and the area to PIP crossing')
print(f'without pressure correction {np.corrcoef(area_to_pip_cross,  pips)[0,1]:.2f}')
print(f'with    pressure correction {np.corrcoef(area_to_pip_crossB, pips)[0,1]:.2f}')

# Nonlinear u_in

We can actually get even better description by assuming nonlinear dependence on u_in. In principle, there should be a bigger difference between the valve closed and 20% opened than between the valve 80% opened and fully opened.

We model this with
$PIP \propto \sum_{i = 1}^{t_{exhale}} (PIP - p_i) u_{in,i}^\alpha$

It looks like integrating over the whole inhale phase works a bit better (note change in the upper bound).

After some tweaking, $\alpha = 0.69$ works best.

In [None]:
alpha = 0.69

area_to_pip_crossC = np.zeros(20)
for i in range(20):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+32].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+32].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+32].values
    
    #position of the first zero u_in during the inhale
    #(not counting the first timestep)
    #(proxy for pressure crossing of PIP)
    t_cross = 1 + np.argmax(u_in[1:]*(1-u_out[1:]) == 0)
    
    area_to_pip_crossC[i] = sum((pips[i] - p)*u_in**alpha)

I will take this:

In [None]:
print('Correlation of PIP and the area to PIP crossing with pressure correction')
print(f'alpha = 1.00: {np.corrcoef(area_to_pip_crossB, pips)[0,1]:.4f}')
print(f'alpha = 0.69: {np.corrcoef(area_to_pip_crossC, pips)[0,1]:.4f}')

Looks much more linear now

In [None]:
plt.scatter(  area_to_pip_crossB, pips, label = 'alpha = 1')
plt.scatter(3*area_to_pip_crossC, pips, marker = 'x', c = 'r', label = 'alpha = 0.69')
plt.ylabel('PIP')
plt.legend();

This suggests we might play around with the idea of using $u_{in}^{0.69}$ as a feature, in addition to $u_{in}$. The same applies to shifts, lags, cumulative sum, ...

# Next 20 breaths

Let's check on the next 20 breaths whether we confirm this behavior

In [None]:
pips20 = np.array([30,30,37,16,9.5,25,40,20,25,25,22,25,35,35,37,20,25,20,15,35])

In [None]:
for i in range(20, 40):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+30].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+30].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+30].values

    plt.plot(u_in/2, label = 'u_in/2')
    plt.plot(p, label = 'pressure')
    plt.axhline(pips20[i-20], color = 'gray', ls = '--')
    plt.axhline(0, color = 'gray', ls = '--')
    plt.legend()
    plt.show()

The same integral as above

In [None]:
alpha20 = 0.69

area_to_pip_crossC20 = np.zeros(20)
for i in range(20, 40):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+32].values
    u_out = train_RC_20_10['u_out']   [i*BL:i*BL+32].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+32].values
    
    #position of the first zero u_in during the inhale
    #(not counting the first timestep)
    #(proxy for pressure crossing of PIP)
    t_cross = 1 + np.argmax(u_in[1:]*(1-u_out[1:]) == 0)
    
    area_to_pip_crossC20[i-20] = sum((pips20[i-20] - p)*u_in**alpha20)

Again pretty tight relationship, though a bit more curved this time

In [None]:
plt.scatter(area_to_pip_crossC  , pips,  label = 'first 20 breaths')
plt.scatter(area_to_pip_crossC20, pips20, marker = 'x', c = 'r', label = 'next 20 breaths')
plt.ylabel('PIP')
plt.legend();

But the correlation is still pretty high

In [None]:
print('Correlation of PIP and the area to PIP crossing with pressure correction')
print(f'first 20 breaths: {np.corrcoef(area_to_pip_crossC,   pips)[0,1]:.4f}')
print(f'next  20 breaths: {np.corrcoef(area_to_pip_crossC20, pips20)[0,1]:.4f}')

The correlation is even better with PIP^2, because this removes the curvature seen above

In [None]:
print('Correlation of PIP and the area to PIP crossing with pressure correction')
print(f'first 20 breaths: {np.corrcoef(area_to_pip_crossC,   pips**2)[0,1]:.4f}')
print(f'next  20 breaths: {np.corrcoef(area_to_pip_crossC20, pips20**2)[0,1]:.4f}')

# Getting PIP for training set by solving an algebraic equation?

This meant there was a hope that for the training breaths, $$PIP^2 \approx c_0 + c_1 \left(\sum (PIP - p)u_{in}^\alpha\right)$$ could be used to solve for PIP (with c_0 and c_1 dependent on R and C).

Unfortunately, this does not work as we sometimes get imaginary solutions.

We might still get something from
$$PIP \approx c_0 + c_1 \left(\sum (PIP - p)u_{in}^\alpha\right)$$
though.

Best fit values:

In [None]:
c1, c0 = np.polyfit(
                np.concatenate((area_to_pip_crossC, area_to_pip_crossC20)), 
                np.concatenate((pips, pips20)), 
                1)
print(c0, c1)

Apply it to first 5000 breaths

In [None]:
num_breaths = len(train_RC_20_10)
pip_estimate = np.zeros(5000)

for i in range(5000):
    u_in  = train_RC_20_10['u_in']    [i*BL:i*BL+32].values
    p     = train_RC_20_10['pressure'][i*BL:i*BL+32].values
    
    #Equation A*PIP = B
    A = 1 - c1 * np.sum(u_in**alpha)
    B = c0 - c1 * np.sum(p * u_in**alpha)
    
    pip_estimate[i] = B/A

The extremes are way off (because sometimes we get A close to zero)

In [None]:
print('Min: ', np.min(pip_estimate))
print('Max: ', np.max(pip_estimate))

The majority of the predictions are sensible though

In [None]:
plt.hist(pip_estimate, np.arange(0,50,1));