# Summary

Most EDAs usually and naturally just used `time_step` as x-axis for pressure prediciton.

When `pressure` is also used as x-axis, it may provide additinal insights. For example, we can easily illustrate why `u_in_lag2` is a better feature than `u_in_lag1`.

The correlation heatmaps were shown:
>  9 R-C combinations

>  before and after 0.5 min for `time_step`.

We know it is not a simple linear regression or linear correlation. But association pattern illustations may help ask interesting questions: 
* why `R=50, C=20` has the highest subset size? whether this subset implies larger prediction errors?
* which R/C combination will not be predicted globally well by a new feature 
* whether stratified splits and conditional lag amounts are needed

Finding:
* `u_in_lag2` is medium high correlated with `pressure` while`u_in_lag4` maybe not provide additional net benefit. (`cumsum` will take care when `u_in_lag2` is not that good at R=5.)
* `cumsum` is very highly correlated with `pressure` and can complement `u_in_lag2` (eg. R=5, C=10 subset).
* `cummean` has relatively higher correlation in the later stage than early ramp stage (say > 0.5 min).

Exploring...

In [None]:
# libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, RobustScaler, normalize
# from sklearn.model_selection import GroupKFold
# from sklearn import metrics

# import time
# import lightgbm as lgb

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Function

In [None]:
def plot2_bid(bid, cols=['u_in_lag1','u_in_lag2']):
    '''
    Compare contour plots of two columns vs presure at x-axis 
    to illustrate correlation levels:
    Is the 2nd column (blue) have potentially higher predictive power?    
    '''
    
    tmp = train.loc[(train['breath_id'] == bid) & (train['time_step'] < 1.2)].reset_index(drop=True)
    
    fig, (ax1, ax3) = plt.subplots(1, 2, figsize = (12, 4)) 
    ax2 = ax1.twinx()
#     ax2.plot(tmp['time_step'], tmp['cumsum'], 'b-', label='cumsum', alpha=0.5)    
    ax2.plot(tmp['time_step'], tmp['u_out'], 'y-', label='u_out', alpha=0.5)    
    
    ax1.plot(tmp['time_step'], tmp['pressure'], 'r-', label='pressure', alpha=0.5)
    ax1.plot(tmp['time_step'], tmp['u_in'], 'g-', label='u_in', alpha=0.5)
    ax1.plot(tmp['time_step'], tmp[cols[0]], 'k--', label=cols[0], alpha=0.5)
    ax1.plot(tmp['time_step'], tmp[cols[1]], 'b--', label=cols[1])

    ax1.set_xlabel('Timestep')
    
    RC = tmp['RC'][0]
    ax1.set_title(f'{RC} breath_id:{bid}')

    ax1.legend(loc=(1.1, 0.7))
    ax2.legend(loc=(1.1, 0.5))
    
    ####################################
    
    ax3.plot(tmp['pressure'], tmp['u_in'], 'g-', label='u_in', alpha=0.5)
    ax3.plot(tmp['pressure'], tmp[cols[0]], 'k--', label=cols[0], lw=1.5, alpha=0.4)
    ax3.plot(tmp['pressure'], tmp[cols[1]], 'b--', label=cols[1], lw=2, alpha=0.8)
    
    ax3.set_xlabel('Pressure') 

    ax3.legend(loc=(1.1, 0.7))
    
    fig.tight_layout()    
    plt.show()

In [None]:
# scaled version for cumulative features
def plot2_bid_scaled(bid, cols=['cumsum','u_in_lag2']):
    '''
    Plotted columns were scaled except for pressure and time_step.
    Compare contour plots of two columns vs presure at x-axis 
    to illustrate correlation levels:
    Is the 2nd column (blue) have potentially higher predictive power?  
    
    '''
    tmp = train.loc[(train['breath_id'] == bid) & (train['time_step'] < 1.2)].reset_index(drop=True)
    
    fig, (ax1, ax3) = plt.subplots(1, 2, figsize = (12, 4)) 
    ax2 = ax1.twinx()   
    
    ax1.plot(tmp['time_step'], tmp['pressure'], 'r-', label='pressure', alpha=0.5)
    
    ax2.plot(tmp['time_step'], tmp['u_out'], 'y-', label='u_out', alpha=0.5)    
    ax2.plot(tmp['time_step'], tmp['u_in'], 'g-', label='u_in', alpha=0.5)
    ax2.plot(tmp['time_step'], tmp[cols[0]], 'k--', label=cols[0], alpha=0.5)
    ax2.plot(tmp['time_step'], tmp[cols[1]], 'b--', label=cols[1])
    
    ax1.set_xlabel('Timestep')
    ax1.set_ylabel('Pressure')
    
    RC = tmp['RC'][0]
    ax1.set_title(f'{RC} breath_id:{bid}')
    
    ax1.legend(loc=(1.12, 0.9))
    ax2.legend(loc=(1.12, 0.4))
    
    #################################### 
    ax4 = ax3.twinx()
    ax4.plot(tmp['pressure'], tmp['u_in'], 'g-', label='u_in', alpha=0.5)
    
#     ax3.plot(tmp['pressure'], tmp['u_in'], 'g-', label='u_in', alpha=0.5)
    ax3.plot(tmp['pressure'], tmp[cols[0]], 'k--', label=cols[0], lw=1.5, alpha=0.4)
    ax3.plot(tmp['pressure'], tmp[cols[1]], 'b--', label=cols[1], lw=2, alpha=0.8)    
            
    ax3.set_xlabel('Pressure') 

    ax3.legend(loc=(1.12, 0.8))    
    ax4.legend(loc=(1.12, 0.4))   
    
    fig.tight_layout()    
    plt.show()

## Data loading and overview

In [None]:
path = '../input/ventilator-pressure-prediction'
train_raw = pd.read_csv(os.path.join(path, 'train.csv'))
# test = pd.read_csv(os.path.join(path, 'test.csv'))
# sub = pd.read_csv(os.path.join(path, 'sample_submission.csv'))

In [None]:
DEBUG = False
#DEBUG = True

if DEBUG:
    train = train_raw.iloc[:8*5000,:]
else:
    train = train_raw

In [None]:
train.shape

In [None]:
train['RC'] = [f'{r}_{c}' for r, c in zip(train['R'], train['C'])]

In [None]:
RC_order = ['5_10', '5_20', '5_50', '20_10', '20_20', '20_50', '50_10', '50_20', '50_50']
plt.figure(figsize = (10, 6))
sns.countplot(x='RC', order=RC_order, data=train)

Is `RC_50_10` a special group?

In [None]:
# Only RC_50_10 subset have negative pressures.
train.groupby('RC')['pressure'].describe().round(2)

In [None]:
train['u_in_lag1'] = train.groupby('breath_id')['u_in'].shift(1) 
train['u_in_lag2'] = train.groupby('breath_id')['u_in'].shift(2)
train['u_in_lag3'] = train.groupby('breath_id')['u_in'].shift(3)
train['u_in_lag4'] = train.groupby('breath_id')['u_in'].shift(4) 
train = train.fillna(0) 

train['area'] = train['time_step'] * train['u_in']
train['area'] = train.groupby('breath_id')['area'].cumsum()

train['cumsum'] = train.groupby('breath_id')['u_in'].cumsum()
train['cumsum'] = train.groupby('breath_id')['u_in'].cumsum()

train['step'] = train.groupby('breath_id')['time_step'].cumcount() + 1
train['cummean'] = train['cumsum'] / train['step']

In [None]:
train.tail()

# u_in_lag2 vs lag1

In [None]:
bid_list = list(train['breath_id'].unique())

In [None]:
# u_in_lag2 generally has smaller/narrower inside contour plots along 45 degree, better correlated with pressure  
for bid in bid_list[:5]:
    plot2_bid(bid)

Compared to the left shifting plots, the right plots are easier to show the improvement due to lag2: shrunk blue area, eg breath_id 1 and 4.

Try the bottom 5 instances to further checking.

In [None]:
for bid in bid_list[-5:]:
    plot2_bid(bid)

# u_in_lag3 vs lag2

In [None]:
# u_in_lag3:  mixed/cross patterns
for bid in bid_list[:5]:
    plot2_bid(bid, cols =['u_in_lag3','u_in_lag2'])

# u_in_lag4 vs lag2

In [None]:
# u_in_lag4 looks too aggressive?
for bid in bid_list[:5]:
    plot2_bid(bid, cols =['u_in_lag4','u_in_lag2'])

# Corr by R_C subsets

### Overall when `u_out == 0`

In [None]:
cols = ['pressure', 'u_in', 'u_in_lag1', 'u_in_lag2', 'u_in_lag3', 'u_in_lag4']

In [None]:
RC_list = train['RC'].unique()

In [None]:
a = train.loc[train['u_out'] == 0,cols]
fig, ax = plt.subplots(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(a.corr(), cmap='coolwarm', annot=True, annot_kws={"size":15})
plt.title('Pearson correlation')
plt.subplot(1, 2, 2)
sns.heatmap(a.corr(method="spearman"), cmap='coolwarm', annot=True, annot_kws={"size":15}) # decimeal pts: fmt='.2g'
plt.title('Spearman correlation')
plt.tight_layout()

There seem similar levels of correlations between parametric pearson and non-parametric spearman methods.

### By R and C subsets

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18,18),  sharey=True) # sharex=True,
for i, ax in enumerate(axes.flatten()):
    a = train.loc[(train['u_out'] == 0) & (train['RC'] == RC_order[i]) ,cols]
    sns.heatmap(data=a.corr(), cmap='coolwarm', annot=True, annot_kws={"size":18},
                ax=ax,
                vmin=-0.3, vmax=1,
               ) 
    #plt.yticks(rotation=90)
    ax.set_title(f'RC: {RC_order[i]}')
    
fig.tight_layout()
plt.show()

**Lag2 wins** 6 times out of 9 subsets and is expected to work well at `R` in (20, 50).
>  Still at medium high corr: 0.45 (5_20) and 0.65 (5_50). lag2 correlats well with lag4 when `C` > 10. So the net benifit may be small if both lag2 and lag 4 in the model.

>  5_10 subset: low or near-zero corr 

Lag4 wins the remaing 3 subests with `R` = 5. 
>  Low resistence `R` = 5 has very high correlations (>0.8) between neighboring lags.

Based on the corr heatmaps, more breath_id plots from two subsets with distinct patterns: top left and bottom right.

In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '5_10', 'breath_id'].unique())
for bid in bid_list_lowCorr[-5:]:
    plot2_bid(bid, cols =['u_in_lag4','u_in_lag2'])

5_10 subset: pressure grows relatively smoothly.
Cumulative **area**, **cumsum**, **cummean** (as already proposed by other teams) may work better than lag2 for this subset?

In [None]:
bid_list_highCorr = list(train.loc[train['RC'] == '50_50', 'breath_id'].unique())
for bid in bid_list_highCorr[-5:]:
    plot2_bid(bid, cols =['u_in_lag4','u_in_lag2'])

# Corr in early <0.5min
Patterns may be different during ramp and PIP/plateau stages.

In [None]:
time_step_max0 = round(train.loc[(train['u_out'] == 0), 'time_step'].max(),2)
time_step_max0

In [None]:
fig, axes = plt.subplots(nrows=len(RC_order), ncols=2, figsize=(16, 30),  sharex=True) # sharex=True,
for i, ax in enumerate(axes.flatten()):
    time_step_start = []
    time_step_end = []
    if i % 2 == 0:
        time_step_start = 0
        time_step_end = 0.5
    else:
        time_step_start = 0.5
        time_step_end = time_step_max0
    a = train.loc[(train['u_out'] == 0) & 
                  (train['RC'] == RC_order[int(i / 2)]) &
                  (train['time_step'] >= time_step_start) &
                  (train['time_step'] < time_step_end)
                  ,cols]
    sns.heatmap(data=a.corr(), cmap='coolwarm', annot=True, ax=ax, annot_kws={"size":14},
                vmin=-0.3, vmax=1,
               ) 
    ax.set_title(f'RC: {RC_order[int(i / 2)]},  time_step: {time_step_start} to {time_step_end}')
    
fig.tight_layout()
plt.show()


As expected,`u_in` features tend to have larger preditive power during the early ramp stage (time_step < 0.5 or earlier).

Lag2 is still the winner most of the time. It is OK in the early stage for`5_10` subset, where lag4 seems the best though.

-- Different optimal lag amounts for some subsets?



# cumsum, cummean, area

## Motivation: Some subsets (e.g. 5_10) are expected to have hihger correlation with pressure by cumulative features than u_in_lag2

In [None]:
train.columns

In [None]:
cols = ['pressure', 'u_in', 'u_in_lag2', 'area', 'cumsum', 'cummean']

In [None]:
fig, axes = plt.subplots(nrows=len(RC_order), ncols=2, figsize=(16, 30),  sharex=True) # sharex=True,
for i, ax in enumerate(axes.flatten()):
    time_step_start = []
    time_step_end = []
    if i % 2 == 0:
        time_step_start = 0
        time_step_end = 0.5
    else:
        time_step_start = 0.5
        time_step_end = time_step_max0
    a = train.loc[(train['u_out'] == 0) & 
                  (train['RC'] == RC_order[int(i / 2)]) &
                  (train['time_step'] >= time_step_start) &
                  (train['time_step'] < time_step_end)
                  ,cols]
    sns.heatmap(data=a.corr(), cmap='coolwarm', annot=True, annot_kws={"size":14},
                ax=ax,
                vmin=-0.3, vmax=1,
               ) 
    ax.set_title(f'RC: {RC_order[int(i / 2)]},  time_step: {time_step_start} to {time_step_end}')
    
fig.tight_layout()
plt.show()

Finding:
* Yes,`cumsum` has extremely high correlations for 5_10 subset which u_in_lag2 may fail.
* `cumsum` correlates very high during the entire 1 min.
* `cummean` consistently correlates higher during **later stage**, after 0.5 min, even close to last_u_in.

## Revisit subset R=5 and C=10.

In [None]:
cols_raw = ['u_in', 'u_in_lag2', 'area', 'cumsum', 'cummean']

In [None]:
# RobustScale() will produce negative values for most cumsum and different range for cumsum and lag2
# scaler = RobustScaler()
# train[cols_raw] = scaler.fit_transform(train[cols_raw])
# train[cols_raw] = scaler.inverse_transform(train[cols_raw])

In [None]:
train['pressure'].describe()

In [None]:
# default range max = 1 is too small relative to the magnitude of pressure
scaler = MinMaxScaler(feature_range=(0, 10))
train[cols_raw] = scaler.fit_transform(train[cols_raw])

In [None]:
train[cols_raw].describe()

In [None]:

train.head()


In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '5_10', 'breath_id'].unique())
print('Please note y-scale: only target pressue and time_step on the left panel plots are still in raw scales.\n\
        All other features are scaled by MinMaxScaler()')
for bid in bid_list_lowCorr[-5:]:
    plot2_bid_scaled(bid, cols =['cumsum','u_in_lag2'])

**High correlation (>0.9) between pressure and cumsum when R=5 and C=10!** (straight up line on the right plot (dashed gray) of cumsum ~ pressure).

## cummean

In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '5_10', 'breath_id'].unique())
for bid in bid_list_lowCorr[-5:]:
    plot2_bid_scaled(bid, cols =['cummean','u_in_lag2'])

### cummean vs cumsum

In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '5_10', 'breath_id'].unique())
print('Note: now the blue curve is not for the default "u_in_lag2"  any more')
for bid in bid_list_lowCorr[-5:]:
    plot2_bid_scaled(bid, cols =['cummean','cumsum'])

### Revisit the 3 subsets with min and max instances

In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '20_10', 'breath_id'].unique())
print('One of two subsets with less instances in train data')
for bid in bid_list_lowCorr[-5:]:
    plot2_bid_scaled(bid, cols =['cummean','cumsum'])

In [None]:
bid_list_lowCorr = list(train.loc[train['RC'] == '20_20', 'breath_id'].unique())
print('One of two subsets with less instances in train data')
for bid in bid_list_lowCorr[-5:]:
    plot2_bid_scaled(bid, cols =['cummean','cumsum'])

In [None]:
bid_list_highCorr = list(train.loc[train['RC'] == '50_10', 'breath_id'].unique())
print('The subset with largest instances in train data')
for bid in bid_list_highCorr[-10:]:
    plot2_bid_scaled(bid, cols =['cummean','cumsum'])

Is it so fun to explore challenging pressure curves which will be predicted better and better by your amazing ML models?

Hope these plenty lots with `breath_id` labeled would sometime faciltate your wonderful new idea and modeling!
That's what I am learning from kaggle communicty like you.

In [None]:
del a

Want to confirm the relative feature importance for the added features in lightGBM,then select good ones applied to LSTM. 

However, kaggle said too much memory was used during lgbm training. Not sure why simple correlation analysis used so much memory.

### Thank you for your reading!

## References
[Ventilator Pressure Prediction: EDA, FE and models](https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models)

[Ventilator Pressure: EDA and simple submission](https://www.kaggle.com/carlmcbrideellis/ventilator-pressure-eda-and-simple-submission)

[EDA about time_step and u_out](https://www.kaggle.com/marutama/eda-about-time-step-and-u-out/notebook)