This notebook contains a simple data analysis and baseline model training for the Google Brain - Ventilator Pressure Prediction competition.

# Train data

In [None]:
import numpy as np
import pandas as pd

In [None]:
train_df = pd.read_csv('/kaggle/input/ventilator-pressure-prediction/train.csv')

In [None]:
train_df.head()

In [None]:
train_df.describe()

In [None]:
len(train_df.breath_id.unique())

The dataset contains ~6M rows for 75k different breath_id (80 observations per 1 breath id)

Let's look at some breath_id to understand how our time series look like.

In [None]:
for breath_id in range(1,6):
    temp_df = train_df[train_df['breath_id'] == breath_id]
    temp_df[['u_in', 'u_out', 'pressure']].plot(figsize = (20,10), 
                                                title = f"Breath id: {breath_id}, R: {temp_df['R'].values[0]}, C: {temp_df['C'].values[0]}, R*C: {temp_df['R'].values[0] * temp_df['C'].values[0]}")
    

We can see two phases - inspiratory and expiratory.
Pressure is high during the inspiratory phase and then goes down when the expiratory step starts. And we can see that pressure is lagging to u_in changes (we will create some lagging features later in this notebook)

# Basic features exploration

In [None]:
train_df.pressure.hist(bins = 50, figsize = (20,10))

Target variable distribution looks ok (there is some skewness but it doesn't look too bad). We will not change it anyhow for now.

In [None]:
train_df.R.hist(bins = 20, figsize = (20,10))

In [None]:
train_df.C.hist(bins = 20, figsize = (20,10))

R and C - are numerical features, but it turned out they have only three unique values each. Probably we can treat them as categorical features in our later analysis, but I prefer to keep them numerical because their values still have physical meaning.

In [None]:
train_df.u_in.hist(bins = 50, figsize = (20,10))

# Feature generation

In [None]:
test_df = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

## Breath level features

Based on the data definition:
- R is the change in pressure per change in flow
- C is the change in volume per change in pressure
So, it seems that R * C also has some physical meaning - the change in volume per change in flow. Let's use it as a feature.

In [None]:
train_df['R*C'] = train_df['R'] * train_df['C']
test_df['R*C'] = test_df['R'] * test_df['C']

Let's also compute some aggregated statistics for the u_in feature.

In [None]:
u_in_agg = train_df[train_df['u_out'] == 0][['breath_id','u_in','time_step']].groupby(['breath_id']).agg({'u_in': ['max','mean', 'std'], 'time_step': ['mean']})
u_in_agg.columns = ['u_in_max','u_in_mean', 'u_in_std','time_step_mean']
u_in_agg_out = train_df[train_df['u_out'] == 1][['breath_id','u_in','time_step']].groupby(['breath_id']).agg({'u_in': ['max','mean', 'std', lambda x: sum(i==0 for i in x)]})
u_in_agg_out.columns = ['u_in_max_out','u_in_mean_out', 'u_in_std_out','steps_before_in_again']
train_df = train_df.merge(u_in_agg, on = 'breath_id')
train_df = train_df.merge(u_in_agg_out, on = 'breath_id')

train_df['max_to_max_u_i_o'] = train_df['u_in_max_out'] / train_df['u_in_max']
train_df['max_to_max_u_i_o'] = train_df['u_in_mean_out'] / train_df['u_in_mean']

In [None]:
u_in_agg = test_df[test_df['u_out'] == 0][['breath_id','u_in','time_step']].groupby(['breath_id']).agg({'u_in': ['max','mean', 'std'], 'time_step': ['mean']})
u_in_agg.columns = ['u_in_max','u_in_mean', 'u_in_std','time_step_mean']
u_in_agg_out = test_df[test_df['u_out'] == 1][['breath_id','u_in','time_step']].groupby(['breath_id']).agg({'u_in': ['max','mean', 'std', lambda x: sum(i==0 for i in x)]})
u_in_agg_out.columns = ['u_in_max_out','u_in_mean_out', 'u_in_std_out','steps_before_in_again']
test_df = test_df.merge(u_in_agg, on = 'breath_id')
test_df = test_df.merge(u_in_agg_out, on = 'breath_id')

test_df['max_to_max_u_i_o'] = test_df['u_in_max_out'] / test_df['u_in_max']
test_df['max_to_max_u_i_o'] = test_df['u_in_mean_out'] / test_df['u_in_mean']

In this competition, we need to predict pressure for the expiratory phase only. 
Let's exclude all data related to the expiratory phase (it still can be helpful, but we will ignore it for simplicity in this notebook).

In [None]:
train_df = train_df[train_df['u_out'] == 0]
test_df = test_df[test_df['u_out'] == 0]

## Lagging features

As I have mentioned before, pressure is lagging to changes in u_in. Let's use this observation and create some u_in lagging features.

In [None]:
for df in (train_df, test_df):
    num_in_breath = [1]
    for i in range(1,len(df)):
        if df['breath_id'].values[i] == df['breath_id'].values[i-1]:
            num_in_breath.append(num_in_breath[-1]+1)
        else:
            num_in_breath.append(1)
    df['num_in_breath'] = num_in_breath
    
    for lag in range(1,5):
        df[f'u_in_lag_{lag}'] = np.roll(df['u_in'].values, lag)
        df[f'u_in_lag_{lag}'][df['num_in_breath'] <= lag] = 0

It also seems that pressure depends on some cumulative features, so let's compute them as well.

In [None]:
for df in (train_df, test_df):
    mean_u_in_cum = [df['u_in'].values[0]]
    cum_sum = df['u_in'].values[0]
    for i in range(1,len(df)):
        if df['breath_id'].values[i] == df['breath_id'].values[i-1]:
            cum_sum += df['u_in'].values[i]
            mean_u_in_cum.append(cum_sum / df['num_in_breath'].values[i])
        else:
            mean_u_in_cum.append(df['u_in'].values[i])
            cum_sum = df['u_in'].values[i]
    df['mean_u_in_cum'] = mean_u_in_cum
    df['u_in_cum'] = df['mean_u_in_cum'] * df['time_step']

In [None]:
test_df.head()

In [None]:
train_df.head()

# Delta features

We can enrich our features set by adding deltas between u_in and its derivative features.

In [None]:
for df in (train_df, test_df):
    df['u_in_last_step_change'] = df[f'u_in'] - df[f'u_in_lag_1']
    df['u_in_delta_mean_cum'] = df[f'u_in'] - df[f'mean_u_in_cum']
    df['u_in_delta_mean'] = df[f'u_in'] - df[f'u_in_mean']
    df['u_in_delta_max'] = df[f'u_in'] - df[f'u_in_max']

# Correlation analysis 

We have generated plenty of features. Let's look at them. 
Let's visualize the relationship between different features based on the small random subset of data (0.01%)

In [None]:
from seaborn import pairplot 
sample_df = train_df.sample(frac = 0.0001)
pairplot(sample_df[['pressure','R', 'C', 'R*C','u_in', 'u_in_lag_4', 'u_in_max', 'u_in_mean', 'mean_u_in_cum', 
                     'u_in_cum', 'u_in_last_step_change', 'u_in_delta_mean_cum', 'u_in_delta_mean','u_in_delta_max', 'u_in_mean_out']])

We can see some correlation between features and target, which means we are going in the right direction. Let's train some baseline model.

# Model training

I will train a simple LGBMRegressor without any params optimization.

In [None]:
from lightgbm import LGBMRegressor
model = LGBMRegressor(
    objective = 'regression_l1',
    n_jobs= -1,
    n_estimators = 20000,
    learning_rate = 0.3
)

It is important to use GroupKFold because we don't want rows related to the same breath_id in both train and valid datasets.

In [None]:
from sklearn.model_selection import GroupKFold
kf = GroupKFold(n_splits=7)

In [None]:
train_df.columns

In [None]:
features = [c for c in train_df.columns if c not in ('id', 'breath_id', 'pressure', 'predictions')]
target = 'pressure'

In [None]:
x = train_df[features]
y = train_df[target]

x_test = test_df[features]

train_df['predictions'] = 0
test_df['predictions'] = 0

In [None]:
models = []
for train_index, test_index in kf.split(x, groups = train_df['breath_id']):
        model = model.fit(x.iloc[train_index], y.values[train_index],
                          eval_set=(x.iloc[test_index], y.values[test_index]),
                          verbose=1000, early_stopping_rounds=100)
        models.append(model)
        train_df['predictions'].iloc[test_index] = model.predict(x.iloc[test_index])
        test_df['predictions'] += model.predict(x_test[features])
test_df['predictions'] /= len(models)

In [None]:
print(f"CV MAE: {(train_df['predictions'] - y).abs().mean()}")

# Submission

In [None]:
submission_df = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

In [None]:
submission_df = submission_df.merge(test_df[['id','predictions']], on = 'id', how = 'outer')

In [None]:
submission_df.predictions.fillna(0, inplace = True)
submission_df['pressure'] = submission_df['predictions']

In [None]:
submission_df[['id','pressure']].to_csv('submission.csv', index = False)

In [None]:
submission_df