# [Ventilator Pressure Prediction](https://www.kaggle.com/c/ventilator-pressure-prediction): EDA and a simple submission

### Summary
In this competition we are provided with 75,450 non-contiguous cycles (each cycle labelled as `breath_id`) of the [PVP1 automated ventilator](https://www.peoplesvent.org/en/latest/) connected to a high-grade test lung ([Quicklung, Ingmar Medical](https://www.ingmarmed.com/product/quicklung/))  Three different values of the compliance (C) were tested [10,20,50] mL cm H<sub>2</sub>O in conjunction with three different values of resistance (R) [5,20,50] cm H<sub>2</sub>O/L/s, resulting in a total of 9 different lung settings.

A typical breath cycle has the following aspect 

![](https://raw.githubusercontent.com/Carl-McBride-Ellis/images_for_kaggle/main/PVP1_typical_cycle.png)

A cycle lasts for up to 3 seconds. It is the inspiratory section (from 0-1 seconds) that we model in this competition.

75,450回の非連続サイクル(各サイクルは`breath_id`でラベル付けされている)
compliance(C)が三つの異なる値[10,20,50]で、resistance(R)も三つの異なる値[5,20,50]あり、合計9つの異なる肺の設定が行われた。

典型的な呼吸サイクルは次のようなものです。

画像

1サイクルは最大で3秒間です。本大会でモデルにするのは、吸気区間（0〜1秒）です。

### Read in the data

In [None]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
train_data = pd.read_csv('../input/ventilator-pressure-prediction/train.csv',index_col=0)
test_data  = pd.read_csv('../input/ventilator-pressure-prediction/test.csv', index_col=0)
sample     = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

Let us take a quick look at the training data

トレーニングデータを見てみましょう。

In [None]:
train_data

How many unique values do we have for each feature?

特徴値のユニークな値はいくつあるのか？

In [None]:
train_data.nunique().to_frame()

We can see that we have over 6 million rows of data, corresponding to 75450 breaths. On average we have 80 time steps of data per breath. Let us check this for the training data

600万行以上のデータがあり、75450回の呼吸に対応していることがわかります。平均すると、1回の呼吸につき80タイムステップのデータがあることになります。これをトレーニングデータで確認してみましょう。

In [None]:
train_data.groupby("breath_id")["time_step"].count().unique().item()

and the test data

およびテストデータ

In [None]:
test_data.groupby("breath_id")["time_step"].count().unique().item()   

The next question is whether we have any missing data or not?

次は欠落しているデータがあるかないかです。

In [None]:
train_data.isnull().sum(axis = 0).to_frame()

Wonderful, it seems not!

すばらしい　なさそうです！
# Time
In this data the unit of time is seconds. How long does longest breath last?

このデータでは、時間の単位を「秒」としています。一番息が長いのは何秒？

In [None]:
train_data.time_step.max()

The longest breath is just under 3 seconds.

最長の呼吸は3秒弱。

What is the maximum time that the exploratory solenoid valve is set to 0?

探索用電磁弁が0になる時間の最大値は？

In [None]:
train_data.query('u_out == 0').time_step.max()

The valve seems to be activated after 1 seccond.

1秒後にバルブが作動しているようです。
# The first breath

Let us select `breath_id=1` and take a look at the features

それでは、「breath_id=1」を選択して、その特徴を見てみましょう。

In [None]:
breath_one = train_data.query('breath_id == 1').reset_index(drop = True)
breath_one

Let us see how many unique values there are in each of these columns

それぞれの列にユニークな値がいくつあるか見てみましょう。

In [None]:
breath_one.nunique().to_frame()

there is only one value for `R`, one value for `C` for the `breath_id`. 

`R`には1つの値、`C`には1つの値、`breath_id`には1つの値しかありません。

Let us visualize `u_in`, `u_out` and `pressure` with respect to the `time_step`:

`u_in`, `u_out`, `pressure` を `time_step` に対して可視化してみましょう。

In [None]:
breath_one.plot(x="time_step", y="u_in", kind='line',figsize=(12,3), lw=2, title="u_in");
breath_one.plot(x="time_step", y="u_out", kind='line',figsize=(12,3), lw=2, title="u_out");
breath_one.plot(x="time_step", y="pressure", kind='line',figsize=(12,3), lw=2, title="pressure");

# All breaths
What values do we have for `R`, which represents how restricted the airway is (in cmH<sub>2</sub>O/L/S).

気道がどれだけ制限されているかを表す`R`にはどのような値があるでしょうか（単位：cmH<sub>2</sub>O/L/S）。

In [None]:
train_data.R.value_counts().to_frame()

now for the values of `C`, the lung attribute indicating how compliant the lung is (in mL/cmH<sub>2</sub>O)

`C`の値については、肺のコンプライアンスを示す肺属性（単位：mL/cmH<sub>2</sub>O

In [None]:
train_data.C.value_counts().to_frame()

and for `u_out`, the control input for the exploratory solenoid valve. Either 0 or 1.

で、`u_out`には、探索用電磁弁の制御入力。0または1のいずれかです。

In [None]:
train_data.u_out.value_counts().to_frame()

# Pressure
And now we shall look at the `pressure`. The pressure is measured in cmH<sub>2</sub>0, where 1 cmH<sub>2</sub>0 is roughly equal to 98 Pascals. The maximum value of the pressure is

そして、次に`pressure`を見てみましょう。圧力の単位はcmH<sub>2</sub>0で、1cmH<sub>2</sub>0は98パスカルとほぼ同じです。圧力の最大値は

In [None]:
train_data.pressure.max()

which corresponds to around 6,350 Pa.

The pressures in the training data have the following distribution

これは約6,350Paに相当します。

トレーニングデータの圧力は以下のような分布をしています。

In [None]:
plt.figure(figsize = (12,5))
ax = sns.distplot(train_data['pressure'], 
             bins=120, 
             kde_kws={"clip":(0,40)}, 
             hist_kws={"range":(0,40)},
             color='darkcyan', 
             kde=False);
values = np.array([rec.get_height() for rec in ax.patches])
norm = plt.Normalize(values.min(), values.max())
colors = plt.cm.jet(norm(values))
for rec, col in zip(ax.patches, colors):
    rec.set_color(col)
plt.xlabel("Histogram of pressures", size=14)
ax.set(yticklabels=[])
plt.show();

with a median value of 

中央値は、

In [None]:
train_data.pressure.median()

Note however that in this competition the expiratory phase is not scored, so for practical purposes we are only really interested in the pressure for `u_out=0`, *i.e.* the first second of the experiments:

しかし、この競技では呼気フェーズはスコア化されていないため、実際には「u_out=0」、つまり実験の最初の1秒間の圧力にのみ関心があることに留意してください。

In [None]:
u_out_is_zero = train_data.query("u_out == 0").reset_index(drop = True)
plt.figure(figsize = (12,5))
ax = sns.distplot(u_out_is_zero['pressure'], 
             bins=120, 
             kde_kws={"clip":(0,50)}, 
             hist_kws={"range":(0,50)},
             color='darkcyan', 
             kde=False);
values = np.array([rec.get_height() for rec in ax.patches])
norm = plt.Normalize(values.min(), values.max())
colors = plt.cm.jet(norm(values))
for rec, col in zip(ax.patches, colors):
    rec.set_color(col)
plt.xlabel("Histogram of pressures (u_out=0)", size=14)
ax.set(yticklabels=[])
plt.show();

with a median value of 

中央値は、

In [None]:
u_out_is_zero.pressure.median()

We have nine combinations of experiments; `C` can be 10, 20 or 50, and `R` can be 5, 20 or 50. Lets take a quick look at an example of each

`C`は10、20、50のいずれか、`R`は5、20、50のいずれかで、実験の組み合わせは9通りあります。それぞれの例を簡単に見てみましょう。

In [None]:
breath_2 = train_data.query('breath_id == 2').reset_index(drop = True)
breath_3 = train_data.query('breath_id == 3').reset_index(drop = True)
breath_4 = train_data.query('breath_id == 4').reset_index(drop = True)
breath_5 = train_data.query('breath_id == 5').reset_index(drop = True)
breath_17 = train_data.query('breath_id == 17').reset_index(drop = True)
breath_18 = train_data.query('breath_id == 18').reset_index(drop = True)
breath_21 = train_data.query('breath_id == 21').reset_index(drop = True)
breath_39 = train_data.query('breath_id == 39').reset_index(drop = True)

fig, axes = plt.subplots(3,3,figsize=(15,15))
sns.lineplot(data=breath_39, x="time_step", y="pressure", lw=2, ax=axes[0,0])
axes[0,0].set_title ("R=5, C=10", fontsize=18)
axes[0,0].set(xlabel='')
#axes[0,0].set(ylim=(0, None))
sns.lineplot(data=breath_21, x="time_step", y="pressure",  lw=2, ax=axes[0,1])
axes[0,1].set_title ("R=20, C=10", fontsize=18)
axes[0,1].set(xlabel='')
axes[0,1].set(ylabel='')
#axes[0,1].set(ylim=(0, None))
sns.lineplot(data=breath_18, x="time_step", y="pressure",  lw=2,ax=axes[0,2])
axes[0,2].set_title ("R=50, C=10", fontsize=18)
axes[0,2].set(xlabel='')
axes[0,2].set(ylabel='')
#axes[0,2].set(ylim=(0, None))
sns.lineplot(data=breath_17, x="time_step", y="pressure",  lw=2,ax=axes[1,0])
axes[1,0].set_title ("R=5, C=20", fontsize=18)
axes[1,0].set(xlabel='')
#axes[1,0].set(ylim=(0, None))
sns.lineplot(data=breath_2, x="time_step", y="pressure",  lw=2,ax=axes[1,1])
axes[1,1].set_title ("R=20, C=20", fontsize=18)
axes[1,1].set(xlabel='')
axes[1,1].set(ylabel='')
#axes[1,1].set(ylim=(0, None))
sns.lineplot(data=breath_3, x="time_step", y="pressure",  lw=2,ax=axes[1,2])
axes[1,2].set_title ("R=50, C=20", fontsize=18)
axes[1,2].set(xlabel='')
axes[1,2].set(ylabel='')
#axes[1,2].set(ylim=(0, None))
sns.lineplot(data=breath_5, x="time_step", y="pressure",  lw=2,ax=axes[2,0])
axes[2,0].set_title ("R=5, C=50", fontsize=18)
#axes[2,0].set(ylim=(0, None))
sns.lineplot(data=breath_one, x="time_step", y="pressure",  lw=2,ax=axes[2,1])
axes[2,1].set_title ("R=20, C=50", fontsize=18)
axes[2,1].set(ylabel='')
#axes[2,1].set(ylim=(0, None))
sns.lineplot(data=breath_4, x="time_step", y="pressure",  lw=2,ax=axes[2,2])
axes[2,2].set_title ("R=50, C=50", fontsize=18)
axes[2,2].set(ylabel='')
#axes[2,2].set(ylim=(0, None))

plt.show();

# Positive end-expiratory pressure (PEEP)

# 呼気終末圧(PEEP)

It is worth noting that even before the experiments start (*i.e.* the `time_step=0` and `u_in=0`) there is a positive pressure in the airway. The system is maintained above atmospheric pressure to promote gas exchange to the lungs.

注目すべきは、実験開始前（※すなわち、`time_step=0`、`u_in=0`）でも、気道には陽圧がかかっていることです。肺へのガス交換を促進するために大気圧以上に保たれています。

In [None]:
zero_time = train_data.query("time_step < 0.000001 & u_in < 0.000001").reset_index(drop = True)
zero_time_5_10  = zero_time.query("R ==  5 & C == 10").reset_index(drop = True)
zero_time_5_20  = zero_time.query("R ==  5 & C == 20").reset_index(drop = True)
zero_time_5_50  = zero_time.query("R ==  5 & C == 50").reset_index(drop = True)
zero_time_20_10 = zero_time.query("R == 20 & C == 10").reset_index(drop = True)
zero_time_20_20 = zero_time.query("R == 20 & C == 20").reset_index(drop = True)
zero_time_20_50 = zero_time.query("R == 20 & C == 50").reset_index(drop = True)
zero_time_50_10 = zero_time.query("R == 50 & C == 10").reset_index(drop = True)
zero_time_50_20 = zero_time.query("R == 50 & C == 20").reset_index(drop = True)
zero_time_50_50 = zero_time.query("R == 50 & C == 50").reset_index(drop = True)

fig, axes = plt.subplots(9,1,figsize=(12,15))
sns.violinplot(x=zero_time_5_10["pressure"], linewidth=2, ax=axes[0], color="indianred")
axes[0].set_title ("R=5, C=10", fontsize=14)
axes[0].set(xlim=(3, 8))
sns.violinplot(x=zero_time_5_20["pressure"], linewidth=2, ax=axes[1], color="firebrick")
axes[1].set_title ("R=5, C=20", fontsize=14)
axes[1].set(xlim=(3, 8))
sns.violinplot(x=zero_time_5_50["pressure"], linewidth=2, ax=axes[2], color="darkred" )
axes[2].set_title ("R=5, C=50", fontsize=14)
axes[2].set(xlim=(3, 8))
sns.violinplot(x=zero_time_20_10["pressure"], linewidth=2, ax=axes[3], color="greenyellow")
axes[3].set_title ("R=20, C=10", fontsize=14)
axes[3].set(xlim=(3, 8))
sns.violinplot(x=zero_time_20_20["pressure"], linewidth=2, ax=axes[4], color="olivedrab")
axes[4].set_title ("R=20, C=20", fontsize=14)
axes[4].set(xlim=(3, 8))
sns.violinplot(x=zero_time_20_50["pressure"], linewidth=2, ax=axes[5], color="olive" )
axes[5].set_title ("R=20, C=50", fontsize=14)
axes[5].set(xlim=(3, 8))
sns.violinplot(x=zero_time_50_10["pressure"], linewidth=2, ax=axes[6], color="steelblue")
axes[6].set_title ("R=50, C=10", fontsize=14)
axes[6].set(xlim=(3, 8))
sns.violinplot(x=zero_time_50_20["pressure"], linewidth=2, ax=axes[7], color="cornflowerblue")
axes[7].set_title ("R=50, C=20", fontsize=14)
axes[7].set(xlim=(3, 8))
sns.violinplot(x=zero_time_50_50["pressure"], linewidth=2, ax=axes[8], color="midnightblue" )
axes[8].set_title ("R=50, C=50", fontsize=14)
axes[8].set(xlim=(3, 8));

The average value of PEEP at the beginning of each cycle is

各サイクルの開始時のPEEPの平均値は

In [None]:
zero_time["pressure"].mean()

Note that not all cycles start with `u_in=0`, and a cycle can even start with the inspiratory solenoid valve set to the maximum value of 100.

なお、すべてのサイクルが`u_in=0`でスタートするわけではなく、吸気電磁弁を最大値の100に設定してもサイクルがスタートすることがあります。

# Negative pressure
The minimum value for the pressure where `u_in=0` at `time_step=0` is

`time_step=0`で`u_in=0`となる圧力の最小値は

In [None]:
zero_time[zero_time['pressure']==zero_time['pressure'].min()]

Both of these breaths have a somewhat unusual aspect

この2つの呼吸には、ちょっと変わった側面があります。

In [None]:
breath_542 = train_data.query('breath_id == 542').reset_index(drop = True)
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ax.plot(breath_542["time_step"],breath_542["u_in"], lw=2, label='u_in')
ax.plot(breath_542["time_step"],breath_542["pressure"], lw=2, label='pressure')
#ax.set(xlim=(0,1))
ax.legend(loc="upper right")
ax.set_xlabel("time_id", fontsize=14)
ax.set_title("breath_id = 542", fontsize=14)
plt.show();

breath_119582 = train_data.query('breath_id == 119582').reset_index(drop = True)
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ax.plot(breath_119582["time_step"],breath_119582["u_in"], lw=2, label='u_in')
ax.plot(breath_119582["time_step"],breath_119582["pressure"], lw=2, label='pressure')
#ax.set(xlim=(0,1))
ax.legend(loc="upper right")
ax.set_xlabel("time_id", fontsize=14)
ax.set_title("breath_id = 119582", fontsize=14)
plt.show();

Note that all of the instances of negative pressure occur only in the `R=50` (high restriction) with `C=10` (thick latex) systems.

なお、負圧が発生するのは、すべて`R=50`（高制限）と`C=10`（厚手のラテックス）のシステムに限られます。

# Simple feature engineering
# 簡単な特徴量エンジニアリング
We shall add a new feature, which is the [cumulative sum](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html) of the `u_in` feature:

新しい特徴量として、`u_in` 特徴量の [cumulative sum](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html)を追加します。

In [None]:
train_data['u_in_cumsum'] = (train_data['u_in']).groupby(train_data['breath_id']).cumsum()
test_data['u_in_cumsum']  = (test_data['u_in']).groupby(test_data['breath_id']).cumsum()

The thinking behind this feature is that it is reasonable to assume the pressure in the lungs is approximately proportional to how much air has actually been pumped into them. It goes almost without saying that this feature is not useful when breathing out, but given that the expiratory phase is not scored in this competition this should not be too much of a problem.

この特徴量の背景にある考え方は、肺の中の圧力は、実際に肺に送り込まれた空気の量にほぼ比例すると考えるのが妥当だというものです。言うまでもなく、この機能は息を吐くときには役に立ちませんが、本大会では呼気段階が採点されないため、それほど問題にはならないでしょう。

### Shifting `u_in`
Let us take a look at the first second of `breath_id=928`, which is an excellent example of an oscillatory experiment

それでは、振動実験の優れた例である`breath_id=928`の最初の1秒間を見てみましょう。

In [None]:
breath_928 = train_data.query('breath_id == 928').reset_index(drop = True)
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
ax.plot(breath_928["time_step"],breath_928["u_in"], lw=2, label='u_in')
ax.plot(breath_928["time_step"],breath_928["pressure"], lw=2, label='pressure')
ax.set(xlim=(0,1))
ax.legend(loc="upper right")
ax.set_xlabel("time_id", fontsize=14)
plt.show();

It can be observed that there is a lag between `u_in` and the resulting `pressure` of around 0.1 seconds. I am sure it is with this in mind that [Chun Fu](https://www.kaggle.com/patrick0302) wrote his excellent notebook ["Add lag u_in as new feat"](https://www.kaggle.com/patrick0302/add-lag-u-in-as-new-feat/notebook), which introduces a new *shifted* `u_in` feature. Here we shall use a shift of 2 rather than his original shift of 1, which is now more in line with the delay seen:

`u_in`とその結果である`pressure`との間には、約0.1秒のラグがあることが観察できます。[Chun Fu](https://www.kaggle.com/patrick0302)が彼の素晴らしいノート["Add lag u_in as new feat"](https://www.kaggle.com/patrick0302/add-lag-u-in-as-new-feat/notebook)を書いたのは、きっとこのことを念頭に置いてのことでしょう。このノートでは、新しい*シフト*した`u_in`機能を導入しています。ここでは、彼の最初のシフトである1ではなく、2のシフトを使用することにします。

In [None]:
train_data['u_in_shifted'] = train_data.groupby('breath_id')['u_in'].shift(2).fillna(method="backfill")
test_data['u_in_shifted']  = test_data.groupby('breath_id')['u_in'].shift(2).fillna(method="backfill")

Again inspired by the work of Chun Fu, this time in his notebook "Add last u_in as new feat" it is found, at least with gradient boosting type models, that providing the estimator with some descriptive statistics regarding u_in for the cycle in question seems to help in improving the model. Here are a number of examples, some of which may (or may not) be useful:

Chun Fu氏の研究に再び触発され、今回は彼のノートブック「Add last u_in as new feat」で、少なくとも勾配ブースト型モデルでは、推定量に問題のサイクルのu_inに関するいくつかの記述的統計を提供することが、モデルの改善に役立つようだということがわかりました。ここでは、いくつかの例を紹介しますが、そのうちのいくつかは役に立つかもしれません（あるいは役に立たないかもしれません）。



In [None]:
for df in (train_data, test_data):
    df['u_in_first']  = df.groupby('breath_id')['u_in'].transform('first')
    df['u_in_mean']   = df.groupby('breath_id')['u_in'].transform('mean')
    df['u_in_median'] = df.groupby('breath_id')['u_in'].transform('median')
    df['u_in_last']   = df.groupby('breath_id')['u_in'].transform('last')
    df['u_in_min']    = df.groupby('breath_id')['u_in'].transform('min')
    df['u_in_max']    = df.groupby('breath_id')['u_in'].transform('max')

# A simple submission

In [None]:
X_train = train_data.drop(['pressure'], axis=1)
y_train = train_data['pressure']
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble     import HistGradientBoostingRegressor
regressor  =  HistGradientBoostingRegressor(loss="least_absolute_deviation",max_iter=300)
regressor.fit(X_train, y_train)
sample["pressure"] = regressor.predict(test_data)
sample.to_csv('submission.csv',index=False)

# Related reading
* [The People's Ventilator Project](https://www.peoplesvent.org/en/latest/)
* [Julienne LaChance, Tom J. Zajdel, Manuel Schottdorf, Jonny L. Saunders, Sophie Dvali, Chase Marshall, Lorenzo Seirup, Daniel A. Notterman, and Daniel J. Cohen "*PVP1–The People’s Ventilator Project: A fully open, low-cost, pressure-controlled ventilator*", medRxiv doi:10.1101/2020.10.02.20206037 October 5 (2020)](https://www.medrxiv.org/content/10.1101/2020.10.02.20206037v1.full.pdf)
* [QuickLung ventilator](https://www.ingmarmed.com/product/quicklung/)