In [None]:
%%html
<style>
table {float:left}
</style>

# Data Description

<span style="font-size:1.2em;">The ventilator data used in this competition was produced using a modified open-source ventilator connected to an artificial bellows test lung via a respiratory circuit. The diagram below illustrates the setup, with the two control inputs highlighted in green and the state variable (airway pressure) to predict in blue. The first control input is a continuous variable from 0 to 100 representing the percentage the inspiratory solenoid valve is open to let air into the lung (i.e., 0 is completely closed and no air is let in and 100 is completely open). The second control input is a binary variable representing whether the expiratory valve is open (1) or closed (0) to let air out.</span>


<img src="https://raw.githubusercontent.com/google/deluca-lung/main/assets/2020-10-02%20Ventilator%20diagram.svg" width="1000" />

<span style="font-size:1.2em;">Each time series represents an approximately 3-second breath. The files are organized such that each row is a time step in a breath and gives the two control signals, the resulting airway pressure, and relevant attributes of the lung, described below.</span>

## Files

| Name                  | Description                                    |
| ----------------------| ----------------------------                   |
| train.csv             | the training set                               |
| test.csv              | the test set                                   |
| sample_submission.csv | a sample submission file in the correct format |

## Columns

| Name | Description |
|----|------------------------------------------------------------|
| id | globally-unique time step identifier across an entire file |
| breath_id | globally-unique time step for breaths |
| R | lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow. |
| C | lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow. |
| time_step | the actual time stamp. |
| u_in | the control input for the inspiratory solenoid valve. Ranges from 0 to 100. |
| u_out | the control input for the expiratory solenoid valve. Either 0 or 1. |
| pressure | the airway pressure measured in the respiratory circuit, measured in cmH2O. |

# Load libraries and data

In [None]:
#algebra
import pandas as pd
pd.options.display.float_format = '{:,.5f}'.format
#I want to see all features from the dataset given. But be careful, sometimes the output can be too large!
pd.options.display.max_rows = None 
import numpy as np

#data preprocessing
from sklearn.preprocessing import RobustScaler

#models
#from catboost import CatBoostRegressor, Pool, metrics, cv
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#Visual
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import ticker as tkr
#import plotly.express as px

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

# First look at data

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
train.head(15)

### What can the data tell me at this stage?

* There are more than 6kk rows, but only 125.749 breath ids. I suppose this is the smallest element to learn, any single breath shouldn't be divided more
* Time step is increasing for breath_id = 1, that thing confirms my hypothesis
* Each time series represents an approximately 3-second breath (data description). I would like to look at the first one to better understand the big picture

### Some tables and plots that, I think, will help me to move forward in my investigation

#### Distribution of breath times (maybe there are some outliers)

In [None]:
train_breath_times = train.groupby('breath_id')['time_step'].max().to_frame().reset_index()
test_breath_times = test.groupby('breath_id')['time_step'].max().to_frame().reset_index()
fig, ax = plt.subplots(figsize=(12, 6))
sns.kdeplot(data = train_breath_times[['time_step']], ax = ax, label = 'train', palette=['red'])
sns.kdeplot(data = test_breath_times[['time_step']], ax = ax, label = 'test', palette=['blue'])
ax.set_title('Max breath time distribution')
ax.legend(bbox_to_anchor = (1.02, 1.02), loc = 'upper left')
plt.grid()
plt.show()

#### Not even close to Gaussian, but train and test are similar. Asked about this behavior in competition topic ([https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273855#1521767](https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273855#1521767))

#### Missing values

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

#### Some more aggregations

In [None]:
train.groupby(['R', 'C'])['time_step'].agg(['mean', 'max', 'count']).reset_index()

In [None]:
test.groupby(['R', 'C'])['time_step'].agg(['mean', 'max', 'count']).reset_index()

### Look at breath №1

In [None]:
breath_1 = train[train['breath_id'] == 1]
fig, ax1 = plt.subplots(figsize = (12, 6))
ax1 = sns.lineplot(data = breath_1, x = 'time_step', y = 'u_in', label = 'u_in')
ax1 = sns.lineplot(data = breath_1, x = 'time_step', y = 'pressure', label = 'pressure')
ax2 = ax1.twinx() 
ax2 = sns.lineplot(data = breath_1, x = 'time_step', y = 'u_out', label = 'u_out', color = 'red')
ax1.set_title('Breath one, R={} and C={}'.format(breath_1['R'].max(), breath_1['C'].max()))
ax1.set_xlabel('Timestep')

lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()

lines = lines_1 + lines_2
labels = labels_1 + labels_2

ax1.legend().remove()
plt.legend(lines, labels, loc = (1.1, 0.8))

plt.grid()
plt.show()

#### This plot shows the process of artificial lumbs and pumps for one separate case.

### Distribution of R and C

In [None]:
ax = sns.catplot(data = train.groupby(['R', 'C'])['time_step'].agg(['count']).reset_index(),
                 kind = 'bar',
                 x = 'R',
                 y = 'count',
                 hue = 'C', legend = True, height = 5, aspect = 2)
ax.fig.suptitle('C/R distribution')
for g in ax.axes.flat:
    g.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.0f}'))
plt.show()

In [None]:
ax = sns.catplot(data = train.groupby(['R', 'C'])['time_step'].agg(['count']).reset_index(),
                 kind = 'bar',
                 x = 'C',
                 y = 'count',
                 hue = 'R', legend = True, height = 5, aspect = 2)
ax.fig.suptitle('R/C distribution')
for g in ax.axes.flat:
    g.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.0f}'))
plt.show()

### I want to see, what average breath looks like for different combinations of R and C.

#### Some steps to do before making plot:
* Add columns with rounded timestep (or the estimating will take too much time)
* Make separate dataset for each combo of R and C (this step can be passed - just aggregate data inside lineplot parameters, but I suppose It can be useful later)
* Plot and analyze the result

In [None]:
train['rounded_time_step'] = round(train['time_step'], 2)

train_5_10 = train[(train['R'] == 5) & (train['C'] == 10)]
train_20_10 = train[(train['R'] == 20) & (train['C'] == 10)]
train_50_10 = train[(train['R'] == 50) & (train['C'] == 10)]
train_5_20 = train[(train['R'] == 5) & (train['C'] == 20)]
train_20_20 = train[(train['R'] == 20) & (train['C'] == 20)]
train_50_20 = train[(train['R'] == 50) & (train['C'] == 20)]
train_5_50 = train[(train['R'] == 5) & (train['C'] == 50)]
train_20_50 = train[(train['R'] == 20) & (train['C'] == 50)]
train_50_50 = train[(train['R'] == 50) & (train['C'] == 50)]

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_5_10)
ax.set_title('Average pressure, R={} and C={}'.format(train_5_10['R'].max(), train_5_10['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_20_10)
ax.set_title('Average pressure, R={} and C={}'.format(train_20_10['R'].max(), train_20_10['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_50_10)
ax.set_title('Average pressure, R={} and C={}'.format(train_50_10['R'].max(), train_50_10['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_5_20)
ax.set_title('Average pressure, R={} and C={}'.format(train_5_20['R'].max(), train_5_20['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_20_20)
ax.set_title('Average pressure, R={} and C={}'.format(train_20_20['R'].max(), train_20_20['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_50_20)
ax.set_title('Average pressure, R={} and C={}'.format(train_50_20['R'].max(), train_50_20['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_5_50)
ax.set_title('Average pressure, R={} and C={}'.format(train_5_50['R'].max(), train_5_50['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_20_50)
ax.set_title('Average pressure, R={} and C={}'.format(train_20_50['R'].max(), train_20_50['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train_50_50)
ax.set_title('Average pressure, R={} and C={}'.format(train_50_50['R'].max(), train_50_50['C'].max()))
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

### And a plot with average pressure for all data - to compare the results.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.lineplot(x = 'rounded_time_step', y = 'pressure', data = train, color = 'red')
ax.set_title('Average pressure for all dataset')
ax.set(xlabel = 'Time step', ylabel = 'Pressure')
plt.show()

### Now I see, that behavior of pressure differs with R/C combinations, and the summarizing plot does not look like any of others. This cause a huge error when creating the model based on whole data. There must be something that works both with categorical and continuous data.