<img src= "https://storage.googleapis.com/kaggle-competitions/kaggle/29594/logos/header.png?t=2021-07-29-12-44-09&quot" style="height:300px"><br>
<h1 style="color:black;font-size:3em;font-weight:30px;text-align:center">Google Brain - Ventilator Pressure Prediction</h1>
<p><b>In this notebook I will describe my understanding of the competition problem statement. Please note that I am making this notebook public even before it is completed. As I generate new insights, I will keep adding them to this notebook along with explanations. If you have any ideas feel free to put forth your suggestions in the comments. Happy Kaggling:)</b></p>

<a id="top"></a>
<div style="background: rgb(49,114,163);padding-bottom:10px">
<h1 style="text-align:center;color:white;">Table of Contents</h1>
</div>
<div style="padding:10px;text-align:center;background:rgba(0,0,0,0.08)">
<a href="#first" target="_self">Overview<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">1</span></a>
</div>

<div style="padding:10px;text-align:center;">
<a href="#second" target="_self">Pressure pattern when output valve closed or open<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">2</span></a>
</div>

<div style="padding:10px;text-align:center;background:rgba(0,0,0,0.08)">
<a href="#third" target="_self" >Analysis of individual breath<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">3</span></a>
</div>

<div style="padding:10px;text-align:center;">
<a href="#fourth" target="_self">[V5 Update] About the constants R and C<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">4</span></a>
</div>

<div style="padding:10px;text-align:center;background:rgba(0,0,0,0.08)">
<a href="#fifth" target="_self">[V6 Update] What Effects do R-C pairs have on pressure distribution<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">5</span></a>
</div>

<div style="padding:10px;text-align:center;">
<a href="#sixth" target="_self">[V6 Update] One More Thing to Ponder About<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">6</span></a>
</div>

<div style="padding:10px;text-align:center;background:rgba(0,0,0,0.08)">
<a href="#seventh" target="_self">[V7 Update] Some More Features<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">7</span></a>
</div>

<div style="padding:10px;text-align:center;">
<a href="#eight" target="_self">[V9 Update] Use Double Derivative of u_in?<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">8</span></a>
</div>

<div style="padding:10px;text-align:center;">
<a href="#nine" target="_self">[V10 Update] Negative Pressure Values<span style="color:white;float:right;background:rgba(0,0,0,0.5);padding:5px;border-radius:10px">9</span></a>
</div>

<a id="first"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        Overview
    </center>
</h1>

<p style="font-size:1.2em;font-family:callibri">Hello people. Very excited to be taking part in yet another interesting kaggle competition. As it should be with any competition, first let's try to understand the problem from the words of the host themselves</p><br>
<div style="background:rgba(0,0,0,0.1)">
    What do doctors do when a patient has trouble breathing? They use a ventilator to pump oxygen into a sedated patient's lungs via a tube in the windpipe. But mechanical ventilation is a clinician-intensive procedure, a limitation that was prominently on display during the early days of the COVID-19 pandemic. At the same time, developing new methods for controlling mechanical ventilators is prohibitively expensive, even before reaching clinical trials. High-quality simulators could reduce this barrier. In this competition, you’ll simulate a ventilator connected to a sedated patient's lung. The best submissions will take lung attributes compliance and resistance into account.
</div>

<img src="https://raw.githubusercontent.com/google/deluca-lung/main/assets/2020-10-02%20Ventilator%20diagram.svg" height=50% width=60% style="position:relative;margin-left:auto;margin-right:auto">
<p style="font-size:1.2em;font-family:callibri">
So in this problem, the ventillator is basically a control system to control the airway pressure in the lung. The pressure here is the manipulated variable and the inspiratory and expiratory valve positions are the control variables. We are given the data in a sequential format and using this data we have to predict the airway pressure at different points of time. The takeaway point here is that the competition is that of a sequence to sequence regression problem. Let's have a look at the data before we speculate any further. The variable descriptions for the dataset are given below:</p>

- `id` - globally-unique time step identifier across an entire file
- `breath_id` - globally-unique time step for breaths
- `R` - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.
- `C` - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.
- `time_step` - the actual time stamp.
- `u_in` - the control input for the inspiratory solenoid valve. Ranges from 0 to 100.
- `u_out` - the control input for the exploratory solenoid valve. Either 0 or 1.
- `pressure `- the airway pressure measured in the respiratory circuit, measured in cmH2O.

In [None]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from umap import UMAP 

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=False)
import plotly.express as px

sns.set_style("darkgrid")

In [None]:
train = pd.read_csv("../input/ventilator-pressure-prediction/train.csv")
test = pd.read_csv("../input/ventilator-pressure-prediction/test.csv")

In [None]:
print("train shape: ",train.shape)
print("test shape: ",test.shape)
print("\nNumber of breaths train: ",train.breath_id.nunique())
print("Number of breaths test: ",test.breath_id.nunique())
train.head(10)

<a id="second"></a>
<h1 style="background:purple;color:white;padding-top:20px"> 
    <center>
        Pressure pattern when output valve closed or open <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>
<br>
<p style="font-size:1.2em;font-family:callibri">
Let's have a look at how pressure magnitude distributions vary when the expiratory valve is closed (0) and open(1) respectively.</p>

In [None]:
sns.kdeplot(train[train["u_out"]==0]["pressure"]);
plt.title("Expiratory valve closed");

In [None]:
sns.kdeplot(train[train["u_out"]==1]["pressure"]);
plt.title("Expiratory valve open");

In [None]:
sns.scatterplot(x='u_in',y='pressure',hue='u_out',data=train);

<p style="font-size:1.2em;font-family:callibri">From these plots we can clearly see that when the expiratory valve is open the pressure distribution is left skewed, meaning that the airway pressure lies in the lower ranges in such a case. On the other hand when the valve is closed (this can be referred to as the inhalation phase) the pressure lies in wider range which is expected as pressure variations would occur during the course of inhalation.</p>

<a id="third"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        Analysis of individual breath  <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>    
<br>
<p style="font-size:1.2em;font-family:callibri">
Let's see how the airway pressure varies within a complete breath cycle. For this we will individually consider four different breaths and plot the pressure as well as input valve positions.</p>

In [None]:
for i in range(1,5,1):
    one_breath = train[train["breath_id"]==i]

    plt.figure(figsize=(8,6));
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==0],color='green',label='pressure inhale');
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==1],color='orange',label='pressure exhale');
    sns.lineplot(x = 'id',y='u_in',data=one_breath,color='blue',label='valve position')
    plt.title(f"Variation of Pressure and Input valve position during breadth {i}");
    plt.legend();

In [None]:
print("Number of time-steps for each breath in train set: ",train.groupby("breath_id").size().value_counts().keys()[0])
print("Number of time-steps for each breath in test set: ",test.groupby("breath_id").size().value_counts().keys()[0])

<ul>
<li> <p style="font-size:1.2em;font-family:callibri"> This is certainly very interesting. During a breath when the expiratory valve is closed (inhalation), The pressure gradually increases as the inspiratory valve open percentage increases. Interesting to note however is that there is a certain `delay time` between the change of valve position (control variable) and the pressure (response). Once the pressure reaches the peak value, the expiratory valve is opened as the exhalation phase begins. The pressure decreases rapidly until it reaches an asymptote. This cycle repeats continuously. So what do we understand from this? Well for starters we realize that pressure at consecutive time-steps bear strong correlations with each other meaning that a sequential treatment of the data can prove advantageous. </p> </li>
<li> <p style="font-size:1.2em;font-family:callibri">We also note that each breath has 80 recorded time-steps</p></li>
</ul>

<a id="fourth"></a>
<h1  style="background:purple;color:white;padding-top:20px">
    <center>
        [V5 Update] About the constants R and C  <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>

<p style="font-size:1.2em;font-family:callibri">
The meaning of the R and C attributes have been described as follows:</p>
<ul>
<li><p style="font-size:1.2em;font-family:callibri">R : lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.</p></li>
<li><p style="font-size:1.2em;font-family:callibri">C : lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.</p></li>
</ul>
<p style="font-size:1.2em;font-family:callibri">
They are both basically lung attributes. R has values 5,20 or 50 and C has values 10, 20 and 50. let's observe their counts in train and test sets.
</p>

In [None]:
fig,ax = plt.subplots(2,2,figsize=(10,7))
sns.countplot(x = "R",data=train,ax = ax[0,0]);
ax[0,0].set_title("Counts of R in train set");
sns.countplot(x = "C",data=train,ax = ax[0,1]);
ax[0,1].set_title("Counts of C in train set");
sns.countplot(x = "R",data=test,ax = ax[1,0]);
ax[1,0].set_title("Counts of R in test set");
sns.countplot(x = "C",data=test,ax = ax[1,1]);
ax[1,1].set_title("Counts of C in test set");
plt.tight_layout()

Let's see the different R-C pairs that exists and their counts

In [None]:
#for train set
pair_rc = train.groupby(["R", "C"]).size().reset_index(name="Counts")
pair_rc["R"] = pair_rc[["R","C"]].apply(lambda cols: (cols[0],cols[1]),axis=1)
pair_rc.drop("C",axis=1,inplace=True)
pair_rc.rename(columns={'R':'R-C pair'},inplace=True)
fig,ax = plt.subplots(1,2,figsize=(16,4))
sns.barplot(x="R-C pair",y="Counts",data=pair_rc,ax=ax[0]);
ax[0].set_title("Counts of R-C pairs train set");
#for test set
pair_rc = test.groupby(["R", "C"]).size().reset_index(name="Counts")
pair_rc["R"] = pair_rc[["R","C"]].apply(lambda cols: (cols[0],cols[1]),axis=1)
pair_rc.drop("C",axis=1,inplace=True)
pair_rc.rename(columns={'R':'R-C pair'},inplace=True)
sns.barplot(x="R-C pair",y="Counts",data=pair_rc,ax=ax[1]);
ax[1].set_title("Counts of R-C pairs test set");

<a id="fifth"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        [V6 Update] What Effects do R-C pairs have on pressure distribution  <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>

In [None]:
train["R-C"] = train[["R","C"]].apply(lambda cols: (cols[0],cols[1]),axis=1)

In [None]:
fig,ax = plt.subplots(3,3,figsize=(12,9))
ax = ax.flatten()
for idx,rc in enumerate(train["R-C"].unique()):
    df = train[(train["R-C"]==rc)& (train["u_out"]==0)]
#     breath_id = df["breath_id"].values[0]
    sns.kdeplot(df["pressure"],ax=ax[idx]);
    ax[idx].set_title("R-C {}".format(rc))
fig.suptitle("Output Valve Closed",fontsize="16")
plt.tight_layout()

In [None]:
fig,ax = plt.subplots(3,3,figsize=(12,9))
ax = ax.flatten()
for idx,rc in enumerate(train["R-C"].unique()):
    df = train[(train["R-C"]==rc)& (train["u_out"]==1)]
#     breath_id = df["breath_id"].values[0]
    sns.kdeplot(df["pressure"],ax=ax[idx]);
    ax[idx].set_title("R-C {}".format(rc))
fig.suptitle("Output Valve Open",fontsize="16")
plt.tight_layout()

<p style="font-size:1.2em;font-family:callibri">
    Well, with different R-C pairs, the pressure distribution also varies both for output valve closed and open. Perhaps including R-C pair as a categorical feature will be benificial
</p>

<a id="sixth"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        [V6 Update] One More Thing to Ponder About  <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>

<p style="font-size:1.2em;font-family:callibri">
    For this one let's plot the distributions of pressure for output valve closed and open as well as overall distribution. Notice the abrupt peaks in the distribution when expiratory valve is closed (during inhalation) which in turn produces small peaks in the overall distribution. Why do certain pressure values have higher likelihood of occurence? That is something to think about and we will certainly address this in the upcoming versions of this notebook.
</p>

In [None]:
fig,ax = plt.subplots(1,3,figsize=(12,4))
sns.kdeplot(train["pressure"],ax=ax[0]);
ax[0].set_title("Pressure Distribution");
sns.kdeplot(train[train["u_out"]==0]["pressure"],ax=ax[1]);
ax[1].set_title("Expiratory valve closed");
sns.kdeplot(train[train["u_out"]==1]["pressure"],ax=ax[2]);
ax[2].set_title("Expiratory valve open");
plt.tight_layout()

<a id="seventh"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        [V7 Update] Some More Features  <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>
<p style="font-size:1.2em;font-family:callibri"><a href="https://www.kaggle.com/lukaszborecki">@lukaszborecki</a> has suggested the following features that seem to be useful for the model</p>
<ul>
    <li><p style="font-size:1.2em;font-family:callibri"> Integral of u_in*dt (change in time_step times u_in)</p></li>
    <li><p style="font-size:1.2em;font-family:callibri"> Delta u_in times Delta Time_step</p></li>
    <li><p style="font-size:1.2em;font-family:callibri"> Slope, delta u_in / delta time_step</p></li>
</ul>
<p style="font-size:1.2em;font-family:callibri"> Another feature being used is the cummulative sum of u_in. Let's explore these features. <br> First of all, let's see the variation of pressure with the cummulative sum of u_in.</p>

In [None]:
train["u_in_cumsum"] = train.groupby("breath_id")["u_in"].cumsum()
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="u_in_cumsum",y="pressure",hue="u_out",data=train);
plt.title("u_in_cumsum vs. pressure coloured by u_out");

<p style="font-size:1.2em;font-family:callibri"> 
    The same plot gets a lot more interesting when you colour it by R-C pair!!! As you can see below, certain distinct clusters are formed. For example, for R-C pair (50,20) at low values of u_in_cumsum the pressure is high. This really looks like an useful feature.
</p>

In [None]:
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="u_in_cumsum",y="pressure",hue="R-C",data=train);
plt.title("u_in_cumsum vs. pressure coloured by R-C pair");
plt.legend(loc="upper right");

<p style="font-size:1.2em;font-family:callibri"> Now let's plot the integral of u_in vs pressure. The integral is calculated as follows:<br>
<pre> integral = cumulative_sum(u_in * dt), where dt is the difference of consecutive time steps.</pre>
<br>
    We plot the integral and color it by u_out. The pattern observed here is similar to the one for u_in_cumsum
 </p>

In [None]:
train["delta_t"] = train.groupby("breath_id")["time_step"].diff().fillna(0)
train["integrand"] = train["u_in"]*train["delta_t"]
train["integral"] = train.groupby("breath_id")["integrand"].cumsum()

In [None]:
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="integral",y="pressure",hue="u_out",data=train);
plt.title("integral of u_in vs. pressure coloured by u_out");

<p style="font-size:1.2em;font-family:callibri"> Just like before we now color this plot by R-C pair. Again, a similar pattern as the one before</p>

In [None]:
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="integral",y="pressure",hue="R-C",data=train);
plt.title("integral of u_in vs. pressure coloured by R-C pair");
plt.legend(loc="upper right");

<p style="font-size:1.2em;font-family:callibri"> The next feature we will examine is the differential of u_in with respect to time. The differential is calculated as follows:<br></p>
<pre> differential = diff(u_in)/diff(time_step) , where diff indicates the difference of consecutive samples.</pre>
<br>
    <p style="font-size:1.2em;font-family:callibri">We plot the differential and color it by u_out and R-C pair respectively. I currently do not have an interpretation of these patterns. Feel free to make suggestions! </p>

In [None]:
train["delta_time"] = train.groupby("breath_id")["time_step"].diff().fillna(3)
train["delta_uin"] = train.groupby("breath_id")["u_in"].diff().fillna(0)
train["differential"] = train["delta_uin"]/train["delta_time"]

In [None]:
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="differential",y="pressure",hue="u_out",data=train);
plt.title("differential of u_in vs. pressure coloured by u_out");
plt.legend(loc="upper right");

In [None]:
fig = plt.figure(figsize=(8,6))
sns.scatterplot(x="differential",y="pressure",hue="R-C",data=train);
plt.title("differential of u_in vs. pressure coloured by R-C pair");
plt.legend(loc="upper right");

<a id="eight"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        [V9 Update] Use Double Derivative of u_in? <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>

<p style="font-size:1.2em;font-family:callibri">Since it is quite clear that u_in is an important feature for the current tasks. Variations in u_in seem to largely affect the subsequent  pressure values. Therefore it seems likely that the double derivative would also give some insights into the pattern of rise or fall of pressure. But how do we calculate the double derivative of u_in. Well, for this we can approximate a double derivative in the form of a kernel in order to convolve it with the u_in sequence which are discrete values. Particularly we use this kernel: </p>
    <pre style="font-size:1.4em;text-align:center"> k = [1,-2, 1]</pre>
<p style="font-size:1.2em;font-family:callibri">For performing the convolution, we group the dataset by breath_id and perform convolution for individual breath_id (each of which consists of 80 sequences) by using appropriate padding</p>

In [None]:
kernel = np.array([1,-2,1])
def convolve(inp_array):
    inp_array = inp_array.values
    return np.convolve(kernel,inp_array,mode="same")

train["delta"] = np.vstack(train.groupby("breath_id")["u_in"].apply(convolve).values).flatten()

In [None]:
for i in range(1,8,1):
    
    fig,ax = plt.subplots(1,1,figsize=(14,6));
    one_breath = train[train["breath_id"]==i]
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==0],color='green',label='pressure inhale',ax=ax);
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==1],color='orange',label='pressure exhale',ax=ax);
    sns.lineplot(x = 'id',y='u_in',data=one_breath,color='blue',label='valve position',ax=ax)
    
    sns.lineplot(x = 'id',y='delta',data=one_breath[one_breath['u_out']==0],color='violet',label='pressure inhale',ax=ax);
    sns.lineplot(x = 'id',y='delta',data=one_breath[one_breath['u_out']==1],color='red',label='pressure exhale',ax=ax);
    ax.set_title(f"Pressure during breadth {i}");
    plt.legend();

<a id="nine"></a>
<h1 style="background:purple;color:white;padding-top:20px">
    <center>
        [V10 Update] Negative Pressure Values <span style="float:right"><a href="#top"><img src="https://www.clipartmax.com/png/middle/163-1630443_go-to-top-white-arrow-in-circle.png" height="40px" width="40px"></a></span>
    </center>
</h1>

<p style="font-size:1.2em;font-family:callibri">It is quite strange to see that certain breath_ids have all negative pressure value. What's more astonishing is the fact that these values seem to have no observable pattern. In <a href="https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/277695">this</a> discussion thread <a href="https://www.kaggle.com/cpmpml">@cpmp</a> puts forth the idea that the pressure values are with respect to atmospheric pressure and hence negative pressures indicate pressure lower than atmsopheric pressure value. Below I have plotted the patterns for breath_ids with negative pressure values.</p>

In [None]:
train["pressure_sign"] = train["pressure"]/np.abs(train["pressure"])
sign_df = train.groupby("breath_id")["pressure_sign"].sum().to_frame().reset_index()
sign_df = sign_df[sign_df["pressure_sign"]==-80.0]
sign_df["R"] = -1.0
sign_df["C"] = -1.0

for i in range(sign_df.shape[0]):
    temp = train[train["breath_id"]==sign_df["breath_id"].iloc[i]][["R","C"]].mean().to_frame().transpose().reset_index(drop=True)
    sign_df.iloc[i,2] = temp["R"]
    sign_df.iloc[i,3] = temp["C"]
sign_df

<p style="font-size:1.2em;font-family:callibri">You can see that all breath_id having negative pressure values have R=50 and C=10. Is this a coincidence? It could be, since R-C pair 50,10 is the most frequently occuring one. The inhale exhale pressures and the u_in values for these ids are shown below.</p>

In [None]:
fig,ax = plt.subplots(17,2,figsize=(16,92))
for i in range(1,sign_df.shape[0]+1,1):
    one_breath = train[train["breath_id"]==sign_df["breath_id"].iloc[i-1]]
    x,y = (i//2),(i%2)
    x=  x-1 if y==0 else x
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==0],color='green',label='pressure inhale',ax=ax[x,y]);
    sns.lineplot(x = 'id',y='pressure',data=one_breath[one_breath['u_out']==1],color='orange',label='pressure exhale',ax=ax[x,y]);
    sns.lineplot(x = 'id',y='u_in',data=one_breath,color='blue',label='valve position',ax=ax[x,y])
    ax[x,y].set_title(f"Variation of Pressure and Input valve position during breadth {i}");
    plt.legend();
plt.show()

<p style="font-size:1.2em;font-family:callibri">Well, the u_in values in these ids certainly have an weird pattern and as for the pressure values, they really do not make any sense to me. So should we drop these samples? Before we proceed further, please note that for these samples the u_in values stay very close to 0</p>
<p style="font-size:1.2em;font-family:callibri">To see if negative pressures are likely to occur in the test set i carry out the following steps.</p>
<ul>
    <li> Get the 34 u_in sequences corresponding to the 34 samples in the train set having all negative pressures</li>
    <li> Calculate the euclidean distance between u_in of each breath_id and every u_in in the set obtained in the previous step</li>
    <li> For each of the 34 u_in sequences, get the sample in the test set having least distance from the particular u_in and plot the graph</li>
</ul>
<p style="font-size:1.2em;font-family:callibri">My reasoning is that if the samples most similar to the ones having negative pressure also have u_in that do not show much fluctuations and stay close to 0 then negative pressures are likely to occur in test set too.</p>

In [None]:
#get u_ins for negative pressure breath_ids
u_ins = []
for i in range(sign_df.shape[0]):
    one_breath = train[train["breath_id"]==sign_df["breath_id"].iloc[i]]
    u_ins.append(one_breath["u_in"].values)
    
distance_df = pd.DataFrame.from_dict({"breath_id":test["breath_id"].unique()})
#get euclidean distances for u_in each breath_id from the u_in of the negative pressure ids.
for i in range(len(u_ins)):
    test["temp"] = (test["u_in"]-np.repeat(np.expand_dims(u_ins[i],axis=0),distance_df.shape[0],axis=0).flatten())**2
    distance_df[f"u_in_{i}"] = np.sqrt(test.groupby("breath_id")["temp"].sum().reset_index(drop=True))

min_ids = []
for i in range(len(u_ins)):
    min_ids.append(np.argmin(distance_df[f"u_in_{i}"]))

In [None]:
fig,ax = plt.subplots(17,2,figsize=(16,92))
for i in range(1,35,1):
    one_breath = test[test["breath_id"]==distance_df["breath_id"].iloc[min_ids[i-1]]]
    x,y = (i//2),1-(i%2)
    x=  x-1 if y==1 else x
    sns.lineplot(x = 'id',y='u_in',data=one_breath,color='blue',label='valve position',ax=ax[x,y])
    ax[x,y].set_title(f"u_in {i}");
    plt.legend();
# plt.suptitle("u_in for closest distance samples in test data")
plt.show()

<p style="font-size:1.2em;font-family:callibri">As we can see, what i thought might happen did happen:). So what is the consensus guys? Do we drop them or do we not?</p>