
<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>CRISP-DM Methodology</center></h3>

* [Buissness Understanding](#1)
* [Data Understanding](#2)
* [Data Preparation](#3)
* [Data Modeling](#4)   
* [Data Evaluation](#5)

In this section we overview our selected method for engineering our solution. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is an open standard guide that describes common approaches that are used by data mining experts. CRISP-DM includes descriptions of the typical phases of a project, including tasks details and provides an overview of the data mining lifecycle. The lifecycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary. It starts with business understanding, and then moves to data understanding, data preparation, modelling, evaluation, and deployment. The CRISP-DM model is flexible and can be customized easily.
## Buissness Understanding

    Tasks:

    1.Determine business objectives

    2.Assess situation

    3.Determine data mining goals

    4.Produce project plan

## Data Understanding
     Tasks:

    1.Collect data

    2.Describe data

    3.Explore data    

## Data Preparation
    Tasks
    1.Data selection

    2.Data preprocessing

    3.Feature engineering

    4.Dimensionality reduction

            Steps:

            Data cleaning

            Data integration

            Data sampling

            Data dimensionality reduction

            Data formatting

            Data transformation

            Scaling

            Aggregation

            Decomposition

## Data Modeling :

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

    Tasks
    1. Select modeling technique Select technique

    2. Generate test design

    3. Build model

    4. Assess model

## Data Evaluation :
    Tasks

    1.Evaluate Result

    2.Review Process

    3.Determine next steps

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Buissness Understanding</center></h3>

    
What do doctors do when a patient has trouble breathing? They use a ventilator to pump oxygen into a sedated patient's lungs via a tube in the windpipe. But mechanical ventilation is a clinician-intensive procedure, a limitation that was prominently on display during the early days of the COVID-19 pandemic. At the same time, developing new methods for controlling mechanical ventilators is prohibitively expensive, even before reaching clinical trials. High-quality simulators could reduce this barrier.

Current simulators are trained as an ensemble, where each model simulates a single lung setting. However, lungs and their attributes form a continuous space, so a parametric approach must be explored that would consider the differences in patient lungs.

Partnering with Princeton University, the team at Google Brain aims to grow the community around machine learning for mechanical ventilation control. They believe that neural networks and deep learning can better generalize across lungs with varying characteristics than the current industry standard of PID controllers.

In this competition, you’ll simulate a ventilator connected to a sedated patient's lung. The best submissions will take lung attributes compliance and resistance into account.

If successful, you'll help overcome the cost barrier of developing new methods for controlling mechanical ventilators. This will pave the way for algorithms that adapt to patients and reduce the burden on clinicians during these novel times and beyond. As a result, ventilator treatments may become more widely available to help patients breathe.

**Eval Metric**: The competition will be scored as the mean absolute error between the predicted and actual pressures during the inspiratory phase of each breath. The expiratory phase is not scored. The score is given by:

|X−Y|

where X is the vector of predicted pressure and Y is the vector of actual pressures across all breaths in the test set.
    
  **Reminder**

* id - globally-unique time step identifier across an entire file

* breath_id - globally-unique time step for breaths

* R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.

* C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.

* time_step - the actual time stamp.

* u_in - the control input for the inspiratory solenoid valve. Ranges from 0 to 100.

* u_out - the control input for the exploratory solenoid valve. Either 0 or 1.

* pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O.
    
<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Understanding</center></h3>

    
## Step 1: Import helpful libraries

In [None]:
#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import matplotlib.gridspec as gridspec
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn
import warnings
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer, StandardScaler,Normalizer,RobustScaler,MaxAbsScaler,MinMaxScaler,QuantileTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
# Import StandardScaler from scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.manifold import TSNE
# Import train_test_split()
# Metrics
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve
from datetime import datetime, date
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.linear_model import LinearRegression, RidgeCV

import lightgbm as lgbm
from catboost import CatBoostRegressor
import tensorflow as tf 
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import LearningRateScheduler
#import smogn
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
# For training random forest model
import lightgbm as lgb
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 
# Model selection
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif
from sklearn.feature_selection import mutual_info_regression

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from itertools import combinations
#import smong 

import category_encoders as ce
import warnings
import optuna 
warnings.filterwarnings('ignore')

In [None]:
!pip install klib 
import klib 


## Step 2: Load the data

Next, we'll load the training and test data.

We set index_col=0 in the code cell below to use the id column to index the DataFrame. (If you're not sure how this works, try temporarily removing index_col=0 and see how it changes the result.)


In [None]:
%%time
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
pressure_values = np.sort( train.pressure.unique() )
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')
submission = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

In [None]:
train.head()

In [None]:
train.columns

In [None]:
train.info()

## EDA 

### Explore the data

    Null Data
    Categorical data
    Itrain.isnull().sum().valuess there Text data
    wich columns will we use
    IS there outliers that can destory our algo
    IS there diffrent range of data
    Curse of dimm...
    

####  Null Data 
**How sparse is my data?**
Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.

In [None]:
train.isnull().sum().values

In [None]:
test.isnull().sum().values

In [None]:
train[train.isnull().sum(axis=1) >=1].shape

In [None]:
# summarize the number of rows with missing values for each column
for i in range(train.shape[1]):
    # count number of rows with missing values
    n_miss = train.iloc[:,i].isnull().sum()
    perc = n_miss / train.shape[0] * 100
    print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

In [None]:
klib.missingval_plot(train)

In [None]:
train.duplicated(subset='id', keep='first').sum()

In [None]:
len(train)-len(train.drop_duplicates())

In [None]:
train.describe().T

### Visual Exploratory 

In [None]:
plot = klib.corr_plot(train, annot=False, figsize=(12,10))

In [None]:
plot.figure.savefig('figure1.pdf')

In [None]:
klib.corr_plot(train, split='pos') # displaying only positive correlations, other settings include threshold, cmap...
klib.corr_plot(train, split='neg') # displaying only negative correlations

In [None]:
# Comparing the datasets length
fig, ax = plt.subplots(figsize=(5, 5))
pie = ax.pie([len(train), len(test)],
             labels=["Train dataset", "Test dataset"],
             colors=["salmon", "teal"],
             textprops={"fontsize": 15},
             autopct='%1.1f%%')
ax.axis("equal")
ax.set_title("Dataset length comparison", fontsize=18)
fig.set_facecolor('white')
plt.show

In [None]:
klib.cat_plot(train[['R', 'C', 'u_out']], figsize=(50,15))

In [None]:
klib.dist_plot(train[['R','C','u_in','u_out','pressure']])

## Convert Dtypes 

In [None]:
train[train.select_dtypes(['float64']).columns] = train[train.select_dtypes(['float64']).columns].apply(pd.to_numeric)
train[train.select_dtypes(['object','int64']).columns] = train.select_dtypes(['object','int64']).apply(lambda x: x.astype('category'))
test[test.select_dtypes(['float64']).columns] = test[test.select_dtypes(['float64']).columns].apply(pd.to_numeric)
test[test.select_dtypes(['object','int64']).columns] = test.select_dtypes(['object','int64']).apply(lambda x: x.astype('category'))

### Num/Cat Features 

In [None]:
cat_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(exclude=['float64']).columns
num_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(include=['int64','float64','category']).columns

### Numerical features distribution
#### Histograms of numerical features

In [None]:
num_columns

In [None]:
# Numerical features distribution 
i = 1
plt.figure()
fig, ax = plt.subplots(3, 2,figsize=(20, 24))
for feature in num_columns:
    plt.subplot(3, 2,i)
    sns.histplot(train[feature],color="red", kde=True,bins=100, label='train')
    sns.histplot(test[feature],color="olive", kde=True,bins=100, label='test')
    plt.xlabel(feature, fontsize=9); plt.legend()
    i += 1
plt.show()

**Histograms : numerical data seems to be similar to train numerical data.**
### Zooming on the correlation between numerical variables and target.

In [None]:
train.corr()['pressure'][:-1].plot.barh(figsize=(8,6),alpha=.6,color='darkblue')
plt.xlim(-.075,.075);
plt.xticks([-0.065, -0.05 , -0.025,  0.   ,  0.025,  0.05 ,  0.065],
           [str(100*i)+'%' for i in [-0.065, -0.05 , -0.025,  0.   ,  0.025,  0.05 ,  0.065]],fontsize=12)
plt.title('Correlation between target and numerical variables',fontsize=14);

It's clear tat there isn't any clear relation between numerical variables and target.

Now Exploring correlation between all numerical variables. First we get a correlation grid of all numercial variables and target


### Correlation 

In [None]:
train.corr().style.background_gradient(cmap='viridis')

### Box plot of numerical columns

In [None]:
v0 = sns.color_palette(palette='viridis').as_hex()[0]
fig = plt.figure(figsize=(18,6))
sns.boxplot(data=train[num_columns], color=v0,saturation=.5);
plt.xticks(fontsize= 14)
plt.title('Box plot of train numerical columns', fontsize=16);

### Test data 

In [None]:
fig = plt.figure(figsize=(18,6))
sns.boxplot(data=test[num_columns], color=v0,saturation=.5);
plt.xticks(fontsize= 14)
plt.title('Box plot of test numerical columns', fontsize=16);

Numerical Data seems to be with few outliers appearing in the box plot Also test numerical data seems to looks like the train ones.



## Number of categorical unique values

In [None]:
fig = plt.figure(figsize=(10,5))
sns.barplot(y=train.drop(['breath_id'],axis=1)[cat_columns].nunique().values, x=train[cat_columns].nunique().index, color='blue', alpha=.5)
plt.xticks(rotation=0)
plt.title('Number of categorical unique values',fontsize=16);

## Categorical features distribution

In [None]:
labels = train['u_out'].astype('category').cat.categories.tolist()
counts = train['u_out'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()

In [None]:
train['R'].astype('category').cat.categories.tolist()

In [None]:
labels = train['R'].astype('category').cat.categories.tolist()
counts = train['R'].value_counts()
sizes = [counts[var_cat] for var_cat in range(3)]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()

In [None]:
labels = train['C'].astype('category').cat.categories.tolist()
counts = train['C'].value_counts()
sizes = [counts[var_cat] for var_cat in range(3)]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()

## Target 
### Pressure graph

In [None]:
plt.plot(range(80),train['pressure'].iloc[0:80])
plt.plot(range(80),train['pressure'].iloc[80:160])
plt.plot(range(80),train['pressure'].iloc[160:240])
plt.plot(range(80),train['pressure'].iloc[240:320])
plt.plot(range(80),train['pressure'].iloc[320:400])
plt.plot(range(80),train['pressure'].iloc[400:480])

In [None]:
cat_columns


In [None]:
fig = plt.figure(figsize=(30,10))
grid =  gridspec.GridSpec(1,3,figure=fig,hspace=.4,wspace=.4)
n =1
for i in range(1):
    for j in range(len(cat_columns)):
        ax = fig.add_subplot(grid[i, j])
        sns.violinplot(data =  train, y = 'pressure' , x =cat_columns[j] ,ax=ax, alpha =.7, fill=True,palette='viridis')
        ax.set_title(cat_columns[j],fontsize=14)
        ax.set_xlabel('')
        ax.set_ylabel('')
        n += 1
fig.suptitle('Violin plot of target with categorical features', fontsize=16,y=.93);

## KDE plot of target with features

In [None]:
fig = plt.figure(figsize=(30,10))
grid =  gridspec.GridSpec(1,3,figure=fig,hspace=.4,wspace=.4)
n =1
for i in range(1):
    for j in range(len(cat_columns)):
        ax = fig.add_subplot(grid[i, j])
        sns.kdeplot(data =  train,hue  = cat_columns[j] , x ='pressure' ,ax=ax, alpha =.7, fill=True,palette='viridis')
        ax.set_title(cat_columns[j],fontsize=14)
        ax.set_xlabel('')
        ax.set_ylabel('')
        n += 1
fig.suptitle('kdeplot plot of target with categorical features', fontsize=16,y=.93);

In [None]:
# Categorical features distribution 
i = 1
plt.figure()
fig, ax = plt.subplots(2, 2,figsize=(10,10))
for feature in cat_columns:
    plt.subplot(2, 2,i)
    sns.histplot(train[feature],color="blue", label='train')
    sns.histplot(test[feature],color="olive", label='test')
    plt.xlabel(feature, fontsize=9); plt.legend()
    i += 1
plt.show()

###  exploring target data main statistics

In [None]:
train.pressure.nunique()

In [None]:
train['pressure'].describe()

In [None]:
train['pressure'].describe().iloc[1:].plot.barh(color=v0,alpha=.5,figsize=(12,5))
plt.title('Target data statistics',fontsize=16)
plt.yticks(fontsize=14)
plt.xticks(np.arange(0,10.8,.5));

In [None]:
train.pressure.value_counts()[train.pressure.value_counts()<1000].plot.kde()

In [None]:
# Categorical features distribution 
plt.figure()
sns.countplot(train['pressure'], label='pressure')
plt.xlabel(feature, fontsize=9); plt.legend()
plt.xticks(rotation=45)
plt.show()

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
f, axes = plt.subplots(nrows=3, ncols=1, figsize=(12, 12))
f.suptitle('pressure', fontsize=16)
g = sns.kdeplot(train['pressure'], shade=True, label="%.2f"%(train['pressure'].skew()), ax=axes[0])
g = g.legend(loc="best")
stats.probplot(train['pressure'], plot=axes[1])
sns.boxplot(x='pressure', data=train, orient='h', ax=axes[2]);
plt.tight_layout()
plt.show()

## Box plot of target data with percentile of .1% and 99.9%

In [None]:
y=train['pressure']
plt.figure(figsize=(12,6))
sns.boxplot(x=y, width=.4);
plt.axvline(np.percentile(y,.1), label='.1%', c='blue', linestyle=':', linewidth=3)
plt.axvline(np.percentile(y,2), label='2%', c='gold', linestyle=':', linewidth=3)
plt.axvline(np.percentile(y,10), label='10%', c='red', linestyle=':', linewidth=3)
plt.axvline(np.percentile(y,90), label='90%', c='red', linestyle=':', linewidth=3)
plt.axvline(np.percentile(y,99), label='99%', c='blue', linestyle=':', linewidth=3)
plt.axvline(np.percentile(y,98), label='98%', c='gold', linestyle=':', linewidth=3)
plt.legend()
plt.title('Box plot of target data', fontsize=16)
plt.xticks(np.arange(0,10.8,.5));

## Bin Target :

In [None]:
bins = [np.percentile(y,0),np.percentile(y,10),  np.percentile(y,90), np.percentile(y,100)]
# Bin labels
labels1 = [ 'Low', 'Medium', 'High']
trainessai=train.copy()
# Bin the continuous variable ConvertedSalary using these boundaries
trainessai['target_binned'] = pd.cut(trainessai['pressure'], 
                                bins=bins,labels=labels1 )

In [None]:
labels = trainessai['target_binned'].astype('category').cat.categories.tolist()
counts = trainessai['target_binned'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()

In [None]:
del train 
del test 

## t-SNE visualization of high-dimensional data

t-SNE intuition t-SNE is super powerful, but do you know exactly when to use it? When you want to visually explore the patterns in a high dimensional dataset. 


In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')

In [None]:
%%time 
m = TSNE()
df_numeric =train.drop(['id','breath_id','pressure'], axis=1).iloc[0:8000]._get_numeric_data()
df_numeric=df_numeric.dropna()
X_train =RobustScaler().fit_transform(df_numeric)
del df_numeric 
# Fit and transform the t-SNE model on the numeric dataset
tsne_features = m.fit_transform(X_train)
print(tsne_features.shape)

In [None]:
trainessai=trainessai.iloc[0:8000]
trainessai['x']=tsne_features[:, 0]
trainessai['y']=tsne_features[:, 1]
# Color the points according to Army Component
sns.scatterplot(x='x', y='y', hue='pressure', data=trainessai)
# Show the plot
plt.show()

In [None]:
trainessai=trainessai.iloc[0:8000]
trainessai['x']=tsne_features[:, 0]
trainessai['y']=tsne_features[:, 1]
# Color the points according to Army Component
sns.scatterplot(x='x', y='y', hue='target_binned', data=trainessai)
# Show the plot
plt.show()

# Kmeans : 


In [None]:
# Import MiniBatchKmeans 
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import  RobustScaler 
from yellowbrick.cluster import KElbowVisualizer

In [None]:
kmeans1 = MiniBatchKMeans(n_clusters=3 ,random_state=42)
kmean_label1= kmeans1.fit_predict(X_train)
print(kmean_label1)

In [None]:
train.shape

In [None]:
train1=train.drop(['id','breath_id'], axis=1).iloc[0:8000]._get_numeric_data()

train1['cluster'] = kmean_label1
train1['cluster'] = train1['cluster'].astype('object')


In [None]:
train1.columns

In [None]:
fig = plt.figure(figsize=(18,26))#,constrained_layout=True)
grid =  gridspec.GridSpec(7, 2, figure= fig, hspace= .2, wspace= .05)
n =0

for i in range(7):
    for j in range(2):
        ax = fig.add_subplot(grid[i, j])
        sns.scatterplot(data=train1, y='pressure', x=train1.columns[j], hue= 'cluster', ax=ax, palette='viridis', alpha=.6 )
        ax.set_title(train1.columns[j],fontsize=16)
        ax.set_xlabel('')
        ax.set_ylabel('')
        ax.legend(loc='lower left',ncol=20)
        n += 1
        
fig.suptitle('Scatter plot of Target, Numerical and Cluster features', fontsize=20,y=.90)
fig.text(0.11,0.5, "Target", ha="center", va="center", rotation=90, fontsize=18);

In [None]:
#Getting the Centroids
centroids = kmeans1.cluster_centers_ 
#Getting unique labels
u_labels = np.unique(kmean_label1)
 
#plotting the results:
 
for i in u_labels:
    plt.scatter(train1.iloc[kmean_label1 == i ,2] , train1.iloc[kmean_label1 == i , 3] , label = i)

plt.scatter(centroids[:,1] , centroids[:,2] , s = 80, color = 'k')
plt.legend()
plt.show()

In [None]:
import seaborn as sns 
red = sns.light_palette("red", as_cmap=True)
cross_tab=pd.crosstab(train1['cluster'], train1['pressure'], margins = True)
H=cross_tab/cross_tab.loc["All"] # Divide by column totals
H.style.background_gradient(cmap=red)

In [None]:
cross_tab

In [None]:
fig = plt.figure(figsize=(18,26))#,constrained_layout=True)
grid =  gridspec.GridSpec(3, 2, figure= fig, hspace= .2, wspace= .05)
n =1
for i in range(3):
    for j in range(2):
        ax = fig.add_subplot(grid[i, j])
        sns.scatterplot(data=train1, y=train1.columns[j+1], x=train1.columns[j], hue= 'cluster', ax=ax, palette='viridis', alpha=.6 )
        ax.set_title(train1.columns[j],fontsize=16)
        ax.set_xlabel('')
        ax.set_ylabel('')
        ax.legend(loc='lower left',ncol=20)
        n += 1
        
fig.suptitle('Scatter plot of Target, Numerical and Cluster features', fontsize=20,y=.90)
fig.text(0.11,0.5, "Pressure", ha="center", va="center", rotation=90, fontsize=18);

In [None]:


# Create a series out of the Country column
cluster = train1.cluster

# Get the counts of each category
cluster_counts = cluster.value_counts()

# Print the count values for each category
print(cluster_counts)



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
sns.barplot(cluster_counts.index,cluster_counts.values, alpha=0.9)
plt.title('Frequency Distribution of cluster')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('cluster', fontsize=12)
plt.show()

In [None]:
# select non-numeric columns
cat_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(exclude=['int64','float64']).columns

### Num Features 

In [None]:
# select the float columns
num_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(include=['int64','float64']).columns

In [None]:
all_columns = (num_columns)
print(cat_columns)
print(num_columns)
print(all_columns)

## check that we have all column

In [None]:
if set(all_columns) == set(train.drop(['id','pressure','breath_id'], axis=1).columns):
    print('Ok')
else:
    # Let's see the difference 
    print('dans all_columns mais pas dans train  :', set(all_columns) - set(train.drop(['id','target'], axis=1).columns))
    print('dans X.columns   mais pas dans all_columns :', set(train.drop(['id','target'], axis=1).columns) - set(all_columns))

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Preparation</center></h3>

## Data preprocessing

Data preprocessing comes after you've cleaned up your data and after you've done some exploratory analysis to understand your dataset. Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them. Think of data preprocessing as a prerequisite for modeling:

Outlier Handling

Scaling

Feature Engineering

Feature Selection 




<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Modeling</center></h3>

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Evaluation  </center></h3>




**MAE**

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

Regression is different from classification, which involves predicting a category or class label.

Evaluating Regression Models

A common question by beginners to regression predictive modeling projects is:

    How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

We cannot calculate accuracy for a regression model.

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

    Mean Squared Error (MSE).
    Root Mean Squared Error (RMSE).
    Mean Absolute Error (MAE)

**Mean Absolute Error**, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is being predicted.

Unlike the RMSE, the changes in MAE are linear and therefore intuitive.

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is forced to be positive when calculating the MAE.

The MAE can be calculated as follows:

    MAE = 1 / N * sum for i to N abs(y_i – yhat_i)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.

we have done all EDA needed to chose the best preprocessing steps and begin modeling .
Work is in progress .. 

**Upvote if you find it useful .**