
<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>CRISP-DM Methodology</center></h3>

* [Buissness Understanding](#1)
* [Data Understanding](#2)
* [Data Preparation](#3)
* [Data Modeling](#4)   
* [Data Evaluation](#5)

In this section we overview our selected method for engineering our solution. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is an open standard guide that describes common approaches that are used by data mining experts. CRISP-DM includes descriptions of the typical phases of a project, including tasks details and provides an overview of the data mining lifecycle. The lifecycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary. It starts with business understanding, and then moves to data understanding, data preparation, modelling, evaluation, and deployment. The CRISP-DM model is flexible and can be customized easily.
## Buissness Understanding

    Tasks:

    1.Determine business objectives

    2.Assess situation

    3.Determine data mining goals

    4.Produce project plan

## Data Understanding
     Tasks:

    1.Collect data

    2.Describe data

    3.Explore data    

## Data Preparation
    Tasks
    1.Data selection

    2.Data preprocessing

    3.Feature engineering

    4.Dimensionality reduction

            Steps:

            Data cleaning

            Data integration

            Data sampling

            Data dimensionality reduction

            Data formatting

            Data transformation

            Scaling

            Aggregation

            Decomposition

## Data Modeling :

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

    Tasks
    1. Select modeling technique Select technique

    2. Generate test design

    3. Build model

    4. Assess model

## Data Evaluation :
    Tasks

    1.Evaluate Result

    2.Review Process

    3.Determine next steps

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Buissness Understanding</center></h3>

    
What do doctors do when a patient has trouble breathing? They use a ventilator to pump oxygen into a sedated patient's lungs via a tube in the windpipe. But mechanical ventilation is a clinician-intensive procedure, a limitation that was prominently on display during the early days of the COVID-19 pandemic. At the same time, developing new methods for controlling mechanical ventilators is prohibitively expensive, even before reaching clinical trials. High-quality simulators could reduce this barrier.

Current simulators are trained as an ensemble, where each model simulates a single lung setting. However, lungs and their attributes form a continuous space, so a parametric approach must be explored that would consider the differences in patient lungs.

Partnering with Princeton University, the team at Google Brain aims to grow the community around machine learning for mechanical ventilation control. They believe that neural networks and deep learning can better generalize across lungs with varying characteristics than the current industry standard of PID controllers.

In this competition, you’ll simulate a ventilator connected to a sedated patient's lung. The best submissions will take lung attributes compliance and resistance into account.

If successful, you'll help overcome the cost barrier of developing new methods for controlling mechanical ventilators. This will pave the way for algorithms that adapt to patients and reduce the burden on clinicians during these novel times and beyond. As a result, ventilator treatments may become more widely available to help patients breathe.

**Eval Metric**: The competition will be scored as the mean absolute error between the predicted and actual pressures during the inspiratory phase of each breath. The expiratory phase is not scored. The score is given by:

|X−Y|

where X is the vector of predicted pressure and Y is the vector of actual pressures across all breaths in the test set.
    
  **Reminder**

* id - globally-unique time step identifier across an entire file

* breath_id - globally-unique time step for breaths

* R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.

* C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.

* time_step - the actual time stamp.

* u_in - the control input for the inspiratory solenoid valve. Ranges from 0 to 100.

* u_out - the control input for the exploratory solenoid valve. Either 0 or 1.

* pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O.
    
<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Understanding</center></h3>
    
    
    
This is the part 1:
    
https://www.kaggle.com/bannourchaker/crispdm-1-dataunderstanding-part1
    
    
## Step 1: Import helpful libraries

In [None]:
#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import matplotlib.gridspec as gridspec
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn
import warnings
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer, StandardScaler,Normalizer,RobustScaler,MaxAbsScaler,MinMaxScaler,QuantileTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
# Import StandardScaler from scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.manifold import TSNE
# Import train_test_split()
# Metrics
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve
from datetime import datetime, date
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.linear_model import LinearRegression, RidgeCV

import lightgbm as lgbm
from catboost import CatBoostRegressor
import tensorflow as tf 
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import LearningRateScheduler
#import smogn
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
# For training random forest model
import lightgbm as lgb
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 
# Model selection
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif
from sklearn.feature_selection import mutual_info_regression

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from itertools import combinations
#import smong 
# Plotlty : 
import pprint
from plotly.offline import iplot, init_notebook_mode
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from plotly import tools
import plotly.io as pio
pp = pprint.PrettyPrinter(indent=4)
pio.templates.default = "plotly_white"


import category_encoders as ce
import warnings
import optuna 
warnings.filterwarnings('ignore')

# Time series EDA 


## Step 2: Load the data

Next, we'll load the training and test data.

We set index_col=0 in the code cell below to use the id column to index the DataFrame. (If you're not sure how this works, try temporarily removing index_col=0 and see how it changes the result.)


In [None]:
%%time
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
pressure_values = np.sort( train.pressure.unique() )
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')
submission = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

### Visual Exploratory 
The first, and perhaps most popular, visualization for time series is the line plot.

In [None]:
import itertools
def plot_sample(dataframe, seed = 42):
    """ Plot time series for each combinations of R and C """
    
    np.random.seed(seed)
    
    cols = ['u_in', 'u_out', 'pressure']

    for (r, c) in list(itertools.product(dataframe.R.unique(), dataframe.C.unique())):
        
        subfig = make_subplots(specs=[[{"secondary_y": True}]])
        
        plot_data = dataframe[(dataframe.R.isin([r]) & dataframe.C.isin([c]))]
        sample_id = plot_data.breath_id.sample(n=1)
        plot_data = plot_data[plot_data.breath_id.isin(sample_id)]

        x_breath_changing_state = plot_data.loc[max(plot_data.loc[plot_data.u_out < 1].index), 'time_step']

        fig1 = px.line()
        fig1.add_scatter(x=plot_data.time_step, y=plot_data.pressure, name='pressure')
        fig1.add_scatter(x=plot_data.time_step, y=plot_data.u_in, name='u_in')
        fig1.add_vline(x_breath_changing_state)
        
        fig2 = px.line()
        fig2.add_scatter(x=plot_data.time_step, y=plot_data.u_out, name='u_out')
        fig2.update_traces(yaxis="y2")

        subfig.add_traces(fig1.data + fig2.data)
        subfig.for_each_trace(lambda t: t.update(line=dict(color=t.marker.color)))
                
        subfig.layout.title = f'Sample {sample_id.values[0]} - R={r}, C={c}'
        subfig.layout.yaxis1.title="u_in/pressure Y"
        subfig.layout.yaxis2.title="u_out Y"
        
        subfig.show()
        #title=f'Sample {sample_id.values[0]} - R={r}, C={c}'
        
plot_sample(train)

## Target Plot  as complet time series 

In [None]:
df = train.iloc[0:8000].copy()

In [None]:
fig = px.line(df, x="id", y=df.columns,
              title='Data labels')
fig.show()

In [None]:

fig = px.line(df, x='id', y="pressure")
fig.show()

The line plot is quite dense.

Sometimes it can help to change the style of the line plot; for example, to use a dashed line or dots.

Below is an example of changing the style of the line to be black dots instead of a connected line (the style=’k.’ argument).

In [None]:
fig = px.scatter(df, x='id', y="pressure")
fig.show()

# Grouped TimeSeries :


In [None]:
groups = df.groupby(df['breath_id']).agg(['sum', 'mean', 'max','count'])
groups

In [None]:

df.groupby(df['breath_id']).pressure.plot(figsize=(10, 6))

##  Time Series Histogram and Density Plots
Another important visualization is of the distribution of observations themselves.

This means a plot of the values without the temporal ordering.

Some linear time series forecasting methods assume a well-behaved distribution of observations (i.e. a bell curve or normal distribution). This can be explicitly checked using tools like statistical hypothesis tests. But plots can provide a useful first check of the distribution of observations both on raw observations and after any type of data transform has been performed.

The example below creates a histogram plot of the observations  dataset. A histogram groups values into bins, and the frequency or count of observations in each bin can provide insight into the underlying distribution of the observations.

In [None]:
fig = px.histogram(df, x="pressure")
fig.show()



We can get a better idea of the shape of the distribution of observations by using a density plot.

This is like the histogram, except a function is used to fit the distribution of observations and a nice, smooth line is used to summarize this distribution.

Below is an example of a density plot 

In [None]:
import plotly.figure_factory as ff
hist_data = [df.pressure]
group_labels = ['distplot of pressure'] # name of the dataset
fig = ff.create_distplot(hist_data, group_labels)
fig.show()

Histograms and density plots provide insight into the distribution of all observations, but we may be interested in the distribution of values by time interval.

Another type of plot that is useful to summarize the distribution of observations is the **box and whisker** plot. This plot draws a box around the 25th and 75th percentiles of the data that captures the middle 50% of observations. A line is drawn at the 50th percentile (the median) and whiskers are drawn above and below the box to summarize the general extents of the observations. Dots are drawn for outliers outside the whiskers or extents of the data.

Box and whisker plots can be created and compared for each interval in a time series, such as years, months, or days.

In [None]:
sns.set_style("whitegrid") 
sns.boxplot(x = 'breath_id', y = 'pressure', data = df.iloc[0:800])

In [None]:
sns.boxplot(x = 'breath_id', y = 'u_in', data = df.iloc[0:800])

In [None]:
sns.boxplot(x = 'breath_id', y = 'time_step', data = df.iloc[0:800])

In [None]:
plt.figure(figsize=(20,5))
series2= df.iloc[0:800].pivot("time_step", "breath_id", "pressure")

ax = sns.heatmap(series2, annot=True,linewidths=.5)

## Time Series Lag Scatter Plots
Time series modeling assumes a relationship between an observation and the previous observation.

Previous observations in a time series are called lags, with the observation at the previous time step called lag1, the observation at two time steps ago lag2, and so on.

A useful type of plot to explore the relationship between each observation and a lag of that observation is called the scatter plot.

Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the lag1 observation (t-1) on the y-axis.

If the points cluster along a diagonal line from the bottom-left to the top-right of the plot, it suggests a positive correlation relationship.
If the points cluster along a diagonal line from the top-left to the bottom-right, it suggests a negative correlation relationship.
Either relationship is good as they can be modeled.

More points tighter in to the diagonal line suggests a stronger relationship and more spread from the line suggests a weaker relationship.

A ball in the middle or a spread across the plot suggests a weak or no relationship.

Below is an example of a lag plot for pressure  dataset.

In [None]:
from matplotlib import pyplot
from pandas.plotting import lag_plot
series3=df.loc[:79,'pressure']
lag_plot(series3)
pyplot.show()

The plot created from running the example shows a relatively strong positive correlation between observations and their lag1 values.

We can repeat this process for an observation and any lag values. Perhaps with the observation at the same time last week, last month, or last year, or any other domain-specific knowledge we may wish to explore.

For example, we can create a scatter plot for the observation with each value in the previous seven days. Below is an example of this for the Minimum Daily Temperatures dataset.

First, a new DataFrame is created with the lag values as new columns. The columns are named appropriately. Then a new subplot is created that plots each observation with a different lag value

In [None]:
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot
from pandas.plotting import scatter_matrix
plt.figure(figsize=(20,10))
values = DataFrame(series3.values)
lags = 7
columns = [values]
for i in range(1,(lags + 1)):
	columns.append(values.shift(i))
dataframe = concat(columns, axis=1)
columns = ['t+1']
for i in range(1,(lags + 1)):
	columns.append('t-' + str(i))
dataframe.columns = columns
pyplot.figure(1)
for i in range(1,(lags + 1)):
	ax = pyplot.subplot(240 + i)
	ax.set_title('t+1 vs t-' + str(i))
	pyplot.scatter(x=dataframe['t+1'].values, y=dataframe['t-'+str(i)].values)
pyplot.show()

We can quantify the strength and type of relationship between observations and their lags.

In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation).

A correlation value calculated between two groups of numbers, such as observations and their lag1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation.

Correlation values, called correlation coefficients, can be calculated for each observation and different lag values. Once calculated, a plot can be created to help better understand how this relationship changes over the lag.

This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation_plot() function.

The example below creates an autocorrelation plot for pressure dataset

We can quantify the strength and type of relationship between observations and their lags.

In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation).

A correlation value calculated between two groups of numbers, such as observations and their lag1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation.

Correlation values, called correlation coefficients, can be calculated for each observation and different lag values. Once calculated, a plot can be created to help better understand how this relationship changes over the lag.

This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation_plot() function.

The example below creates an autocorrelation plot for the Minimum Daily Temperatures dataset

In [None]:
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(series3)
pyplot.show()


The statsmodels library also provides a version of the plot in the plot_acf() function as a line plot.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(series3, lags=31)
pyplot.show()

In [None]:
breath_ids = train.breath_id.sample(n = 5000//80, replace = False)
train_EDA = train.loc[train.breath_id.isin(breath_ids), :].reset_index(drop = True)
fig = px.histogram(
    train_EDA, 
    x="pressure",
    marginal="box",
    color="u_out",
    hover_data=train_EDA.columns,
    nbins = 50
)

fig.update_layout(
    title="Pressure distribution"
)

fig.show()

In [None]:
fig = px.histogram(
    train_EDA, 
    x="u_in",
    marginal="box",
    color="u_out",
    hover_data=train_EDA.columns,
    nbins = 50
)

fig.update_layout(
    title="u_in distribution"
)

fig.show()

In [None]:
dict_data = dict(train_EDA.u_out.value_counts())

fig = go.Figure(
    data=[
        go.Bar(
            x = list(dict_data.keys()),
            y = list(dict_data.values())
        )
    ],
    layout_title_text="u_out distribution",
)

fig.update_layout(
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0,
        dtick = 1
    )
)

fig.show()

del dict_data

In [None]:
def display_ts_examples(dataframe, graph_indexes = np.arange(9)):
    import random 
    
    # plot first few images
    plt.figure(figsize=(12,12))
    
    for graph_index in graph_indexes:
        
        breath_id = random.choice(dataframe.breath_id.unique())
        
        # define subplot
        plt.subplot(330 + 1 + graph_index)
        plt.title('Breath id: %s \n'%breath_id,
                 fontsize=18)
        # plot raw pixel data
        ts_to_plot = dataframe.loc[dataframe.breath_id == breath_id, ['time_step', 'pressure']]
        pd.Series(ts_to_plot.pressure.values, index=ts_to_plot.time_step.values).plot()
        
    plt.subplots_adjust(bottom = 0.001)  # the bottom of the subplots of the figure
    plt.subplots_adjust(top = 1.25)
    # show the figure
    plt.show()
    
display_ts_examples(train_EDA)

In [None]:
!pip install tslearn
from tqdm import tqdm
from tslearn.clustering import TimeSeriesKMeans

## Create clusters with pressure data

https://towardsdatascience.com/dynamic-time-warping-3933f25fcdd

https://rtavenar.github.io/blog/dtw.html

https://medium.com/walmartglobaltech/time-series-similarity-using-dynamic-time-warping-explained-9d09119e48ec

In [None]:
def generate_matrix_cluster(dataframe, n = 300, seed = 42):
    """ Clustering of time series based on dynamic time warp """
    
    np.random.seed(seed)
    matrix = []

    for breath_id in tqdm(breath_ids):
        df_ = dataframe.loc[dataframe.breath_id == breath_id, ['time_step', 'pressure']]
        matrix.append(np.array(pd.Series(df_.pressure.values, index=df_.time_step.values)))
        
    matrix = np.matrix(matrix)[:,:,np.newaxis]
    
    return matrix

def run_clustering(matrix):
    """ Perform KMeans on matrix of time series """
    
    model = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=10)
    model.fit(matrix)
    
    return model

matrix = generate_matrix_cluster(train_EDA)
cluster_p_model = run_clustering(matrix)

In [None]:
from yellowbrick.cluster.elbow import kelbow_visualizer

kelbow_visualizer(TimeSeriesKMeans(metric="dtw", max_iter=10),
                  df,
                  k=(2, 10),
                  timings=False)

In [None]:
def display_ts_clusters(model, n_clusters=3):
    
    # plot first few images
    plt.figure(figsize=(12,12))
    
    for graph_index in range(n_clusters):
                
        # define subplot
        plt.subplot(330 + 1 + graph_index)
        plt.title('Cluster No: %s \n'%graph_index,
                 fontsize=18)
        
        # plot raw pixel data
        array_cluster = model.cluster_centers_[graph_index]
        pd.Series(array_cluster.ravel()).plot()
        
    plt.subplots_adjust(bottom = 0.001)
    plt.subplots_adjust(top = 1.25)
    plt.show()
    
display_ts_clusters(cluster_p_model)

## Convert Dtypes 

In [None]:
train[train.select_dtypes(['float64']).columns] = train[train.select_dtypes(['float64']).columns].apply(pd.to_numeric)
train[train.select_dtypes(['object','int64']).columns] = train.select_dtypes(['object','int64']).apply(lambda x: x.astype('category'))
test[test.select_dtypes(['float64']).columns] = test[test.select_dtypes(['float64']).columns].apply(pd.to_numeric)
test[test.select_dtypes(['object','int64']).columns] = test.select_dtypes(['object','int64']).apply(lambda x: x.astype('category'))

### Num/Cat Features 

In [None]:
cat_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(exclude=['float64']).columns
num_columns = train.drop(['id','pressure','breath_id'], axis=1).select_dtypes(include=['int64','float64','category']).columns

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Preparation</center></h3>

## Data preprocessing

Data preprocessing comes after you've cleaned up your data and after you've done some exploratory analysis to understand your dataset. Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them. Think of data preprocessing as a prerequisite for modeling:

Outlier Handling

Scaling

Feature Engineering

Feature Selection 




<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Modeling</center></h3>

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

<a id="top"></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home"><center>Data Evaluation  </center></h3>




**MAE**

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

Regression is different from classification, which involves predicting a category or class label.

Evaluating Regression Models

A common question by beginners to regression predictive modeling projects is:

    How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

We cannot calculate accuracy for a regression model.

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

    Mean Squared Error (MSE).
    Root Mean Squared Error (RMSE).
    Mean Absolute Error (MAE)

**Mean Absolute Error**, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is being predicted.

Unlike the RMSE, the changes in MAE are linear and therefore intuitive.

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is forced to be positive when calculating the MAE.

The MAE can be calculated as follows:

    MAE = 1 / N * sum for i to N abs(y_i – yhat_i)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.

we have done all EDA needed to chose the best preprocessing steps and begin modeling .
Work is in progress .. 

Upvote if you find it useful .