# Welcome to my First Kernel !


### table of contents :

- **Introduction**:
  - About California housing dataset : 
    - Dependent variable : median_house_value
    - Independent variables : 
        - Numerical variables : housing_median_age, total_rooms, total_bedrooms, population,         households, median_income ,longitude and latitude
        - Categorial variables : ocean_proximity
- **Data exploration ** :
  - data's shape and type
  - distribution of variables
  - Find missing values
  - Correlation between independent and dependent variables
- **Feature Engineering** :
  - Add new variables
  - Handle missing values
  - Handle noisy data 
- **Preprocessing** :
  - Encode the Data
  - Split data into training and validation set
- **Modelling** :
  - The Algorithms that i used in this notebook, plus a brief description of each one
      - Linear Models : LinearRegression ,Laso, Ridge, ElasticNet
      - Support Vector Machine Regressor
      - K-Nearest Neighbors Regressor
      - Decision Tree Regressor
      - Ensemble methods : RandomForestRegressor and AdaBoost
  - Fine Tune Algorithms
  - The Metrics that i used to quantify the models' performance, plus a brief description of each one 
      - MAE
      - MSE
      - RMSE
      - R2 
  
  


-  "Sorry if i made any mistakes. English is not my native language"

### Import Necessary Libraries

In [None]:
#data analysis libraries 
import pandas as pd
import numpy as np

#visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Setting pandas to display a N number of columns and rows 
pd.set_option('display.max_row',33)
pd.set_option('display.max_column',111)

In [None]:
# Import the data
housing = pd.read_csv('../input/california-housing-prices/housing.csv')

###  Data Analysis

In [None]:
# Peek at the Data
housing.head()

In [None]:
# shape of data
housing.shape

In [None]:
# get a quick description of the data, in particular the number of non-null, and each attribute’s type 
housing.info()

- The total_bedrooms variable have missing values 


In [None]:
# see a summary of the numerical attributes
housing.describe()

In [None]:
# Class Distribution 
housing['ocean_proximity'].value_counts()

In [None]:
# Find NaN values 
housing.isna().sum()

In [None]:
#Visualization of the variables' distribution

columns = ['longitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms','households', 'median_income','longitude',
        'latitude','median_house_value']


def distplot(nrows, ncols, columns):

    rows=nrows
    cols=ncols

    fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(18, 12))

    columns = columns
    index=0

    for i in range(rows):
        for j in range(cols):
            sns.distplot(housing[columns[index]], ax=ax[i][j], bins=40)
            index+=1

        
distplot(3, 3, columns)

##### Some explorations : 

These histograms reveal a few things :

- The median house value was capped
- Machine Learning algorithms may learn that prices will never go beyond that limit "$500,000" 
- Some variables are tail heavy : they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. 
    


#### Handle noisy data

In [None]:
# zoom in on the target variable
plt.figure()
plt.hist(housing['median_house_value'], bins=150)
plt.show()

In [None]:
# The shape of noisy data 
housing[housing['median_house_value'] >= 500000 ].shape

In [None]:
# The shape of data without noisy values
housing[housing['median_house_value'] <= 500000].shape

In [None]:
# Remove noisy values
housing = housing[housing['median_house_value'] <= 500000]
housing.shape

In [None]:
# check if our target is clean
plt.figure()
plt.hist(housing['median_house_value'], bins=100)
plt.show()

### Feature Engineering
1 - Add new variables :

In [None]:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]

In [None]:
housing.head(3)

In [None]:
# Visualization of the distribution of these new variables

new_columns = ['rooms_per_household', 'bedrooms_per_room','population_per_household']

for col in new_columns :
    plt.figure()
    sns.distplot(housing[col])
    plt.show()


#### Correlation Matrix

In [None]:
#Visualization of the correlation between all the numerical variables
plt.figure(figsize=(15,7))
corr_matrix = housing.corr()
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")

In [None]:
# The correlation between the targer and the other variables
corr_matrix['median_house_value'].sort_values(ascending=False)

#### Visualization of Geographic Data  

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4, label='population',
            figsize=(10,7),c='median_house_value',s=housing['population']/100, cmap=plt.get_cmap('cubehelix') ,colorbar=True)
              
plt.legend()

This geographical scatterplot of the data tells us that the housing prices are very much related to the location (close to the ocean) and to the population density

## Preprocessing

In [None]:
#housing = housing[['median_income','total_rooms','bedrooms_per_room','households','total_bedrooms','housing_median_age','population_per_household','ocean_proximity','longitude','latitude','median_house_value']]
#housing.head()

#### Remove Missing Values

In [None]:
housing.dropna(axis=0, inplace=True)
print(housing.isna().sum())

#### Handle categorial variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

housing_cat = housing[['ocean_proximity']]

transformer = OneHotEncoder(sparse=False)
housing_ohe = transformer.fit_transform(housing_cat)
housing_ohe

In [None]:
OneHotEncoder = pd.DataFrame(data = housing_ohe, columns=['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'])

OneHotEncoder 

In [None]:
# Make sure that there is no missing values
OneHotEncoder.isna().sum()

In [None]:
# reset index 
housing = housing.set_index(np.arange(0,19475))
housing

In [None]:
# Delete the variable 'ocean_proximity'
housing.drop(columns=['ocean_proximity'], axis=1, inplace=True)
housing

In [None]:
# Check their shapes before concatenation
print(OneHotEncoder.shape)
print(housing.shape)

In [None]:
# concatenate housing and OneHotEncoder
housing_prep = pd.concat([housing, OneHotEncoder], axis=1)
housing_prep


In [None]:
# Again, make sure after concatenation that there are no missing values left
plt.figure(figsize=(12,4))
sns.heatmap(housing_prep.isna(), cmap='YlGnBu')

## Modelling

- **learning curves**: 
    - Learning curves are plots that show changes in learning performance over time in terms of experience.
    - Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
    - Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.
- **Grid search** 
     - performs a combination of hyperparameter tuning in order to determine the optimal combination values for a given model.
     - The grid search approach is fine when you are exploring relatively few combinations

- **Randomized Search** 
     - Can be used in much the same way as the GridSearchCV , but instead of trying out all possible combinations , it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration.



- **Regression Evaluation Metrics**

  - <i>Mean Absolute Error (MAE)</i>:
  
      - is the mean of the absolute value of the errors
      - sometimes it is called the Manhattan norm (l1 norm).
      - Mathematical Formula :   $MAE(X,h)=\frac{1}{m}\sum\limits_{i=1}^{m}|h(x^{i})-y^{i}|$ 
  
  - <i> Mean Squared Error (MSE)</i>: 
      - is the mean of the squared errors
      - Mathematical Formula:   $MSE(X,h)=\frac{1}{m}\sum\limits_{i=1}^{m}(h(x^{i})-y^{i})^{2}$
  
  - <i>Root Mean Squared Error (RMSE)</i>:
       - is the square root of the mean of the squared errors 
       - corresponds to the Euclidian norm (l2 norm)
       - RMSE is more sensitive to outliers than the MAE
       - RMSE is interpretable in the "y" units
       - Mathematical Formula: :  $RMSE(X,h)=\sqrt{ \frac{1}{m}\sum\limits_{i=1}^{m}(h(x^{i})-y^{i})^{2}}$ 
  
  - <i>Coefficient of determination (R2)</i>:
       - It is used to check how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model.
       - the value of R2 range from 0 to 1 
           - A model with an R2 equal 1 perfetly predicts the target variable, whereas a model with R2 of 0 always fails to predict the target variable
       - Mathematical Formula : $R^{2}= 1 - \frac{SS_{res}}{SS_{tot}}$   Where,
       
            - $SS_{res}$ is the sum of squares of the residual errors between the actual y and the predicated y.
            - it's representing the ML algorithm error score 
            - $SS_{tot}$ is the total sum of the errors (the sum of the squared deviation of the actual y from the                centre)
            - it is representing the Mean Model

In [None]:
housing = housing_prep.copy()

In [None]:
#Import necessary libraries 

from sklearn.model_selection import cross_val_score, learning_curve, RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectKBest, f_classif, chi2
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


#### Create functions 

In [None]:
def train_test_split_(data ,target_var) :

    X = data.drop([target_var], axis=1)
    y = data[target_var]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0 )
    return (X_train, X_test, y_train, y_test)



def model_val_scores(mod, X_train, y_train, cv=5):
        
    score_val = []
    standard_deviation = []
        
    scores = cross_val_score(mod, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    
    rmse_scores = np.sqrt(-scores)
    scores_mean = rmse_scores.mean()
    scores_std = rmse_scores.std()
        
    score_val.append(scores_mean)
    standard_deviation.append(scores_std)
    
    return pd.DataFrame(data=[score_val, standard_deviation], index = ['scores_val', 'scores_std']) 


def learning_curves(mod, X_train, y_train , cv=5):
    
    N , train_score, val_score = learning_curve(mod, X_train, y_train,  cv=5 , train_sizes=np.linspace(0.2 ,1.0, 5))

    plt.plot(N, train_score.mean(axis=1), label='Train')
    plt.plot(N, val_score.mean(axis=1), label='Validation')
    plt.xlabel('train size')
    plt.legend()
    

def RandomizeSearchCV_(model, param_grid, X_train, y_train ) :
    
    randomSCV = RandomizedSearchCV(model, param_grid, n_iter=30, cv=5, scoring='neg_mean_squared_error', random_state=42)

    randomSCV.fit(X_train, y_train)
    model_best_params = randomSCV.best_estimator_
    
    print('best score :', randomSCV.best_score_ )
    print('best params :', randomSCV.best_params_ )
    
    return model_best_params

def GridSearchCV_(mod, param_grid, X_train, y_train):
    grid = GridSearchCV(estimator=mod, param_grid=param_grid, cv= 5, scoring='neg_mean_squared_error')
    
    grid.fit(X_train, y_train)
    model_best_params = grid.best_estimator_
    
    print('best score :', grid.best_score_ )
    print('best params :', grid.best_params_ )
    
    return model_best_params


def performance_metrics(y_test, y_pred):
    
    r2_scores  = []
    mae_value  = []
    mse_value  = []
    rmse_value = []
   
    scores = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    r2_scores.append(scores)
    mae_value.append(mae)
    mse_value.append(mse)
    rmse_value.append(rmse)
    
    metrics_dataframe=pd.DataFrame(data= [r2_scores, mae_value, mse_value, rmse_value],
                                index=['r2_score','MAE','MSE','RMSE'])
    return metrics_dataframe.T


#### Splitting the data 

In [None]:
X_train, X_test, y_train, y_test = train_test_split_(housing, 'median_house_value')

print('X_train :', X_train.shape)
print('X_test :', X_test.shape)
print('y_train :', y_train.shape)
print('y_test :', y_test.shape)

# Linear Models

- linear Regression model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term)
- Linear Regression models use two different ways to compute the model parameters that best fit the model to the training set :
    - Using a direct “**closed-form**” equation that directly computes the model parameters that minimize the cost function over the training set , like the Normal Equation
    - Using an **iterative optimization approach** that gradually tweaks the model parameters in order to minimize the cost function over the training set, like  Gradient Descent (GD)
        - Some few variants of Gradient Descent : Batch GD, Mini-batch GD, and Stochastic GD

- Linear Assumption
    - **Linear Assumption** :Linear regression assumes that the relationship between our input and output is linear, if it is not the case we may need to transform data to make the relationship linear (e.g. log transform for an exponential )
    - **Remove Noise** : Linear regression assumes that your input and output variables are not noisy
    - **Remove Collinearity** :Linear regression will over-fit your data when you have highly correlated input variables
    - **Gaussian Distributions** : Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.
    - **Rescale Inputs** : Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.


### LinearRegression

- Linear Regression model uses the Normal Equation which directly computes the model parameters that best fit the model to the training set (the model parameters that minimize the cost function over the training set)
- Linear Regression model equation :
  $\hat{Y} = \theta_{0} + \sum \limits _{i=1} ^{n} X_{i}\theta_{i} $
  
   - ŷ is the predicted value
   - n is the number of features.
   - $X_{i}$ is the $i^{th}$ feature value.
   - $\theta$ is the model parameter (including the bias term $\theta_{θ}$  and the feature weights $\theta_{1}$ , $\theta_{2}$ ,..., $\theta_{3}$ ).


Normal Equation : 

- The Normal Equation : $\hat{\theta} = (X^{T}.X)^{-1}.X^{T}.y $
   - $\hat{\theta}$ is the value of θ that minimizes the cost function
   - y is the vector of target values containing $y_{1}$ to $y_{m}$
   - $X^{T}.X$ is the dot product of $X^{T}$ and $X$
   - $X^{T}$ is the transpose of $X$ 
- The Normal Equation gets very slow when the number of features grows large 
- The Normal Equation handles large training sets efficiently, provided they can fit in memory
- Feature scaling is not necessary
- Predictions are very fast



In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# n_jobs=-1 tells Scikit-Learn to use all available cores
LR_model = make_pipeline(LinearRegression(n_jobs=-1))

In [None]:
model_val_scores(LR_model, X_train , y_train)

In [None]:
learning_curves(LR_model, X_train, y_train)

<i> we will assume that we are looking at a Maximizing metric, which means that bigger scores on the y-axis indicate more or better learning.</i>
 
- The training loss remains flat over training size
- The validation loss decreases to a point and begins increasing again then flattening in the end with the training score in the same value.


- Let's try to feed some correlated features to the learning algorithm and see if we could achieve better performance than before.

> Based on the correlation matrix and the histograms seen from above :
   - The most correlated variables with the target are median income, total rooms , bedrooms per room and rooms per household
   - There is no collinearity between these variables (the correlation between these variables against each other is not too strong the coefficient of correlation doesn't exceed |0.7|)
   - the correlated features follow a Normal Distribution



In [None]:
housing_must_corr_vars = housing_prep[['total_rooms','median_income','bedrooms_per_room','rooms_per_household','median_house_value','<1H OCEAN','INLAND','NEAR OCEAN','NEAR BAY','ISLAND']]

#housing_must_corr_vars.head(2)

In [None]:
X_train_corr, X_test_corr, y_train_corr, y_test_corr = train_test_split_(housing_must_corr_vars,'median_house_value' )

In [None]:
#Training and testing the model
LR_model.fit(X_train_corr, y_train_corr)

y_pred = LR_model.predict(X_test_corr)

In [None]:
# Model's performance
performance_metrics(y_test_corr, y_pred)

 - In this case, the linear regression model is unable to generalize new instances, resulting in an increase in generalization error. Therefore a high RMSE on the test data
 - Let's train the linear regression model on the X_train and y_train dataset and test it on the X_test dataset

In [None]:
#Train and test the model
LR_model.fit(X_train, y_train)

y_pred = LR_model.predict(X_test)

In [None]:
# Model's performance
performance_metrics(y_test, y_pred)

- Unlike the linear regression model trained only with the correlated features, we got a low Root Mean Squared Error in this case 
![](http://)  

### Regularized Linear Models

- For a linear model, regularization is typically achieved by constraining the weights of the model. 
- We will look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.

#### Ridge Regression

- Ridge Regression is a regularized version of Linear Regression
- The regularization term equal to : $ \alpha \sum \limits _{i=1} ^{n} \theta ^{2}_{i}$ is added to the cost function. This forces the algorithm to not only fit the data but also keep the model weights as small as possible
- Ridge Regression cost function equation : $J(\theta) =MSE(\theta) + \frac{1}{2} \alpha \sum \limits _{i=1} ^{n} \theta ^{2}_{i}$
- Specifying penalty="l2" in linear regression models like SGD indicates that we want to add a regularization term to the cost function

In [None]:
from sklearn.linear_model import Ridge

In [None]:
Ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1, solver='cholesky'))

In [None]:
model_val_scores(Ridge_model, X_train, y_train)

In [None]:
learning_curves(Ridge_model, X_train, y_train)

In [None]:
Ridge_model.get_params().keys()

In [None]:
param_grid = {'ridge__alpha':np.arange(0.001, 5, 0.1 ),
              'ridge__tol':np.arange(0.00001, 0.1, 0.5),
              'ridge__solver':['auto','cholesky']}

Ridge_model_best_params = GridSearchCV_(Ridge_model, param_grid , X_train, y_train)

In [None]:
Ridge_model_best_params.fit(X_train, y_train)

y_pred = Ridge_model_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

####  Lasso Regression

- Another regularized version of Linear Regression : just like Ridge Regression, it adds a regularization term to the cost function, but it uses the l1 norm of the weight vector instead of half the square of the l2 norm
- Lasso Regression cost function equation : $J(\theta) =MSE(\theta) + \alpha \sum \limits _{i=1} ^{n} |\theta_{i}| $
- Lasso Regression automatically performs feature selection 
- Specifying penalty="l1" in linear regression models like SGD indicates that we want to add a regularization term to the cost function

In [None]:
from sklearn.linear_model import Lasso

In [None]:
Lasso_model = make_pipeline(StandardScaler(),  Lasso(alpha=0.01, max_iter=10000))

In [None]:
model_val_scores(Lasso_model, X_train, y_train)

In [None]:
learning_curves(Lasso_model, X_train, y_train)

In [None]:
Lasso_model.get_params().keys()

In [None]:
param_grid = {'lasso__alpha':np.arange(0.0001, 0.1, 0.01)}

Lasso_best_params = GridSearchCV_(Lasso_model, param_grid, X_train, y_train)

In [None]:
Lasso_best_params.fit(X_train, y_train)

y_pred = Lasso_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

### ElasticNet

- Elastic Net is a middle ground between Ridge Regression and Lasso Regression
- The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r
    - When r=0 ,Elastic Net is equivalent to Ridge Regression
    - When r = 1,Elastic Net is equivalent to Lasso Regression 
- Elastic Net cost function Equation : $J(\theta) =MSE(\theta) + r\alpha \sum \limits _{i=1} ^{n} |\theta_{i}| + \frac{1-r}{2} \alpha \sum \limits _{i=1} ^{n} \theta ^{2}_{i} $

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
ELN_model  = make_pipeline(StandardScaler(), ElasticNet(alpha=0.001, l1_ratio=0.5, max_iter=100000))

In [None]:
model_val_scores(ELN_model, X_train, y_train)

In [None]:
learning_curves(ELN_model, X_train, y_train)

In [None]:
ELN_model.get_params().keys()

In [None]:
param_grid = {'elasticnet__alpha':np.arange(0.001, 0.1, 0.1 ),
              'elasticnet__l1_ratio':np.arange(0, 1, 0.1)}

ELN_model_best_params = RandomizeSearchCV_(ELN_model, param_grid, X_train_corr, y_train_corr )

In [None]:
learning_curves(ELN_model_best_params, X_train, y_train)

In [None]:
ELN_model_best_params.fit(X_train, y_train)

y_pred = ELN_model_best_params.predict(X_test)

In [None]:
performance_metrics(y_test_corr, y_pred)

## Support Vector Machine 

- SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (instances off the street)
- The width of the street is controlled by a hyperparameter ε (epsilon)
- SVMs are sensitive to the feature scales
- The hyperparameter C acts like a regularization hyperparameter : if our model is overfitting, we should reduce it, and if it is underfitting, we should increase it 




In [None]:
from sklearn.svm import SVR

In [None]:
SVR_model = make_pipeline( StandardScaler(), SVR(epsilon=1.5, kernel='linear', C=100) )

In [None]:
model_val_scores(SVR_model, X_train, y_train)

In [None]:
learning_curves(SVR_model, X_train, y_train)

In [None]:
SVR_model.get_params().keys()

In [None]:
param_grid = {'svr__C':np.arange(10, 50, 5),
              'svr__epsilon':np.arange(1, 6, 1),
              'svr__kernel':['rbf','linear']}

SVR_best_params = RandomizeSearchCV_(SVR_model, param_grid , X_train, y_train)

In [None]:
SVR_best_params.fit(X_train, y_train)

In [None]:
y_pred = SVR_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

## K-Nearest Neighbors 

KNN how it works :
  - Calculate the distance between the data sample and every other sample with the help of a method such as Euclidean.
  - Sort these values of distances in ascending order.
  - Choose the top K values from the sorted distances.
  - return the mean of the nearest K neighbors.

- KNN is a non-parametric algorithm because it does not assume anything about the training data. This makes it useful for problems having non-linear data.
- KNN can be very sensitive to the scale of data as it relies on computing the distances. For features with a higher scale, the calculated distances can be very high and might produce poor results. It is thus advised to scale the data before running the KNN.
- KNN can be computationally expensive both in terms of time and storage, if the data is very large because KNN has to store the training data to work. This is generally not the case with other supervised learning models.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
KNR_model = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=15,n_jobs=-1 ))

In [None]:
model_val_scores(KNR_model, X_train, y_train)

In [None]:
learning_curves(KNR_model, X_train, y_train)

In [None]:
KNR_model.get_params().keys()

In [None]:
param_grid = {'kneighborsregressor__n_neighbors':np.arange(5, 100, 1),
             'kneighborsregressor__p':[1,2],
             'kneighborsregressor__weights':['uniform','distance'],
             'kneighborsregressor__leaf_size':np.arange(20,50,5)}


KNR_model_best_params = RandomizeSearchCV_(KNR_model, param_grid , X_train, y_train)

In [None]:
learning_curves(KNR_model_best_params, X_train, y_train)

In [None]:
KNR_model_best_params.fit(X_train, y_train)

y_pred = KNR_model_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

## Decision Trees

- Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees .
- the idea is really quite simple: the algorithm first splits the training set into two subsets using a single feature k and a threshold $t_{k}$, then it splits the subsets using the same logic, then the sub-subsets, and so on, in a way that minimizes the MSE.
- To avoid overfitting the training data, we need to restrict the Decision Tree’s freedom (max_depth, min_samples_split, min_samples_leaf  ...) during training. As we know , this is called regularization 
- One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering also dummy encoding
-  CART cost function for regression equation :

$J(k,t_{K}) = \frac{m_{left}}{m}MSE_{left}+\frac{m_{right}}{m}MSE_{right}$   


where : $\binom{MSE_{node} = \sum _{i \in node} ( \hat{y}_{node} - y^{(i)}) ^{2}}{\hat{y}_{node} = \frac{1}{m_{node}} \sum _{i \in node} y^{(i)}}$

Decision Trees limitations :
- Decision Trees is that they are very sensitive to small variations in the training data. In other words, if the training data is changed the resulting decision tree can be quite different, and in turn, the predictions can be quite different
- Also Decision trees are computationally expensive to train, carry a big risk of overfitting, and tend to find local optima because they can’t go back after they have made a split.
- Random Forests can limit this instability by averaging predictions over many trees 

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
DTR_model = make_pipeline(DecisionTreeRegressor())

In [None]:
model_val_scores(DTR_model ,X_train, y_train)

In [None]:
learning_curves(DTR_model, X_train, y_train)

In [None]:
DTR_model.get_params().keys()

In [None]:
#Let's gain a comprehensive understanding of these hyperparameters using tree visualizations.
param_grid = {'decisiontreeregressor__max_depth':np.arange(5, 30, 1), 
              'decisiontreeregressor__max_features':np.arange(7, 12, 1)               
             }

DTR_model_best_params = GridSearchCV_(DTR_model, param_grid , X_train, y_train)

In [None]:
learning_curves(DTR_model_best_params, X_train, y_train)

In [None]:
DTR_model_best_params.fit(X_train, y_train)

y_pred = DTR_model_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

## Ensemble Methods

- An Ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions (increase performance) than any individual model, a model comprised of many models is called an Ensemble model.
- Ensemble methods can decrease variance using bagging approach, bias using a boosting approach, or improve predictions using stacking approach

### Random Forest 

- Random Forest is an ensemble of Decision Trees , generally trained via the bagging method (or sometimes pasting)
     - **Bagging method** (Bootstrap Aggregation) : refers to random sampling with replacement .It is can be used to reduce the variance for those algorithm that have high variance, typically decision trees
     - **Pasting method**: it refers to random sampling without replacement with the same features as the bagging method
- RandomForestRegressor  has all the hyperparameters of a DecisionTreeRegressor (to control how trees are grown), plus all the hyperparameters of a BaggingRegressor to control the ensemble itself.
- The trees in random forests are run in parallel. There is no interaction between these trees while building the trees.
- It can handle thousands of input variables without variable deletion.
- It operates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
RFR_model  = make_pipeline(RandomForestRegressor(n_estimators=12, n_jobs=-1))

In [None]:
model_val_scores(RFR_model, X_train, y_train)

In [None]:
learning_curves(RFR_model, X_train, y_train)

In [None]:
RFR_model.get_params().keys()

In [None]:
param_grid = {'randomforestregressor__max_depth':np.arange(10, 40, 10),
             'randomforestregressor__n_estimators':np.arange(200, 300, 20)}

RFR_model_best_params = GridSearchCV_(RFR_model, param_grid , X_train, y_train)

In [None]:
learning_curves(RFR_model_best_params, X_train, y_train)

In [None]:
RFR_model_best_params.fit(X_train, y_train)

y_pred = RFR_model_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

### AdaBoost

- AdaBoost is a **boosting** algorithm .
  - Boosting algorithm : refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Each model that runs, dictates what features the next model will focus on.
- How it works : the new predictor corrects  its predecessor by paying a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases.
- If AdaBoost ensemble is overfitting the training set, we can try reducing the number of estimators or more strongly regularizing the base estimator.

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
ada_model = make_pipeline(AdaBoostRegressor(DecisionTreeRegressor(max_depth=5), n_estimators=200, learning_rate=0.5))

In [None]:
model_val_scores(ada_model, X_train, y_train)

In [None]:
learning_curves(ada_model, X_train, y_train)

In [None]:
ada_model.get_params().keys()

In [None]:
param_grid = {'adaboostregressor__base_estimator__max_depth':np.arange(5,20,5),
              'adaboostregressor__n_estimators':np.arange(100, 300, 50),
              'adaboostregressor__learning_rate' : np.arange(0.001, 0.9, 0.9)
               }

ada_best_params = RandomizeSearchCV_(ada_model, param_grid, X_train, y_train )

In [None]:
learning_curves(ada_best_params, X_train, y_train)

In [None]:
ada_best_params.fit(X_train, y_train)

y_pred = ada_best_params.predict(X_test)

In [None]:
performance_metrics(y_test, y_pred)

> ### Performance ranking :

In [None]:
final_RMSE = pd.DataFrame( data  = [[44623.94, 45386.56, 51730.20, 56308.76, 60356.87, 60372.75, 60381.28, 60381.32, 63395.09],
                                    [0.791,0.783,0.719,0.667,0.6179,0.6177,0.61766,0.61763,0.578]],
                          columns  = ['Random Forest Regressor','AdaBoost Regressor','KNeighbors Regressor',
                                   'Decision Tree Regressor','ElasticNet','Ridge', 'Lasso', 'Linear Regression',   
                                   'Support Vector Regressor'],
                         index =['RMSE','R2'])


final_RMSE = final_RMSE.T

cm = sns.light_palette('green', as_cmap=True)

final_RMSE = final_RMSE.style.background_gradient(cmap=cm)
final_RMSE

> As can be seen from the table below, Random Forest Regressor resulted to be the best model for this dataset because of:

- highest R^2 score
- lowest root mean squared error


- Thank you for reading my Notebook. Please feel free to improve me by suggesting anything 
- Please upvote if you found this kernel useful!

#### References

- <a src='https://www.amazon.fr/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291'> Hands-On Machine Learning with Scikit-Learn and TensorFlow </a>
- <a src='https://machinelearnia.com/machine-learning/'>MachineLearnia</a>
- <a src='https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing'>Sklearn Preprocessing</a>
- <a src='https://machinelearningmastery.com/linear-regression-for-machine-learning/'>Machine Learning Mastery: Linear Regression for Machine Learning</a>
- <a src='https://medium.com/analytics-vidhya/writing-math-equations-in-jupyter-notebook-a-naive-introduction-a5ce87b9a214'>Writting math equations </a>
- <a src='https://towardsdatascience.com/k-nearest-neighbors-94395f445221'> towardsdatascience KNN </a>
- <a src='https://towardsdatascience.com/understanding-random-forest-58381e0602d2'> towardsdatascience Random Forest </a>