Hello Friends !

This kernel is an introduction to an end to end Machine learning Process for beginners. Here are the main steps you will
go through:

1. Data ecxploration and visualization with seaborn and [plotly](https://plotly.com/)
1. Data Processing for Machine Learning algorithms with customized scikit-learn tensformers
1. Model Selection 
1. Model training
1. Model fine tuning with GridSearch technique

I referred to 
* The chapter 2 [git](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) of the great best seller [book](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) of Aurelien Géron
* This excellent [post](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65) for how to construct Custom Transformers with Scikit-learn


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are aavailable in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import requests
import os
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


RANDOM_STATE = 75

In [None]:
# get Data
df = pd.read_csv('../input/california-housing-prices/housing.csv')

## Data Exploration
### 1. Quick look

In [None]:
df.info()

Get statistical description of numerical variables

In [None]:
# 
df.describe()

There are obvious differences between numerical features scales that we have to re-adjust afterwards

Missing values:
we will use [pandas.DataFrame.count](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html) that Counts non-NA cells for each column

In [None]:
len(df) - df.count()


"**total_bedrooms**" contains missing values, we ll see how to handle it

## 2. Split train test:

We have to put aside a test subset and never look at it cause otherwise the estimate would be too optimistic and once launched on a prod environment it would not perform as well as expected. This is called **data snooping bias**.

### Stratified sampling: 
If we want to devide train-test subset in such a way that the the test subset is representative of the overall population, we proceed by stratified sampling. if a feature is important to predict our target variable, we can stratify train, test subsets on that feature. a good way to find a suitable feature to strat on it, is to calculate **Pearson correlation** between each of the varaibles and taking the one having the highest absolute value of the correlation coefficient 
Check out this excellent. [post](http://spss.espaceweb.usherbrooke.ca/pages/stat-inferentielles/correlation.php) for deep understanding of person's correlation 

In [None]:
#from sklearn.model_selection import train_test_split

corr_matrix = df.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

The variable <span style="color: #D35400">median_income</span> seems to be very correlated to our target. the stratification can be applied to this feature so that the test set would be representative of the various categories of incomes in the whole dataset.

A scatter plot can show better the linear correlation between the two variables as shown below

In [None]:
ax = sns.relplot(x="median_income", y="median_house_value", data=df)

Note that it is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. to find how we ll cut "median_income" on significant intervals, let s take a look at our feature : whether a **histogram** or a **box plot**

In [None]:
plt.figure(figsize=(16,9))

sns.distplot(df["median_income"],label="median income")
 
plt.title("Histogram of Median Income") # for histogram title
plt.legend() # for label
plt.show()

In [None]:
import plotly.express as px
fig = px.box(df, y="median_income")
fig.show()

The previous charts show that most of median income values are clustered around **1.5** to **6**, so we'll split "median_income" values into 5 stratas:
* values between 0 and 1.5 => strata 1
* values between 1.5 and 3 => strata 2
* values between 3 and 4.5 => strata 3
* values between 4.5 and 6 => strata 4
* values higher than 6 => strata 5

In [None]:
#https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
df["income_cat"] = pd.cut(df["median_income"],
                          bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                          labels=[1, 2, 3, 4, 5])
# bar plot i=of income cat
plt.figure(figsize=(12,7))
sns.countplot(x='income_cat', data=df)
 
plt.show()

Now let's apply stratified sampling based on the income categories ("**income_cat**")

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_STATE)

for train_index, test_index in split.split(df, df["income_cat"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]


### 3. EDA : Exploratory Data Analysis:
Exploratory data analysis (EDA) is the task of analyzing data sets to summarize their main characteristics, often with **DataViz** methods, before applying any data transformation pipeline or launching a statistical/machine learning models. it aims basically to discover what can the disposed data tell us and possibly formulate hypotheses that could lead to new data collection and experiments. 

In [None]:
df = strat_train_set.copy().drop('income_cat', axis=1)


In [None]:
df.info()

**float64** is quite more heavy in terms of memory, so if it is possible to change float variables to **int**, it is recommended to do it (even though in our case it does not really matter as the set is quite small)

In [None]:
df['housing_median_age'] = df['housing_median_age'].astype('uint8')
g = sns.catplot(x="housing_median_age",
                y="median_house_value", 
                data=df, 
                kind="box")
g.fig.set_figwidth(16)
g.fig.set_figheight(9)

==> 'housing median age ' does not seem an interesting predictive feature

However it seems like the the most old houses can be quite expensive comparing to less old other houses

#### longitude X latitude 

In [None]:
import plotly.express as px

fig = px.scatter(df, x="longitude", y="latitude",color="median_house_value")
fig.show()

the housing prices are very much related to the location : regions close to the ocean have the highest house prices

In [None]:
df['total_rooms'] = df['total_rooms'].astype('uint16')
ax = sns.relplot(x='total_rooms', y="median_house_value", data=df)

In [None]:
sns.catplot(x="ocean_proximity",
                y="median_house_value", 
                data=df, 
                kind="box")
plt.show()

In [None]:
corr_matrix = df.corr()
#corr_matrix["median_house_value"].sort_values(ascending=False)
sns.heatmap(corr_matrix)

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

let's display scatter matrix of the top correalated vars to the "median_house_value"

In [None]:


fig = px.scatter_matrix(df,
    dimensions=["median_house_value", "median_income", "total_rooms", "housing_median_age"])#,
#    color="ocean_proximity")
fig.show()

Seems like the most promising attribute to predict the median house value is the **median_income**, so let’s **zoom** in on their correlation scatterplot 

In [None]:
fig = px.scatter(df, x="median_income",
                 y="median_house_value", 
                 size='total_rooms')
fig.show()

The correlation is actually very strong; the upward trend is quite clear

## 4. Feature Engineering

### Data Cleaning

In [None]:
df.info()

### a) Missing values 
As we saw "**total_bedrooms**" has null values 

Most of Machine Learning algorithms does not support missing values, to address this issue we have three options:
1. Drop entries with null values
2. Drop feature with many missing values
3. Impute nan values with a significant value (zero, the mean, the median, etc.).

(1&2) it is always better to keep data than to delete them. 
(2) The only case that it may worth deleting a variable is when its missing values are**more than 60%** of the observations but only if that variable is insignificant. Taking this into consideration, _imputation is always a preferred choice over deleting variables_.

(3) Mean/Median imputation decreases any correlations affecting the imputed the variables. This is because we assume that there is no relationship between the imputed variable and the other features. An attractive way to reduce that is to use k-nearest neighbors algorithm to impute missing values. The assumption behind this methode is that a data point can be approximated by the values of the its closest points, based on other variables.

If you choose the imputation option, please keep in mind that the imputaion value is  **<span style="color: #FF5733">training</span>** set, and save it to use it later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data.

We'll use [sklearn knnImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) to overcome the missing values issue

#### how does it work?
Referring to [sklearn doc](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) 
> Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

In [None]:
from sklearn.impute import KNNImputer


imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df.drop('ocean_proximity', axis=1))


In [None]:
#lets check the result the 'total_bedrooms' values
np.isnan(df_filled[:,4]).sum()

In [None]:
df['total_bedrooms'] = df_filled[:,4]

### b) Categorical variables:


Most of Machine Learning algorithms deal with scalars rather than textual or categorical variables, so it is recommended to convert categorical varibles to scalars. One of the most common solution is the **one hot encoding** consisting of creating a new binary feature for each modality of the categorical variable: takes 1 when the category is the corresponding modality 0 otherwise The new_attributes are also called dummy attributes.

The [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) method is quite straightforward,  it converts categorical variable into dummy variables.

In [None]:
pd.get_dummies(df)

### Custom transformer:
to construct a transfromer customized to our dataset. All we have to do is to create a class inherited from `BaseEstimator` and override `fit` and `transform` methods (add the `TransformerMixin` class inheritance to have the `fit_transform` as a bonus method)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin


class AttributeAdder(BaseEstimator, TransformerMixin):
    def __init__(self, num_vars, add_bedrooms_per_room = True): 
        self.add_bedrooms_per_room = add_bedrooms_per_room
        #self._num_vars = num_vars
        self._cols = num_vars
        
    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):     
        X = pd.DataFrame(X, columns=self._cols)
        X['rooms_per_household'] = X['total_rooms']/X['households']
        X['population_per_household'] = X['population']/X['households']
        #self.new_cols_ = ['rooms_per_household', 'population_per_household']
        if self.add_bedrooms_per_room:
            X['bedrooms_per_room'] = X['total_bedrooms']/X['total_rooms']
            #self.new_cols_.append('bedrooms_per_room')
            
        self._cols = X.columns.tolist()
        return X.values
    
    def get_feature_names(self):
        return self._cols

### Feature Scaling
With few exceptions, Machine Learning algorithms don’t appreciate diffrences between numerical features scales.
Some examples of algorithms where feature scaling matters:
* **k-NN** : based on $| |_2$ (Euclidean distance) to measure distances between data points, is sensitive to magnitudes and hence should be scaled for all features to **weigh in equally**.
* **PCA** : Scaling is critical, because PCA algo looks for the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.
* **Gradient Descent** We can speed up gradient descent by scaling. This is because $θ_{hat}$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Alogrithms tolerant to different scales 
* **Tree based models** are not distance based models and can handle varying ranges of features. Hence, Scaling is not required while modelling trees.
* **LDA**  and **Naive Bayes** are by design equipped to handle this and gives weights to the features accordingly. Performing a features scaling in these algorithms may not have much effect.

There are two common ways for feature scaling: min-max scaling and standardization.
1. MinMax Scaling: (also called normalization) consists of shifting and rescaling values so that they end up ranging from 0 to 1 as follows:

$$ \frac{x-min}{max-min} $$ 

2. Standardization: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. 
$$\frac{x-\mu}{\sigma}$$
where μ is the mean value of the feature and σ is the standard deviation of the feature

Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers.


**PS IMPORTANT** : Scalers must be fitted only on the training data, (not to the whole dataset with the test ssubset). Then thes fitted trasformers are used to transform the training set, the test set and new dat

In [None]:
df.describe().iloc[:, 2:]

**total_rooms** variable takes values between 6 and 39320 whereas **housing_median_age** varies between 1 and 52 which reflect very different scales



### Custom Pipeline:
With many data transformation steps it is recommanded to use Pipeline class provided by Scikit-learn that helps to make sequenced transformations in the right order.
We can do that using the [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) estimator offered by scikit-learn. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results

In [None]:
df = strat_train_set.copy().drop('income_cat', axis=1)
labels = strat_train_set["median_house_value"].copy()
cat_vars, num_vars = [], []
for var, var_type in df.dtypes.items():
    if var_type =='object':
        cat_vars.append(var)
    else:
        num_vars.append(var)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import KNNImputer
#from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline 
from sklearn.compose import ColumnTransformer

#Custom Transformer that extracts columns passed as argument to its constructor 
class FeatureSelector(BaseEstimator, TransformerMixin ):
    #Class Constructor 
    def __init__( self, feature_names):
        self._feature_names = feature_names 
        
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 
    
    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        return X[self._feature_names].values 
    
#Defining the steps in the categorical pipeline 
cat_pipeline = Pipeline( [ ( 'cat_selector', FeatureSelector(cat_vars) ),
                          ( 'one_hot_encoder', OneHotEncoder( sparse = False ) ) ] )
    
#Defining the steps in the numerical pipeline     
num_pipeline = Pipeline([
        ( 'num_selector', FeatureSelector(num_vars) ),
        ('imputer', KNNImputer(n_neighbors=5)),
        ('attribs_adder', AttributeAdder(num_vars=num_vars, add_bedrooms_per_room = True)),
        ('std_scaler', StandardScaler()),
    ])


#housing_num_tr = num_pipeline.fit_transform(housing_num)

#Combining numerical and categorical piepline into one full big pipeline horizontally 
#using FeatureUnion
full_pipeline = FeatureUnion( transformer_list = [ ( 'num_pipeline', num_pipeline ),
                                                  ( 'cat_pipeline', cat_pipeline )] 
                            )



In [None]:
#num_pipeline.
#num_pipeline.named_steps['attribs_adder'].get_feature_names()
#cat_pipeline.named_steps['one_hot_encoder'].get_feature_names()

In [None]:
df_prepared = full_pipeline.fit_transform(df)
df_prepared

In [None]:
all_vars = (list(full_pipeline.transformer_list[0][1].named_steps['attribs_adder'].get_feature_names()) + 
            list(full_pipeline.transformer_list[1][1].named_steps['one_hot_encoder'].get_feature_names()))
all_vars

### Select Model
We ll train severel models and evaluate them using **CrossValidation** technique: we'll randomly split the training set into k=7 folds, then then we ll be training and evaluate our model 7 times picking a different evaluation fold at each iteration training on the remaining 6 folds. 

we ll use `cross_val_score` of scikit learn setting `scoring` parameter to **neg_mean_squared_error** because it expects a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value)
The evaluation scores will be saved in an array.


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
reg_tree = DecisionTreeRegressor()


scores = cross_val_score(reg_tree, df_prepared, labels,
                         scoring="neg_mean_squared_error",
                         cv=7)

tree_scores = np.sqrt(-scores)


In [None]:
print('CV scores', tree_scores)
print('CV best score', tree_scores.min())
print('CV mean score', tree_scores.mean())
print('CV std score', tree_scores.std())

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(df_prepared, labels)

preds = lin_reg.predict(df_prepared)
lr_rmse = np.sqrt(mean_squared_error(labels, preds))
lr_rmse

In [None]:

scores = cross_val_score(lin_reg, df_prepared, labels,
                         scoring="neg_mean_squared_error",
                         cv=7)

lr_scores = np.sqrt(-scores)
print('CV scores', lr_scores)
print('CV best score', lr_scores.min())
print('CV mean score', lr_scores.mean())
print('CV std score', lr_scores.std())

In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators=10, max_depth=2)
xgb.fit(df_prepared, labels)

In [None]:

#Now that the model is trained, let’s evaluate it on the training set:


preds = xgb.predict(df_prepared)
xgb_rmse = np.sqrt(mean_squared_error(labels, preds))
xgb_rmse

In [None]:
xgb = XGBRegressor(n_estimators=10, max_depth=2)
scores = cross_val_score(xgb, df_prepared, labels,
                         scoring="neg_mean_squared_error",
                         cv=7)

xgb_scores = np.sqrt(-scores)
print('CV scores', xgb_scores)
print('CV best score', xgb_scores.min())
print('CV mean score', xgb_scores.mean())
print('CV std score', xgb_scores.std())

## Model Fine-Tuning  
once some models are selected we need to fine tune them. the most commun way is GridSearchCV evaluate all the possible combinations of setteled hyperparameter values using cross-validation.


For example, the following code searches for the best combi‐
nation of hyperparameter values for the RandomForestRegressor:


In [None]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(df_prepared, labels)


All in all, the grid search will explore **3x4 + 1x2x3 = 18** combinations of RandomForestRegressor hyperparameter values, and it will train each model five times (since we are
using five-fold cross validation). In other words, all in all, there will be 18 × 5 = 90 rounds of training

In [None]:
grid_search.best_params_


In [None]:
np.sqrt(-grid_search.best_score_)

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
sorted(zip(feature_importances, all_vars), reverse=True)
#feature_importances