## KEY NOTE

This notbook is complete guide to end to end machine learning problem from scratch. if you are beginner, it might hep you have an insight on how to start and how to approach a ML problem. Since the dataset is fairly simple it is very good to start your handson with.

I followed the book by Aurelien Geron, The steps described by him are really detailed, so I decided to replicate it on my own and experiment with the concepts.

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Navigation</h3>

[1. Project Skeleton](#1)   
[2. Load the Data](#2)  
[3. Take a Quick Look at Data Structures](#3)   
[4. Create a Test Set](#4)    
[5. Discover and Visualize Data to Gain Insights](#5)  
&nbsp;&nbsp;&nbsp;&nbsp;[a. Visualizing Geographical Data](#5a)   
&nbsp;&nbsp;&nbsp;&nbsp;[b. Looking for Correlations](#5b)       
&nbsp;&nbsp;&nbsp;&nbsp;[c. Experimenting with Feature Combinations](#5c)     
[6.Preparing Data for Machine Learning Algorithms](#6)     
&nbsp;&nbsp;&nbsp;&nbsp;[a. Data Cleaning](#6a)     
&nbsp;&nbsp;&nbsp;&nbsp;[b. Handling Text and Categorical Features](#6b)     
&nbsp;&nbsp;&nbsp;&nbsp;[c. Column Transformers](#6c)     
&nbsp;&nbsp;&nbsp;&nbsp;[d. Transformation Pipelines](#6d)     
[7. Select and Train a Model](#7)     
&nbsp;&nbsp;&nbsp;&nbsp;[a. Training and Evaluating on Training Set](#7a)     
&nbsp;&nbsp;&nbsp;&nbsp;[b. Better Evaluation Using Cross Validation](#7b)     
[8. Fine-Tune a Model](#8)  
&nbsp;&nbsp;&nbsp;&nbsp;[a. Grid Search](#8a)     
&nbsp;&nbsp;&nbsp;&nbsp;[b. Analyse the Best Models and Their Errors](#8b)       
[9. Evaluate Your System on Test Set](#9)    
[10. References](#10)   

<a id="1"></a>
## 1. Project Skeleton
Before starting out any project, we must first plan our steps and have clarity on what type of problem we are tackling and what tools can be used and what cannot be used and why not?. This "why not" question will help you gain more insights on your ML journey. The following are key points I took into consideration.

Staircase
* What kind of ML problem statement is it? Try to define it
* Understand the type of data?
* Keep a test data aside for EDA
* Relationships between various features, ie EDA 
* Try your intuition about the field: 
   * What can be important features that effect a house price? Bedrooms? Area? Population?
* Data preprocessing: Building a pipeline for it
* Applying models to predict
* What must be the evaluation metric?
* Evaluate the model on Test data

<a id="2"></a>
## 2. Load the Data

In [None]:
# In book a function is defined to download data from Url and auto-extract it using tgz
# but since we are using data directly from kaggle it is not required

import pandas as pd
housing = pd.read_csv("../input/california-housing-prices/housing.csv")

<a id="3"></a>
## 3. Take a Quick Look at Data Structures

In [None]:
#housing = load_housing_data()
housing.head()

In [None]:
housing.info()

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
housing.describe() #all null values ignored

In [None]:
#creating plots on dataset
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()

<a id="4"></a>
## 4. Create a Test Set

In [None]:
"""
Creating shuffled testset with constant values in training and updated dataset values going to 
test set in case dataset is updated, this done via hashlib
"""
import hashlib
import numpy as np

def test_set_check(identifier,test_ratio,hash):
    return hash(np.int64(identifier)).digest()[-1]<256*test_ratio
    
def split_train_test(data,test_ratio,id_column,hash=hashlib.md5):
    ids=data[id_column]
    in_test_set=ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
    return data.loc[~in_test_set],data.loc[in_test_set]

In [None]:
#combining latitude and longitude as new column id
#housing_with_id["id"]=housing["longitude"]*1000+housing["latitude"]
#train_set1,test_set1 = split_train_test(housing_with_id,0.2,"id")

In [None]:
#housing_with_id.head()

In [None]:
# or we can use sklearn function 
#from sklearn.model_selection import train_test_split
#train_set,test_set = train_test_split(housing,test_size=0.2,random_state=42)

In [None]:
#understanding stratification
housing["median_income"].hist(bins=40)

In [None]:
#creating hosusing income categories
housing["income_cat"]=np.ceil(housing["median_income"]/1.5)
housing["income_cat"]=housing["income_cat"].apply(lambda x: 5 if x>5 else x)

In [None]:
housing["income_cat"].hist(bins=40)

In [None]:
#startified split
from sklearn.model_selection import StratifiedShuffleSplit

split= StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_idx,test_idx in split.split(housing,housing["income_cat"]):
    strat_train_set=housing.loc[train_idx]
    strat_test_set=housing.loc[test_idx]

In [None]:
#dropping income category from test and train splits
a= (strat_train_set,strat_test_set)

In [None]:
for i in a:
    i.drop(["income_cat"],axis=1,inplace=True)

<a id="5"></a>
## 5. Discover and Visualize Data to Gain Insights
 Do exploratory data analysis on test data

<a id="5a"></a>
### a. Visualizing Geographical Data

In [None]:
data =strat_test_set.copy()
data.head()

In [None]:
# since there are latitude and longitudes, its good idea to have a scatter plot
#set alpha =0.1 to clearly see dense points
data.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

In [None]:
#advanced scatter plot using median value of house
data.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,
         s=data["population"]/100,label="population",
         c="median_house_value",cmap=plt.get_cmap("jet"),
         colorbar=True)
plt.legend()

<a id="5b"></a>
### b. Looking for Correlations

In [None]:
# Calculate pearson's r coefficient
corr_matrix=data.corr()
corr_matrix

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
#scatter matrix fom pandas

from pandas.plotting import scatter_matrix
attributes=["median_house_value","median_income","total_rooms","housing_median_age"]
scatter_matrix(data[attributes],figsize=(12,8))

In [None]:
#exploring more on median income
data.plot(kind="scatter",x="median_income",y="median_house_value",alpha=0.1)

<a id="5c"></a>
### c. Experimenting with Feature Combinations
we ll try to create new features that are more relevant

In [None]:
data["rooms_per_household"]=data["total_rooms"]/data["households"]
data["bedrooms_per_room"]=data["total_bedrooms"]/data["total_rooms"]
data["population_per_household"]=data["population"]/data["households"]

In [None]:
#lets check co-relation matrix again
corr_matrix=data.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

<a id="6"></a>
## 6. Preparing Data for Machine Learning Algorithms

Lets start by separating labels and predictors of our orignal train dataset into copies that we can use


In [None]:
housing=strat_train_set.drop("median_house_value",axis=1)
housing_labels=strat_train_set["median_house_value"].copy()

<a id="6a"></a>
### a. Data Cleaning
Missing values can be dealt in follwoing ways:
1. Get rid of the corresponding values
2. Get rid of whole features
3. Set missing values to some value (zero, mean, median, etc)

In [None]:
# we will use SimpleImputer from sklearn.impute
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy="median")

In [None]:
# since imputer only works on numerical features we ll create a copy of data with only numerical features
housing_num=housing.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)

In [None]:
# you can view these values
imputer.statistics_

In [None]:
# use this imputer to transform
X=imputer.transform(housing_num)

In [None]:
housing_tr=pd.DataFrame(X,columns=housing_num.columns)

In [None]:
housing_tr.isnull().sum()

<a id="6b"></a>
### b. Handling Text and Categorical Features
We will handle the text feature "ocean_proximity" that we dropped earlier as it cannot be fed directly into any ML model

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
housing_cat=housing["ocean_proximity"]
housing_cat_encoded=encoder.fit_transform(housing_cat)
housing_cat_encoded

In [None]:
encoder.classes_

Since classes are not ordinal, we will one-hot encode them

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
housing_cat_1hot=encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

In [None]:
#converting the sparse matrix to array
housing_cat_1hot.toarray()

<a id="6c"></a>
### c. Column Transformer

For regular transformation of columns as we did while experimention with features, we can define a column transformer

In [None]:
from sklearn.base import BaseEstimator,TransformerMixin

rooms_ix,bedrooms_ix,population_ix,household_ix=3,4,5,6

class FeatureAdder(BaseEstimator, TransformerMixin):
    def __init__(self,add_bedrooms_per_room=True):
        self.add_bedrooms_per_room=add_bedrooms_per_room
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
        population_per_household=X[:,population_ix]/X[:,household_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]
            return np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room]
        else:
            return np.c_[X,rooms_per_household,population_per_household]

In [None]:
#lets instantiate our object
adder= FeatureAdder(add_bedrooms_per_room=False)
housing_extra_features =adder.fit_transform(housing.values)

or we can use FunctionTransformer that easily defines above class based on your function

In [None]:
from sklearn.preprocessing import FunctionTransformer

rooms_ix,bedrooms_ix,population_ix,household_ix=3,4,5,6

def extra_features(X,add_bedrooms_per_room=True):
    rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
    population_per_household=X[:,population_ix]/X[:,household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

In [None]:
feature_adder =FunctionTransformer(extra_features,validate=False,
                                  kw_args={"add_bedrooms_per_room":False})
housing_extra_features =feature_adder.fit_transform(housing.values)

housing_extra_feat = pd.DataFrame(
    housing_extra_features,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_feat.head()

<a id="6d"></a>
### d. Transformation Pipelines

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.compose import ColumnTransformer

num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]

num_pipeline=Pipeline([
    ("imputer",SimpleImputer(strategy="median")),
    ("feature_adder",FeatureAdder()),
    ("std_scaler",StandardScaler()),
])

full_pipeline=ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

In [None]:
housing_prepared = full_pipeline.fit_transform(housing)

<a id="7"></a>
## 7. Select and Train a Model

<a id="7a"></a>
### a. Training and Evaluating on Training Set

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)

In [None]:
some_data=housing.iloc[:5]
some_data

In [None]:
housing_labels.iloc[:5]

In [None]:
some_prepared_data = full_pipeline.transform(some_data)

In [None]:
lin_reg.predict(some_prepared_data)

In [None]:
#calculate mean squared error
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse= mean_squared_error(housing_labels,housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

This is underfitting model

In [None]:
# try another model

from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

In [None]:
housing_predictions=tree_reg.predict(housing_prepared)
tree_mse= mean_squared_error(housing_labels,housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

The above model is highly overfitting, it recalls every value from the training set

<a id="7b"></a>
### b. Better Evaluation Using Cross Validation

In [None]:
# using cross_val_score

from sklearn.model_selection import cross_val_score
scores=cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
rmse_scores=np.sqrt(-scores)

In [None]:
rmse_scores

In [None]:
#lets view all scores
def display_scores(scores):
    print("Scores:",scores)
    print("Mean:",scores.mean())
    print("Standard Deviation:",scores.std())

display_scores(rmse_scores)

In [None]:
lin_scores=cross_val_score(lin_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
lin_rmse_scores=np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [None]:
# try with randomforest
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
rf_scores=cross_val_score(forest_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=5)
rf_rmse_scores=np.sqrt(-rf_scores)
display_scores(rf_rmse_scores)

<a id="8"></a>
## 8. Fine-Tune the Model

<a id="8a"></a>
### a. Grid Search

In [None]:
#lets use GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
param_grid={
    'n_estimators':[3,10,30],'max_features':[2,4,6,8],
    'bootstrap':[False,True],'n_estimators':[3,10],'max_features':[2,3,4],
}

forest_reg=RandomForestRegressor()
grid_search=GridSearchCV(forest_reg,param_grid,cv=5,scoring="neg_mean_squared_error")
grid_search.fit(housing_prepared,housing_labels)

In [None]:
grid_search.best_params_

In [None]:
#getting the best model
grid_search.best_estimator_

In [None]:
#scores
cvres=grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"],cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
pd.DataFrame(grid_search.cv_results_)

<a id="8b"></a>
### b. Analyse the Best Models and Their Errors

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

In [None]:
# display with feature names
extra_features=["rooms_per_hhold","population_per_hhold","bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"] # calleing transformer named "cat" from full pipeline
cat_one_hot_features = list(cat_encoder.categories_[0])
features = num_attribs + extra_features + cat_one_hot_features

In [None]:
sorted(zip(feature_importances, features), reverse=True)

<a id="9"></a>
## 9. Evaluate Your System on the Test Set

In [None]:
final_model= grid_search.best_estimator_

In [None]:
X_test= strat_test_set.drop("median_house_value",axis=1)
y_test= strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

In [None]:
final_predictions= final_model.predict(X_test_prepared)

In [None]:
final_mse= mean_squared_error(y_test,final_predictions)
final_rmse=np.sqrt(final_mse)

In [None]:
final_rmse

### Thats the final Test Score

<a id="10"></a>
## 10. References:

Link to the book I followed: [Hands-On Machine Learning with Scikit-Learn and TensorFlow](https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_1?dchild=1&keywords=handson+sklearn&qid=1599399632&sr=8-1) - *Aurélien Géron*
    
Top 5 Conceptual Books you might wanna see:
https://www.kaggle.com/getting-started/171809
    

### Do Upvote if you like :)