Checklist to start a new project

    1. Understand the business domain
    2. Business Objective
    3. How does the company expect to use and benefit from this model.
    4. What the current solution looks like.
 
 
These are very important points to know before we start work on any model as it will determine how we frame the problem, what algorithms will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it. 

ML Work FLow

1. **Fetch data from various datasources.**
    
2. **Data Cleaning** <br>
    *There are 2 types of problems we face in data preparation stage.*
        a) Missing Values
        b) Categorical Variables
        
     *Handle Missing Values:*
          There are different approaches to handle missing values.
              a) Get rid of the corresponding row.
              b) Get rid of the entire column. 
              c) Set the missing values to some value. Depend on situtaion 
                  Mean or Median 
                  Expectation Maximization Algorithm 
                  Nearest Neighbour
                  
      *Handle Categorical Variables:*
              As we know most machine algorithms prefer to work with numbers, so lets convert these categorical     
              variables into numbers.
              
              There are 2 types: 
                  a) Nominal  => Named Categories (eg. Red, green, blue etc)
                      Use below techinques to convert nominal values to numerical values
                          1. One Hot Encoding (one column for each value to compare vs. all other values.)
                          2. One Hot Encoding With Many Categorical Variables.
                     
                  b) Ordinal  => Categories with an implied order
                          1. Binary Encoding     
                          2. Label Encoding

3. **Prepare Data For Test/Train/Validation** <br>
       Key points to remember when preparing datasets for train, test and validation.
           a) Do not generate new test set every time.
           b) Ensure that the test set is representative of the whole dataset to avoid sampling bias.
           
4. **Discover and Visualize the data to Gain Insights** <br>
       Make sure you have put the test set a side and you are only exploring the training set. 
       Understant the problem and make sure training dataset represnts visual perspective of what we want to 
       acheive. 
       
5. **Feature Scaling** <br>
       It is one of the most important transformations we need to apply to our data. With few exceptions, Machine 
       Learning algorithms don't perform well when the input numerical attributes have very different scales.
       There are 2 different ways to get all attributes to have the same scale.
           a) min-max scaling (Normalization)
               Values are shifted and rescaled so that they end up ranging from 0 to 1. 
               $X_{initial}$ = X - $X_{mean}$/ ($X_{max}$ -  $X_{min}$)
               
           b) Standardization Scaling
               It is quite different. First it subtracts the mean value and then it divides by the standard 
               deviation so that the resulting distribution has unit variance. 
               It doesn't bound values to specific range, which may be a problem for some algorithms (eg., neural 
               networks often expect input value ranging from 0 to 1). However, standardization is much less 
               effected by outliers. 
               
6. **Choose Algorithm & Train Model** <br>
        Train and then evaluate test or cross validation scores. 
        
7. **Tuning Hyperparameters**<br>
        In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. 
        By contrast, the values of other parameters are derived via training.
        
        Hyperparameters can be classified as model hyperparameters, that cannot be inferred while fitting the 
        machine to the training set because they refer to the model selection task, or algorithm hyperparameters, 
        that in principle have no influence on the performance of the model but affect the speed and quality of the 
        learning process.
        
        There are different ways to tune the hyperparameters.
        1. Grid Search (grid of hyperparameter values and for each combination, trains a model and scores on 
            the testing data)
        2. Randomized Search (sets up a grid of hyperparameter values and selects random combinations to train 
            the model and score)
        3. Bayesian Optimization (It uses Guassian process to find best parameters for model)
                

In [1]:
from sklearn.experimental import enable_iterative_imputer
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import plotly.express as px
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import BayesianRidge

#### Fetch Data

In [2]:
df = pd.read_csv("/Users/ukannika/work/personal/machine-learning/datasets/housing.csv", sep=",")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
df.describe()

#### Data Cleaning

#### Handle Missing Data

   **Univariate Imputation [SimpleImputer]** <br>
        One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using 
        only non-missing values in that feature dimension.
        Supported Strategies => mean, median, most_frequent, costant

   **Multivariate Imputation [IterativeImputer]** <br>  [currently it's in experimental]
        In multivariate imputation algorithms use the entire set of available feature dimensions to estimate the 
        missing values. 
        
   **KNNImputer**
        Imputation for completing missing values using k-Nearest Neighbors.

In [3]:
simple_imputer = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=None, 
                               verbose=0, copy=True, add_indicator=False)

simple_imputer_categorical = SimpleImputer(missing_values=np.nan, strategy='most_frequent', fill_value=None, 
                               verbose=0, copy=True, add_indicator=False)

iterative_imputer = IterativeImputer(estimator= BayesianRidge(), missing_values=np.nan, 
                                     sample_posterior=False, max_iter=10, tol=0.001, n_nearest_features=None, 
                                     initial_strategy='mean', imputation_order='ascending', skip_complete=False, 
                                     min_value=None, max_value=None, verbose=0, random_state=None,
                                     add_indicator=False)

knn_imputer = KNNImputer(missing_values=np.nan, n_neighbors=10, weights='distance', metric='nan_euclidean', copy=True, add_indicator=False)

#### Handle Categorical Variables

In [4]:
# Nominal
one_hot_encoder = OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=np.float64, handle_unknown='error')

# Ordinal
ordinal_encoder = OrdinalEncoder(categories='auto', dtype=np.float64)

#### Discover and Visualize the data to Gain Insights

In [None]:
fig = px.scatter_matrix(df,
    dimensions=["median_house_value", "median_income", "total_rooms", "housing_median_age"],
    color="median_house_value")

fig.update_layout(
    title='Housing Data set',
    dragmode='select',
    width=800,
    height=800,
    hovermode='closest',
)

fig.show()

#### Prepare Data For [Train, Test, Validation]

**arrays**  
    Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

**test_size**
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 
    If int, represents the absolute number of test samples. If None, the value is set to the complement of the train 
    size. If train_size is also None, it will be set to 0.25.

**train_size**
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. 
    If int, represents the absolute number of train samples. If None, the value is automatically set to the complement 
    of the test size.

**random_state**
    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the 
    random number generator; If None, the random number generator is the RandomState instance used by np.random.

**shuffle**
    Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

**stratify**
    If not None, data is split in a stratified fashion, using this as the class labels.

#### Find attributes that most represent the whole dataset

As we can see in our data median income is a very important attribute to predict median housing prices.
We need to find important attributes to predict median housing prices. 
By finding these important attributes will help us to ensure that the test set is representative of the various 
categories of incomes in the whole dataset. 

For our testing purpose we are only considering one attribute which is median_income. 

Since median_income is a continous numerical attribute, we first need to create an income_category attribute.

In [None]:
px.histogram(x=df['median_income'])

As we can see median income values are clustered around 2 to 5 (20000$ to 50000$), but some median incomes go far beyond 6. It is important to have a sufficient number of instances in your dataset for each stratum or else the estimate of the stratum's importance may be biased. 

In [5]:
df['income_category'] = np.ceil(df['median_income']/1.5)
df['income_category'].where(df['median_income'] < 5, 5.0, inplace=True)

In [None]:
px.histogram(x=df['income_category'])

In [6]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42, shuffle=True, stratify=df['income_category'])

# Now we should remove the income_category attribute, so the data is back to its original state:
train_set = train_set.drop("income_category", axis=1)
test_set = test_set.drop("income_category", axis=1)

print("Train Dataset Shape %s" % (train_set.shape,))
print("Test Dataset Shape %s" % (test_set.shape,))

Train Dataset Shape (16512, 10)
Test Dataset Shape (4128, 10)


In [7]:
housing_features = train_set.drop(columns="median_house_value") 
housing_labels = train_set["median_house_value"].copy()

In [None]:
housing_features.head(5)

In [None]:
housing_labels.head(5)

#### Feature Scaling

In [None]:
fig = px.scatter_matrix(df,
    dimensions=["median_house_value", "median_income", "total_rooms", "housing_median_age"],
    color="median_house_value")

fig.update_layout(
    title='Housing Data set',
    dragmode='select',
    width=800,
    height=800,
    hovermode='closest',
)

fig.show()

In [None]:
# Check for outliers
fig = make_subplots(rows=3, cols=3)

fig.add_trace(
    go.Box(y=housing_features["total_rooms"]),
    row=1, col=1
)

fig.add_trace(
    go.Box(y=housing_features["median_income"]),
    row=1, col=2
)

fig.add_trace(
    go.Box(y=housing_features["housing_median_age"]),
    row=2, col=1
)

fig.add_trace(
    go.Box(y=housing_features["total_bedrooms"]),
    row=2, col=2
)

fig.add_trace(
    go.Box(y=housing_features["population"]),
    row=3, col=1
)

fig.add_trace(
    go.Box(y=housing_features["households"]),
    row=3, col=2
)

fig.update_layout(height=600, width=800, title_text="Before Feature Scaling")
fig.show()

In [8]:
std_scalar = StandardScaler(copy=True, with_mean=True, with_std=True)
min_max_scalar = MinMaxScaler(copy=True, feature_range=(0, 1))

# This one only used for visualization & understanding after feature scaling.
# We don't need to explicitly call this fit_transform here.. Best practice is to use SKlean pipeline preprocessing. 
after_feature_scaling_std_scalar_visualization = std_scalar.fit_transform(train_set.iloc[:, :9])
after_feature_scaling_min_max_scalar_visualization = min_max_scalar.fit_transform(train_set.iloc[:, :9])

In [None]:
# After feature scaling
fig = make_subplots(rows=3, cols=3)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 4]),
    row=1, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 8]),
    row=1, col=2
)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 3]),
    row=2, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 5]),
    row=2, col=2
)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 6]),
    row=3, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_std_scalar_visualization[:, 7]),
    row=3, col=2
)

fig.update_layout(height=600, width=800, title_text="After Feature Scaling")
fig.show()

In [None]:
fig = make_subplots(rows=3, cols=3)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 4]),
    row=1, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 8]),
    row=1, col=2
)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 3]),
    row=2, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 5]),
    row=2, col=2
)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 6]),
    row=3, col=1
)

fig.add_trace(
    go.Box(y=after_feature_scaling_min_max_scalar_visualization[:, 7]),
    row=3, col=2
)

fig.update_layout(height=600, width=800, title_text="After Feature Scaling")
fig.show()

#### Scikit Learn Design

**Pipeline:** <br>
      *Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an
      estimator into one object, so that we only have to call important methods once ( fit() , predict() , etc).*

**Transformer** <br>
       *A transformer, which can transform one dataset into another dataset perfomed by transform() method.*
       
 **Estimator** <br>
       *An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning 
       algorithm is an Estimator which trains on a DataFrame and produces a model. It is performed by fit() method.*
        


In [9]:
numeric_features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 
                    'total_bedrooms', 'population', 'households', 'median_income']

categorical_features = ['ocean_proximity']

# A transformer to apply to apply on numerical features.
numeric_transformer = Pipeline(steps=[
    ('imputer', simple_imputer),
    ('scaler', std_scalar)])

numeric_transformer_iterative_imputer = Pipeline(steps=[
    ('imputer', iterative_imputer),
    ('scaler', std_scalar)])

numeric_transformer_knn_imputer = Pipeline(steps=[
    ('imputer', knn_imputer),
    ('scaler', std_scalar)])

categorical_transformer = Pipeline(steps=[
    ('imputer', simple_imputer_categorical),
    ('onehot', one_hot_encoder)])

categorical_transformer_knn_imputer = Pipeline(steps=[
    ('imputer', knn_imputer),
    ('onehot', one_hot_encoder)])

In [10]:
preprocessor_simple_imputer = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

preprocessor_iterative_imputer = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer_iterative_imputer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

preprocessor_knn_imputer = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer_knn_imputer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [11]:
preprocessed_simple_imputer_train_set = preprocessor_simple_imputer.fit_transform(housing_features)
preprocessed_iterative_imputer_train_set = preprocessor_iterative_imputer.fit_transform(housing_features)
preprocessed_knn_imputer_train_set = preprocessor_knn_imputer.fit_transform(housing_features)

In [None]:
print(df['ocean_proximity'].unique())
print(preprocessed_simple_imputer_train_set.shape)
print(preprocessed_iterative_imputer_train_set.shape)
print(preprocessed_knn_imputer_train_set.shape)
preprocessed_knn_imputer_train_set[1, :]

In [None]:
print("Check if there are any nan values in training sets: ")
print(np.isnan(preprocessed_simple_imputer_train_set).any())
print(np.isnan(preprocessed_simple_imputer_train_set).any())
print(np.isnan(preprocessed_simple_imputer_train_set).any())

#### Finally we have the datasets for train, test.

In [12]:
np.savetxt(X=preprocessed_simple_imputer_train_set, 
           fname="/Users/ukannika/work/personal/machine-learning/datasets/housing_features_simple_imputer.csv", 
           fmt="%1.3f", delimiter=",")

np.savetxt(X=preprocessed_simple_imputer_train_set, 
           fname="/Users/ukannika/work/personal/machine-learning/datasets/housing_features_iterative_imputer.csv", 
           fmt="%1.3f", delimiter=",")

np.savetxt(X=preprocessed_simple_imputer_train_set, 
           fname="/Users/ukannika/work/personal/machine-learning/datasets/housing_features_knn_imputer.csv", 
           fmt="%1.3f", delimiter=",")

In [13]:
# Write target values
housing_labels.to_csv("/Users/ukannika/work/personal/machine-learning/datasets/housing_labels.csv", sep=","
                      ,header=None, index=None)

In [14]:
# Write test set
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

X_test_prepared = preprocessor_simple_imputer.fit_transform(X_test)
np.savetxt(X=X_test_prepared, 
           fname="/Users/ukannika/work/personal/machine-learning/datasets/test_housing_features.csv", 
           fmt="%1.3f", delimiter=",")
y_test.to_csv("/Users/ukannika/work/personal/machine-learning/datasets/test_housing_labels.csv", sep=",", 
              header=None,index=False)