## Introduction  

In this mini workshop, we are going to learn some advanced features in [scikit-learn](https://scikit-learn.org/stable/user_guide.html) that can improve your undersanding and efficiency of machine learning.  

This time, we will use a built-in dataset from scikit-learn as an example for our exercise. The dataset we are using is the [Boston house price](https://scikit-learn.org/stable/datasets/index.html#toy-datasets) dataset for regression modeling. 

Let's start the notebook with importing the package and the data.  

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from sklearn.datasets import load_boston

In [None]:
X, y = load_boston(return_X_y=True)
print(X.shape, y.shape)

It looks like we have 13 features that we  can use to create a predictive model for the target/output which is the house price in Boston.

## Pipeline  

In the past two days, we are doing the model development step-by-step with the typical process of:

1. Feature selection/transformation;
2. Defining the model;
3. Tunning model hyperparameters;

This is a clear way to start our learning of each individual step. However, the code will look lengthy. In scikit-learn, you can create a model [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators) to nest all model steps into a sequence and specifying the key words for each step. You can also implement grid search CV directly to your pipeline.

In [None]:
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
## First, let's review a traditional modeling process

## Step 1. Data transformation - e.g., PCA

## Step 2. Defining the model

## Step 3. Gridsearch CV


## Step 4. fit the model/grid search

## Step 5. find the best combo & final model

## ...

## Things can be simplified with pipeline by nesting these together


## making the pipeline


In [None]:
## Now, le's say we want to change how many different PCA components we want to include in our final model 
## also, we want to change the number of estimators (trees) in our random forest


## now we can define the gridsearch object that we want (say with 10-fold cross validation)

## Seperating the data for training and testing


In [None]:
## Now we can fit the whole pipeline instead of just doing it step by step


In [None]:
## Let's looks at what is the suggestion of the model structure


In [None]:
## Now we can directly apply the model pipeline from the grid search 
## object to our testing data which will give us the best estimator
## based on our grid search CV results.


## We can also add the scatter plot between predicted and test data


In [None]:
### Now it's your turn to try build your own pipeline for building a model 
### for this Boston Housing Price data using neural network.
### For the hyperparameters, you can change both the number of hidden layers
### for 1-layer and 2-layer models, as well as the activation function






## Random Search for Hyperparameters

In the previous exercise, we always use the grid search method to find the best model hyperparameters. This is a good method when you have a limited number of hyperparameters and small range of the parameters to tune. However, when we have a large parameter space for searching, the grid search can be really time consuming for large data sets. Sometimes, random search can help you reduce the computational need for that. We are going to use random forest model as our base model again here.

In [None]:
## define our base model pipeline


In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from time import time

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(results['mean_test_score'][candidate],
                          results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

## In this situation, we can define a grid for our search


## Now we can perform our grid search with 5-fold cross validation




In [None]:
## Now let's see how the random search will take for us
import scipy.stats as stats
## here we define the range for search first


In [None]:

## run randomized search


## Assessing feature importance  

Assuming that we finally find our best random forest model. We want to know which features have higer importance than others. What we can do is to use the [permutation feature importance](https://scikit-learn.org/stable/modules/permutation_importance.html#) functionality in scikit-learn.


In [None]:
## Now, let's use the model hyperparameter from our random search as the final model


## Let's calculate feature importance here


In [None]:
## We want to visualize the ranking of individual features

 