# Practice Exercise: Scikit-Learn 5
### Pipelines

### Objectives

As part of the [SK5 Tutorial](https://www.featureranking.com/tutorials/machine-learning-tutorials/sk-part-5-advanced-topics-pipelines-statistical-model-comparison-and-model-deployment/), the objective of this practice notebook is to illustrate how you can set up `pipelines` for streamlining your machine learning workflow using `Scikit-Learn`. Pipelines make life easier in relatively large machine learning projects by combining multiple steps in to a single process.

You will be using the cleaned "income data" dataset. In the previous practices, you cleaned and transformed the raw `income data` and renamed the `income` column as `target` (with high income being the positive class). Including `target`, the cleaned data consists of 42 columns and 45,222 rows. Each column is numeric and between 0 and 1.

You will use **stratified 5-fold cross-validation with no repetitions** during training. For testing, you will use the fine-tuned model for prediction **without** any cross-validation for simplicity.

In `GridSearchCV()`, try setting `n_jobs` to -2 for shorter run times with parallel processing. Here, -2 means use all core except 1.

### Exercise 0: Modeling Preparation

- Read in the clean data `us_census_income_data_clean_encoded.csv` on GitHub [here](https://github.com/vaksakalli/datasets). 
- Randomly sample 5000 rows using a random seed of 999.
- Split the sampled data as 70% training set and the remaining 30% test set using a random seed of 999. 
- Remember to separate `target` during the splitting process. 

### Exercise 1: Pipeline Preparation

For feature selection, you will use the powerful Random Forest Importance (RFI) method with 100 estimators. A trick here is that you will need a bit of coding so that you can make RFI feature selection as part of the pipeline. For this reason, we are providing for you the custom `RFIFeatureSelector()` class below to pass in RFI as a "step" to the pipeline.
```Python
from sklearn.base import BaseEstimator, TransformerMixin
# custom function for RFI feature selection inside a pipeline
# here we use n_estimators=100
# notice the random_state in RandomForestClassifier()
# to control randomness
class RFIFeatureSelector(BaseEstimator, TransformerMixin):
    # class constructor 
    # make sure class attributes end with a "_"
    # per scikit-learn convention to avoid errors
    def __init__(self, n_features_=10):
        self.n_features_ = n_features_
        self.fs_indices_ = None
    # override the fit function
    def fit(self, X, y):
        from sklearn.ensemble import RandomForestClassifier
        from numpy import argsort
        model_rfi = RandomForestClassifier(n_estimators=100, random_state=999)
        model_rfi.fit(X, y)
        self.fs_indices_ = argsort(model_rfi.feature_importances_)[::-1][0:self.n_features_] 
        return self 
    # override the transform function
    def transform(self, X, y=None):
        return X[:, self.fs_indices_]
```

We are also making available the custom function below, called `get_search_results()`, which will format outputs of an input grid search object as a `Pandas` data frame.
```Python
# custom function to format the search results as a Pandas data frame
def get_search_results(gs):
    def model_result(scores, params):
        scores = {'mean_score': np.mean(scores),
             'std_score': np.std(scores),
             'min_score': np.min(scores),
             'max_score': np.max(scores)}
        return pd.Series({**params,**scores})
    models = []
    scores = []
    for i in range(gs.n_splits_):
        key = f"split{i}_test_score"
        r = gs.cv_results_[key]        
        scores.append(r.reshape(-1,1))
    all_scores = np.hstack(scores)
    for p, s in zip(gs.cv_results_['params'], all_scores):
        models.append((model_result(s, p)))
    pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)
    columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']
    columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]
    return pipe_results[columns]
```

You will need to copy and paste these two code blocks before you can continue with the next exercise.

### Exercise 2

Using a pipeline, stack Random Forest Importance (RFI) feature selection together with grid search for DT hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

For RFI, consider number of features in {10, 20, full_number_of_descriptive_features}.

For the DT model, aim to determine the optimal combinations of maximum depth (`max_depth`) and minimum sample split (`min_samples_split`) using the **Gini Index** split criterion. In particular, consider max_depth values in {3, 5, 7, 9, 11} and min_samples_split values in {2, 5, 7, 9, 11}.

### Exercise 3

Display the pipeline best parameters, the best score, and the best estimator. 

### Exercise 4

Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

### Exercise 5

Visualize DT performance comparison results by filtering the output of the `get_search_results()` function for 10 features. Put minimum samples for split in the x-axis, AUC in the y-axis and break down the plot by maximum depth.

### Exercise 6

Using the best estimator of the pipeline, obtain the predictions on the **test** data. Display the confusion matrix and the AUC score on this test data. 

Next, using the best estimator of the pipeline, obtain the predictions on the **train** data. Display the confusion matrix and the AUC score on this train data. 

How does the test AUC compare to the train AUC? Why do you think there is difference?

***
www.featureranking.com