### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error

In [10]:
survey = fetch_openml(data_id=534, as_frame=True).frame

In [11]:
survey.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8.0,no,female,21.0,not_member,5.1,35.0,Hispanic,Other,Manufacturing,Married
1,9.0,no,female,42.0,not_member,4.95,57.0,White,Other,Manufacturing,Married
2,12.0,no,male,1.0,not_member,6.67,19.0,White,Other,Manufacturing,Unmarried
3,12.0,no,male,4.0,not_member,4.0,22.0,White,Other,Other,Unmarried
4,12.0,no,male,17.0,not_member,7.5,35.0,White,Other,Other,Married


In [14]:
models = { 'lr': LinearRegression(), 'ridge': Ridge(), 'knn': KNeighborsRegressor(), 'tree': DecisionTreeRegressor(), 'svr': SVR() }  

tfm = make_column_transformer(
    (OneHotEncoder(drop = 'if_binary'), ['SOUTH', 'SEX', 'UNION', 'OCCUPATION', 'RACE', 'SECTOR', 'MARR']),
)

X_train, X_test, y_train, y_test = train_test_split(survey.drop('WAGE', axis=1), survey['WAGE'], test_size=0.3, random_state=0)

models['vr'] = VotingRegressor(estimators=[('lr', models['lr']), ('ridge', models['ridge']), ('knn', models['knn']), ('tree', models['tree']), ('svr', models['svr'])])

In [36]:
for m in models:

    pipe = Pipeline(
        (( 'tfm', tfm ), ( 'scaler', StandardScaler() ), ( m, models[m] ))
    )
    
    pipe.fit(X_train, y_train)
    print(f'{m} -- Train MSE: {mean_squared_error(pipe.predict(X_train), y_train):.2f} -- Test: {mean_squared_error(pipe.predict(X_test), y_test):.2f}')


lr -- Train MSE: 20.95 -- Test: 17.32
ridge -- Train MSE: 20.71 -- Test: 17.34
knn -- Train MSE: 18.17 -- Test: 20.68
tree -- Train MSE: 10.98 -- Test: 32.07
svr -- Train MSE: 21.14 -- Test: 20.47
vr -- Train MSE: 16.17 -- Test: 17.89


In [33]:
pipe['vr'].get_params()

{'estimators': [('lr', LinearRegression()),
  ('ridge', Ridge()),
  ('knn', KNeighborsRegressor()),
  ('tree', DecisionTreeRegressor()),
  ('svr', SVR())],
 'n_jobs': None,
 'verbose': False,
 'weights': None,
 'lr': LinearRegression(),
 'ridge': Ridge(),
 'knn': KNeighborsRegressor(),
 'tree': DecisionTreeRegressor(),
 'svr': SVR(),
 'lr__copy_X': True,
 'lr__fit_intercept': True,
 'lr__n_jobs': None,
 'lr__normalize': 'deprecated',
 'lr__positive': False,
 'ridge__alpha': 1.0,
 'ridge__copy_X': True,
 'ridge__fit_intercept': True,
 'ridge__max_iter': None,
 'ridge__normalize': 'deprecated',
 'ridge__positive': False,
 'ridge__random_state': None,
 'ridge__solver': 'auto',
 'ridge__tol': 0.001,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform',
 'tree__ccp_alpha': 0.0,
 'tree__criterion': 'squared_error',
 'tree__max_depth': None,
 'tree__max