### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `iowa_housing.csv` file

In [69]:
# your answer here
import pandas as pd
import numpy as np
df = pd.read_csv('../../data/iowa_train.csv')

**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [71]:
# your answer here
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

**Step 4:** Import `GradientBoostingRegressor` and initialize it.

In [72]:
# your answer here
from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor()

**Step 5:** Call the `fit()` method on `X` & `y`

In [73]:
gbm.fit(X, y)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

**Step 6:** Check the score of your model using the `score()` method

In [74]:
# your answer here
gbm.score(X, y)

0.9371367715410339

**Step 7:** Make a column that represents the predictions your model made for each sample

In [75]:
# your answer here
df['Predictions'] = gbm.predict(X)

**Step 8:** Take a look at the values returned from the `feature_importances_` attribute

In [76]:
# your answer here
gbm.feature_importances_

array([0.00083484, 0.00277247, 0.02783743, 0.56745495, 0.01227388,
       0.04799766, 0.06092333, 0.08068937, 0.03701382, 0.08721593,
       0.00864966, 0.00185042, 0.06448624])

**Step 9:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [77]:
# your answer here
feats = pd.DataFrame({
    'Columns': X.columns,
    'Importance': gbm.feature_importances_
})

# to make them more viewable
feats.sort_values(by='Importance', ascending=False)

Unnamed: 0,Columns,Importance
3,OverallQual,0.567455
9,GrLivArea.1,0.087216
7,1stFlrSF,0.080689
12,GarageCars,0.064486
6,GrLivArea,0.060923
5,YearBuilt,0.047998
8,2ndFlrSF,0.037014
2,LotArea,0.027837
4,OverallCond,0.012274
10,FullBath,0.00865


**Step 10:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable
 - combing associated columns into larger, more descriptive ones, like full bathrooms, total living area, etc
 
**hint:** you can use the `set_params()` method to change the parameter values of a scikit-learn algorithm.

In [79]:
# your answer here
# we'll talk about this in more detail, but here's a start
# select the columns with feature importance of at least 1 %
query = feats.Importance > .01
cols_to_use = feats.loc[query, 'Columns'].tolist()
# and fit and score our model with them -- a very modest improvement, but we'll take it
print(f"New model score is: {gbm.fit(X[cols_to_use], y).score(X[cols_to_use], y)}")

New model score is: 0.9377636242979969


In [80]:
# to check for improvements we can make with parameters we'll go use a nested for loop
rates = [.001, .01, .1]
num_trees = [100, 500, 1000]
mod_scores = []

for rate in rates:
    for tree in num_trees:
        gbm.set_params(learning_rate=rate, n_estimators=tree)
        score = gbm.fit(X[cols_to_use], y).score(X[cols_to_use], y)
        mod_scores.append((score, rate, tree))
        
print(max(mod_scores))

(0.9913900417055366, 0.1, 1000)
