** XGBoost **
- leading model for working with standard tabular data (ex. Pandas DataFrames)
- require more knowledge and model tuning to reach peak accuracy
- implementation of the Gradient Boosted Decision Trees algorithm

** Gradient Boosted Decision Trees **
-  go through cycles that repeatedly builds new models and combines them into an ensemble model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the "ensemble of models."
- to make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.
- need some base prediction to start the cycle.

[Install XGBoost](https://anaconda.org/anaconda/py-xgboost) - conda install -c anaconda py-xgboost (for windows)

Note: Research more about XGBoost

In [15]:
# import libraries: pandas, scikit-learn, xgboost
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
#from sklearn.preprocessing import Imputer   -deprecated. use SimpleImputer()

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [16]:
# dataset: Home prices in iowa
iowa_file_path = "./data/iowa_home.csv"
# read csv file and load it as a DataFrame in pandas
iowa_data = pd.read_csv(iowa_file_path)

In [17]:
# Analyze data/ Data cleaning/ Data exploration
#iowa_data.head()
#iowa_data.tail()
#iowa_data['TotRmsAbvGrd'].dtypes
#df = iowa_data.select_dtypes(include= np.number)
#print(df)

In [26]:
iowa_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = iowa_data.SalePrice
X = iowa_data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])


# Split data to training and validation data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)



** Model Tuning **

Parameters that can dramatically affect your model's accuracy and training speed:
1. n_estimators - specifies how many times to go through the modeling cycle
2. early_stopping_rounds - offers a way to automatically find the ideal value
3. learning_rate - instead of getting predictions by simply adding up the predictions from each component model, we will multiply the predictions from each model by a small number before adding them in. This means each tree we add to the ensemble helps us less. In practice, this reduces the model's propensity to overfit.
4. n_jobs - On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine.

In [30]:
# create pipeline
#xgb_pipeline = make_pipeline(SimpleImputer(), XGBRegressor(n_estimators=1000, learning_rate=0.05))

# New way to create pipeline
xgb_pipeline = Pipeline([('imputer', SimpleImputer()), ('xgbrg', XGBRegressor(n_estimators=1000, learning_rate=0.05))])

# 5-fold cross-validation
fit_params = {'xgbrg__verbose' : False,
              'xgbrg__early_stopping_rounds' : 5,
              'xgbrg__eval_set' :[(X_test.values, y_test)]
             }

scores = cross_val_score(xgb_pipeline, X_train.values, y_train, scoring='neg_mean_absolute_error', cv=5, fit_params=fit_params)
print(scores)
print('Mean Absolute Error %.2f' %(-1 * scores.mean()))

[-19841.19314024 -16332.79042203 -18468.36485141 -17090.28412224
 -14658.24036841]
Mean Absolute Error 17278.17


** Partial Dependence Plots **
- show how each variable or predictor affects the model's predictions.
- can be interepreted similarly to the coefficients in those models. But partial dependence plots can capture more complex patterns from your data, and they can be used with any model.
- The partial dependence plot is calculated only after the model has been fit

In [None]:
cols_to_use = ['LotArea', 'YearBuilt']
X = X[cols_to_use]
model = GradientBoostingRegressor()
gbr_pipeline = make_pipeline(SimpleImputer(),model)
gbr_pipeline.fit(X, y)
my_plots = plot_partial_dependence(model, 
                                   features=[0,1], 
                                   X=X, 
                                   feature_names=cols_to_use, 
                                   grid_resolution=10)

In [None]:
gbr_model = GradientBoostingRegressor()
gbr_pipeline = make_pipeline(SimpleImputer(),gbr_model)

scores = cross_val_score(gbr_pipeline, X, y, scoring='neg_mean_absolute_error', cv=5)
print(scores)
print('Mean Absolute Error %.2f' %(-1 * scores.mean()))