**Guojing Wu** *| 2019-07-21*

<a href = "https://www.kaggle.com/alexisbcook/xgboost"> Kaggle: XGBoost </a>

# XGBoost - Decision Tree Ensembles

## parameter tuning

1) n_estimators, also know as the number of models.

2) early_stopping_rounds, early stop cause the model to stop running when validation score doesn't improve anymore. But since there are random chances when validation score doesn't improve, this para specify how many rounds of deterioration are allowed before stop. Come along with `eval_set` para.

3) learning_rate, multiple the prediciton of each model by a small number before adding them in. 

So in general, a small learning rate and big n_estimators would be good

## preparing the data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read data
X = pd.read_csv("train.csv", index_col='Id')
X_test_full = pd.read_csv("test.csv", index_col='Id')

In [2]:
# split as X and y
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(axis=0, columns=['SalePrice'], inplace=True)
X_train_full, X_val_full, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, 
                                                                      random_state=0)

In [3]:
# only choose the numercial column and low cardinality categorical column
low_cardinality_col = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10
                      and X_train_full[cname].dtype == 'object']
numerical_col = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
my_col = low_cardinality_col + numerical_col

# make a harder copy
X_train = X_train_full[my_col].copy()
X_val = X_val_full[my_col].copy()
X_test = X_test_full[my_col].copy()

# one-hot encode, and also use align() to make sure only the dummies in train exist
X_train = pd.get_dummies(X_train)
X_val = pd.get_dummies(X_val)
X_test = pd.get_dummies(X_test)
X_train, X_val = X_train.align(X_val, axis=1, join='left')
X_train, X_test = X_train.align(X_test, axis=1, join='left')

In [4]:
# start XBGoost
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

In [None]:
my_model_1 = XGBRegressor(random_state=0)
my_model_1.fit(X_train, y_train)
prediction_1 = my_model_1.predict(X_val)
mae_1 = mean_absolute_error(prediction_1, y_val)
print("MAE for model 1:", mae_1)

In [None]:
# improve the model
my_model_2 = XGBRegressor(random_state=0, n_estimators=1000, learning_rate=0.1)
my_model_2.fit(X_train, y_train)
prediction_2 = my_model_2.predict(X_val)
mae_2 = mean_absolute_error(prediction_2, y_val)
print("MAE for model 2:", mae_2)

# turns out it did improve a little bit

In [None]:
# give a bad model
my_model_3 = XGBRegressor(random_state=0, n_estimators=1)
my_model_3.fit(X_train, y_train)
prediction_3 = my_model_3.predict(X_val)
mae_3 = mean_absolute_error(prediction_3, y_val)
print("MAE for model 3:", mae_3)