# CatBoost
The goal of training is to select the model `y` depending on a set of features `x_i`, that best solves the given problem (regression, classification, or multiclassification) for any input object. This model is found by using a training dataset, which is a set of objects with known features and label values. Accuracy is checked on the validation dataset, which has data in the same format as in the training dataset, but it is only used for evaluating the quality of training (it is not used for training).

CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees.

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

In [44]:
# importing sys
import sys
  
# adding srcpy to the system path
sys.path.insert(0, "/Users/charles/Desktop/iFixerup/zr1/src/srcpy/")

import data_proc, feature_proc, catboost_mod

# Auto reload: watch a directory for changed files and restarts a process when the change is detected
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import random
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
pd.options.mode.chained_assignment = None
pd.options.display.float_format

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
mpl.style.use("ggplot")
pylab.rcParams["figure.figsize"] = 8 , 6

import seaborn as sns
sns.set_style("white")

In [3]:
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor, Pool

# Data Loading

In [4]:
%%time
# Read DataFrames from hdf5
features_2016 = pd.read_hdf('/Users/charles/Desktop/iFixerup/zr1/data/hdf5/features.h5', 'features_2016')    # All features except for datetime for 2016
features_2017 = pd.read_hdf('/Users/charles/Desktop/iFixerup/zr1/data/hdf5/features.h5', 'features_2017')    # All features except for datetime for 2017
train = pd.read_hdf('/Users/charles/Desktop/iFixerup/zr1/data/hdf5/train.h5', 'train')                       # Concatenated 2016 and 2017 training data with labels

CPU times: user 4.54 s, sys: 2.63 s, total: 7.17 s
Wall time: 8.41 s


# Fine-tuning with Valid Data 

In [5]:
# Drop useless feature
catboost_features = catboost_mod.catboost_drop_features(train)
print("Number of features for CatBoost: {}".format(len(catboost_features.columns)))
catboost_features.head()

Number of features for CatBoost: 71


Unnamed: 0,cooling_id,bathroom_cnt,bedroom_cnt,quality_id,floor1_sqft,finished_area_sqft_calc,floor1_sqft_unk,base_total_area,fireplace_cnt,bathroom_full_cnt,garage_cnt,garage_sqft,spa_flag,heating_id,latitude,longitude,lot_sqft,pool_cnt,pool_total_size,pool_w_sht,pool_no_sht,landuse_type_id,census_raw,city_id,neighborhood_id,region_zip,room_cnt,bathroom_small_cnt,unit_cnt,patio_sqft,year_built,story_cnt,tax_structure,tax_parcel,tax_year,tax_land,tax_property,tax_overdue_flag,tax_overdue_year,census_2,country_landuse_code_id,zoning_description_id,avg_garage_size,property_tax_per_sqft,location_sum,location_minus,location_sum05,location_minus05,missing_finished_area,missing_total_area,missing_bathroom_cnt_calc,derived_room_cnt,avg_area_per_room,derived_avg_area_per_room,region_zip-groupcnt,region_zip-lot_sqft-diff,region_zip-lot_sqft-percent,region_zip-year_built-diff,region_zip-finished_area_sqft_calc-diff,region_zip-finished_area_sqft_calc-percent,region_zip-tax_structure-diff,region_zip-tax_structure-percent,region_zip-tax_land-diff,region_zip-tax_land-percent,region_zip-tax_property-diff,region_zip-tax_property-percent,region_zip-property_tax_per_sqft-diff,region_zip-property_tax_per_sqft-percent,year,month,quarter
0,0,2.0,3.0,4.0,,1684.0,,,,2.0,,,,1,34280992.0,-118488536.0,7528.0,,,,,230,60371068.0,12447.0,31817.0,96370.0,0.0,,1.0,,1959.0,,122754.0,360170.0,2015.0,237416.0,6735.879883,,,60371070000000.0,41.0,11.0,,3.999929,-84207544.0,152769536.0,-24963276.0,93525260.0,0.0,1.0,0.0,5.0,,336.799988,14719.0,-13398.96875,-0.640273,-3.998413,-247.725464,-0.128241,-50475.015625,-0.291377,51026.421875,0.273762,2047.035645,0.436576,1.521634,0.613984,0,1,1
1,-1,3.5,4.0,,,2263.0,,,,3.0,2.0,468.0,,-1,33668120.0,-117677552.0,3643.0,,,,,230,60590524.0,32380.0,,96962.0,0.0,1.0,,,2014.0,,346458.0,585529.0,2015.0,239071.0,10153.019531,,,,15.0,,234.0,4.486531,-84009432.0,151345664.0,-25170656.0,92506896.0,0.0,1.0,0.0,7.5,,301.733337,17682.0,-2715.032715,-0.427024,35.535156,526.538208,0.303225,213678.171875,1.609267,16302.671875,0.073182,6339.847656,1.662618,2.160548,0.928875,0,1,1
2,0,3.0,2.0,4.0,,2217.0,,,,3.0,,,,1,34136312.0,-118175032.0,11423.0,,,,,230,60374640.0,47019.0,275411.0,96293.0,0.0,,1.0,,1940.0,,61994.0,119906.0,2015.0,57912.0,11484.480469,,,60374640000000.0,41.0,41.0,,5.18019,-84038720.0,152311344.0,-24951204.0,93223828.0,0.0,1.0,0.0,5.0,,443.399994,4422.0,-14927.021484,-0.56649,-12.917847,-173.867432,-0.072721,-236757.28125,-0.79249,-427605.09375,-0.880721,1845.573242,0.191471,1.178391,0.294465,0,1,1
3,0,2.0,2.0,4.0,,839.0,,,,2.0,,,,1,33755800.0,-118309000.0,70859.0,,,,,235,60372964.0,12447.0,54300.0,96222.0,0.0,,1.0,,1987.0,,171518.0,244880.0,2015.0,73362.0,3048.73999,,,60372960000000.0,32.0,60.0,,3.633778,-84553200.0,152064800.0,-25398700.0,92910300.0,0.0,1.0,0.0,4.0,,209.75,7293.0,-43346.804688,-0.37955,21.690186,-782.150757,-0.482466,30903.765625,0.219777,-129440.8125,-0.638259,-1337.844971,-0.304986,0.830251,0.296145,0,1,1
4,-1,2.5,4.0,,,2283.0,,,,2.0,2.0,598.0,,-1,33485644.0,-117700232.0,6000.0,1.0,,,1.0,230,60590424.0,17686.0,,96961.0,8.0,1.0,,,1981.0,2.0,169574.0,434551.0,2015.0,264977.0,5488.959961,,,60590420000000.0,4.0,,299.0,2.404275,-84214592.0,151185872.0,-25364472.0,92335760.0,0.0,1.0,0.0,6.5,285.375,351.230774,9875.0,-1155.377441,-0.16147,0.695679,244.801147,0.120107,-50359.125,-0.228975,-195977.25,-0.425156,-2742.87207,-0.333203,-1.339566,-0.357805,0,1,1


In [6]:
# Specify feature names and categorical features for CatBoost
feature_names = [s for s in catboost_features.columns]
categorical_features = ['cooling_id', 'heating_id', 'landuse_type_id', 'year', 'month', 'quarter']

categorical_indices = []
for index, name_col in enumerate(catboost_features.columns):
    if name_col in categorical_features:
        categorical_indices.append(index)
categorical_indices

[0, 13, 21, 68, 69, 70]

In [38]:
# Prepare training and cross-validation data
# Label Catboost: Log error
catboost_label = train.log_error.astype(np.float32)

# Transform to Numpy matrices
catboost_X = catboost_features
castboost_y = pd.DataFrame(catboost_label)

# Perform shuffled train/test split
np.random.seed(42)
random.seed(10)
X_train, X_val, y_train, y_val = train_test_split(catboost_X, catboost_y, test_size=0.2)

# Remove outlier examples from X_train and y_train
# Keep them in X_val and y_val for proper cross-validation
outlier_threshold = 0.4
mask = (abs(y_train) <= outlier_threshold)
X_train = X_train.loc[mask]
y_train = y_train[mask]

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}\n".format(y_train.shape))
print("X_val shape: {}".format(X_val.shape))
print("y_val shape: {}".format(y_val.shape))

X_train shape: (131426, 71)
y_train shape: (131426,)

X_val shape: (33578, 71)
y_val shape: (33578,)


In [39]:
# CatBoost parameters (Fine-tuning)
params = {}
params['loss_function'] = 'MAE'
params['eval_metric'] = 'MAE'
params['nan_mode'] = 'Min'              # Method to handle NaN (set NaN to either Min or Max)
params['random_seed'] = 0

params['iterations'] = 1000             # default 1000, use early stopping during training
params['learning_rate'] = 0.015         # default 0.03

params['border_count'] = 254            # default 254 (alias max_bin, suggested to keep at default for best quality)

params['max_depth'] = 6                 # default 6 (must be <= 16, 6 to 10 is recommended)
params['random_strength'] = 1           # default 1 (used during splitting to deal with over fitting, try different values)
params['l2_leaf_reg'] = 5               # default 3 (used for leaf value calculation, try different values)
params['bagging_temperature'] = 1       # default 1 (higher value -> more aggressive bagging, try different values)

In [40]:
# Train CatBoost Regressor with cross-validated early-stopping
val_pool = Pool(X_val, y_val, cat_features = categorical_indices)

# Training with seed random
np.random.seed(42)
random.seed(36)
model = CatBoostRegressor(**params)

# Training
model.fit(X_train, y_train,
          cat_features=categorical_indices,
          use_best_model=True, eval_set=val_pool, early_stopping_rounds=50, verbose=False)

# Evaluate model performance
print("Train score: {}".format(abs(model.predict(X_train) - y_train).mean() * 100))
print("Val score: {}".format(abs(model.predict(X_val) - y_val).mean() * 100))

Train score: 5.121938864571018
Val score: 6.864634505527216


In [41]:
# CatBoost feature importance
feature_importance = [(feature_names[i], value) for i, value in enumerate(model.get_feature_importance())]
feature_importance.sort(key=lambda x: x[1], reverse=True)
for k, v in feature_importance[:10]:
    print("{}: {}".format(k, v))

finished_area_sqft_calc: 4.4446918450286335
year_built: 3.9309265716140254
month: 3.622635462706179
region_zip-tax_property-percent: 3.1424325221431246
location_sum: 3.1001804767255328
region_zip: 3.097685141764494
region_zip-finished_area_sqft_calc-percent: 3.054203590488389
region_zip-finished_area_sqft_calc-diff: 2.898600357374613
lot_sqft: 2.8462314985148525
region_zip-property_tax_per_sqft-percent: 2.7588189621089203


# Train on all data + Make predictions

In [42]:
# Train CatBoost on all given training data (preparing for submission)
outlier_threshold = 0.4
mask = (abs(catboost_y) <= outlier_threshold)
catboost_X = catboost_X.loc[mask]
catboost_y = catboost_y[mask]
print("catboost_X: {}".format(catboost_X.shape))
print("catboost_y: {}".format(catboost_y.shape))

params['random_seed'] = 0
params['iterations'] = 1000  # roughly chosen based on public leaderboard score
print(params)
np.random.seed(42)
random.seed(36)
model = CatBoostRegressor(**params)
model.fit(catboost_X, catboost_y, cat_features=categorical_indices, verbose=False)

# Sanity check: score on a small portion of the dataset
print("sanity check score: {}".format(abs(model.predict(X_val) - y_val).mean() * 100))

catboost_X: (164299, 71)
catboost_y: (164299,)
{'loss_function': 'MAE', 'eval_metric': 'MAE', 'nan_mode': 'Min', 'random_seed': 0, 'iterations': 1000, 'learning_rate': 0.015, 'border_count': 254, 'max_depth': 6, 'random_strength': 1, 'l2_leaf_reg': 5, 'bagging_temperature': 1}
sanity check score: 6.82649824020036


In [46]:
%%time
file_name = '/Users/charles/Desktop/iFixerup/zr1/submission/final_catboost_single.csv'
submission, pred_2016, pred_2017 = catboost_mod.predict_and_export([model], features_2016, features_2017, file_name)

Start model 0 (2016)
Start model 0 (2017)
Length of submission DataFrame: 2985217
Submission header:
   ParcelId  201610  201611  201612  201710  201711  201712
0  10754147  0.0001  0.0001  0.0001  0.0001  0.0001  0.0001
1  10759547  0.0006  0.0006  0.0006  0.0034  0.0034  0.0034
2  10843547  0.0166  0.0166  0.0166  0.0097  0.0097  0.0097
3  10859147  0.0147  0.0147  0.0147  0.0154  0.0154  0.0154
4  10879947  0.0067  0.0067  0.0067  0.0085  0.0085  0.0085
CPU times: user 36.7 s, sys: 5.27 s, total: 42 s
Wall time: 46 s


# Ensemble Training & Prediction

In [47]:
# Remove outliers (if any) from training data
outlier_threshold = 0.4
mask = (abs(catboost_y) <= outlier_threshold)
catboost_X = catboost_X.loc[mask]
catboost_y = catboost_y[mask]
print("catboost_X: {}".format(catboost_X.shape))
print("catboost_y: {}".format(catboost_y.shape))

# Train multiple models
bags = 8
models = []
params['iterations'] = 1000
for i in range(bags):
    print("Start training model {}".format(i))
    params['random_seed'] = i
    np.random.seed(42)
    random.seed(36)
    model = CatBoostRegressor(**params)
    model.fit(catboost_X, catboost_y, cat_features=categorical_indices, verbose=False)
    models.append(model)
    
# Sanity check (make sure scores on a small portion of the dataset are reasonable)
for i, model in enumerate(models):
    print("model {}: {}".format(i, abs(model.predict(X_val) - y_val).mean() * 100))

# Save the trained models to disk
# save_models(models)

# models = load_models(['checkpoints/catboost_' + str(i) for i in range(8)])  # load pretrained models

catboost_X: (164299, 71)
catboost_y: (164299,)
Start training model 0
Start training model 1
Start training model 2
Start training model 3
Start training model 4
Start training model 5
Start training model 6
Start training model 7
model 0: 6.82649824020036
model 1: 6.826164956511857
model 2: 6.826275059383477
model 3: 6.8249612098063945
model 4: 6.826096716252645
model 5: 6.826055081492923
model 6: 6.826655431068749
model 7: 6.825603161829596


NameError: name 'save_models' is not defined

In [50]:
# Make predictions and export results
file_name = '/Users/charles/Desktop/iFixerup/zr1/submission/final_catboost_ensemble_x8.csv'
submission, pred_2016, pred_2017 = catboost_mod.predict_and_export(models, features_2016, features_2017, file_name)

Start model 0 (2016)
Start model 0 (2017)
Start model 1 (2016)
Start model 1 (2017)
Start model 2 (2016)
Start model 2 (2017)
Start model 3 (2016)
Start model 3 (2017)
Start model 4 (2016)
Start model 4 (2017)
Start model 5 (2016)
Start model 5 (2017)
Start model 6 (2016)
Start model 6 (2017)
Start model 7 (2016)
Start model 7 (2017)
Length of submission DataFrame: 2985217
Submission header:
   ParcelId  201610  201611  201612  201710  201711  201712
0  10754147  0.0055  0.0055  0.0055  0.0035  0.0035  0.0035
1  10759547  0.0098  0.0098  0.0098  0.0133  0.0133  0.0133
2  10843547  0.0222  0.0222  0.0222  0.0233  0.0233  0.0233
3  10859147  0.0166  0.0166  0.0166  0.0180  0.0180  0.0180
4  10879947  0.0052  0.0052  0.0052  0.0061  0.0061  0.0061
