# Two Sigma Connect: Rental Listing Inquiries

## Prerequisites
Please make sure the following Python distributions and packages were installed.

* [Anaconda](https://anaconda.org)
* [XGBoost](https://github.com/dmlc/xgboost)
* [LightGBM](https://github.com/Microsoft/LightGBM)
* [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization)
* [seaborn](https://seaborn.pydata.org)


You'll also need to create three subfolders in your working path: 

* input
* output
* python


Then download the data files into "input" folder and put this notebook in "python" folder.

The data files can be donwload from 
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data


In [None]:
import numpy as np
from scipy import sparse
import pandas as pd
import xgboost as xgb
import re
import string
import time
import seaborn as sns
import itertools
import lightgbm as lgb
from bayes_opt import BayesianOptimization
import seaborn as sns

from sklearn import preprocessing, pipeline, metrics, model_selection
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.preprocessing import Imputer
%matplotlib inline 


# Get started

## Load data

In [None]:
train_data = pd.read_json('../input/train.json')
test_data = pd.read_json('../input/test.json')

In [None]:
train_size = train_data.shape[0]

## Create target variables

We need to convert the raw target variable into numeric

In [None]:
train_data['target'] = train_data['interest_level'].apply(lambda x: 0 if x=='low' else 1 if x=='medium' else 2)
train_data['low'] = train_data['interest_level'].apply(lambda x: 1 if x=='low' else 0)
train_data['medium'] = train_data['interest_level'].apply(lambda x: 1 if x=='medium' else 0)
train_data['high'] = train_data['interest_level'].apply(lambda x: 1 if x=='high' else 0)

## Merge training and testing data
So we don't have to perform transformations twice

In [None]:
full_data=pd.concat([train_data,test_data])

## Group variables

In [None]:
num_vars = ['bathrooms','bedrooms','latitude','longitude','price']
cat_vars = ['building_id','manager_id','display_address','street_address']
text_vars = ['description','features']
date_var = 'created'
image_var = 'photos'
id_var = 'listing_id'

## Date/time features

In [None]:
full_data['created_datetime'] = pd.to_datetime(full_data['created'], format="%Y-%m-%d %H:%M:%S")
full_data['created_year']=full_data['created_datetime'].apply(lambda x:x.year) ## low variant
full_data['created_datetime'] = pd.to_datetime(full_data['created'], format="%Y-%m-%d %H:%M:%S")
full_data['created_month']=full_data['created_datetime'].apply(lambda x:x.month)
full_data['created_day']=full_data['created_datetime'].apply(lambda x:x.day)
full_data['created_dayofweek']=full_data['created_datetime'].apply(lambda x:x.dayofweek)
full_data['created_dayofyear']=full_data['created_datetime'].apply(lambda x:x.dayofyear)
full_data['created_weekofyear']=full_data['created_datetime'].apply(lambda x:x.weekofyear)
full_data['created_hour']=full_data['created_datetime'].apply(lambda x:x.hour)
full_data['created_epoch']=full_data['created_datetime'].apply(lambda x:x.value//10**9)

date_num_vars = ['created_month','created_dayofweek','created_dayofyear'
                 ,'created_weekofyear','created_hour','created_epoch']

## Numeric features: basic engineering

In [None]:
full_data['rooms'] = full_data['bedrooms'] + full_data['bathrooms'] 
full_data['num_of_photos'] = full_data['photos'].apply(lambda x:len(x))
full_data['num_of_features'] = full_data['features'].apply(lambda x:len(x))
full_data['len_of_desc'] = full_data['description'].apply(lambda x:len(x))
full_data['words_of_desc'] = full_data['description'].apply(lambda x:len(re.sub('['+string.punctuation+']', '', x).split()))


full_data['nums_of_desc'] = full_data['description']\
        .apply(lambda x:re.sub('['+string.punctuation+']', '', x).split())\
        .apply(lambda x: len([s for s in x if s.isdigit()]))
        
full_data['has_phone'] = full_data['description'].apply(lambda x:re.sub('['+string.punctuation+']', '', x).split())\
        .apply(lambda x: [s for s in x if s.isdigit()])\
        .apply(lambda x: len([s for s in x if len(str(s))==10]))\
        .apply(lambda x: 1 if x>0 else 0)
full_data['has_email'] = full_data['description'].apply(lambda x: 1 if '@renthop.com' in x else 0)

full_data['building_id_is_zero'] = full_data['building_id'].apply(lambda x:1 if x=='0' else 0)

additional_num_vars = ['rooms','num_of_photos','num_of_features','len_of_desc',
                    'words_of_desc','has_phone','has_email','building_id_is_zero']

## Numeric-Numeric interactions

In [None]:
full_data['avg_word_len'] = full_data[['len_of_desc','words_of_desc']]\
                                    .apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
    
full_data['price_per_room'] = full_data[['price','rooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_per_bedroom'] = full_data[['price','bedrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_per_bathroom'] = full_data[['price','bathrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_per_feature'] = full_data[['price','num_of_features']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_per_photo'] = full_data[['price','num_of_photos']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_per_word'] = full_data[['price','words_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['price_by_desc_len'] = full_data[['price','len_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)


full_data['photos_per_room'] = full_data[['num_of_photos','rooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['photos_per_bedroom'] = full_data[['num_of_photos','bedrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['photos_per_bathroom'] = full_data[['num_of_photos','bathrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)

full_data['desc_len_per_room'] = full_data[['len_of_desc','rooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['desc_len_per_bedroom'] = full_data[['len_of_desc','bedrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['desc_len_per_bathroom'] = full_data[['len_of_desc','bathrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['desc_len_per_word'] = full_data[['len_of_desc','words_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['desc_len_per_numeric'] = full_data[['len_of_desc','nums_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)

full_data['features_per_room'] = full_data[['num_of_features','rooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['features_per_bedroom'] = full_data[['num_of_features','bedrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['features_per_bathroom'] = full_data[['num_of_features','bathrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['features_per_photo'] = full_data[['num_of_features','num_of_photos']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['features_per_word'] = full_data[['num_of_features','words_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
full_data['features_by_desc_len'] = full_data[['num_of_features','len_of_desc']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)


interactive_num_vars = ['avg_word_len','price_per_room','price_per_bedroom','price_per_bathroom',
                        'price_per_feature','price_per_photo','price_per_word','price_by_desc_len',
                        'photos_per_room','photos_per_bedroom','photos_per_bathroom',
                        'desc_len_per_room','desc_len_per_bedroom','desc_len_per_bathroom','desc_len_per_word','desc_len_per_numeric',
                        'features_per_room','features_per_bedroom','features_per_bathroom',
                        'features_per_photo','features_per_word','features_by_desc_len']

## Numeric-categorical interactions

In [None]:
num_cat_vars =[]
price_by_manager = full_data.groupby('manager_id')['price'].agg([np.min,np.max,np.median,np.mean]).reset_index()
price_by_manager.columns = ['manager_id','min_price_by_manager',
                            'max_price_by_manager','median_price_by_manager','mean_price_by_manager']
full_data = pd.merge(full_data,price_by_manager, how='left',on='manager_id')

price_by_building = full_data.groupby('building_id')['price'].agg([np.min,np.max,np.median,np.mean]).reset_index()
price_by_building.columns = ['building_id','min_price_by_building',
                            'max_price_by_building','median_price_by_building','mean_price_by_building']
full_data = pd.merge(full_data,price_by_building, how='left',on='building_id')


full_data['price_percentile_by_manager']=\
            full_data[['price','min_price_by_manager','max_price_by_manager']]\
            .apply(lambda x:(x[0]-x[1])/(x[2]-x[1]) if (x[2]-x[1])!=0 else 0.5,
                  axis=1)
full_data['price_percentile_by_building']=\
            full_data[['price','min_price_by_building','max_price_by_building']]\
            .apply(lambda x:(x[0]-x[1])/(x[2]-x[1]) if (x[2]-x[1])!=0 else 0.5,
                  axis=1)


num_cat_vars.append('price_percentile_by_manager')
num_cat_vars.append('price_percentile_by_building')



## Two-way categorical features interactions

In [None]:
for comb in itertools.combinations(cat_vars, 2):
    comb_var_name = comb[0] +'-'+ comb[1]
    full_data [comb_var_name] = full_data [ comb[0]].astype(str) +'_' + full_data [ comb[1]].astype(str)
    cat_vars.append(comb_var_name)

cat_vars    

## listing ID

Theoretically ID variable is not supposed to be included in training a model. However, in this competition listing_id somehow contains the information related to listing created time so it helps improve the model

In [None]:
min_listing_id = full_data['listing_id'].min()
max_listing_id = full_data['listing_id'].max()
full_data['listing_id_pos']=full_data['listing_id'].apply(lambda x:np.float64((x-min_listing_id+1))/(max_listing_id-min_listing_id+1))
num_vars.append('listing_id')
num_vars.append('listing_id_pos')

## Text features

* Here we are using CountVectorizer but you are encouraged to give TfidfVectorizer a try.

* The parameter of max_features to be tuned

* The outputs are sparse matrices which can be merged with numpy arrays using scipy.stats.sparse.hstack function


In [None]:
full_data["features"].apply(lambda x: " ".join(["_".join(i.split(" ")) for i in x]))
cntvec = CountVectorizer(stop_words='english', max_features=200)
feature_sparse =cntvec.fit_transform(full_data["features"]\
                                     .apply(lambda x: " ".join(["_".join(i.split(" ")) for i in x])))

feature_vars = ['feature_' + v for v in cntvec.vocabulary_]

cntvec = CountVectorizer(stop_words='english', max_features=100)
desc_sparse = cntvec.fit_transform(full_data["description"])
desc_vars = ['desc_' + v for v in cntvec.vocabulary_]


cntvec = CountVectorizer(stop_words='english', max_features=10)
st_addr_sparse = cntvec.fit_transform(full_data["street_address"])
st_addr_vars = ['desc_' + v for v in cntvec.vocabulary_]

## Categorical features - label encoding

In [None]:
LBL = preprocessing.LabelEncoder()

LE_vars=[]
LE_map=dict()
for cat_var in cat_vars:
    print ("Label Encoding %s" % (cat_var))
    LE_var=cat_var+'_le'
    full_data[LE_var]=LBL.fit_transform(full_data[cat_var])
    LE_vars.append(LE_var)
    LE_map[cat_var]=LBL.classes_
    
print ("Label-encoded feaures: %s" % (LE_vars))

## Categorical features - one hot encoding

The output is a sparse matrix

In [None]:
OHE = preprocessing.OneHotEncoder(sparse=True)
start=time.time()
OHE.fit(full_data[['building_id_le', 'manager_id_le']])
OHE_sparse=OHE.transform(full_data[['building_id_le', 'manager_id_le']])
                                   
print ('One-hot-encoding finished in %f seconds' % (time.time()-start))


OHE_vars = [var[:-3] + '_' + str(level).replace(' ','_')\
                for var in cat_vars for level in LE_map[var] ]

print ("OHE_sparse size :" ,OHE_sparse.shape)
print ("One-hot encoded catgorical feature samples : %s" % (OHE_vars[:100]))

## Categorical features - mean encoding

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from itertools import product

class MeanEncoder:
    def __init__(self, categorical_features, n_splits=5, target_type='classification', prior_weight_func=None):
        """
        :param categorical_features: list of str, the name of the categorical columns to encode

        :param n_splits: the number of splits used in mean encoding

        :param target_type: str, 'regression' or 'classification'

        :param prior_weight_func:
        a function that takes in the number of observations, and outputs prior weight
        when a dict is passed, the default exponential decay function will be used:
        k: the number of observations needed for the posterior to be weighted equally as the prior
        f: larger f --> smaller slope
        """

        self.categorical_features = categorical_features
        self.n_splits = n_splits
        self.learned_stats = {}

        if target_type == 'classification':
            self.target_type = target_type
            self.target_values = []
        else:
            self.target_type = 'regression'
            self.target_values = None

        if isinstance(prior_weight_func, dict):
            self.prior_weight_func = eval('lambda x: 1 / (1 + np.exp((x - k) / f))', dict(prior_weight_func, np=np))
        elif callable(prior_weight_func):
            self.prior_weight_func = prior_weight_func
        else:
            self.prior_weight_func = lambda x: 1 / (1 + np.exp((x - 2) / 1))

    @staticmethod
    def mean_encode_subroutine(X_train, y_train, X_test, variable, target, prior_weight_func):
        X_train = X_train[[variable]].copy()
        X_test = X_test[[variable]].copy()

        if target is not None:
            nf_name = '{}_pred_{}'.format(variable, target)
            X_train['pred_temp'] = (y_train == target).astype(int)  # classification
        else:
            nf_name = '{}_pred'.format(variable)
            X_train['pred_temp'] = y_train  # regression
        prior = X_train['pred_temp'].mean()

        col_avg_y = X_train.groupby(by=variable, axis=0)['pred_temp'].agg({'mean': 'mean', 'beta': 'size'})
        col_avg_y['beta'] = prior_weight_func(col_avg_y['beta'])
        col_avg_y[nf_name] = col_avg_y['beta'] * prior + (1 - col_avg_y['beta']) * col_avg_y['mean']
        col_avg_y.drop(['beta', 'mean'], axis=1, inplace=True)

        nf_train = X_train.join(col_avg_y, on=variable)[nf_name].values
        nf_test = X_test.join(col_avg_y, on=variable).fillna(prior, inplace=False)[nf_name].values

        return nf_train, nf_test, prior, col_avg_y

    def fit_transform(self, X, y):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :param y: pandas Series or numpy array, n_samples
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()
        if self.target_type == 'classification':
            skf = StratifiedKFold(self.n_splits)
        else:
            skf = KFold(self.n_splits)

        if self.target_type == 'classification':
            self.target_values = sorted(set(y))
            self.learned_stats = {'{}_pred_{}'.format(variable, target): [] for variable, target in
                                  product(self.categorical_features, self.target_values)}
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, target, self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        else:
            self.learned_stats = {'{}_pred'.format(variable): [] for variable in self.categorical_features}
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, None, self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        return X_new

    def transform(self, X):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()

        if self.target_type == 'classification':
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits
        else:
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits

        return X_new

In [None]:
mean_encoder = MeanEncoder(categorical_features=['manager_id','building_id'], prior_weight_func={'k':5, 'f':1})
mean_encoded_train = mean_encoder.fit_transform(train_data, train_data['target'])
mean_encoded_test = mean_encoder.transform(test_data)

mean_coded_vars = list(set(mean_encoded_train.columns) - set(train_data.columns))
mean_coded_vars.append('listing_id')
full_data = pd.merge(full_data, 
                     pd.concat([mean_encoded_train[mean_coded_vars], mean_encoded_test[mean_coded_vars]]),
                     how='left',
                     on='listing_id'
                    )

## Listing freshness and listing quality

In [None]:
full_data['disp_is_street'] = (full_data['display_address'] == full_data['street_address'])*1

full_data['num_of_html_tag']=full_data.description.apply(lambda x:x.count('<'))
full_data['num_of_#']=full_data.description.apply(lambda x:x.count('<'))
full_data['num_of_!']=full_data.description.apply(lambda x:x.count('<'))
full_data['num_of_$']=full_data.description.apply(lambda x:x.count('<'))

temp_aggr = full_data.sort_values(['description','building_id','bedrooms',
                                   'bathrooms','price','created_datetime','created_datetime']).\
                groupby(['description','building_id','bedrooms','bathrooms','price'])
full_data['posted_times'] = temp_aggr.created_datetime.rank(method='first', na_option='top',pct=True)

num_vars = num_vars + ['disp_is_street', 'num_of_html_tag','num_of_#','num_of_!','num_of_$', 'posted_times']

## Manager performance

In [None]:
manager_agg_vars = []
aggr_num_vars = ['bathrooms',
                 'bedrooms',
                 'latitude',
                 'longitude',
                 'price',
                 'listing_id_pos',
                 'building_id_is_zero',
                 'rooms',
                 'num_of_photos',
                 'num_of_features',
                 'len_of_desc',
                 'words_of_desc',
                 'price_per_room',
                 'num_of_html_tag',
                 'posted_times'
                ]

mgr_aggr =full_data.groupby('manager_id')[aggr_num_vars].agg([np.size,np.mean,np.median,np.min,np.max,np.std])


for v in aggr_num_vars:
    manager_agg_vars.append(v+'_'+'cnt_by_mgr')
    manager_agg_vars.append(v+'_'+'mean_by_mgr')
    manager_agg_vars.append(v+'_'+'median_by_mgr')
    manager_agg_vars.append(v+'_'+'max_by_mgr')
    manager_agg_vars.append(v+'_'+'min_by_mgr')
    manager_agg_vars.append(v+'_'+'std_by_mgr')

mgr_aggr.columns = manager_agg_vars

full_data = pd.merge(full_data, mgr_aggr.reset_index(), how = 'left', on='manager_id')

mgr_aggr = full_data[['manager_id','building_id']].drop_duplicates().groupby('manager_id').count().reset_index()
mgr_aggr.columns=['manager_id','bldn_cnt_by_mgr']
full_data = pd.merge(full_data, mgr_aggr, how = 'left', on='manager_id')

mgr_aggr = full_data[['manager_id','building_id']].drop_duplicates().groupby('building_id').count().reset_index()
mgr_aggr.columns=['building_id','mgr_cnt_by_bldn']
full_data = pd.merge(full_data, mgr_aggr, how = 'left', on='building_id')

manager_agg_vars = manager_agg_vars + ['bldn_cnt_by_mgr','mgr_cnt_by_bldn']

## The magic feature

Firstly mentioned by Grand Master Silogram
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/31765

Discovered and made available to public by another Grand Master KazAnova
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/31870

It may contain the information when the listing was actually created.

In [None]:
image_date = pd.read_csv("../input/listing_image_time.csv")

image_date.columns = ["listing_id", "image_time_stamp"]
full_data = pd.merge(full_data, image_date, on="listing_id", how="left")
num_vars.append('image_time_stamp')

In [None]:
full_vars = num_vars + date_num_vars + interactive_num_vars\
            + additional_num_vars + manager_agg_vars + LE_vars + mean_coded_vars
    
    
    
train_x = full_data[full_vars][:train_size].values
train_y = full_data['target'][:train_size].values

test_x = full_data[full_vars][train_size:].values


train_x = sparse.hstack([train_x, feature_sparse[:train_size], 
                         desc_sparse[:train_size], st_addr_sparse[:train_size]]).tocsr()
train_y = full_data['target'][:train_size].values

test_x = sparse.hstack([test_x, feature_sparse[train_size:], 
                        desc_sparse[train_size:], st_addr_sparse[train_size:]]).tocsr()


full_vars = full_vars + feature_vars + desc_vars + st_addr_vars + OHE_vars 

print ("training data size: ", train_x.shape,"testing data size: ", test_x.shape)

## XGBoost Tuning

### Manual tuning

#### max_depth

In [None]:
# %%time
# scores = []
# for max_depth in [3,4,5,6]:

#     params = dict()
#     params['objective'] = 'multi:softprob'
#     params['num_class'] = 3
#     params['eta'] = 0.1
#     params['max_depth'] = max_depth
#     params['min_child_weight'] = 1
#     params['colsample_bytree'] = 1
#     params['subsample'] = 1
#     params['gamma'] = 0
#     params['seed']=1234

#     cv_results = xgb.cv(params, xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
#                    num_boost_round=1000000,
#                    nfold=5,
#            metrics={'mlogloss'},
#            seed=1234,
#            callbacks=[xgb.callback.early_stop(50)])
#     best_iteration = len(cv_results)
#     best_score = cv_results['test-mlogloss-mean'].min()
#     print (max_depth,best_score,best_iteration)
#     scores.append([best_score,params['eta'],params['max_depth'],params['min_child_weight'],
#                       params['colsample_bytree'],params['subsample'],params['gamma'],best_iteration])
    
# scores = pd.DataFrame(scores,columns=['score','eta','max_depth','min_child_weight',
#                                    'colsample_bytree','subsample','gamma','best_iteration'])   
# best_max_depth = scores.sort_values(by='score',ascending=True)['max_depth'].values[0]
# print ('best max_depth is', best_max_depth)

#### min_child_weight

In [None]:
# %%time
# scores = []
# for min_child_weight in [1, 10, 50, 100]:

#     params = dict()
#     params['objective'] = 'multi:softprob'
#     params['num_class'] = 3
#     params['eta'] = 0.1
#     params['max_depth'] = best_max_depth
#     params['min_child_weight'] = min_child_weight
#     params['colsample_bytree'] = 1
#     params['subsample'] = 1
#     params['gamma'] = 0
#     params['seed']=1234

#     cv_results = xgb.cv(params, xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
#                    num_boost_round=1000000,
#                    nfold=5,
#            metrics={'mlogloss'},
#            seed=1234,
#            callbacks=[xgb.callback.early_stop(50)])
#     best_iteration = len(cv_results)
#     best_score = cv_results['test-mlogloss-mean'].min()
#     print (min_child_weight,best_score,best_iteration)
#     scores.append([best_score,params['eta'],params['max_depth'],params['min_child_weight'],
#                       params['colsample_bytree'],params['subsample'],params['gamma'],best_iteration])
    
# scores = pd.DataFrame(scores,columns=['score','eta','max_depth','min_child_weight',
#                                    'colsample_bytree','subsample','gamma','best_iteration'])   
# best_min_child_weight = scores.sort_values(by='score',ascending=True)['min_child_weight'].values[0]
# print ('best min_child_weight is', best_min_child_weight)

#### colsample_bytree

In [None]:
# %%time
# scores = []
# for colsample_bytree in [0.1,0.3,0.5,0.7,0.9]:

#     params = dict()
#     params['objective'] = 'multi:softprob'
#     params['num_class'] = 3
#     params['eta'] = 0.1
#     params['max_depth'] = best_max_depth
#     params['min_child_weight'] = best_min_child_weight
#     params['colsample_bytree'] = colsample_bytree
#     params['subsample'] = 1
#     params['gamma'] = 0
#     params['seed']=1234

#     cv_results = xgb.cv(params, xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
#                    num_boost_round=1000000,
#                    nfold=5,
#            metrics={'mlogloss'},
#            seed=1234,
#            callbacks=[xgb.callback.early_stop(50)])
#     best_iteration = len(cv_results)
#     best_score = cv_results['test-mlogloss-mean'].min()
#     print (colsample_bytree,best_score,best_iteration)
#     scores.append([best_score,params['eta'],params['max_depth'],params['min_child_weight'],
#                       params['colsample_bytree'],params['subsample'],params['gamma'],best_iteration])
    
# scores = pd.DataFrame(scores,columns=['score','eta','max_depth','min_child_weight',
#                                    'colsample_bytree','subsample','gamma','best_iteration'])   
# best_colsample_bytree = scores.sort_values(by='score',ascending=True)['colsample_bytree'].values[0]
# print ('best colsample_bytree is', best_colsample_bytree)

#### subsample

In [None]:
# %%time
# scores = []
# for subsample in [0.1,0.3,0.5,0.7,0.9]:

#     params = dict()
#     params['objective'] = 'multi:softprob'
#     params['num_class'] = 3
#     params['eta'] = 0.1
#     params['max_depth'] = best_max_depth
#     params['min_child_weight'] = best_min_child_weight
#     params['colsample_bytree'] = best_colsample_bytree
#     params['subsample'] = subsample
#     params['gamma'] = 0
#     params['seed']=1234

#     cv_results = xgb.cv(params, xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
#                    num_boost_round=1000000,
#                    nfold=5,
#            metrics={'mlogloss'},
#            seed=1234,
#            callbacks=[xgb.callback.early_stop(50)])
#     best_iteration = len(cv_results)
#     best_score = cv_results['test-mlogloss-mean'].min()
#     print (subsample,best_score,best_iteration)
#     scores.append([best_score,params['eta'],params['max_depth'],params['min_child_weight'],
#                       params['colsample_bytree'],params['subsample'],params['gamma'],best_iteration])
    
# scores = pd.DataFrame(scores,columns=['score','eta','max_depth','min_child_weight',
#                                    'colsample_bytree','subsample','gamma','best_iteration'])   
# best_subsample = scores.sort_values(by='score',ascending=True)['subsample'].values[0]
# print ('best subsample is', best_subsample)

#### gamma

In [None]:
# %%time
# scores = []
# for gamma in [0,0.5,1,1.5,2]:

#     params = dict()
#     params['objective'] = 'multi:softprob'
#     params['num_class'] = 3
#     params['eta'] = 0.1
#     params['max_depth'] = best_max_depth
#     params['min_child_weight'] = best_min_child_weight
#     params['colsample_bytree'] = best_colsample_bytree
#     params['subsample'] = best_subsample
#     params['gamma'] = gamma
#     params['seed']=1234

#     cv_results = xgb.cv(params, xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
#                    num_boost_round=1000000,
#                    nfold=5,
#            metrics={'mlogloss'},
#            seed=1234,
#            callbacks=[xgb.callback.early_stop(50)])
#     best_iteration = len(cv_results)
#     best_score = cv_results['test-mlogloss-mean'].min()
#     print (gamma,best_score,best_iteration)
#     scores.append([best_score,params['eta'],params['max_depth'],params['min_child_weight'],
#                       params['colsample_bytree'],params['subsample'],params['gamma'],best_iteration])
    
# scores = pd.DataFrame(scores,columns=['score','eta','max_depth','min_child_weight',
#                                    'colsample_bytree','subsample','gamma','best_iteration'])   
# best_gamma = scores.sort_values(by='score',ascending=True)['gamma'].values[0]
# print ('best gamma is', best_gamma)

## Automated tuning

We will be using Bayesian optimization for automated parameter tuning.

It works by constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below.

* https://github.com/fmfn/BayesianOptimization

### XGBoost

In [None]:
xgtrain = xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1))

def xgb_evaluate(min_child_weight,
                 colsample_bytree,
                 max_depth,
                 subsample,
                 gamma):
    params = dict()
    params['objective'] = 'multi:softprob'
    params['num_class'] = 3
    params['eta'] = 0.1
    params['max_depth'] = int(max_depth )   
    params['min_child_weight'] = int(min_child_weight)
    params['colsample_bytree'] = colsample_bytree
    params['subsample'] = subsample
    params['gamma'] = gamma
    params['verbose_eval'] = True    


    cv_result = xgb.cv(params, xgtrain,
                       num_boost_round=100000,
                       nfold=5,
                       metrics={'mlogloss'},
                       seed=1234,
                       callbacks=[xgb.callback.early_stop(50)])

    return -cv_result['test-mlogloss-mean'].min()


xgb_BO = BayesianOptimization(xgb_evaluate, 
                             {'max_depth': (3, 10),
                              'min_child_weight': (0, 100),
                              'colsample_bytree': (0.1, 0.7),
                              'subsample': (0.7, 1),
                              'gamma': (0, 2)
                             }
                            )

xgb_BO.maximize(init_points=5, n_iter=40)

#### Show tuning results

In [None]:
xgb_BO_scores = pd.DataFrame(xgb_BO.res['all']['params'])
xgb_BO_scores['score'] = pd.DataFrame(xgb_BO.res['all']['values'])
xgb_BO_scores = xgb_BO_scores.sort_values(by='score',ascending=False)
xgb_BO_scores.head()

#### Plot scores vs parameters

In [None]:
sns.pairplot(xgb_BO_scores)

### Train the model with smaller learning rate

In [None]:
xgb_params = xgb_BO_scores.iloc[0].to_dict()
xgb_params['objective'] = 'multi:softprob'
xgb_params['num_class'] = 3
xgb_params['eta'] = 0.01 # Smaller 

xgb_params['max_depth'] = int(xgb_params['max_depth'])   
xgb_params['min_child_weight'] = int(xgb_params['min_child_weight'])    
xgb_params['subsample'] = xgb_params['subsample']     
xgb_params['colsample_bytree'] = xgb_params['colsample_bytree']
xgb_params['gamma'] = xgb_params['gamma']
xgb_params['seed']=1234

cv_results = xgb.cv(xgb_params, 
                    xgb.DMatrix(train_x, label=train_y.reshape(train_x.shape[0],1)),
                    num_boost_round=1000000, 
                    nfold=5,
                    metrics={'mlogloss'},
                    seed=1234,
                    callbacks=[xgb.callback.early_stop(50)],
                    verbose_eval=50
                   )

best_xgb_score = cv_results['test-mlogloss-mean'].min()
best_xgb_iteration = len(cv_results)

In [None]:
start = time.time()
clf = xgb.XGBClassifier(learning_rate = 0.01
                        , n_estimators =best_xgb_iteration
                        , max_depth = xgb_params['max_depth']
                        , min_child_weight = xgb_params['min_child_weight']
                        , subsample = xgb_params['subsample']
                        , colsample_bytree = xgb_params['colsample_bytree']
                        , gamma = xgb_params['gamma']
                        , seed = 1234
                        , nthread = -1
                       )

clf.fit(train_x, train_y)

print ("Training finished in %d seconds." % (time.time()-start))

preds = clf.predict_proba(test_x)
sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_xgb_tuned.csv", index=False)

### LightGBM

#### Manual tuning

In [None]:
# %%time
# scores = []
# for max_bin in [100, 255, 400,600,800,1000]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = max_bin  

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (max_bin,best_iteration, best_score)
#     scores.append([max_bin,best_iteration, best_score])
    
# scores = pd.DataFrame(scores, columns =['max_bin','iteration','score'])
# best_max_bin = scores.sort_values(by='score',ascending=True)['max_bin'].values[0]
# print ('best max_bin is', best_max_bin)


In [None]:
# %%time
# scores = []
# for num_leaves in [3, 10, 30,100,300,1000]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = 800  
#     params['num_leaves'] = num_leaves

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (num_leaves,best_score,best_iteration)
#     scores.append([num_leaves,best_iteration, best_score])

# scores = pd.DataFrame(scores, columns =['num_leaves','iteration','score'])
# best_num_leaves = scores.sort_values(by='score',ascending=True)['num_leaves'].values[0]
# print ('best num_leaves is', best_num_leaves)


In [None]:
# %%time
# scores = []
# for feature_fraction in [0.2, 0.4, 0.6, 0.8]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin  
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = feature_fraction

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (feature_fraction,best_iteration, best_score)
#     scores.append([feature_fraction,best_iteration, best_score])

# scores = pd.DataFrame(scores, columns =['feature_fraction','iteration','score'])
# best_feature_fraction = scores.sort_values(by='score',ascending=True)['feature_fraction'].values[0]
# print ('best feature_fraction is', best_feature_fraction)



In [None]:
# %%time
# scores = []
# for bagging_fraction in [0.3, 0.5, 0.7, 0.9]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = bagging_fraction
#     params['bagging_freq'] = 1

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (bagging_fraction,best_iteration,best_score)
#     scores.append([bagging_fraction,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['bagging_fraction','iteration','score'])
# best_bagging_fraction = scores.sort_values(by='score',ascending=True)['bagging_fraction'].values[0]
# print ('best bagging_fraction is', best_bagging_fraction)

In [None]:
# %%time
# scores = []
# for bagging_fraction in [1]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = bagging_fraction
#     params['bagging_freq'] = 1

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (bagging_fraction,best_iteration,best_score)
#     scores.append([bagging_fraction,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['bagging_fraction','iteration','score'])
# best_bagging_fraction = scores.sort_values(by='score',ascending=True)['bagging_fraction'].values[0]
# print ('best bagging_fraction is', best_bagging_fraction)

In [None]:
# %%time
# scores = []
# for bagging_freq in [1, 3, 5]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = best_bagging_fraction
#     params['bagging_freq'] = bagging_freq

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (bagging_freq,best_iteration,best_score)
#     scores.append([bagging_freq,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['bagging_freq','iteration','score'])
# best_bagging_freq = scores.sort_values(by='score',ascending=True)['bagging_freq'].values[0]
# print ('best bagging_freq is', best_bagging_freq)

In [None]:
# %%time
# scores = []
# for min_gain_to_split in [0, 0.1, 0.5, 1.0, 1.5]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = best_bagging_fraction
#     params['min_gain_to_split'] = min_gain_to_split

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (min_gain_to_split,best_iteration,best_score)
#     scores.append([min_gain_to_split,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['min_gain_to_split','iteration','score'])
# best_min_gain_to_split = scores.sort_values(by='score',ascending=True)['min_gain_to_split'].values[0]
# print ('best min_gain_to_split is', best_min_gain_to_split)


In [None]:
# %%time
# scores = []
# for min_sum_hessian_in_leaf in [0,0.001,1, 3, 10,30,100]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = best_bagging_fraction
#     params['min_gain_to_split'] = best_min_gain_to_split
#     params['min_sum_hessian_in_leaf'] = min_sum_hessian_in_leaf

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (min_sum_hessian_in_leaf,best_iteration,best_score)
#     scores.append([min_sum_hessian_in_leaf,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['min_sum_hessian_in_leaf','iteration','score'])
# best_min_sum_hessian_in_leaf = scores.sort_values(by='score',ascending=True)['min_sum_hessian_in_leaf'].values[0]
# print ('best min_sum_hessian_in_leaf is', best_min_sum_hessian_in_leaf)



In [None]:
# %%time
# scores = []
# for min_sum_hessian_in_leaf in [0,0.001,1, 3, 10,30,100]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = best_bagging_fraction
#     params['min_gain_to_split'] = best_min_gain_to_split
#     params['min_sum_hessian_in_leaf'] = min_sum_hessian_in_leaf

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (min_sum_hessian_in_leaf,best_iteration,best_score)
#     scores.append([min_sum_hessian_in_leaf,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['min_sum_hessian_in_leaf','iteration','score'])
# best_min_sum_hessian_in_leaf = scores.sort_values(by='score',ascending=True)['min_sum_hessian_in_leaf'].values[0]
# print ('best min_sum_hessian_in_leaf is', best_min_sum_hessian_in_leaf)




In [None]:
# %%time
# scores = []
# for lambda_l2 in [0,0.01,0.1, 1, 10]:

#     params = dict()
#     params['objective'] = 'multiclass'
#     params['num_class'] = 3
#     params['learning_rate'] = 0.1
#     params['max_bin'] = best_max_bin  
#     params['num_leaves'] = best_num_leaves
#     params['feature_fraction'] = best_feature_fraction
#     params['bagging_fraction'] = best_bagging_fraction
#     params['min_gain_to_split'] = best_min_gain_to_split
#     params['min_sum_hessian_in_leaf'] = best_min_sum_hessian_in_leaf
#     params['lambda_l1'] = best_lambda_l1
#     params['lambda_l2'] = lambda_l2

#     cv_results = lgb.cv(params,
#                     lgb.Dataset(train_x, train_y, max_bin=best_max_bin),
#                     num_boost_round=1000000,
#                     nfold=5,
#                     early_stopping_rounds=100,
#                     metrics='multi_logloss',
#                     shuffle=False,
#                     verbose_eval=100
#                    )
#     cv_results = pd.DataFrame(cv_results)
#     best_iteration = len(cv_results)
#     best_score = cv_results['multi_logloss-mean'].min()
#     print (lambda_l2,best_iteration,best_score)
#     scores.append([lambda_l2,best_iteration,best_score])

# scores = pd.DataFrame(scores, columns =['lambda_l2','iteration','score'])
# best_lambda_l2 = scores.sort_values(by='score',ascending=True)['lambda_l2'].values[0]
# print ('best best_lambda_l2 is', best_lambda_l2)


In [None]:
def lgb_evaluate(max_bin,
                 num_leaves,
                 min_sum_hessian_in_leaf,
                 min_gain_to_split,
                 feature_fraction,
                 bagging_fraction,
                 bagging_freq,
                 lambda_l1,
                 lambda_l2
                 ):
    params = dict()
    params['objective'] = 'multiclass'
    params['num_class'] = 3
    params['learning_rate'] = 0.1
    params['max_bin'] = int(max_bin)
    params['num_leaves'] = int(num_leaves)    
    params['min_sum_hessian_in_leaf'] = int(min_sum_hessian_in_leaf)
    params['min_gain_to_split'] = min_gain_to_split    
    params['feature_fraction'] = feature_fraction
    params['bagging_fraction'] = bagging_fraction
    params['bagging_freq'] = int(bagging_freq)


    cv_results = lgb.cv(params,
                    lgb.Dataset(train_x, train_y, max_bin=int(max_bin)),
                    num_boost_round=1000000,
                    nfold=5,
                    early_stopping_rounds=100,
                    metrics='multi_logloss',
                    stratified=False,
                    shuffle=True,
                    verbose_eval=False
                   )

    return -pd.DataFrame(cv_results)['multi_logloss-mean'].min()


lgb_BO = BayesianOptimization(lgb_evaluate, 
                             {'max_bin': (850, 900),
                              'num_leaves': (10, 20),
                              'min_sum_hessian_in_leaf': (4, 8),
                              'min_gain_to_split': (0,1),
                              'feature_fraction': (0.35, 0.45),
                              'bagging_fraction': (0.8,1),
                              'bagging_freq': (1,1),
                              'lambda_l1': (0,0.5),
                              'lambda_l2': (0,10)
                             }
                            )

lgb_BO.maximize(init_points=5, n_iter=40)


#### Show LightGBM tuning results

In [None]:
lgb_BO_scores = pd.DataFrame(lgb_BO.res['all']['params'])
lgb_BO_scores['score'] = pd.DataFrame(lgb_BO.res['all']['values'])
lgb_BO_scores = lgb_BO_scores.sort_values(by='score',ascending=False)
lgb_BO_scores

### Train the model with smaller learning rate

In [None]:
params = lgb_BO_scores.iloc[0].to_dict()
lgb_params = dict()
lgb_params['objective'] = 'multiclass'
lgb_params['num_class'] = 3
lgb_params['learning_rate'] = 0.01 # Smaller learning rate

lgb_params['max_bin'] = int(params['max_bin'])   
lgb_params['num_leaves'] = int(params['num_leaves'])    
lgb_params['min_sum_hessian_in_leaf'] = int(params['min_sum_hessian_in_leaf'])
lgb_params['min_gain_to_split'] = params['min_gain_to_split']     
lgb_params['feature_fraction'] = params['feature_fraction']
lgb_params['bagging_fraction'] = params['bagging_fraction']
lgb_params['bagging_freq'] = int(params['bagging_freq'])


cv_results = lgb.cv(lgb_params,
                lgb.Dataset(train_x, train_y, max_bin=lgb_params['max_bin']),
                num_boost_round=1000000,
                nfold=5,
                early_stopping_rounds=200, # Bigger stopping rounds
                metrics='multi_logloss',
                shuffle=True, stratified=False,
                verbose_eval=100
               )

cv_results = pd.DataFrame(cv_results)
best_lgb_iteration = len(cv_results)
best_lgb_score = cv_results['multi_logloss-mean'].min()

print (best_lgb_iteration, best_lgb_score)


# [100]	cv_agg's multi_logloss: 0.746109 + 0.0065575
# [200]	cv_agg's multi_logloss: 0.639026 + 0.0120335
# [300]	cv_agg's multi_logloss: 0.596377 + 0.0147712
# [400]	cv_agg's multi_logloss: 0.574696 + 0.0158541
# [500]	cv_agg's multi_logloss: 0.562016 + 0.0162295
# [600]	cv_agg's multi_logloss: 0.553874 + 0.0161019
# [700]	cv_agg's multi_logloss: 0.548233 + 0.0157665
# [800]	cv_agg's multi_logloss: 0.544046 + 0.0156151
# [900]	cv_agg's multi_logloss: 0.540636 + 0.0154918
# [1000]	cv_agg's multi_logloss: 0.537844 + 0.0153443
# [1100]	cv_agg's multi_logloss: 0.535618 + 0.0151677
# [1200]	cv_agg's multi_logloss: 0.533848 + 0.0150771
# [1300]	cv_agg's multi_logloss: 0.53235 + 0.0150731
# [1400]	cv_agg's multi_logloss: 0.531319 + 0.0150225
# [1500]	cv_agg's multi_logloss: 0.530262 + 0.0149934
# [1600]	cv_agg's multi_logloss: 0.529346 + 0.0149552
# [1700]	cv_agg's multi_logloss: 0.528506 + 0.0149328
# [1800]	cv_agg's multi_logloss: 0.527763 + 0.0148955
# [1900]	cv_agg's multi_logloss: 0.527144 + 0.0148148
# [2000]	cv_agg's multi_logloss: 0.526505 + 0.014854
# [2100]	cv_agg's multi_logloss: 0.525989 + 0.0148448
# [2200]	cv_agg's multi_logloss: 0.525539 + 0.0148273
# [2300]	cv_agg's multi_logloss: 0.525226 + 0.0147968
# [2400]	cv_agg's multi_logloss: 0.524854 + 0.0147791
# [2500]	cv_agg's multi_logloss: 0.524553 + 0.0147948
# [2600]	cv_agg's multi_logloss: 0.524289 + 0.014782
# [2700]	cv_agg's multi_logloss: 0.524064 + 0.0147793
# [2800]	cv_agg's multi_logloss: 0.523813 + 0.0147561
# [2900]	cv_agg's multi_logloss: 0.52353 + 0.0147243
# [3000]	cv_agg's multi_logloss: 0.523274 + 0.0146968
# [3100]	cv_agg's multi_logloss: 0.523125 + 0.0146779
# [3200]	cv_agg's multi_logloss: 0.522853 + 0.0146509
# [3300]	cv_agg's multi_logloss: 0.522641 + 0.0146358
# [3400]	cv_agg's multi_logloss: 0.522566 + 0.0146042
# [3500]	cv_agg's multi_logloss: 0.522362 + 0.0145816
# [3600]	cv_agg's multi_logloss: 0.522202 + 0.0145666
# [3700]	cv_agg's multi_logloss: 0.522101 + 0.0145976
# [3800]	cv_agg's multi_logloss: 0.521914 + 0.0146079
# [3900]	cv_agg's multi_logloss: 0.521868 + 0.0145966
# [4000]	cv_agg's multi_logloss: 0.521732 + 0.0145751
# [4100]	cv_agg's multi_logloss: 0.521581 + 0.0145607
# [4200]	cv_agg's multi_logloss: 0.521465 + 0.0145596
# [4300]	cv_agg's multi_logloss: 0.521352 + 0.0145497
# [4400]	cv_agg's multi_logloss: 0.521356 + 0.0145549
# [4500]	cv_agg's multi_logloss: 0.521287 + 0.0145471
# [4600]	cv_agg's multi_logloss: 0.521228 + 0.0145317
# [4700]	cv_agg's multi_logloss: 0.521191 + 0.0145207
# [4800]	cv_agg's multi_logloss: 0.521114 + 0.0145103
# [4900]	cv_agg's multi_logloss: 0.521096 + 0.0144984
# [5000]	cv_agg's multi_logloss: 0.521048 + 0.0145049
# [5100]	cv_agg's multi_logloss: 0.521157 + 0.0145078
# [5200]	cv_agg's multi_logloss: 0.521136 + 0.0144979
# 5021 0.521042308018

In [None]:
start = time.time()
clf = lgb.LGBMClassifier(learning_rate = 0.01
                        , n_estimators =best_lgb_iteration
                        , max_bin = lgb_params['max_bin']   
                        , num_leaves = lgb_params['num_leaves']
                        , min_child_weight = lgb_params['min_sum_hessian_in_leaf']
                        , min_split_gain = lgb_params['min_gain_to_split'] 
                        , colsample_bytree = lgb_params['feature_fraction']
                        , subsample = lgb_params['bagging_fraction']
                        , subsample_freq = lgb_params['bagging_freq']
                        , seed = 1234
                       )

print (clf)

clf.fit(train_x, train_y)

print ("Training finished in %d seconds." % (time.time()-start))

preds = clf.predict_proba(test_x)
sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_lgb_tuned.csv", index=False)

## Model stacking

1. We'll leverage the tuned parameter sets to train 5 XGBoost models and 5 LightGBM models for level 1.
2. The outputs of level models will contain 3(classes)*10(models) = 15 features. 
3. We'll train a MLP model using these 15 features only
4. We'll train another LightGBM using these 15 features plus original features
5. The outputs from the two level 2 models can be combined as the final submission

#### Let's define  a few stacking fucntions

In [None]:
def blend_lgb_model(params_list, train_x, train_y, test_x, num_class, blend_folds):

    #     skf = model_selection.StratifiedKFold(n_splits=blend_folds,random_state=1234)
    skf = model_selection.KFold(
        n_splits=blend_folds, random_state=1234)
    skf_ids = list(skf.split(train_x, train_y))

    train_blend_x = np.zeros((train_x.shape[0], len(params_list) * num_class))
    test_blend_x = np.zeros((test_x.shape[0], len(params_list) * num_class))
    blend_scores = np.zeros((blend_folds, len(params_list)))

    print("Start blending.")
    for j, params in enumerate(params_list):
        print("Blending model", j + 1, params)
        test_blend_x_j = np.zeros((test_x.shape[0], num_class))
        max_bin = params['max_bin']
        num_boost_round = params['num_boost_round']
        for i, (train_ids, val_ids) in enumerate(skf_ids):
            start = time.time()
            print("Model %d fold %d" % (j + 1, i + 1))
            train_x_fold = train_x[train_ids]
            train_y_fold = train_y[train_ids]
            val_x_fold = train_x[val_ids]
            val_y_fold = train_y[val_ids]
            # Set n_estimators to a large number for early_stopping
            print(params)
            model = lgb.train(params,
                              lgb.Dataset(train_x_fold, train_y_fold,
                                          max_bin=max_bin),
                              num_boost_round=num_boost_round
                              )
            val_y_predict_fold = model.predict(val_x_fold)
            score = metrics.log_loss(val_y_fold, val_y_predict_fold)
            print("LOGLOSS: ", score)
            blend_scores[i, j] = score
            train_blend_x[val_ids, j * num_class:j *
                          num_class + num_class] = val_y_predict_fold
            test_blend_x_j = test_blend_x_j + model.predict(test_x)
            print(time.time() - start)
        test_blend_x[:, j * num_class:j * num_class +
                     num_class] = test_blend_x_j / blend_folds
        print("Score for model %d is %f" %
              (j + 1, np.mean(blend_scores[:, j])))
    return train_blend_x, test_blend_x, blend_scores

In [None]:
def blend_xgb_model(params_list, train_x, train_y, test_x, num_class, blend_folds,missing=None):

    skf = model_selection.KFold(n_splits=blend_folds,random_state=1234)
    skf_ids = list(skf.split(train_x, train_y))


    train_blend_x = np.zeros((train_x.shape[0], len(params_list)*num_class))
    test_blend_x = np.zeros((test_x.shape[0], len(params_list)*num_class))
    blend_scores = np.zeros ((blend_folds,len(params_list)))

    print  ("Start blending.")
    for j, params in enumerate(params_list):
        print ("Blending model",j+1, params)
        test_blend_x_j = np.zeros((test_x.shape[0], num_class))
        for i, (train_ids, val_ids) in enumerate(skf_ids):
            start = time.time()
            print ("Model %d fold %d" %(j+1,i+1))
            train_x_fold = train_x[train_ids]
            train_y_fold = train_y[train_ids]
            val_x_fold = train_x[val_ids]
            val_y_fold = train_y[val_ids]
            # Set n_estimators to a large number for early_stopping   
            print (params, params['num_boost_round'], missing)
            model = xgb.train(params,
                                xgb.DMatrix(train_x_fold, 
                                            label=train_y_fold.reshape(train_y_fold.shape[0],1), 
                                            missing=missing),
                                num_boost_round=params['num_boost_round']
                            )
            val_y_predict_fold = model.predict(xgb.DMatrix(val_x_fold,missing=missing),
                                              )
            
            score = metrics.log_loss(val_y_fold,val_y_predict_fold)
            print ("LOGLOSS: ", score)
            blend_scores[i,j]=score
            train_blend_x[val_ids, j*num_class:j*num_class+num_class] = val_y_predict_fold
            test_blend_x_j = test_blend_x_j + model.predict(xgb.DMatrix(test_x,missing=missing))
            print (time.time()-start)
        test_blend_x[:,j*num_class:j*num_class+num_class] = test_blend_x_j/blend_folds
        print ("Score for model %d is %f" % (j+1,np.mean(blend_scores[:,j])))
    return train_blend_x, test_blend_x, blend_scores    

In [None]:
## Another version: using early stopping

# def blend_xgb_model(params_list, train_x, train_y, test_x, num_class, blend_folds,missing=None):

#     skf = model_selection.KFold(n_splits=blend_folds,random_state=1234, shuffle=True)
#     skf_ids = list(skf.split(train_x, train_y))


#     train_blend_x = np.zeros((train_x.shape[0], len(params_list)*num_class))
#     test_blend_x = np.zeros((test_x.shape[0], len(params_list)*num_class))
#     blend_scores = np.zeros ((blend_folds,len(params_list)))

#     print  ("Start blending.")
#     for j, params in enumerate(params_list):
#         print ("Blending model",j+1, params)
#         test_blend_x_j = np.zeros((test_x.shape[0], num_class))
#         for i, (train_ids, val_ids) in enumerate(skf_ids):
#             start = time.time()
#             print ("Model %d fold %d" %(j+1,i+1))
#             train_x_fold = train_x[train_ids]
#             train_y_fold = train_y[train_ids]
#             val_x_fold = train_x[val_ids]
#             val_y_fold = train_y[val_ids]
#             # Set n_estimators to a large number for early_stopping   
#             print (params, params['num_boost_round'], missing)
#             model = xgb.train(params,
#                               xgb.DMatrix(train_x_fold, 
#                                           label=train_y_fold.reshape(train_y_fold.shape[0],1), 
#                                           missing=missing),
#                               evals = [(xgb.DMatrix(val_x_fold, 
#                                                     val_y_fold, 
#                                                     missing=missing), 'valid')],
#                               num_boost_round=10000,
#                               verbose_eval=100, 
#                               early_stopping_rounds=100
#                              )    
#             val_y_predict_fold = model.predict(xgb.DMatrix(val_x_fold,missing=missing), 
#                                                     ntree_limit = model.best_iteration
#                                               )
            
#             score = metrics.log_loss(val_y_fold,val_y_predict_fold)
#             print ("LOGLOSS: ", score)
#             blend_scores[i,j]=score
#             train_blend_x[val_ids, j*num_class:j*num_class+num_class] = val_y_predict_fold
#             test_blend_x_j = test_blend_x_j + model.predict(xgb.DMatrix(test_x,missing=missing))
#             print (time.time()-start)
#         test_blend_x[:,j*num_class:j*num_class+num_class] = test_blend_x_j/blend_folds
#         print ("Score for model %d is %f" % (j+1,np.mean(blend_scores[:,j])))
#     return train_blend_x, test_blend_x, blend_scores    

In [None]:
from sklearn.linear_model import LogisticRegression,RidgeClassifier
from sklearn.neural_network import MLPClassifier
def search_model(train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
##Grid Search for the best model
    model = model_selection.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
                                     scoring    = 'log_loss',
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = refit,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model

### Level 1 models

#### LightGBM

* Train 5 models based on top 5 sets of parameters tuned by Bayesian Optimization
* Save the outputs they can be used later for model ensemble.

In [None]:
lgb_params_list = []

for p in lgb_BO_scores.head(5).iterrows(): #Top 5 sets of params
    params = dict()
    params['objective'] = 'multiclass'
    params['num_class'] = 3
    params['learning_rate'] = 0.01
    params['max_bin'] = int(p[1].to_dict()['max_bin'])
    params['num_leaves'] = int(p[1].to_dict()['num_leaves'])   
    params['min_sum_hessian_in_leaf'] = int(p[1].to_dict()['min_sum_hessian_in_leaf'])   
    params['min_gain_to_split'] = p[1].to_dict()['min_gain_to_split']    
    params['feature_fraction'] = p[1].to_dict()['feature_fraction']
    params['bagging_fraction'] = p[1].to_dict()['bagging_fraction']
    params['bagging_freq'] = int(p[1].to_dict()['bagging_freq'])
    params['num_boost_round'] = best_lgb_iteration
    print (params)
    lgb_params_list.append(params) 

In [None]:
# lgb_params_list = [{'min_gain_to_split': 0.77275621547590578, 'bagging_fraction': 0.93193799681414446, 'learning_rate': 0.01, 'bagging_freq': 1, 'min_sum_hessian_in_leaf': 4, 'feature_fraction': 0.42634174534518871, 'max_bin': 870, 'num_leaves': 19, 'objective': 'multiclass', 'num_boost_round': 10, 'num_class': 3},
# {'min_gain_to_split': 0.78091813895986428, 'bagging_fraction': 0.98937474204062581, 'learning_rate': 0.01, 'bagging_freq': 1, 'min_sum_hessian_in_leaf': 7, 'feature_fraction': 0.43295950599989869, 'max_bin': 890, 'num_leaves': 19, 'objective': 'multiclass', 'num_boost_round': 10, 'num_class': 3},
# {'min_gain_to_split': 0.87710698158147304, 'bagging_fraction': 0.94705819160185833, 'learning_rate': 0.01, 'bagging_freq': 1, 'min_sum_hessian_in_leaf': 7, 'feature_fraction': 0.36031255532444145, 'max_bin': 899, 'num_leaves': 10, 'objective': 'multiclass', 'num_boost_round': 10, 'num_class': 3},
# {'min_gain_to_split': 0.78341159495834189, 'bagging_fraction': 0.96529206675756984, 'learning_rate': 0.01, 'bagging_freq': 1, 'min_sum_hessian_in_leaf': 7, 'feature_fraction': 0.38724806130539013, 'max_bin': 893, 'num_leaves': 19, 'objective': 'multiclass', 'num_boost_round': 10, 'num_class': 3},
# {'min_gain_to_split': 0.94107377908690526, 'bagging_fraction': 0.85962423257426801, 'learning_rate': 0.01, 'bagging_freq': 1, 'min_sum_hessian_in_leaf': 7, 'feature_fraction': 0.43604871956773872, 'max_bin': 899, 'num_leaves': 19, 'objective': 'multiclass', 'num_boost_round': 10, 'num_class': 3}]

In [None]:
train_blend_x_lgb_01, test_blend_x_lgb_01, blend_scores_lgb_01 = blend_lgb_model(lgb_params_list, 
                                                        train_x, 
                                                        train_y, 
                                                        test_x, 
                                                        num_class=3, 
                                                        blend_folds=5)


np.savetxt('../input/train_blend_x_lgb_01.csv',train_blend_x_lgb_01, delimiter=',')
np.savetxt('../input/test_blend_x_lgb_01.csv',test_blend_x_lgb_01, delimiter=',')

Here we'll create a sumbision using the output from model 1, which was generated from out-of-sample predictions instead of a single model. Compare to the submission generated by a single model, which one is better?

In [None]:
preds = test_blend_x_lgb_01[:,0:3]
sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_lgb_tuned_oos.csv", index=False)

### Level 1 LightGBM features -> Level 2 MLP

Here we are training a MLP model with features from Level 1 LightGBM models. How what one scores as compared to previous ones?

In [None]:
param_grid = {
              "hidden_layer_sizes":[50,100,200,(200,100),(200,100,50)]
              }
model = search_model(train_blend_x_lgb_01
                                         , train_y
                                         , MLPClassifier()
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=True)   

print ("best subsample:", model.best_params_)


preds = model.predict_proba(test_blend_x_lgb_01)
sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_lgb_stacked_mlp.csv", index=False)


### Level 1 LightGBM features + original features -> Level 2 LightGBM

Here we are training a LightGBM model with features from Level 1 LightGBM models along with original features. How what one scores as compared to previous ones?

In [None]:
params = lgb_BO_scores.iloc[0].to_dict()
lgb_params = dict()
lgb_params['objective'] = 'multiclass'
lgb_params['num_class'] = 3
lgb_params['learning_rate'] = 0.01 # Smaller learning rate

lgb_params['max_bin'] = int(params['max_bin'])   
lgb_params['num_leaves'] = int(params['num_leaves'])    
lgb_params['min_sum_hessian_in_leaf'] = int(params['min_sum_hessian_in_leaf'])
lgb_params['min_gain_to_split'] = params['min_gain_to_split']     
lgb_params['feature_fraction'] = params['feature_fraction']
lgb_params['bagging_fraction'] = params['bagging_fraction']
lgb_params['bagging_freq'] = 1


cv_results = lgb.cv(lgb_params,
                lgb.Dataset(sparse.hstack([train_x,train_blend_x_lgb_01]).tocsr(),
                            train_y, 
                            max_bin=lgb_params['max_bin']
                           ),
                num_boost_round=1000000,
                nfold=5,
                early_stopping_rounds=200, # Bigger stopping rounds
                metrics='multi_logloss',
                shuffle=False,
                verbose_eval=100
               )

cv_results = pd.DataFrame(cv_results)
best_lgb_iteration = len(cv_results)
best_lgb_score = cv_results['multi_logloss-mean'].min()

print (best_lgb_iteration, best_lgb_score)

start = time.time()
clf = lgb.LGBMClassifier(learning_rate = 0.01
                        , n_estimators =best_lgb_iteration
                        , max_bin = lgb_params['max_bin']   
                        , num_leaves = lgb_params['num_leaves']
                        , min_child_weight = lgb_params['min_sum_hessian_in_leaf']
                        , min_split_gain = lgb_params['min_gain_to_split'] 
                        , colsample_bytree = lgb_params['feature_fraction']
                        , subsample = lgb_params['bagging_fraction']
                        , subsample_freq = lgb_params['bagging_freq']
                        , seed = 1234
                       )

print (clf)

clf.fit(sparse.hstack([train_x,train_blend_x_lgb_01]).tocsr(), train_y)

print ("Training finished in %d seconds." % (time.time()-start))

preds = clf.predict_proba(sparse.hstack([test_x, test_blend_x_lgb_01]).tocsr())
sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_lgb_stacked_lgb.csv", index=False)


The result seems promising. But what if we added XGBoost models to the ensemble? Would it help?

### Level 1 models

#### XGBoost

* Train 5 XGBoost models based on top 5 sets of parameters tuned by Bayesian Optimization
* Save the outputs they can be used later for model ensemble.

In [None]:
learning_rate = 0.01
stopping_rounds = 100

xgb_param_list = []

for p in xgb_BO_scores[:5].iterrows():
    start = time.time()
    params = dict()
    params['objective'] = 'multi:softprob'
    params['num_boost_round'] = best_xgb_iteration
    params['num_class'] = 3
    params['eta'] = 0.01
    params['max_depth'] = int(p[1].to_dict()['max_depth'])
    params['min_child_weight'] = int(p[1].to_dict()['min_child_weight'])
    params['subsample'] = p[1].to_dict()['subsample']
    params['colsample_bytree'] = p[1].to_dict()['colsample_bytree']
    params['gamma'] = p[1].to_dict()['gamma']
    params['seed']=1234
    xgb_param_list.append(params)

print(xgb_param_list)

In [None]:
train_blend_x_xgb_01, test_blend_x_xgb_01, blend_scores_xgb_01 = blend_xgb_model(xgb_param_list, 
                                                        train_x, 
                                                        train_y, 
                                                        test_x, 
                                                        num_class=3, 
                                                        blend_folds=5)  
                                                        
                                                        
np.savetxt('../input/train_blend_x_xgb_01.csv',train_blend_x_xgb_01, delimiter=',')
np.savetxt('../input/test_blend_x_xgb_01.csv',test_blend_x_xgb_01, delimiter=',')  

### Level 1 LightGBM features + level 1 XGBoost features -> Level 2 MLP

In [None]:
param_grid = {
              "hidden_layer_sizes":[50,100,200,(200,100),(200,100,50)]
              }
model = search_model(np.hstack((train_blend_x_lgb_01, train_blend_x_xgb_01))
                                         , train_y
                                         , MLPClassifier()
                                         , param_grid
                                         , n_jobs=-1
                                         , cv=4
                                         , refit=True)   

print ("best subsample:", model.best_params_)


preds_mlp = model.predict_proba(np.hstack((test_blend_x_lgb_01, test_blend_x_xgb_01)))
sub_df_mlp = pd.DataFrame(preds_mlp,columns = ["low", "medium", "high"])
sub_df_mlp["listing_id"] = test_data.listing_id.values
sub_df_mlp.to_csv("../output/sub_lgb_xgb_stacked_mlp.csv", index=False)

### Level 1 LightGBM features + level 1 XGBoost features  + original features -> Level 2 LightGBM

In [None]:
params = lgb_BO_scores.iloc[0].to_dict()
lgb_params = dict()
lgb_params['objective'] = 'multiclass'
lgb_params['num_class'] = 3
lgb_params['learning_rate'] = 0.01 # Smaller learning rate

lgb_params['max_bin'] = int(params['max_bin'])   
lgb_params['num_leaves'] = int(params['num_leaves'])    
lgb_params['min_sum_hessian_in_leaf'] = int(params['min_sum_hessian_in_leaf'])
lgb_params['min_gain_to_split'] = params['min_gain_to_split']     
lgb_params['feature_fraction'] = params['feature_fraction']
lgb_params['bagging_fraction'] = params['bagging_fraction']
lgb_params['bagging_freq'] = 1


cv_results = lgb.cv(lgb_params,
                lgb.Dataset(sparse.hstack([train_x,train_blend_x_lgb_01, train_blend_x_xgb_01]).tocsr(),
                            train_y, 
                            max_bin=lgb_params['max_bin']
                           ),
                num_boost_round=1000000,
                nfold=5,
                early_stopping_rounds=200, # Bigger stopping rounds
                metrics='multi_logloss',
                shuffle=False,
                verbose_eval=100
               )

cv_results = pd.DataFrame(cv_results)
best_lgb_iteration = len(cv_results)
best_lgb_score = cv_results['multi_logloss-mean'].min()

print (best_lgb_iteration, best_lgb_score)


start = time.time()
clf = lgb.LGBMClassifier(learning_rate = 0.01
                        , n_estimators =best_lgb_iteration
                        , max_bin = lgb_params['max_bin']   
                        , num_leaves = lgb_params['num_leaves']
                        , min_child_weight = lgb_params['min_sum_hessian_in_leaf']
                        , min_split_gain = lgb_params['min_gain_to_split'] 
                        , colsample_bytree = lgb_params['feature_fraction']
                        , subsample = lgb_params['bagging_fraction']
                        , subsample_freq = lgb_params['bagging_freq']
                        , seed = 1234
                       )

print (clf)

clf.fit(sparse.hstack([train_x,train_blend_x_lgb_01, train_blend_x_xgb_01]).tocsr(), train_y)

print ("Training finished in %d seconds." % (time.time()-start))

preds_lgb = clf.predict_proba(sparse.hstack([test_x, test_blend_x_lgb_01, test_blend_x_xgb_01]).tocsr())
sub_df_lgb = pd.DataFrame(preds_lgb,columns = ["low", "medium", "high"])
sub_df_lgb["listing_id"] = test_data.listing_id.values
sub_df_lgb.to_csv("../output/sub_lgb_xgb_stacked_lgb.csv", index=False)

### Final submission

Let's create our final submission by simply averaging the above two submissions

In [None]:
preds = preds_mlp*0.5+preds_lgb*0.5

sub_df = pd.DataFrame(preds,columns = ["low", "medium", "high"])
sub_df["listing_id"] = test_data.listing_id.values
sub_df.to_csv("../output/sub_lgb_xgb_stacked_mlp_lgb.csv", index=False)

# Conclusion

In this week we've learnt how to create model ensemble, particularly with model stacking scheme. Keep in mind that the key considerations for building a good ensemble solution are diversity and randomness which can be introduced by:

* Deploying heterogeneous algorithms
* Using modified version of training data
* Randomizing learning algorithms with different parameters