# XGBoost Parameter Tuning for Rent Listing Inqueries

Rental Listing Inquiries数据集是Kaggle平台上的一个分类竞赛任务，需要根据公寓的特征来预测其受欢迎程度（用户感兴趣程度分为高、中、低三类）。其中房屋的特征x共有14维，响应值y为用户对该公寓的感兴趣程度。评价标准为logloss。 数据链接：https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries

# 第三步：调整树的参数：min_child_weight
(粗调，参数的步长为2；下一步是在粗调最佳参数周围，将步长降为1，进行精细调整)
精细调整略

一次调试两个参数太慢，每次只调整一个参数
为了加快速度，cv=3

首先 import 必要的模块

In [1]:
from xgboost import XGBClassifier
import xgboost as xgb

import pandas as pd 
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import log_loss

from matplotlib import pyplot
import seaborn as sns
%matplotlib inline

## 读取数据

In [2]:
# path to where the data lies
#dpath = './data/'
train = pd.read_csv("RentListingInquries_FE_train.csv")
#train.head()

In [3]:
y_train = train['interest_level']

train = train.drop([ "interest_level"], axis=1)
X_train = train

## 第一轮参数调整得到的n_estimators最优值（232），max_depth=6
其余参数继续默认值

用交叉验证评价模型性能时，用scoring参数定义评价指标。评价指标是越高越好，因此用一些损失函数当评价指标时，需要再加负号，如neg_log_loss，neg_mean_squared_error 详见sklearn文档：http://scikit-learn.org/stable/modules/model_evaluation.html#log-loss

In [4]:
#max_depth 建议3-10， min_child_weight=1／sqrt(ratio_rare_event) =5.5
#max_depth = range(4,10,2)
min_child_weight = range(1,6,2)
#param_test2_1 = dict(max_depth=max_depth, min_child_weight=min_child_weight)
param_test3 = dict(min_child_weight=min_child_weight)
param_test3

{'min_child_weight': [1, 3, 5]}

In [5]:
xgb3 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=232,  #第一轮参数调整得到的n_estimators最优值
        max_depth=6,
        min_child_weight=1,
        gamma=0,
        subsample=0.5,
        colsample_bytree=0.8,
        colsample_bylevel = 0.7,
        objective= 'multi:softprob',
        seed=3)


gsearch3= GridSearchCV(xgb3, param_grid = param_test3, scoring='neg_log_loss',n_jobs=-1, cv=3)
gsearch3.fit(X_train , y_train)

gsearch3.grid_scores_, gsearch3.best_params_,     gsearch3.best_score_



([mean: -0.57868, std: 0.00427, params: {'min_child_weight': 1},
  mean: -0.57752, std: 0.00486, params: {'min_child_weight': 3},
  mean: -0.57747, std: 0.00478, params: {'min_child_weight': 5}],
 {'min_child_weight': 5},
 -0.5774709604798296)

最佳结果在min_child_weight=5，所以继续测试更大的min_child_weight

In [4]:
min_child_weight = range(7,10,2)
#param_test2_1 = dict(max_depth=max_depth, min_child_weight=min_child_weight)
param_test3_2 = dict(min_child_weight=min_child_weight)
param_test3_2

{'min_child_weight': [7, 9]}

In [5]:
xgb3_2 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=232,  #第一轮参数调整得到的n_estimators最优值
        max_depth=6,
        min_child_weight=1,
        gamma=0,
        subsample=0.5,
        colsample_bytree=0.8,
        colsample_bylevel = 0.7,
        objective= 'multi:softprob',
        seed=3)


gsearch3_2= GridSearchCV(xgb3_2, param_grid = param_test3_2, scoring='neg_log_loss',n_jobs=-1, cv=3)
gsearch3_2.fit(X_train , y_train)

gsearch3_2.grid_scores_, gsearch3_2.best_params_,     gsearch3_2.best_score_



([mean: -0.57773, std: 0.00461, params: {'min_child_weight': 7},
  mean: -0.57815, std: 0.00403, params: {'min_child_weight': 9}],
 {'min_child_weight': 7},
 -0.5777316363039906)

最佳min_child_weight=7