In [1]:
import csv
import pandas as pd

## Random Forest + GridSearch

In [5]:
train = pd.read_csv("preprocess/train.csv")
test = pd.read_csv("preprocess/test.csv")

- Future Selection

&emsp;&emsp;Since thousands of features have been created before, if all the features are brought in for modeling, the modeling time of the model will be greatly extended, and too many irrelevant features will improve the model results to a limited extent, so here we use the Pearson correlation coefficient , select the 300 features most relevant to the label for modeling.

In [7]:
# Extract feature names
features = train.columns.tolist()
features.remove("card_id")
features.remove("target")
featureSelect = features[:]

# Calculate the correlation coefficient
corr = []
for fea in featureSelect:
    corr.append(abs(train[[fea, 'target']].fillna(0).corr().values[0][1]))

# Take the top300 features for modeling, the specific number is optional
se = pd.Series(corr, index=featureSelect).sort_values(ascending=False)
feature_select = ['card_id'] + se[:300].index.tolist()

# output result
train = train[feature_select + ['target']]
test = test[feature_select]

>The main reason why feature extraction can be performed through the Pearson correlation coefficient is that we default all features to continuous variables in the process of feature creation 

- Parameter tuning with grid search

In [30]:
from sklearn.metrics import mean_squared_error #Mean squared error calculation function
from sklearn.ensemble import RandomForestRegressor #random forest estimator
from sklearn.model_selection import GridSearchCV #grid search estimator

Create a parameter space, the basic parameters of the random forest are as follows:

|Name|Description|      
|:--:|:--:| 
|criterion|规则评估指标或损失函数，默认基尼系数，可选信息熵| 
|splitter|树模型生长方式，默认以损失函数取值减少最快方式生长，可选随机根据某条件进行划分|
|max_depth|树的最大生长深度，类似max_iter，即总共迭代几次| 
|min_samples_split|内部节点再划分所需最小样本数| 
|min_samples_leaf|叶节点包含最少样本数| 
|min_weight_fraction_leaf|叶节点所需最小权重和| 
|max_features|在进行切分时候最多带入多少个特征进行划分规则挑选|
|random_state|随机数种子| 
|max_leaf_nodes|叶节点最大个数| 
|min_impurity_decrease|数据集再划分至少需要降低的损失值| 
|min_impurity_split|数据集再划分所需最低不纯度，将在0.25版本中移除| 
|class_weight|各类样本权重| 

Select "n_estimators", "min_samples_leaf", "min_samples_split", "max_depth" and "max_features" for parameter search:

> RandomizedSearchCV / HalvingGridSearchCV / HalvingRandomSearchCV. Further reduce the computing resources required by the grid search and speed up the grid search.

&emsp;&emsp;First use RandomizedSearchCV to determine the approximate range, and then use GridSearchCV to search for specific parameter values with high precision

In [43]:
features = train.columns.tolist()
features.remove("card_id")
features.remove("target")


parameter_space = {
    "n_estimators": [79, 80, 81], 
    "min_samples_leaf": [29, 30, 31],
    "min_samples_split": [2, 3],
    "max_depth": [9, 10],
    "max_features": ["auto", 80]
}

Then build a random forest evaluator and enter other hyperparameter values

In [44]:
clf = RandomForestRegressor(
    criterion="mse",
    n_jobs=15,
    random_state=22)

 grid search:

In [45]:
grid = GridSearchCV(clf, parameter_space, cv=2, scoring="neg_mean_squared_error")
grid.fit(train[features].values, train['target'].values)

GridSearchCV(cv=2, estimator=RandomForestRegressor(n_jobs=15, random_state=22),
             param_grid={'max_depth': [9, 10], 'max_features': ['auto', 80],
                         'min_samples_leaf': [29, 30, 31],
                         'min_samples_split': [2, 3],
                         'n_estimators': [79, 80, 81]},
             scoring='neg_mean_squared_error')

&emsp;&emsp;The optimal parameter group finally searched in the parameter space:

In [46]:
grid.best_params_

{'max_depth': 10,
 'max_features': 80,
 'min_samples_leaf': 31,
 'min_samples_split': 2,
 'n_estimators': 80}

View or call the optimal parameter composition estimator directly:

In [47]:
grid.best_estimator_

RandomForestRegressor(max_depth=10, max_features=80, min_samples_leaf=31,
                      n_estimators=80, n_jobs=15, random_state=22)

view the final score on the training set:

In [48]:
np.sqrt(-grid.best_score_)

3.6900889856014247

In [49]:
grid.best_estimator_.predict(test[features])

array([-3.42895506, -1.05271922, -0.34647055, ...,  0.71331227,
       -2.40402906,  0.29249733])

Write the csv document according to the format that needs to be submitted

In [50]:
test['target'] = grid.best_estimator_.predict(test[features])
test[['card_id', 'target']].to_csv("result/submission_randomforest.csv", index=False)