# 经验

1. 对于不同的数量的分类特征的处理方式选择：
   1. 连续值使用归一化。
   2. low-cardinality 特征采用label encoding的方式进行编码。
      1. 分界线在于类别数量为10。
   3.  high-cardinality 特征采用 Target Encoding 的方式进行编码。
2. kmeans不建议使用网格搜索来获取最优值，容易出现报错。直接手动搜索。

# 题目

## 数据说明

Bob 的初创手机公司需要科学定价策略以对抗苹果、三星等巨头。他收集了 2000+ 款手机的硬件规格数据，要求你构建价格分档预测模型。根据行业惯例，价格分为4档：
0=低端(＜$200) | 1=中端($200-$400) | 2=高端($400-$600) | 3=旗舰(＞$600)

字段说明（21个字段）
|字段名	|类型	|说明	|示例值|
|---|---|---|---|
|id	|数值	|手机唯一ID（建模时忽略）	|3|
|battery_power	|数值	|电池容量(mAh)	|1024|
|blue	|分类	|蓝牙支持：0=无,1=有	|1|
|clock_speed	|数值	|处理器主频(GHz)	|2.5|
|dual_sim	|分类	|双卡支持：0=无,1=有	|0|
|fc	|数值	|前置摄像头像素(MP)	|8|
|four_g	|分类	|4G支持：0=无,1=有	|1|
|int_memory	|数值	|内部存储(GB)|	64|
|m_dep	|数值	|手机厚度(cm)	|0.8|
|mobile_wt	|数值	|手机重量(g)	|188|
|n_cores|	数值	|处理器核心数	|8|
|pc	|数值	|主摄像头像素(MP)	|13|
|px_height	|数值	|屏幕垂直分辨率(像素)	|905|
|px_width|	数值	|屏幕水平分辨率(像素)	|1988|
|ram	|数值	|运行内存(MB)	|2632|
|sc_h	|数值	|屏幕高度(cm)	|15|
|sc_w	|数值	|屏幕宽度(cm)	|7|
|talk_time	|数值	|通话时长(小时)	|12|
|three_g	|分类	|3G支持：0=无,1=有	|1|
|touch_screen	|分类	|触摸屏：0=无,1=有	|1|
|wifi	|分类	|WiFi支持：0=无,1=有	|1|
|price_range	|分类	|目标变量：价格分档	|2|

注意：测试集不含 price_range 字段

## 任务说明
完成以下流程：

1. 数据处理
   - 处理缺失值：px_height 和 px_width 有少量缺失，用均值填充。
   - 处理异常值：px_height 和 px_width 为0的记录视为异常值，直接删除。
   - 标准化数值特征：标准化所有数值特征（使用StandardScaler），包括但不限于battery_power, ram, px_height, px_width等。
   - 所有预处理操作（缺失值填充、异常值处理、标准化）
   - 必须同时应用于训练集和测试集
2. 模型构建
   - 选择合适的分类模型
   - 在训练集上使用5折交叉验证计算F1-score（宏平均）作为自评估指标
3. 预测与保存
   - 预测测试集 test.csv 的价格分档
   - 生成提交文件 predictions.csv 格式：
        ```csv
        id,price_range
        1000,1
        1001,3
        ...
        ```
   
## 提交物

1. 模型文件
   - model.pkl
   - 使用joblib保存：
   - joblib.dump(model, 'model.pkl')
2. 预测结果文件
   - predictions.csv（含两列：id 和预测的 price_range）
   - 以上两个文件一同打包为一个压缩包文件后提交，命名为：submit.zip
3. 在答题区题写自评估F1-score（宏平均）

# 导入库

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV, GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, roc_curve, make_scorer, confusion_matrix, classification_report, RocCurveDisplay, auc
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, SVR

# from xgboost import XGBClassifier
import joblib

import matplotlib.pyplot as plt
# import seaborn as sns

# from ModelInformation import KFoldTargetEncoder, numeric_features, low_card_features, categorical_features

import csv
import os
import time

# 确定模型类别

1. 是分类模型。

# 读取数据

In [2]:
train = pd.read_csv('./Data/2/train.csv')
test = pd.read_csv('./Data/2/test.csv')

# 查看数据基本信息

In [3]:
train.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [4]:
def StatisticFeature(df):
    # print(PeopleInfo_selected['Sex'].value_counts()) 这是统计每个类别的数量。
    print('number of blue sort {}'.format(df['blue'].nunique()))
    print('number of dual_sim sort {}'.format(df['dual_sim'].nunique()))
    print('number of four_g sort {}'.format(df['four_g'].nunique()))
    print('number of three_g sort {}'.format(df['three_g'].nunique()))
    print('number of touch_screen sort {}'.format(df['touch_screen'].nunique()))
    print('number of wifi sort {}'.format(df['wifi'].nunique()))
StatisticFeature(train)

number of blue sort 2
number of dual_sim sort 2
number of four_g sort 2
number of three_g sort 2
number of touch_screen sort 2
number of wifi sort 2


In [5]:
numeric_features = ['battery_power', 'clock_speed', 'fc', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time']
low_cardinality_features = ['blue', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi']

# print(len(numeric_features))
# print(len(numeric_features)+ len(low_cardinality_features))

In [6]:
# 数据形状。
print(train.shape, test.shape)

trainset = set(train.columns)
testset = set(test.columns)
# 数据列的不同。
print(testset.symmetric_difference(trainset))

(2000, 21) (1000, 21)
{'price_range', 'id'}


In [8]:
train.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

# 数据预处理

## 数据初级处理

In [9]:
for col in ['px_height', 'px_width']:
    train[col] = train[col].replace(0, np.nan)  # 将0视为缺失
    train[col].fillna(train[col].mean(), inplace=True)

    test[col] = test[col].replace(0, np.nan)  # 将0视为缺失
    test[col].fillna(test[col].mean(), inplace=True)

# 删除无意义的列。
test.drop(columns=['id'], inplace=True)

# 获取标签列和特征列。
labels = train['price_range'].astype(int)
samples = train.drop(columns=['price_range'])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[col].fillna(train[col].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test[col].fillna(test[col].mean(), inplace=True)


## 数据高级处理

In [30]:
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), low_cardinality_features),
#     # ('target_encoder', ce.TargetEncoder(cols=categorical_features), categorical_features),
    # ('target_encoder', KFoldTargetEncoder(cols=categorical_features), categorical_features),
    # ('passthrough', '')
])

pl = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='saga'))
               ])

X_train, X_Validation, y_train, y_Validation = train_test_split(samples, labels, random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 注意本题是以f1为目标。
scoring = {'accuracy': make_scorer(accuracy_score),
           'f1': make_scorer(f1_score),
           'roc_auc': make_scorer(roc_auc_score)}
# scoring = {'f1': make_scorer(f1_score)}


# 第一阶段的粗搜索。
parameters_dist = {
    'classifier__penalty': ['l1', 'l2', 'elasticnet'],
    'classifier__C': np.logspace(-4, 4, 20),
    'classifier__l1_ratio': np.linspace(0, 1, 10), 
}

random_search = RandomizedSearchCV(
    pl,
    param_distributions=parameters_dist,
    n_iter=20,
    cv=cv,
    scoring=scoring,
    refit='accuracy',  # 使用f1作为最终模型的评估标准
    verbose=1,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

print("第一阶段最优参数：", random_search.best_params_)

# 第二阶段：网格搜索（精调）
best_params = random_search.best_params_

penalty = best_params['classifier__penalty']
C = best_params['classifier__C']

# 构造精调网格
if penalty == 'elasticnet':
    param_grid = {
        'classifier__penalty': [penalty],
        'classifier__C': [C * 0.5, C, C * 2],
        'classifier__l1_ratio': np.linspace(0.2, 0.8, 5),
    }
else:
    param_grid = {
        'classifier__penalty': [penalty],
        'classifier__C': [C * 0.5, C, C * 2],
    }

grid_search = GridSearchCV(
    pl,
    param_grid=param_grid,
    cv=cv,
    scoring=scoring,
    refit='accuracy',
    verbose=1,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("第二阶段最优参数：", grid_search.best_params_)

best_model = grid_search.best_estimator_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


 nan nan]


第一阶段最优参数： {'classifier__penalty': 'l1', 'classifier__l1_ratio': np.float64(0.3333333333333333), 'classifier__C': np.float64(11.288378916846883)}
Fitting 5 folds for each of 3 candidates, totalling 15 fits




第二阶段最优参数： {'classifier__C': np.float64(5.644189458423441), 'classifier__penalty': 'l1'}




In [31]:
# cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_scores = cross_val_score(best_model, train, labels, cv=cv, scoring='f1_macro')

print(f"5-fold CV Macro F1-score: {f1_scores.mean():.4f}")



5-fold CV Macro F1-score: 0.9699




In [24]:
# 5. 测试集评估
y_pred = best_model.predict(train)
# y_pred
print("\n测试集评估:")
print(classification_report(y_true=labels, y_pred=y_pred))


测试集评估:
              precision    recall  f1-score   support

           0       0.60      0.99      0.75       500
           1       0.53      0.36      0.43       500
           2       0.00      0.00      0.00       500
           3       0.59      0.99      0.74       500

    accuracy                           0.58      2000
   macro avg       0.43      0.58      0.48      2000
weighted avg       0.43      0.58      0.48      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [None]:


y_pred = best_model.predict(train)
y_proba = best_model.predict_proba(train)[:, 1]
print('accuracy: {}'.format(accuracy_score(labels, y_pred)))

scores = cross_val_score(best_model, samples, labels, cv=5)
print('mean accuracy :{}'.format(scores.mean()))

accuracy: 0.5845
mean accuracy :0.531


In [35]:
joblib.dump(best_model, './Data/model.pkl')
# joblib.dump(best_model.named_steps['preprocessor'].get_feature_names_out(), './Data/feature_names.pkl')

['./Data/model.pkl']

In [34]:
y_pred = pd.DataFrame(best_model.predict(test))
y_pred.to_csv('./Data/predictions.csv', index=False, header=False)

# 多个模型的模板

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

# 1. 加载示例数据
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)

# 2. 定义模型字典
models = {
    "LogisticRegression": LogisticRegression(solver='saga', max_iter=10000),
    "SVC": SVC(probability=True),
    "RandomForest": RandomForestClassifier(),
    "DecisionTree": DecisionTreeClassifier()
}

# 3. 定义每个模型的搜索空间
param_distributions = {
    "LogisticRegression": {
        'clf__penalty': ['l1', 'l2', 'elasticnet', 'none'],
        'clf__C': np.logspace(-4, 4, 10),
        'clf__l1_ratio': np.linspace(0, 1, 5)  # 仅在 elasticnet 中有效
    },
    "SVC": {
        'clf__C': np.logspace(-3, 2, 6),
        'clf__gamma': ['scale', 'auto'],
        'clf__kernel': ['linear', 'rbf']
    },
    "RandomForest": {
        'clf__n_estimators': [50, 100, 200],
        'clf__max_depth': [None, 5, 10, 20],
        'clf__min_samples_split': [2, 5, 10]
    },
    "DecisionTree": {
        'clf__max_depth': [None, 5, 10, 20],
        'clf__min_samples_split': [2, 5, 10],
        'clf__criterion': ['gini', 'entropy']
    }
}

# 4. 自动化调参和评估
best_models = {}
for name, model in models.items():
    print(f"\n🔍 正在搜索模型: {name}")
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', model)
    ])
    
    search = RandomizedSearchCV(
        pipeline,
        param_distributions=param_distributions[name],
        n_iter=20,
        scoring='accuracy',
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        verbose=0,
        n_jobs=-1,
        random_state=42
    )
    search.fit(X_train, y_train)
    best_models[name] = search.best_estimator_
    
    print(f"✅ 最佳参数: {search.best_params_}")
    print(f"📊 交叉验证准确率: {search.best_score_:.4f}")

    # 评估测试集
    y_pred = search.predict(X_test)
    print("📋 测试集结果:")
    print(classification_report(y_test, y_pred))

# 可选：比较不同模型最终在测试集的准确率等


In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import joblib
import warnings

warnings.filterwarnings("ignore")

# 1. 加载数据
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# 2. 定义模型及调参空间
models = {
    "LogisticRegression": LogisticRegression(solver='saga', max_iter=10000),
    "RandomForest": RandomForestClassifier(),
    "SVC": SVC(probability=True)
}

param_distributions = {
    "LogisticRegression": {
        'clf__penalty': ['l1', 'l2', 'elasticnet'],
        'clf__C': np.logspace(-4, 4, 10),
        'clf__l1_ratio': np.linspace(0, 1, 5)
    },
    "RandomForest": {
        'clf__n_estimators': [50, 100, 200],
        'clf__max_depth': [3, 5, 10, None],
        'clf__max_features': ['sqrt', 'log2']
    },
    "SVC": {
        'clf__C': np.logspace(-2, 2, 5),
        'clf__gamma': ['scale', 'auto'],
        'clf__kernel': ['linear', 'rbf']
    }
}

param_grids = {
    "LogisticRegression": {
        'clf__penalty': ['l1', 'elasticnet'],
        'clf__C': [0.1, 1, 10],
        'clf__l1_ratio': [0.5, 0.7]
    },
    "RandomForest": {
        'clf__n_estimators': [100, 200],
        'clf__max_depth': [5, 10],
    },
    "SVC": {
        'clf__C': [1, 10],
        'clf__kernel': ['rbf'],
    }
}

# 3. 搜索并保存最优模型
best_score = 0
best_model = None
best_model_name = ""
best_pipeline = None

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f"🔍 正在搜索模型: {name}")
    
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', model)
    ])
    
    # 1. 随机搜索
    rand_search = RandomizedSearchCV(pipe, param_distributions[name], n_iter=10,
                                     scoring='accuracy', cv=cv, random_state=42, n_jobs=-1)
    rand_search.fit(X_train, y_train)
    
    # 2. 网格搜索（以随机搜索结果为基础）
    best_params = rand_search.best_params_
    grid_params = param_grids.get(name, {})
    grid_search = GridSearchCV(pipe, grid_params, scoring='accuracy', cv=cv, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    acc = accuracy_score(y_test, grid_search.predict(X_test))
    print(f"✅ 模型 {name} 的测试准确率: {acc:.4f}")
    
    if acc > best_score:
        best_score = acc
        best_model = grid_search.best_estimator_
        best_model_name = name
        best_pipeline = grid_search

# 4. 保存最优模型
print(f"\n🏆 最优模型为: {best_model_name}，准确率: {best_score:.4f}")
joblib.dump(best_model, f"{best_model_name}_best_model.pkl")
print(f"💾 模型已保存为: {best_model_name}_best_model.pkl")

# 5. 打印特征重要性
print("\n📊 特征重要性：")
feature_names = X.columns
try:
    if hasattr(best_model.named_steps['clf'], 'coef_'):
        coefs = best_model.named_steps['clf'].coef_
        if coefs.ndim == 2:
            coefs = coefs[0]
        for name, coef in sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True):
            print(f"{name}: {coef:.4f}")
    elif hasattr(best_model.named_steps['clf'], 'feature_importances_'):
        importances = best_model.named_steps['clf'].feature_importances_
        for name, importance in sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True):
            print(f"{name}: {importance:.4f}")
    else:
        print("❌ 该模型不支持特征重要性输出。")
except Exception as e:
    print(f"❌ 特征重要性提取失败: {e}")


🔍 正在搜索模型: LogisticRegression
✅ 模型 LogisticRegression 的测试准确率: 0.9649
🔍 正在搜索模型: RandomForest
✅ 模型 RandomForest 的测试准确率: 0.9561
🔍 正在搜索模型: SVC
✅ 模型 SVC 的测试准确率: 0.9825

🏆 最优模型为: SVC，准确率: 0.9825
💾 模型已保存为: SVC_best_model.pkl

📊 特征重要性：
❌ 该模型不支持特征重要性输出。


# <a id='toc1_2_1_'></a>[不同类型数据的不同处理方式说明](#toc0_)


这里针对的是使用xgboost模型而言的。

one-hot encoding适合特征的种类比较少时，且对应于深度学习模型。

当特征类别数量>=10时，考虑使用 Target Encoding。

对于不同的数量的分类特征的处理方式选择：

1. 连续值使用归一化。
2. low-cardinality 特征采用label encoding的方式进行编码。
   1. 分界线在于类别数量为10。
3.  high-cardinality 特征采用 Target Encoding 的方式进行编码。


|特征名称|英文|类别数量|特征处理方式|
|---|---|---|---|
|性别代码|Sex|2| Lable Encoding |
|血型代码| BloodType |6 |Lable Encoding |
|民族代码| NationCode |48 | Target Encoding |
|宗教信仰代码|ReligiousCode|8| Lable Encoding |
|学历代码 |EducationCode|35| Target Encoding |
|婚姻状况代码|MaritalStatus|8| Lable Encoding |
|兵役状况代码|MilitaryStatus|6| Lable Encoding |
|职业类别代码| OccupationCode| 462| Target Encoding |
|职业|Occupation|8746| Target Encoding |
|籍贯_国家和地区代码|OriginCountryCode|7| Lable Encoding |
|籍贯_行政区划代码|OriginAreaCode|3075| Target Encoding |
|出生地_国家和地区代码|BirthCountryCode|17| Target Encoding |
|出生地_行政区划代码|BirthAreaCode|3157| Target Encoding |
|服务处所|ServicePlace|195248| Target Encoding |
|所属省市县（区）|Province|13| Target Encoding |
|人员类型|PersonnelType|1| / 不处理|
|行业类别|IndustryCategory|202| Target Encoding |
|区县|District|13| Target Encoding |

# KFlod处理函数

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import KFold
import numpy as np

numeric_features = ['Height', 'Age',
                    'ain_1', 'ain_2', 'ain_3', 'ain_4', 'ain_5', 'ain_6', 
                    'fcp_1', 'fcp_2', 'fcp_3', 'fcp_4', 'fcp_5', 'fcp_6', 
                    'fcn_1', 'fcn_2', 'fcn_3', 'fcn_4', 'fcn_5', 'fcn_6', 
                    'rac', 'rhp', 'rsp', 'sapn', 
                    'spn_1', 'spn_2', 
                    'ncdrppm_1', 'ncdrppm_2', 'ncdrppm_3', 'ncdrppm_4', 'ncdrppm_5', 'ncdrppm_6', 
                    'nsiswsrpm_1', 'nsiswsrpm_2', 'nsiswsrpm_3', 'nsiswsrpm_4', 'nsiswsrpm_5', 'nsiswsrpm_6', 
                    'nswsspm_1', 'nswsspm_2', 'nswsspm_3', 'nswsspm_4', 'nswsspm_5', 'nswsspm_6', 
                    'nsiswsspm_1', 'nsiswsspm_2', 'nsiswsspm_3', 'nsiswsspm_4', 'nsiswsspm_5', 'nsiswsspm_6', 
                    'nswsrpm_1', 'nswsrpm_2', 'nswsrpm_3', 'nswsrpm_4', 'nswsrpm_5', 'nswsrpm_6']
low_card_features = ['Sex', 'BloodType', 'ReligiousCode', 'MaritalStatus', 'MilitaryStatus', 'OriginCountryCode']
categorical_features = ['NationCode', 'EducationCode', 'OccupationCode', 'Occupation', 'OriginAreaCode', 
                        'BirthCountryCode', 'BirthAreaCode', 'ServicePlace', 'Province', 'IndustryCategory', 
                        'PersonnelType', 'District']

# 加载模型信息。
class KFoldTargetEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, cols, n_splits=5, smoothing=10, random_state=42):
        self.cols = cols
        self.n_splits = n_splits
        self.smoothing = smoothing
        self.random_state = random_state
        self.maps_ = {}
        self.global_mean = None
        
    def fit(self, X, Y):
        self.maps_ = {}
        self.global_mean = np.mean(Y)
        kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state)
        x = X.copy()
        y = Y.copy()
        # df['__target__'] = Y
        
        for col in self.cols:
            out_of_fold_map = {}
            for train_idx, val_idx in kf.split(x):
                x_train, y_train = x.iloc[train_idx], y.iloc[train_idx]
                means = y_train.groupby(x_train[col]).mean()
                out_of_fold_map.update(means.to_dict())
            self.maps_[col] = out_of_fold_map
        return self
    
    def get_feature_names_out(self, input_feature=None):
        return [col + '_te' for col in self.cols]
    
    def transform(self, X):
        X_ = X.copy()
        
        for col in self.cols:
            X_[col + '_te'] = X_[col].map(self.maps_[col]).fillna(self.global_mean)
        return X_[[col + '_te' for col in self.cols]]


In [None]:
PersonInfo_dw_features[numeric_features] = PersonInfo_dw_features[numeric_features].astype(int)
PersonInfo_dw_features[low_card_features] = PersonInfo_dw_features[low_card_features].astype(str)

# 模型训练

In [None]:
samples = PersonInfo_dw_features[['Sex', 'Height', 'BloodType', 'NationCode',
       'ReligiousCode', 'EducationCode', 'MaritalStatus', 'MilitaryStatus',
       'OccupationCode', 'Occupation', 'OriginCountryCode', 'OriginAreaCode',
       'BirthCountryCode', 'BirthAreaCode', 'ServicePlace', 'Province',
       'IndustryCategory', 'PersonnelType', 'District', 'Age', 'Label',
       'ain_1', 'ain_2', 'ain_3', 'ain_4', 'ain_5', 'ain_6', 'fcp_1', 'fcp_2',
       'fcp_3', 'fcp_4', 'fcp_5', 'fcp_6', 'fcn_1', 'fcn_2', 'fcn_3', 'fcn_4',
       'fcn_5', 'fcn_6', 'rac', 'rhp', 'rsp', 'sapn', 'spn_1', 'spn_2',
       'ncdrppm_1', 'ncdrppm_2', 'ncdrppm_3', 'ncdrppm_4', 'ncdrppm_5',
       'ncdrppm_6', 'nsiswsrpm_1', 'nsiswsrpm_2', 'nsiswsrpm_3', 'nsiswsrpm_4',
       'nsiswsrpm_5', 'nsiswsrpm_6', 'nswsspm_1', 'nswsspm_2', 'nswsspm_3',
       'nswsspm_4', 'nswsspm_5', 'nswsspm_6', 'nsiswsspm_1', 'nsiswsspm_2',
       'nsiswsspm_3', 'nsiswsspm_4', 'nsiswsspm_5', 'nsiswsspm_6', 'nswsrpm_1',
       'nswsrpm_2', 'nswsrpm_3', 'nswsrpm_4', 'nswsrpm_5', 'nswsrpm_6']]
labels = PersonInfo_dw_features['Label']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), low_card_features),
#     # ('target_encoder', ce.TargetEncoder(cols=categorical_features), categorical_features),
    ('target_encoder', KFoldTargetEncoder(cols=categorical_features), categorical_features),
    # ('passthrough', '')
])

pl = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
               ])

X_train, X_test, y_train, y_test = train_test_split(samples, labels, random_state=42)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

scoring = {'accuracy': make_scorer(accuracy_score),
           'f1': make_scorer(f1_score),
           'roc_auc': make_scorer(roc_auc_score)}

# 第一阶段的粗搜索。
parameters_dist = {
    'classifier__n_estimators': [50, 100, 150, 200, 300], # 多少颗树。
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3], # 学习率。
    'classifier__max_depth': [3, 4, 5, 6, 7], # 树的最大深度。
    'classifier__colsample_bytree': [0.4, 0.6, 0.8, 1], # 选择多少列构建一个树。
    'classifier__min_child_weight': [1, 2, 3, 4] # 叶子节点最小样本数量。
}

random_search = RandomizedSearchCV(
    pl, 
    param_distributions=parameters_dist,
    n_iter=5,
    scoring=scoring,
    refit='f1', # 以 f1 为目标进行优化。
    cv=cv,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
print('[stage 1] best parameters :', random_search.best_params_)


# 第二阶段的细调。
best_params = random_search.best_params_
param_grid = {
    'classifier__n_estimators': [best_params['classifier__n_estimators'] -50,
                                 best_params['classifier__n_estimators'],
                                 best_params['classifier__n_estimators'] + 50],
    'classifier__max_depth': [max(1, best_params['classifier__max_depth'] - 1),
                              best_params['classifier__max_depth'],
                              best_params['classifier__max_depth'] + 1],
    'classifier__learning_rate': [round(best_params['classifier__learning_rate'], 3)]
}

grid_search = GridSearchCV(
    random_search.best_estimator_, 
    param_grid=param_grid,
    scoring=scoring,
    refit='f1', # 以 f1 为目标进行优化。
    cv=cv,
    verbose=1,
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print('[stage 2] best parameters : {}'.format(grid_search.best_params_))
print('[stage 2] best score : {}'.format(grid_search.best_score_))

# 预测

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
print('accuracy: {}'.format(accuracy_score(y_test, y_pred)))

scores = cross_val_score(best_model, samples, labels, cv=5)
print('mean accuracy :{}'.format(scores.mean()))

# 耗时15.1s 。

# 评估特征

需要一一列举入相关系数等函数的方法。

# 混淆矩阵

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.savefig('./pictures/ConfusionMatrix.png')
plt.show()

# ROC

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'ROC Curve (area = {roc_auc:.2f})', color='darkorange')
plt.title('ROC Curve')
# plt.text(0.5, 0.3, 'ROC curve (area =%0.2f)'%roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.tight_layout()
plt.savefig('./pictures/ROCCurve.png')
plt.show()

print('\nClassification Report:\n')
print(classification_report(y_test, y_pred, digits=4))

# 特征重要性

In [None]:
feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
importances = best_model.named_steps['classifier'].feature_importances_
# importances
# 与特征名称绑定。
for name, score in zip(feature_names, importances):
    print(f"{name:20s} : {score:4f}")

# 保存模型

In [None]:
joblib.dump(best_model, './data/model_pipeline.pkl')
joblib.dump(best_model.named_steps['preprocessor'].get_feature_names_out(), './data/feature_names.pkl')