We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
跟着作者的代码一直执行到最后,在cv这一步输出的auc的确是挺高的,达到了0.79,然而当我自行分割训练集和验证集,并且用同样的模型参数训练模型时,效果却不如人意。
from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, roc_curve, auc, roc_auc_score voting = VotingClassifier(estimators = estimators, voting='soft') X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=0) voting.fit(X_train_new, y_train_new) y_train_predit = voting.predict(X_train_new) y_val_predit = voting.predict(X_val) print(classification_report(y_train_new, y_train_predit)) print(roc_auc_score(y_train_new, y_train_predit)) print(classification_report(y_val, y_val_predit)) print(roc_auc_score(y_val, y_val_predit))
输出如下:
precision recall f1-score support 0 0.94 1.00 0.97 21212 1 1.00 0.00 0.00 1247 micro avg 0.94 0.94 0.94 22459 macro avg 0.97 0.50 0.49 22459 weighted avg 0.95 0.94 0.92 22459 0.5008019246190858 precision recall f1-score support 0 0.94 1.00 0.97 5306 1 0.00 0.00 0.00 309 micro avg 0.94 0.94 0.94 5615 macro avg 0.47 0.50 0.49 5615 weighted avg 0.89 0.94 0.92 5615 0.49962306822465136
模型的auc只有0.5不到,而且recall基本为0。因为这是一个不平衡预测集,违约人数较少,此时模型可能是把所有样本判断为0,虽然准确率很高,但是这样的模型却是没意义的。
那么到底是哪里出了问题呢,为什么前面的交叉验证显示的auc这么高呢。观察代码,我发现了一个bug。在这个地方:
cv = StratifiedKFold(n_splits=3, shuffle=True) def estimate(estimator, name='estimator'): auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean() accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean() recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean() print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
作者在cross_val_score的cv参数传入了一个 StratifiedKFold的实例。阅读代码发现,如果传入数字cross_val_score也会默认使用StratifiedKFold(cv)来对数据集进行分割,但是不会传入shuffle=True。另外计算三个指标分别进行三次交叉验证计算也不合常理。于是我尝试把代码改成如下:
def estimate(estimator, name='estimator'): scoring = {'roc_auc': 'roc_auc', 'accuracy': 'accuracy', 'recall': 'recall'} scoring_result_dict= cross_validate(estimator, X_train, y_train, scoring=scoring, cv=3, return_estimator=True) auc = scoring_result_dict['test_roc_auc'].mean() accuracy = scoring_result_dict['test_accuracy'].mean() recall = scoring_result_dict['test_recall'].mean() print(scoring_result_dict) print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
此时算出来的auc只有0.5左右,符合上面的结果。同时我也尝试传入cv = StratifiedKFold(n_splits=3, shuffle=True),算出来的auc也只有0.5左右。我猜测,shuffle=True是造成auc偏高的原因,但具体原因我还没找到。
作者在数据清洗和特征工程做了大量的工作,还是能给人不少启发的,不过最后在模型调参这一部分就显得有点粗糙了。
The text was updated successfully, but these errors were encountered:
应该是你用y_val_predit的问题,计算auc的时候应该使用y_pred[:,1],就是预测为1的分数。 改成y_val_predit = voting.predict_proba(X_val)[:,1]试试
Sorry, something went wrong.
No branches or pull requests
跟着作者的代码一直执行到最后,在cv这一步输出的auc的确是挺高的,达到了0.79,然而当我自行分割训练集和验证集,并且用同样的模型参数训练模型时,效果却不如人意。
输出如下:
模型的auc只有0.5不到,而且recall基本为0。因为这是一个不平衡预测集,违约人数较少,此时模型可能是把所有样本判断为0,虽然准确率很高,但是这样的模型却是没意义的。
那么到底是哪里出了问题呢,为什么前面的交叉验证显示的auc这么高呢。观察代码,我发现了一个bug。在这个地方:
作者在cross_val_score的cv参数传入了一个 StratifiedKFold的实例。阅读代码发现,如果传入数字cross_val_score也会默认使用StratifiedKFold(cv)来对数据集进行分割,但是不会传入shuffle=True。另外计算三个指标分别进行三次交叉验证计算也不合常理。于是我尝试把代码改成如下:
此时算出来的auc只有0.5左右,符合上面的结果。同时我也尝试传入cv = StratifiedKFold(n_splits=3, shuffle=True),算出来的auc也只有0.5左右。我猜测,shuffle=True是造成auc偏高的原因,但具体原因我还没找到。
作者在数据清洗和特征工程做了大量的工作,还是能给人不少启发的,不过最后在模型调参这一部分就显得有点粗糙了。
The text was updated successfully, but these errors were encountered: