-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Check correct interactions of class_weight
and sample_weight
#21504
Comments
Hi, I'm new, I would like to work on this. |
Hi @mlant, I am glad you asked, I just have added the list of the interfaces to apply those tests on. I would suggest starting with a simple PR which introduces tests for |
I think we could probably directly add this as a common tests (and maybe skip estimators that fail at first)? It's very similar in spirit to
def check_class_weights_invariance(name, estimator_orig) and only run it here if the estimator has the "class_weight" init parameter.
Thanks for raising this @jjerphan ! |
Thank you, I think it's clear. |
Hi. I would like to work on |
@MrinalTyagi: feel free to start when you want. I'll then add your PR to this issue description so that people will know you're already working on this class and its sub-classes. |
Hi, I would like to work on |
@VibhutiBansal-11: Thank you for manifesting your interest. I would wait for the changes for |
I would say this is not a good first issue since it deals with a few complexities that are more suited to deal with when people are more familiar with the code-base. I would suggest first/second time contributors to focus on other issues which are marked as a "good first issue" and "help wanted", and we have a few of those around now. |
Hello @jjerphan! I want to do a PR that addresses the proposed tests, but I have some questions before I open it. 1- I've written some code that implements the logic envisioned in the first test proposal (as far as I see). Does it make sense? import numpy as np
from itertools import product
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, RandomTreesEmbedding
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
list_of_estimators = [DecisionTreeClassifier, ExtraTreeClassifier,
RandomForestClassifier, ExtraTreesClassifier,
LogisticRegression, LogisticRegressionCV]
for Estimator, n_classes in product(list_of_estimators, range(3, 10)):
X, y = make_classification(n_samples=200, n_classes=n_classes,
n_informative=2*n_classes, random_state=42)
sample_weight = np.random.RandomState(42).uniform(size=y.shape[0])
class_weight_dict = {cls: 1 if cls != 0 else 0 for cls in range(n_classes)}
sample_weight_exclude_first = np.where(y == 0, 0, sample_weight)
dtc_cw = (
Estimator(random_state=42, class_weight=class_weight_dict)
.fit(X, y, sample_weight=sample_weight)
)
dtc_sw = (
Estimator(random_state=42)
.fit(X, y, sample_weight=sample_weight_exclude_first)
)
assert (dtc_cw.predict_proba(X) == dtc_sw.predict_proba(X)).all() (I'm getting some 2- Also, I've implemented the logic of the second test (from what I understand) in the code below: for Estimator, n_classes in product(list_of_estimators, range(2, 10)):
X, y = make_classification(n_samples=200, n_classes=n_classes,
n_informative=2*n_classes, random_state=42)
sample_weight = np.random.RandomState(42).binomial(1, 0.8, size=y.shape[0])
class_weight_dict = {cls: np.random.RandomState(cls).randint(1, 6) for cls in range(n_classes)}
dtc_cw_and_sw = (
Estimator(random_state=42, class_weight=class_weight_dict)
.fit(X, y, sample_weight=sample_weight)
)
dtc_sw = (
Estimator(random_state=42, class_weight=class_weight_dict)
.fit(X[sample_weight == 1], y[sample_weight == 1])
)
assert (dtc_cw_and_sw.predict_proba(X) == dtc_sw.predict_proba(X)).all() My assert passes for 3- Also, other estimators have class_weight and sample_weight simultaneously. Should I check them all or just the ones listed in the PR description? If it's the second option, why? 4- Finally, RandomTreesEmbedding(class_weight="balanced")
>>> TypeError: __init__() got an unexpected keyword argument 'class_weight' |
Hello @jjerphan, pinging you again, just in case you overlooked this notification. :) Do you find this issue pertinent to the library? If you believe it isn't relevant at this time, I can search for an alternative issue to focus on. If you find it beneficial for me to submit the pull request, but lack the time to address my questions at the moment, I can proceed with the submission and tackle my doubts during the review stage instead of discussing them here. |
Hi @vitaliset, thanks for the heads-up. I am currently busy right now, but I will try to come back to you soon. |
Hi @vitaliset, can you open a draft pull request with the code you propose? This way it will be easier to discuss and inspect. Thank you! |
I created the PR @jjerphan! :D Some classifiers seem to have failed the tests, such as Perceptron, LinearSVC, and SGDClassifier. Also, I saw an error related to pairwise that might require me to add a similar test check to it: scikit-learn/sklearn/utils/estimator_checks.py Lines 96 to 98 in 265b9aa
|
In scikit-learn, some estimators support
class_weight
andsample_weight
.It might be worth testing the correct interaction of those two types of weights, especially asserting that:
Relevant interfaces:
sklearn.tree.BaseDecisionTree
for classification, i.e.:sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.BaseForest
for classification and embedding, i.e.:sklearn.ensemble.RandomTreesEmbedding
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
sklearn.CalibratedClassifierCV
after the merge of [MRG] Add class_weight parameter to CalibratedClassifierCV #17541The text was updated successfully, but these errors were encountered: