Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST Check correct interactions of class_weight and sample_weight #21504

Open
10 tasks
jjerphan opened this issue Oct 29, 2021 · 15 comments
Open
10 tasks

TST Check correct interactions of class_weight and sample_weight #21504

jjerphan opened this issue Oct 29, 2021 · 15 comments
Labels
help wanted Meta-issue General issue associated to an identified list of tasks Moderate Anything that requires some knowledge of conventions and best practices module:test-suite everything related to our tests

Comments

@jjerphan
Copy link
Member

jjerphan commented Oct 29, 2021

In scikit-learn, some estimators support class_weight and sample_weight.

It might be worth testing the correct interaction of those two types of weights, especially asserting that:

  • setting a class weights to zero is equivalent to excluding the samples associated to this class from the calibration even when using non uniform sample weights.
  • setting some samples weights to zero is equivalent to excluding those samples from the calibration even when if they are associated to using non uniform class weights.

Relevant interfaces:

  • the main subclasses of sklearn.tree.BaseDecisionTree for classification, i.e.:
    • sklearn.tree.DecisionTreeClassifier
    • sklearn.tree.ExtraTreeClassifier
  • the main subclasses of sklearn.ensemble.BaseForest for classification and embedding, i.e.:
    • sklearn.ensemble.RandomTreesEmbedding
    • sklearn.ensemble.RandomForestClassifier
    • sklearn.ensemble.ExtraTreesClassifier
  • sklearn.linear_model.LogisticRegression
  • sklearn.linear_model.LogisticRegressionCV
  • sklearn.CalibratedClassifierCV after the merge of [MRG] Add class_weight parameter to CalibratedClassifierCV #17541
@mlant
Copy link
Contributor

mlant commented Nov 3, 2021

Hi, I'm new, I would like to work on this.
@jjerphan Do you have an estimator in mind that uses both that I can use to start with?

@jjerphan
Copy link
Member Author

jjerphan commented Nov 4, 2021

Hi @mlant,

I am glad you asked, I just have added the list of the interfaces to apply those tests on.

I would suggest starting with a simple PR which introduces tests for sklearn.linear_model.LogisticRegression and eventually sklearn.linear_model.LogisticRegressionCV. Would this work for you?

@rth
Copy link
Member

rth commented Nov 4, 2021

I think we could probably directly add this as a common tests (and maybe skip estimators that fail at first)? It's very similar in spirit to

def check_sample_weights_invariance(name, estimator_orig, kind="ones"):
so I imagine we could add a similar def check_class_weights_invariance(name, estimator_orig) and only run it here if the estimator has the "class_weight" init parameter.

Thanks for raising this @jjerphan !

@jjerphan
Copy link
Member Author

jjerphan commented Nov 4, 2021

@rth's proposal shows the correct steps to take. To ease new contributions, should we setup something similar to what has been done for #21406?

@mlant: In any case, let us know if you need more explanations.

@mlant
Copy link
Contributor

mlant commented Nov 4, 2021

Thank you, I think it's clear.
I'll begin tomorow, if I have some issues I'll come back to you.

@MrinalTyagi
Copy link
Contributor

Hi. I would like to work on sklearn.tree.BaseDecisionTree.

@jjerphan
Copy link
Member Author

jjerphan commented Nov 6, 2021

@MrinalTyagi: feel free to start when you want.

I'll then add your PR to this issue description so that people will know you're already working on this class and its sub-classes.

@VibhutiBansal-11
Copy link

Hi, I would like to work on sklearn.ensemble.RandomTreesEmbedding..

@jjerphan
Copy link
Member Author

jjerphan commented Nov 7, 2021

@VibhutiBansal-11: Thank you for manifesting your interest.

I would wait for the changes for sklearn.tree.BaseDecisionTree to be submitted before working on sklearn.ensemble.RandomTreesEmbedding.

@adrinjalali
Copy link
Member

I would say this is not a good first issue since it deals with a few complexities that are more suited to deal with when people are more familiar with the code-base. I would suggest first/second time contributors to focus on other issues which are marked as a "good first issue" and "help wanted", and we have a few of those around now.

@cmarmo cmarmo added Moderate Anything that requires some knowledge of conventions and best practices module:test-suite everything related to our tests Meta-issue General issue associated to an identified list of tasks help wanted labels Sep 14, 2022
@vitaliset
Copy link
Contributor

vitaliset commented Feb 25, 2023

Hello @jjerphan! I want to do a PR that addresses the proposed tests, but I have some questions before I open it.

1- I've written some code that implements the logic envisioned in the first test proposal (as far as I see). Does it make sense?

import numpy as np
from itertools import product

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, RandomTreesEmbedding
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

list_of_estimators = [DecisionTreeClassifier, ExtraTreeClassifier,
                      RandomForestClassifier, ExtraTreesClassifier,
                      LogisticRegression, LogisticRegressionCV]

for Estimator, n_classes in product(list_of_estimators, range(3, 10)):

    X, y = make_classification(n_samples=200, n_classes=n_classes,
                               n_informative=2*n_classes, random_state=42)

    sample_weight = np.random.RandomState(42).uniform(size=y.shape[0])
    class_weight_dict = {cls: 1 if cls != 0 else 0 for cls in range(n_classes)}
    sample_weight_exclude_first = np.where(y == 0, 0, sample_weight)

    dtc_cw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X, y, sample_weight=sample_weight)
    )
    dtc_sw = (
        Estimator(random_state=42)
        .fit(X, y, sample_weight=sample_weight_exclude_first)
    )

    assert (dtc_cw.predict_proba(X) == dtc_sw.predict_proba(X)).all()

(I'm getting some Increase the number of iterations (max_iter) or scale the data as shown in: warnings for logistic regression BTW.)

2- Also, I've implemented the logic of the second test (from what I understand) in the code below:

for Estimator, n_classes in product(list_of_estimators, range(2, 10)):

    X, y = make_classification(n_samples=200, n_classes=n_classes,
                               n_informative=2*n_classes, random_state=42)

    sample_weight = np.random.RandomState(42).binomial(1, 0.8, size=y.shape[0])    
    class_weight_dict = {cls: np.random.RandomState(cls).randint(1, 6) for cls in range(n_classes)}

    dtc_cw_and_sw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X, y, sample_weight=sample_weight)
    )
    dtc_sw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X[sample_weight == 1], y[sample_weight == 1])
    )

    assert (dtc_cw_and_sw.predict_proba(X) == dtc_sw.predict_proba(X)).all()

My assert passes for DecisionTreeClassifier, ExtraTreeClassifier and ExtraTreesClassifier. Still, it does not pass RandomForestClassifier, LogisticRegression and LogisticRegressionCV. I imagine that, in the first case, the bootstrapped sample from building the model interferes. For logistic regression we might have something to do with initialization of weights. Does it make sense to restrict this test to just some estimators with class_weight/sample_weight?

3- Also, other estimators have class_weight and sample_weight simultaneously. Should I check them all or just the ones listed in the PR description? If it's the second option, why?

4- Finally, RandomTreesEmbedding does not have the class_weight parameter. Why is it on the list?

RandomTreesEmbedding(class_weight="balanced")
>>> TypeError: __init__() got an unexpected keyword argument 'class_weight'

@vitaliset
Copy link
Contributor

Hello @jjerphan, pinging you again, just in case you overlooked this notification. :)

Do you find this issue pertinent to the library? If you believe it isn't relevant at this time, I can search for an alternative issue to focus on. If you find it beneficial for me to submit the pull request, but lack the time to address my questions at the moment, I can proceed with the submission and tackle my doubts during the review stage instead of discussing them here.

@jjerphan
Copy link
Member Author

Hi @vitaliset, thanks for the heads-up.

I am currently busy right now, but I will try to come back to you soon.

@jjerphan
Copy link
Member Author

Hi @vitaliset, can you open a draft pull request with the code you propose? This way it will be easier to discuss and inspect. Thank you!

@vitaliset
Copy link
Contributor

I created the PR @jjerphan! :D

Some classifiers seem to have failed the tests, such as Perceptron, LinearSVC, and SGDClassifier. Also, I saw an error related to pairwise that might require me to add a similar test check to it:

if not tags["pairwise"]:
# We skip pairwise because the data is not pairwise
yield check_sample_weights_shape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Meta-issue General issue associated to an identified list of tasks Moderate Anything that requires some knowledge of conventions and best practices module:test-suite everything related to our tests
Projects
None yet
Development

No branches or pull requests

8 participants