TST Check correct interactions of `class_weight` and `sample_weight` #21504

jjerphan · 2021-10-29T13:55:46Z

mlant · 2021-11-03T20:09:35Z

Hi, I'm new, I would like to work on this.
@jjerphan Do you have an estimator in mind that uses both that I can use to start with?

jjerphan · 2021-11-04T08:11:24Z

Hi @mlant,

I am glad you asked, I just have added the list of the interfaces to apply those tests on.

I would suggest starting with a simple PR which introduces tests for sklearn.linear_model.LogisticRegression and eventually sklearn.linear_model.LogisticRegressionCV. Would this work for you?

rth · 2021-11-04T13:57:46Z

I think we could probably directly add this as a common tests (and maybe skip estimators that fail at first)? It's very similar in spirit to

scikit-learn/sklearn/utils/estimator_checks.py

Line 972 in 46485a9

def check_sample_weights_invariance(name, estimator_orig, kind="ones"):

so I imagine we could add a similar def check_class_weights_invariance(name, estimator_orig) and only run it here if the estimator has the "class_weight" init parameter.

Thanks for raising this @jjerphan !

jjerphan · 2021-11-04T14:12:50Z

@rth's proposal shows the correct steps to take. To ease new contributions, should we setup something similar to what has been done for #21406?

@mlant: In any case, let us know if you need more explanations.

mlant · 2021-11-04T19:48:04Z

Thank you, I think it's clear.
I'll begin tomorow, if I have some issues I'll come back to you.

MrinalTyagi · 2021-11-06T04:25:36Z

Hi. I would like to work on sklearn.tree.BaseDecisionTree.

jjerphan · 2021-11-06T06:54:18Z

@MrinalTyagi: feel free to start when you want.

I'll then add your PR to this issue description so that people will know you're already working on this class and its sub-classes.

VibhutiBansal-11 · 2021-11-07T06:59:28Z

Hi, I would like to work on sklearn.ensemble.RandomTreesEmbedding..

jjerphan · 2021-11-07T07:43:06Z

@VibhutiBansal-11: Thank you for manifesting your interest.

I would wait for the changes for sklearn.tree.BaseDecisionTree to be submitted before working on sklearn.ensemble.RandomTreesEmbedding.

adrinjalali · 2021-11-15T16:19:47Z

I would say this is not a good first issue since it deals with a few complexities that are more suited to deal with when people are more familiar with the code-base. I would suggest first/second time contributors to focus on other issues which are marked as a "good first issue" and "help wanted", and we have a few of those around now.

vitaliset · 2023-02-25T06:16:49Z

Hello @jjerphan! I want to do a PR that addresses the proposed tests, but I have some questions before I open it.

1- I've written some code that implements the logic envisioned in the first test proposal (as far as I see). Does it make sense?

import numpy as np
from itertools import product

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, RandomTreesEmbedding
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

list_of_estimators = [DecisionTreeClassifier, ExtraTreeClassifier,
                      RandomForestClassifier, ExtraTreesClassifier,
                      LogisticRegression, LogisticRegressionCV]

for Estimator, n_classes in product(list_of_estimators, range(3, 10)):

    X, y = make_classification(n_samples=200, n_classes=n_classes,
                               n_informative=2*n_classes, random_state=42)

    sample_weight = np.random.RandomState(42).uniform(size=y.shape[0])
    class_weight_dict = {cls: 1 if cls != 0 else 0 for cls in range(n_classes)}
    sample_weight_exclude_first = np.where(y == 0, 0, sample_weight)

    dtc_cw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X, y, sample_weight=sample_weight)
    )
    dtc_sw = (
        Estimator(random_state=42)
        .fit(X, y, sample_weight=sample_weight_exclude_first)
    )

    assert (dtc_cw.predict_proba(X) == dtc_sw.predict_proba(X)).all()

(I'm getting some Increase the number of iterations (max_iter) or scale the data as shown in: warnings for logistic regression BTW.)

2- Also, I've implemented the logic of the second test (from what I understand) in the code below:

for Estimator, n_classes in product(list_of_estimators, range(2, 10)):

    X, y = make_classification(n_samples=200, n_classes=n_classes,
                               n_informative=2*n_classes, random_state=42)

    sample_weight = np.random.RandomState(42).binomial(1, 0.8, size=y.shape[0])    
    class_weight_dict = {cls: np.random.RandomState(cls).randint(1, 6) for cls in range(n_classes)}

    dtc_cw_and_sw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X, y, sample_weight=sample_weight)
    )
    dtc_sw = (
        Estimator(random_state=42, class_weight=class_weight_dict)
        .fit(X[sample_weight == 1], y[sample_weight == 1])
    )

    assert (dtc_cw_and_sw.predict_proba(X) == dtc_sw.predict_proba(X)).all()

My assert passes for DecisionTreeClassifier, ExtraTreeClassifier and ExtraTreesClassifier. Still, it does not pass RandomForestClassifier, LogisticRegression and LogisticRegressionCV. I imagine that, in the first case, the bootstrapped sample from building the model interferes. For logistic regression we might have something to do with initialization of weights. Does it make sense to restrict this test to just some estimators with class_weight/sample_weight?

3- Also, other estimators have class_weight and sample_weight simultaneously. Should I check them all or just the ones listed in the PR description? If it's the second option, why?

4- Finally, RandomTreesEmbedding does not have the class_weight parameter. Why is it on the list?

RandomTreesEmbedding(class_weight="balanced")
>>> TypeError: __init__() got an unexpected keyword argument 'class_weight'

vitaliset · 2023-04-17T02:16:43Z

Hello @jjerphan, pinging you again, just in case you overlooked this notification. :)

Do you find this issue pertinent to the library? If you believe it isn't relevant at this time, I can search for an alternative issue to focus on. If you find it beneficial for me to submit the pull request, but lack the time to address my questions at the moment, I can proceed with the submission and tackle my doubts during the review stage instead of discussing them here.

jjerphan · 2023-04-19T08:47:39Z

Hi @vitaliset, thanks for the heads-up.

I am currently busy right now, but I will try to come back to you soon.

jjerphan · 2023-04-23T09:39:25Z

Hi @vitaliset, can you open a draft pull request with the code you propose? This way it will be easier to discuss and inspect. Thank you!

vitaliset · 2023-04-23T23:04:36Z

I created the PR @jjerphan! :D

Some classifiers seem to have failed the tests, such as Perceptron, LinearSVC, and SGDClassifier. Also, I saw an error related to pairwise that might require me to add a similar test check to it:

scikit-learn/sklearn/utils/estimator_checks.py

Lines 96 to 98 in 265b9aa

    
           if not tags["pairwise"]: 
        
               # We skip pairwise because the data is not pairwise 
        
               yield check_sample_weights_shape

jjerphan mentioned this issue Oct 29, 2021

[MRG] Add class_weight parameter to CalibratedClassifierCV #17541

Open

mlant mentioned this issue Nov 8, 2021

Add test TST Check correct interactions of class_weight and sam… #21601

Closed

mlant mentioned this issue Nov 15, 2021

[MRG] TST Check correct interactions of class_weight and sample_weight #21504 #21683

Closed

jjerphan mentioned this issue Mar 11, 2022

TST Add tests for LinearRegression that sample weights act consistently #15554

Merged

cmarmo added Moderate Anything that requires some knowledge of conventions and best practices module:test-suite everything related to our tests Meta-issue General issue associated to an identified list of tasks help wanted labels Sep 14, 2022

vitaliset mentioned this issue Apr 23, 2023

TST Interaction between class_weight and sample_weight #26266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST Check correct interactions of `class_weight` and `sample_weight` #21504

TST Check correct interactions of `class_weight` and `sample_weight` #21504

jjerphan commented Oct 29, 2021 •

edited

mlant commented Nov 3, 2021

jjerphan commented Nov 4, 2021

rth commented Nov 4, 2021 •

edited

jjerphan commented Nov 4, 2021

mlant commented Nov 4, 2021

MrinalTyagi commented Nov 6, 2021

jjerphan commented Nov 6, 2021

VibhutiBansal-11 commented Nov 7, 2021

jjerphan commented Nov 7, 2021

adrinjalali commented Nov 15, 2021

vitaliset commented Feb 25, 2023 •

edited

vitaliset commented Apr 17, 2023

jjerphan commented Apr 19, 2023

jjerphan commented Apr 23, 2023

vitaliset commented Apr 23, 2023

TST Check correct interactions of class_weight and sample_weight #21504

TST Check correct interactions of class_weight and sample_weight #21504

Comments

jjerphan commented Oct 29, 2021 • edited

mlant commented Nov 3, 2021

jjerphan commented Nov 4, 2021

rth commented Nov 4, 2021 • edited

jjerphan commented Nov 4, 2021

mlant commented Nov 4, 2021

MrinalTyagi commented Nov 6, 2021

jjerphan commented Nov 6, 2021

VibhutiBansal-11 commented Nov 7, 2021

jjerphan commented Nov 7, 2021

adrinjalali commented Nov 15, 2021

vitaliset commented Feb 25, 2023 • edited

vitaliset commented Apr 17, 2023

jjerphan commented Apr 19, 2023

jjerphan commented Apr 23, 2023

vitaliset commented Apr 23, 2023

TST Check correct interactions of `class_weight` and `sample_weight` #21504

TST Check correct interactions of `class_weight` and `sample_weight` #21504

jjerphan commented Oct 29, 2021 •

edited

rth commented Nov 4, 2021 •

edited

vitaliset commented Feb 25, 2023 •

edited