Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to get scorers for task #12385

Open
jnothman opened this issue Oct 15, 2018 · 9 comments · May be fixed by #17889
Open

Function to get scorers for task #12385

jnothman opened this issue Oct 15, 2018 · 9 comments · May be fixed by #17889
Labels
API Moderate Anything that requires some knowledge of conventions and best practices New Feature

Comments

@jnothman
Copy link
Member

I would like to see a utility which would construct a set of applicable scorers for a particular task, returning a Mapping from string to callable scorer. It will be hard to design the API of this right the first time. [Maybe this should be initially developed outside this project and contributed to scikit-learn-contrib, but I think it reduces risk of mis-specifying scorers, so it's of benefit to this project.]

The user will be able to select a subset of the scorers, either with a dict comprehension or with some specialised methods or function parameters. Initially it wouldn't be efficient to run all these scorers, but hopefully we can do something to fix #10802 :|.

Let's take for instance a binary classification task. The function get_applicable_scorers(y, pos_label='yes') for binary y might produce something like:

{
    'accuracy': make_scorer(accuracy_score),
    'balanced_accuracy': make_scorer(balanced_accuracy_score),
    'matthews_corrcoef': make_scorer(matthews_corrcoef),
    'cohens_kappa': make_scorer(cohens_kappa),
    'precision': make_scorer(precision_score, pos_label='yes'),
    'recall': make_scorer(recall_score, pos_label='yes'),
    'f1': make_scorer(f1_score, pos_label='yes'),
    'f0.5': make_scorer(f1_score, pos_label='yes', beta=0.5),
    'f2': make_scorer(f1_score, pos_label='yes', beta=2),
    'specificity': ...,
    'miss_rate': ...,
    ...
    'roc_auc': make_scorer(roc_auc_score, needs_threshold=True),
    'average_precision': make_scorer(average_precision_score, needs_threshold=True),
    'neg_log_loss': make_scorer(average_precision_score, needs_proba=True, greater_is_better=False),
    'neg_brier_score_loss': make_scorer(average_precision_score, needs_proba=True, greater_is_better=False),
}

Doing the same for multiclass classification would pass labels as appropriate, and would optionally would get per-class binary metrics, as well as overall multiclass metrics.

I'm not sure how sample_weight fits in here, but ha! we still don't support weighted scoring in cross validation (#1574), so let's not worry about that.

@jnothman jnothman added New Feature Moderate Anything that requires some knowledge of conventions and best practices API help wanted labels Oct 15, 2018
@whiletruelearn
Copy link
Contributor

@jnothman Trying to understand. Can this be implemented by maintaining separate dict of scorers as in

SCORERS = dict(explained_variance=explained_variance_scorer,
r2=r2_scorer,
max_error=max_error_scorer,
neg_median_absolute_error=neg_median_absolute_error_scorer,
neg_mean_absolute_error=neg_mean_absolute_error_scorer,
neg_mean_squared_error=neg_mean_squared_error_scorer,
neg_mean_squared_log_error=neg_mean_squared_log_error_scorer,
accuracy=accuracy_scorer, roc_auc=roc_auc_scorer,
balanced_accuracy=balanced_accuracy_scorer,
average_precision=average_precision_scorer,
neg_log_loss=neg_log_loss_scorer,
brier_score_loss=brier_score_loss_scorer,
# Cluster metrics that use supervised evaluation
adjusted_rand_score=adjusted_rand_scorer,
homogeneity_score=homogeneity_scorer,
completeness_score=completeness_scorer,
v_measure_score=v_measure_scorer,
mutual_info_score=mutual_info_scorer,
adjusted_mutual_info_score=adjusted_mutual_info_scorer,
normalized_mutual_info_score=normalized_mutual_info_scorer,
fowlkes_mallows_score=fowlkes_mallows_scorer)

for different tasks such as classification , regression, clustering etc.

@jnothman
Copy link
Member Author

No, while related to that:

  1. this needs to be a function of other parameters, such as a single or multiple classes of interest.
  2. this needs to exclude inappropriate scorers, such as those for regression or clustering tasks when performing classification.

@baluyotraf
Copy link
Contributor

baluyotraf commented Jan 21, 2019

Not sure if I really get what you want but can't we do a class wrapper for the scorers?

For example a scorer wrapper class would have something like

class Scorer:

    is_binary_scorer = True
    needs_probability = True

    def __init__(self, *args, **kwargs):
        pass

    def compute(self, y_true, y_pred):
        pass

Then to take binary scorers you can do a list comprehension.

params = {
    'weights': 10,
    'something': 'pass'
}
binary_scorers = [s(**params) for s in ALL_SCORERS where s.is_binary_scorer] 

If a scorer needs a predict_proba when doing multi metrics you can also skip them with a warning if the model used doesn't have it. Just my 2 cents.

You can also do something like this for the multi metrics

for s in scorers:
    if s.needs_probability:
        if proba is None:
            try:
                proba = estimator.predict_proba(X_test)
                score = s.compute(y_true, proba)
            except AttributeError:
                pass
        else:
            score = s.compute(y_true, proba)
    else:
        if y_pred is None:
            estimator.predict(X_test)
        score = s.compute(y_true, y_pred)

Oops seems like something like this was already posted in the referenced issue.

@jnothman
Copy link
Member Author

This is not about computing various metrics efficiently, but about helping users find metrics appropriate for a task, and making sure they do not get caught in some traps. For example, the scorers available do not return any per-class metrics in the case of multiclass, multilabel, etc. They also make some sometimes-poor assumptions, including that:

  • y_true and y_pred contain all relevant classes. This makes a big difference for macro-averaged metrics (and for any metrics that change their behaviour depending on whether their inputs is binary or multiclass). It can be overridden in metrics by passing a labels parameter.
  • in a binary problem the highest-valued class label is the one that you are interested in.

@baluyotraf
Copy link
Contributor

If that's the case won't sub classes or separate metrics module do fine? If I'm working on a classification problem and sklearn.metrics.classification exists it's kind of intuitive that I must use the scorers in those right? Similar to the sklearn.metrics.cluster. Classes might also include other scoring information which may help the users select the metrics they want to use.

@jnothman
Copy link
Member Author

It still doesn't satisfy the need for specifying labels, nor does it readily create scorers for search.

@chkoar
Copy link
Contributor

chkoar commented Apr 4, 2022

The function get_applicable_scorers(y, pos_label='yes') for binary y might produce something like:

I suppose that most likely you will need the model too.
Having the model itself you would be able to exclude/include scorers.

@jnothman
Copy link
Member Author

jnothman commented Apr 5, 2022

Yes, you're right, the estimator might be useful to determine predict_proba or return_std support.

@DCoupry
Copy link

DCoupry commented May 2, 2024

Bit stale, but still: the sklearn.utils.discovery module could be the best place for a function all_scorers([task_filter]) to live in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Moderate Anything that requires some knowledge of conventions and best practices New Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants