Skip to content

Improve tutorial classification on imbalanced data #2169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

lorentzenchr
Copy link
Contributor

@lorentzenchr lorentzenchr commented Dec 29, 2022

This PR improves the tutorial for classification on imbalanced data, https://www.tensorflow.org/tutorials/structured_data/imbalanced_data:

  • Add proper scoring rules
  • Add choice of threshold
  • Remove oversampling (to be considered a bad practice)

See also https://discuss.tensorflow.org/t/improvements-to-the-tutorial-classification-on-imbalanced-data/13520.

- Add proper scoring rules
- Add choice of threshold
- Remove oversampling (to be considered a bad practice)
@github-actions
Copy link

Preview

Preview and run these notebook edits with Google Colab: Rendered notebook diffs available on ReviewNB.com.

Format and style

Use the TensorFlow docs notebook tools to format for consistent source diffs and lint for style:
$ python3 -m pip install -U --user git+https://github.com/tensorflow/docs

$ python3 -m tensorflow_docs.tools.nbfmt notebook.ipynb
$ python3 -m tensorflow_docs.tools.nblint --arg=repo:tensorflow/docs notebook.ipynb
If commits are added to the pull request, synchronize your local branch: git pull origin improve_imbalanced_classification

@8bitmp3 8bitmp3 self-assigned this Dec 29, 2022
@8bitmp3 8bitmp3 added the review in progress Someone is actively reviewing this PR label Dec 29, 2022
@lorentzenchr
Copy link
Contributor Author

@8bitmp3 Is there any chance to get a first feedback?

Copy link
Member

@MarkDaoust MarkDaoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay, a lot of people were out for the Christmas holidays.

Thanks for taking the time to make the PR. Generally I support these changes, we jhust havea few little things to discuss.

Mainly: I'm not convinced that removing the resampling example is the right approach here.

Yes, on the training data resampling/reweighting almost never beat the straight classifier. on the validation set resamplng does show improvements compared to the baseline, and does much better than reweighting. Given that resampling is working better than reweighting, I'm against removing it.

Would it make sense to emphasize the cross entropy / log loss a bit more?

I don't think you should stop at CrossEntropy because because in many applications you do need to return a 0 or 1, and that has real-world value/costs and those are what you care about. I think the right thing to emphasize is the PRC curve and the relative values/costs of the different types of errors.

"#### Metrics for probability predictions\n",
"\n",
"As we train our network with the cross entropy as a loss function, it is fully capable of predicting class probabilities, i.e. it is a probabilistic classifier.\n",
"Metrics that assess probabilistic predictions and that are, in fact, **proper scoring rules** are:\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proper scoring rules

This is the first I've seen this term, if it's worth mentioning we should give a brief description of what that means, and why it's important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to add a single sentence. Under "Read more" is a canonical reference. On top of that, I can recommend to read https://arxiv.org/abs/0912.0902 (knowing that scoring rules and scoring functions coincide for binary classification).

"\n",
"#### Other metrices\n",
"\n",
"The following metrics take into account all possible choices of thresholds $t$, but they are not proper scoring rules and only assess the ranking of predictions, not their absolute values.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is hard to understand without a little more context on "proper scoring rules".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion?
I added one sentence for proper scoring rules above. Then this clearly says that AUC only assesses the ranking of predictions. Otherwise stated: Best AUC does not guarantee to be close to the true probabilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call introducing proper scoring earlier.

only assess the ranking of predictions, not their absolute values.

I don't understand this. Which ranking & values are we talking about?

Best AUC does not guarantee to be close to the true probabilities.

Right, but if all you want is a deterministic classifier, we don't care about the true probabilities.

@lorentzenchr
Copy link
Contributor Author

lorentzenchr commented Jan 14, 2023

@MarkDaoust Thanks for looking into this PR and your feedback.

Probabilistic classifier

Would it make sense to emphasize the cross entropy / log loss a bit more?

I don't think you should stop at CrossEntropy because because in many applications you do need to return a 0 or 1, and that has real-world value/costs and those are what you care about. I think the right thing to emphasize is the PRC curve and the relative values/costs of the different types of errors.

I would divide it into 2 steps. The first is modelling: Find a good probabilistic classifier. The statistical forecast literature clearly states that this is to be preferred over deterministic ones. The second step is then to make a decision, i.e. predict 0 or 1, given the predicted class probability. Note that given a good probabilistic classifier, there does not exist a (systematically/in expectation) better decision than the one based on it.

Without knowing the true cost (or cost ratio), the best one can do is - like in this tutorial - to demonstrate different thresholds and plot ROC curves.

In this regard, I noticed that the EarlyStopping is using the PRC-AUC ('val_prc'). I think this is not the best choice and indeed, setting it to 'val_loss' gives better results in the end (with setting a lower patience=5 instead of 10 to prevent overfittng). I would like to chance this, too.

I also think that the differences in the final results, in particular for the over-sampling case, are due to estimation uncertainty, i.e. due to chance and not systematic (confidence intervals would proof it).

@lorentzenchr
Copy link
Contributor Author

@MarkDaoust Any change to get this merged?

@MarkDaoust
Copy link
Member

Thanks for the ping, I give it a final look and try to get it merged.

MarkDaoust
MarkDaoust previously approved these changes Feb 24, 2023
"\n",
"#### Other metrices\n",
"\n",
"The following metrics take into account all possible choices of thresholds $t$, but they are not proper scoring rules and only assess the ranking of predictions, not their absolute values.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call introducing proper scoring earlier.

only assess the ranking of predictions, not their absolute values.

I don't understand this. Which ranking & values are we talking about?

Best AUC does not guarantee to be close to the true probabilities.

Right, but if all you want is a deterministic classifier, we don't care about the true probabilities.

@github-actions github-actions bot added the lgtm Community-added approval label Feb 24, 2023
MarkDaoust
MarkDaoust previously approved these changes Feb 24, 2023
@MarkDaoust MarkDaoust added ready to pull Start merge process and removed review in progress Someone is actively reviewing this PR labels Feb 24, 2023
@lorentzenchr
Copy link
Contributor Author

lorentzenchr commented Feb 24, 2023

IMHO, 885aead drops an important piece of information: "AUC and AUPRC only assess the ranking of predictions, not their absolute values", i.e. they are insensitive to (bad) calibration. That's a real deficiency of those metrics.

@MarkDaoust
Copy link
Member

IMHO, 885aead drops an important piece of information: "they only assess the ranking of predictions, not their absolute values", i.e. they are insensitive to (bad) calibration. That's a real deficiency of those metrics.

Thanks for the feedback. Could you help clarify? Can you give a little more detail here? What do you mean concretely?

"they only assess the ranking of predictions, not their absolute values"

I'm stumbling here on the fact that an AUPRC of 1.0 includes a perfect deterministic classifier, and a random classifier would give .. 0.5? 0.0? Those seem like absolute reference points to me.

insensitive to (bad) calibration

I'm still lost here, can you give an example?

@lorentzenchr
Copy link
Contributor Author

Let's concentrate on AUC: If you add a constant (or multiply by a positive constant) to the probability prediction of a model, AUC does not change. More visual, AUC does tell nothing about a reliability diagram which assesses (auto-) calibration.
References out of my head (you'll notice fast why they are in my head):

@MarkDaoust MarkDaoust requested a review from a team as a code owner March 7, 2023 12:23
@MarkDaoust MarkDaoust added ready to pull Start merge process and removed ready to pull Start merge process labels Mar 7, 2023
@copybara-service copybara-service bot merged commit 775470f into tensorflow:master Mar 9, 2023
@lorentzenchr lorentzenchr deleted the improve_imbalanced_classification branch March 12, 2023 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Community-added approval ready to pull Start merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants