-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Improve tutorial classification on imbalanced data #2169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tutorial classification on imbalanced data #2169
Conversation
- Add proper scoring rules - Add choice of threshold - Remove oversampling (to be considered a bad practice)
PreviewPreview and run these notebook edits with Google Colab: Rendered notebook diffs available on ReviewNB.com.Format and styleUse the TensorFlow docs notebook tools to format for consistent source diffs and lint for style:$ python3 -m pip install -U --user git+https://github.com/tensorflow/docsIf commits are added to the pull request, synchronize your local branch: git pull origin improve_imbalanced_classification
|
@8bitmp3 Is there any chance to get a first feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay, a lot of people were out for the Christmas holidays.
Thanks for taking the time to make the PR. Generally I support these changes, we jhust havea few little things to discuss.
Mainly: I'm not convinced that removing the resampling example is the right approach here.
Yes, on the training data resampling/reweighting almost never beat the straight classifier. on the validation set resamplng does show improvements compared to the baseline, and does much better than reweighting. Given that resampling is working better than reweighting, I'm against removing it.
Would it make sense to emphasize the cross entropy / log loss a bit more?
I don't think you should stop at CrossEntropy because because in many applications you do need to return a 0 or 1, and that has real-world value/costs and those are what you care about. I think the right thing to emphasize is the PRC curve and the relative values/costs of the different types of errors.
"#### Metrics for probability predictions\n", | ||
"\n", | ||
"As we train our network with the cross entropy as a loss function, it is fully capable of predicting class probabilities, i.e. it is a probabilistic classifier.\n", | ||
"Metrics that assess probabilistic predictions and that are, in fact, **proper scoring rules** are:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proper scoring rules
This is the first I've seen this term, if it's worth mentioning we should give a brief description of what that means, and why it's important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to add a single sentence. Under "Read more" is a canonical reference. On top of that, I can recommend to read https://arxiv.org/abs/0912.0902 (knowing that scoring rules and scoring functions coincide for binary classification).
"\n", | ||
"#### Other metrices\n", | ||
"\n", | ||
"The following metrics take into account all possible choices of thresholds $t$, but they are not proper scoring rules and only assess the ranking of predictions, not their absolute values.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is hard to understand without a little more context on "proper scoring rules".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion?
I added one sentence for proper scoring rules above. Then this clearly says that AUC only assesses the ranking of predictions. Otherwise stated: Best AUC does not guarantee to be close to the true probabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call introducing proper scoring earlier.
only assess the ranking of predictions, not their absolute values.
I don't understand this. Which ranking & values are we talking about?
Best AUC does not guarantee to be close to the true probabilities.
Right, but if all you want is a deterministic classifier, we don't care about the true probabilities.
@MarkDaoust Thanks for looking into this PR and your feedback. Probabilistic classifier
I would divide it into 2 steps. The first is modelling: Find a good probabilistic classifier. The statistical forecast literature clearly states that this is to be preferred over deterministic ones. The second step is then to make a decision, i.e. predict 0 or 1, given the predicted class probability. Note that given a good probabilistic classifier, there does not exist a (systematically/in expectation) better decision than the one based on it. Without knowing the true cost (or cost ratio), the best one can do is - like in this tutorial - to demonstrate different thresholds and plot ROC curves. In this regard, I noticed that the I also think that the differences in the final results, in particular for the over-sampling case, are due to estimation uncertainty, i.e. due to chance and not systematic (confidence intervals would proof it). |
@MarkDaoust Any change to get this merged? |
Thanks for the ping, I give it a final look and try to get it merged. |
"\n", | ||
"#### Other metrices\n", | ||
"\n", | ||
"The following metrics take into account all possible choices of thresholds $t$, but they are not proper scoring rules and only assess the ranking of predictions, not their absolute values.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call introducing proper scoring earlier.
only assess the ranking of predictions, not their absolute values.
I don't understand this. Which ranking & values are we talking about?
Best AUC does not guarantee to be close to the true probabilities.
Right, but if all you want is a deterministic classifier, we don't care about the true probabilities.
IMHO, 885aead drops an important piece of information: "AUC and AUPRC only assess the ranking of predictions, not their absolute values", i.e. they are insensitive to (bad) calibration. That's a real deficiency of those metrics. |
Thanks for the feedback. Could you help clarify? Can you give a little more detail here? What do you mean concretely?
I'm stumbling here on the fact that an AUPRC of 1.0 includes a perfect deterministic classifier, and a random classifier would give .. 0.5? 0.0? Those seem like absolute reference points to me.
I'm still lost here, can you give an example? |
Let's concentrate on AUC: If you add a constant (or multiply by a positive constant) to the probability prediction of a model, AUC does not change. More visual, AUC does tell nothing about a reliability diagram which assesses (auto-) calibration.
|
This PR improves the tutorial for classification on imbalanced data, https://www.tensorflow.org/tutorials/structured_data/imbalanced_data:
See also https://discuss.tensorflow.org/t/improvements-to-the-tutorial-classification-on-imbalanced-data/13520.