Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing bug in positive class selection used by makePredObsMatrix #190

Merged
merged 13 commits into from
Feb 16, 2016

Conversation

eric-czech
Copy link
Contributor

This commit contains a fix for an error in makePredObsMatrix (line 241) where the following code was being used to select the positive class for classification problems:

positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!

The problem with that is that unique returns the values in order of appearance so a different positive class is potentially being selected for each resampling (which seems like a rather large bug if I'm not mistaken). For example:

unique(factor(c('positive', 'negative')))[2] # = 'negative'
unique(factor(c('negative', 'positive')))[2] # = 'positive'
# Given the above, the choice of positive class would be dependent on the order of the response data itself

I made a change to choose that class like this instead: positive <- levels(modelLibrary$obs)[2]

From a broader perspective, it would be nice if caretEnsemble maintained consistency with caret in that the first level in binary outcomes is treated as the positive class by default. I see hard-coded selections for the second class's probability predictions in predict.{caretList, caretStack} which might not be too hard to override with a more explicit selection of the positive class (or to change to the first class), but I'd be curious to hear what your thoughts are on that. Perhaps it could be some sort of global configuration option rather than a parameter that needs to be passed in multiple places? I dunno. I'm happy to help with any of it if you have suggestions.

@zachmayer
Copy link
Owner

From a broader perspective, it would be nice if caretEnsemble maintained consistency with caret in that the first level in binary outcomes

Overall, this seems like a good idea. A few comments:

  1. Why don't you change predict.caretList and predict.caretStack to use the first level as the positive level?
  2. Please add some unit tests to ensure the makePredObsMatrix and the predict functions are using the correct positive levels.
  3. Please fix the lint that the lintr bot found.

Other than that, this seems like a good idea. That todo has been on the list for a while...

@eric-czech
Copy link
Contributor Author

Alright sorry for the build errors, I didn't realize lintr wasn't running on my machine when running unit tests (had to install it first, whoops).

Anyways the changes to caretList and caretEnsemble were pretty minor and the only previous test it conflicted with was in test-ensemble.R where an expected probability prediction is now for a different class (i.e. the new value is equal to approximately 1 - the old value).

Let me know if you see any issues with those changes or with the new unit test (I'd be happy to move it into an existing test if you don't think it makes sense on its own).

@eric-czech
Copy link
Contributor Author

@zachmayer What you would think of making the above configurable with a global option like: options(caret.ensemble.def.bin.level=1) # or 2

Now that it's clear what all needs to be changed to make the default one or the other, it would be easy to add that switch in. But I don't want to do that unless you're ok with the changes so far and not aware of a better way to make global configuration settings.

@zachmayer
Copy link
Owner

@eric-czech sure, making it a global option works for me. Make the default be the same one as the caret package uses by default

data(Y.class)

#############################################################################
context("Do classifier predictions use the correct outcome classes?")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would run this test twice, once where the postive level is "yes" and once where the positive level is "no", and check that no re-arrangement happens, regardless of the alphabetical sorting of the target.

@zachmayer
Copy link
Owner

@eric-czech Actually, I'm happy with the PR as-is. No need to add a global option, but I'd like you to consider adding 1 more test.

@eric-czech
Copy link
Contributor Author

@zachmayer No problem, I added in the configuration option as well as some extra tests. I didn't see any reason to test problems with alphabetical ordering at first, but I'm glad you suggested it because I did ultimately find that while the alphabetical ordering of the level names doesn't affect results of caret or caretEnsemble models directly, it does lead to different results indirectly because of the stratified CV splits caret creates (they ARE dependent on the factor names ... which is annoying).

In other words if you let the models use the caret::createFolds function then they produce different results depending on what names you give the factors because the stratification is based on results of the base::table function, which orders the factor names alphabetically. Anyways, the results were different because of that initially (not by very much) but by creating the CV splits manually I was able to make sure the factor names don't affect results in the tests.

#' but that value can be overriden using global options (e.g.
#' \code{options(caret.ensemble.target.bin.level=2)}).
getBinaryLevel <- function() {
value <- as.numeric(getOption("caret.ensemble.target.bin.level", default = 1))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do default = 1L.

@zachmayer
Copy link
Owner

I like getBinaryLevel, thanks for writing that. What do you think about an exported function called setBinaryLevel that's a friendly way to set the option in caretEnsemble? This function could coerce the input to an integer (with a warning if it's not already an integer), and then check that it's equal to 1 or 2.

@zachmayer
Copy link
Owner

Once we have this merged in here, it might actually make sense to add this option and logic to the caret package itself. I really like the approach.

# order of the factor levels in the response (which is what this test module
# needs to prove invariance to). This happens because createFolds uses
# the base table function to create class frequency counts and that table
# command sorts results alphabetically.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should file a bug report with example code on the caret github repo

@zachmayer
Copy link
Owner

A couple more comments, but this is looking really good.

Do you know to squash git commits? If so, you should squash this PR into one commit. If not, don't worry about it =D.

@eric-czech
Copy link
Contributor Author

Ok @zachmayer I added in a getter and setter for the target level and added coercion to an integer for the argument (as well as a few other things like docs and using the getter in the test cases to reset the target level).

The stratified sampling issue is up there too so we'll see if anyone has suggestions on it.

I also gave rebasing a try but kind of shot myself in the foot by changing a lot of the same things throughout those commits (err I also kind of screwed it up by repeating some commits .. my gitfu is weak). Anyways, I do have another feature enhancement I was going to submit (for using custom models) and assuming that turns into multiple commits as well I'll be careful rebase them more frequently. My bad.

@zachmayer
Copy link
Owner

No worries! This PR is fine. In the future, you can "squash" multiple commits into a single commit using an "interactive rebase":
http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html

#' and that the resulting integer is in \{1, 2\}.
#' @param arg argument to potentially be used as new target level
#' @return Binary target level (as integer equal to 1 or 2)
validateBinaryTargetLevel <- function(arg){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives weird results if you pass a vector of length > 0, e.g. validateBinaryTargetLevel(1:10) or validateBinaryTargetLevel(letters). You should probably add something like stopifnot(length(arg)==0). I'll fix this after merging, since it's very minor.

zachmayer added a commit that referenced this pull request Feb 16, 2016
Fixing bug in positive class selection used by makePredObsMatrix
@zachmayer zachmayer merged commit 66956d8 into zachmayer:master Feb 16, 2016
@zachmayer
Copy link
Owner

Also note that after rebasing, you ALWAYS have to git push -f to replace your branch on github with your new one.

@eric-czech
Copy link
Contributor Author

Cool I think that was where I went wrong .. I couldn't figure out how to recover from not having forced it and doing another rebase was resulting in tons of conflicts.

@zachmayer
Copy link
Owner

@eric-czech I fixed the rebase on my branch, and then force pushed to my own master.

Unfortunately, this kinda messed up your fork of my repo. You'll have to replace your master branch with mine:
http://scribu.net/blog/resetting-your-github-fork.html

git remote add upstream git@github.com:zachmayer/caretEnsemble.git
git fetch upstream
git branch backup
git checkout upstream/master -B master
git push --force

@eric-czech
Copy link
Contributor Author

Thanks! Looks much better now and that reset on my fork seems to have worked well.

@zachmayer
Copy link
Owner

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants