Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add checkCondtionalX to preProc of train()? #334

Closed
blueskypie opened this issue Nov 28, 2015 · 10 comments
Closed

add checkCondtionalX to preProc of train()? #334

blueskypie opened this issue Nov 28, 2015 · 10 comments

Comments

@blueskypie
Copy link

@blueskypie blueskypie commented Nov 28, 2015

It will reduce the chance of naive bayes to crash.

Thanks a lot!

@blueskypie blueskypie changed the title add nzv and checkCondtionalX to preProc of train()? add checkCondtionalX to preProc of train()? Nov 28, 2015
@topepo
Copy link
Owner

@topepo topepo commented Nov 29, 2015

nzv is already available. checkCondtionalX shouldn't be too hard to add.

@blueskypie
Copy link
Author

@blueskypie blueskypie commented Nov 30, 2015

Thanks a lot! if added to preProc, I hope those predictors with empty distribution be automatically removed from that set of training data. May I get a note when this is implemented?

@topepo
Copy link
Owner

@topepo topepo commented Nov 30, 2015

I'll update this thread as it happens.

topepo added a commit that referenced this issue Nov 30, 2015
@topepo
Copy link
Owner

@topepo topepo commented Nov 30, 2015

That should do it. If you could do some testing, that would be a big help.

@blueskypie
Copy link
Author

@blueskypie blueskypie commented Dec 2, 2015

Thanks for the quick response! And sorry for the slow testing! I'm not familiar with how to use github and it also took me hours to install devtools into my centOS due to lack of many dependencies.

Anyway, finally I successfully installed and verified the caret version is caret_6.0-63. But here comes the questions: is the below warning message from nb due to the empty conditional distribution of a predictor? If it does, adding preProc = c("center", "scale","conditionalX") to train does not remove the warnings.

48: In FUN(X[[i]], ...) :
Numerical 0 probability for all classes with observation 5

Here is my code

rm(list=ls())

library(caret)
library(klaR)
data(mdrr)
set.seed(107)

preds1=mdrrDescr[1:50,1:10]
resps1=mdrrClass[1:50]
table(resps1)

data1=cbind(preds1,Class=resps1)

inTrain   <-   createDataPartition(y   =   data1$Class, p   =   .75,
        list   =   FALSE)

str(inTrain)

training   <-   data1[inTrain,]
testing   <-   data1[-inTrain,]
ctrl   <-   trainControl(method   =   "repeatedcv",
        repeats   =   3, 
        number = 10, 
        classProbs   =   TRUE, 
        summaryFunction   =   twoClassSummary) 
nbFit   <-   train(Class   ~   .,
        data   =   training,
        method   =   "nb",
        tuneLength   =   15, 
        trControl   =   ctrl,
        metric   =   "ROC", 
        preProc   =   c("center",   "scale","conditionalX"))

nbFit2=NaiveBayes(Class   ~   .,data=training[-(20:25),])
@topepo
Copy link
Owner

@topepo topepo commented Dec 2, 2015

That is a warning generated by the underlying naive Bayes code. Basically, it wants you to know that, for one sample, all of the elements that make up the posterior probability (besides the prior) are zero.

@blueskypie
Copy link
Author

@blueskypie blueskypie commented Dec 2, 2015

but should conditionalX remove such predictor before feeding into nb?

@topepo
Copy link
Owner

@topepo topepo commented Dec 2, 2015

No, I don't think that it related to this. These predictors are all continuous (some with fairly degenerate distributions) and the density is zero where one of the holdout samples falls.

@blueskypie
Copy link
Author

@blueskypie blueskypie commented Dec 2, 2015

Thanks for the quick response! I am not very good at statistics, so please understand if my questions are naive.
If I partition all samples into testing, training (including, validation and sub-training),

  1. are you saying the sample in the warning message comes from the validation set?
  2. does "density is zero where one of the holdout samples falls." means the sample has zero variance?
  3. Is preProc applied to training set or sub-training set?
@topepo
Copy link
Owner

@topepo topepo commented Mar 12, 2016

The warning most likely occurs when samples are being predicted (perhaps the holdout during resampling)

There could be a zero value from a perfectly good density. If your data are X ~ N(0,1), the density for X = 10^10 would be zero since it is a highly unlikely value.

During resampling, pre-processing statistics are computed on the same data being used to fit the model. It is applied to the same data as well as the holdout. After resampling, the statistics are computed on the entire training set and applied to any data that are subsequently predicted.

I'm going to close this since it is not about this particular bug anymore.

@topepo topepo closed this Mar 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.