-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a module to prune boosted ensemble. #2
Conversation
A module to prune the boosted ensemble to obtain user defined number of decision trees/rules.
Sorry for the delay; I'm just now getting to look at this. Question: is there any instance where the number of boosting iterations to be pruned is less than the number of trials? How would that work? Can you
Thanks, Max |
I hadn't responded due to some personal reasons. Sorry for the delay.
The ensemble techniques work by generating set of classifiers and then having a voting measure to classify the data in the test set. However, one of the disadvantage of ensemble techniques is that it requires a large amount of memory to store all the classifiers that are generated. Hence, there is a need to reduce the memory consumption. Refer https://www.lri.fr/~sebag/Examens/Margineantu.pdf for detailed explanation of Kappa pruning algorithm. The kappa pruning algorithm has been implemented as follows. Kappa K on the training set is calculated for every pair of classifiers produced by adaboost. Once all kappas are calculated, choose pairs of classifiers starting with the pair that has the lowest K and considering them in increasing order of K until M classifiers are obtained. The above figure shows that pruning improves performance over AdaBoost for Chess, Adult, Car, Letter, Wine and Forest-Cover. The only data sets that show very bad behaviour are Adult and Breast-Cancer, which appears to be very unstable. Hence, in many cases, significant pruning does not hurt performance very much, while saving the memory by reducing the number of classifiers.
Thanks & Regards, |
Navami , Thanks for the follow-up. No big deal about timing. A few things though: Your description in this thread is helpful, but you need to document it in the man file (or create a vingette with the details). You should also word it for non-techinical experts. Please sync before submitting the pull request. This one would remove funcitonality from the package (including contributors). Please remove these files in the PR:
Also, please make some test cases with the current version of the package and test that you get the same results when the additional functionality is not used. I'm very interested in these changes but, to be honest, the PR is really sloppy and it makes be hesitant about merging it in |
Hi Max, I have already updated the .Rd man files (C5.0.Rd) in my previous commit. Could have a look at it and see if I am missing something? I synced the repository to include the update in as.party.C5.0. I also removed the list of files you mentioned from the PR. I realize, those files are build related and are specific to mine and hence should not be included. Sorry, for missing on that thing earlier. Which additional functionality? Yeah, I understand your concerns, and this is my and my teammates first time raising PR on github, hence the errors. Any suggestion for improvement that you provide will be valuable to us. Navami |
Hi Max, Can you please update us on the status? Regards, |
Sorry for the delay. I would really like to see more detail about the algorithm in the Rd file or, preferably, in a vignette or other document not in the package. Please make them as clear as possible. Here is an example that you can use to understand the level of detail that we would need. This issue is that, in other packages, people have made extensive changes to code and not been available for bug fixes when they occur. You'll need to outline the details somewhere so that I and others do not have to pour through the C code to figure out what it is doing and where the issue is. |
After running some examples, I think that a lot more details and explanation are needed.
For example: library(C5.0)
library(caret)
# data from github/topepo/recipes
library(recipes)
data("credit_data")
set.seed(1)
in_train <- createDataPartition(credit_data$Status, p = .5, list = FALSE)
credit_tr <- credit_data[ in_train,]
credit_te <- credit_data[-in_train,]
no_prune <- C5.0(x = credit_tr[, -1],
y = credit_tr$Status,
trials = 20,
control = C5.0Control(seed = 1))
no_prune_pred <- predict(no_prune, credit_te)
confusionMatrix(no_prune_pred, credit_te$Status)
pruned <- C5.0(x = credit_tr[, -1],
y = credit_tr$Status,
trials = 20,
prunem = 7,
control = C5.0Control(seed = 1))
pruned_pred <- predict(pruned, credit_te)
confusionMatrix(pruned_pred, credit_te$Status) |
Closing due to no activity. |
A module to prune the boosted ensemble using kappa pruning algorithm to obtain user defined number of decision trees/rules.