Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a module to prune boosted ensemble. #2

Closed
wants to merge 6 commits into from

Conversation

navamikairanda
Copy link

A module to prune the boosted ensemble using kappa pruning algorithm to obtain user defined number of decision trees/rules.

A module to prune the boosted ensemble to obtain user defined number of
decision trees/rules.
@topepo
Copy link
Owner

topepo commented Jun 27, 2015

Sorry for the delay; I'm just now getting to look at this.

Question: is there any instance where the number of boosting iterations to be pruned is less than the number of trials? How would that work?

Can you

  • update the Rd files with the proper syntax and an explanation as to how the pruning works?
  • provide some test files (either as part of the package or not) to show that the pruning works under different options?

Thanks,

Max

@navamikairanda
Copy link
Author

I hadn't responded due to some personal reasons. Sorry for the delay.
To answer your queries,

  1. The number of boosting iterations to be pruned is always lesser than or equal to the number of trials.
    The explanation for the boost pruning process is as follows:

The ensemble techniques work by generating set of classifiers and then having a voting measure to classify the data in the test set. However, one of the disadvantage of ensemble techniques is that it requires a large amount of memory to store all the classifiers that are generated. Hence, there is a need to reduce the memory consumption.
This can be achieved by selecting a subset of classifiers which would produce a performance comparable to the performance got when all the boosted classifiers are considered. Several algorithms like Early Stopping, KL-Divergence Pruning, Kappa Pruning, Reduced error pruning with backfitting can be used to find the subset of classifiers. In the module that we have developed, Kappa Pruning is used as an algorithm to find the subset of classifiers.

Refer https://www.lri.fr/~sebag/Examens/Margineantu.pdf for detailed explanation of Kappa pruning algorithm.

The kappa pruning algorithm has been implemented as follows. Kappa K on the training set is calculated for every pair of classifiers produced by adaboost. Once all kappas are calculated, choose pairs of classifiers starting with the pair that has the lowest K and considering them in increasing order of K until M classifiers are obtained.
The source code of C50 package in R was modified to prune the boosted classifiers. Kappa pruning algorithm was applied on T boosted classifiers to obtain M pruned boosted classifiers (M <= T).
The Modified C5.0 (MC5.0) was tested on ten data sets. All these data sets were taken from the UCI Repository. The AdaBoost of MC5.0 was run on each data set to generate 100 classifiers. Then the Kappa pruning technique was evaluated by generating a set of 20, 40, 60, 80 and 100 classifiers. This corresponds to 80%, 60%, 40%, 20% and 0% pruning.
In the plotted overall performance figures, the gain is defined as the difference in percentage points between the performance of full boosted MC5.0 and the performance of MC5.0 alone(trials = 1).
The gain was always positive for all of our ten data sets. The relative performance of Kappa pruning is defined as the difference between its performance and MC5.0 divided by the gain.
Hence, a relative performance of 1.0 indicates that the Kappa pruning method obtains the same gain as AdaBoost. A relative performance of 0.0 indicates that Kappa pruning obtains the same performance as MC5.0 alone.
results
Fig. Relative performance of Kappa Pruning with various amounts of pruning.

The above figure shows that pruning improves performance over AdaBoost for Chess, Adult, Car, Letter, Wine and Forest-Cover. The only data sets that show very bad behaviour are Adult and Breast-Cancer, which appears to be very unstable. Hence, in many cases, significant pruning does not hurt performance very much, while saving the memory by reducing the number of classifiers.

  1. I have attached data sets as part of the package. You can find it in the folder named 'test data sets'.
    The following is the code used for testing.
library(C50)
#train = read.csv("adult.csv")
train = read.csv("haberman.csv")
#train = read.csv("iris.csv")
#train = read.csv("krkopt.csv")
#train = read.csv("forest_cover.csv")

index <- 1:nrow(train)
index <- sample(index) ### shuffle index
fold <- rep(1:10, each=nrow(train)/10)[1:nrow(train)]
folds <- split(index, fold) ### create list with indices for each fold

#' Do each foldx
accs <- vector(mode="numeric")
#aucs <- vector(mode="numeric")
sizes <- vector(mode="numeric")
for(i in 1:length(folds)) {
  tree <- C5.0(factor(class) ~.,data = train[-folds[[i]],],trials=2)
sizes <- tree$size
# accs[i] <- accuracy(factor(predict(tree, train[folds[[1]],])),factor(train[folds[[1]],]$Cover_Type)))
a <- predict(tree, train[folds[[i]],])    
b <- train[folds[[i]],]$class
accs[i] <- sum(a == b) /length(a)
#aucs[i] <- auc(roc(a,b))
print(i)
}
#' Report the average
mean(accs)
#mean(aucs)
mean(sizes)

Thanks & Regards,
Navami

@topepo
Copy link
Owner

topepo commented Nov 8, 2015

Navami ,

Thanks for the follow-up. No big deal about timing. A few things though:

Your description in this thread is helpful, but you need to document it in the man file (or create a vingette with the details). You should also word it for non-techinical experts.

Please sync before submitting the pull request. This one would remove funcitonality from the package (including contributors).

Please remove these files in the PR:

  • pkg/C50_0.1.0-21.tar.gz
  • pkg/C50/.Rproj.user/
  • pkg/C50/.RData
  • pkg/C50.zip
  • pkg/C50/.Rbuildignore
  • pkg/C50/.Rhistory

Also, please make some test cases with the current version of the package and test that you get the same results when the additional functionality is not used.

I'm very interested in these changes but, to be honest, the PR is really sloppy and it makes be hesitant about merging it in

@navamikairanda
Copy link
Author

Hi Max,

I have already updated the .Rd man files (C5.0.Rd) in my previous commit. Could have a look at it and see if I am missing something?

I synced the repository to include the update in as.party.C5.0.

I also removed the list of files you mentioned from the PR. I realize, those files are build related and are specific to mine and hence should not be included. Sorry, for missing on that thing earlier.

Which additional functionality?

Yeah, I understand your concerns, and this is my and my teammates first time raising PR on github, hence the errors. Any suggestion for improvement that you provide will be valuable to us.

Navami

@navamikairanda
Copy link
Author

Hi Max,

Can you please update us on the status?

Regards,
Navami

@topepo
Copy link
Owner

topepo commented May 31, 2016

Sorry for the delay. I would really like to see more detail about the algorithm in the Rd file or, preferably, in a vignette or other document not in the package. Please make them as clear as possible. Here is an example that you can use to understand the level of detail that we would need.

This issue is that, in other packages, people have made extensive changes to code and not been available for bug fixes when they occur. You'll need to outline the details somewhere so that I and others do not have to pour through the C code to figure out what it is doing and where the issue is.

@topepo
Copy link
Owner

topepo commented Mar 23, 2017

After running some examples, I think that a lot more details and explanation are needed.

  • There is no discussion on how the pruning is done. Looking at the outcome of the function, it appears that the iterations are pruned sequentially backwards (instead of in a global/non-greedy way).
  • The output doesn't reflect how the pruning affected the results. For examle, if prunem = 5 and trials = 20:
Evaluation on training data (2227 cases):

Trial	    Decision Tree   
-----	  ----------------  
	  Size      Errors  

   0	    72  342(15.4%)
   1	    33  472(21.2%)
   2	    44  480(21.6%)
   3	    49  449(20.2%)
   4	    44  491(22.0%)
boost	        281(12.6%)   <<
  • Even when pruning is used, the summary function shows all trees from all iterations.
  • The output from the print method isn't consistent. If prunem = 8, the output says "8 used due to early stopping". This isn't correct.

For example:

library(C5.0)
library(caret)

# data from github/topepo/recipes
library(recipes)
data("credit_data")

set.seed(1)
in_train <- createDataPartition(credit_data$Status, p = .5, list = FALSE)

credit_tr <- credit_data[ in_train,]
credit_te <- credit_data[-in_train,]

no_prune <- C5.0(x = credit_tr[, -1],
                 y = credit_tr$Status,
                 trials = 20,
                 control = C5.0Control(seed = 1))
no_prune_pred <- predict(no_prune, credit_te)
confusionMatrix(no_prune_pred, credit_te$Status)

pruned <- C5.0(x = credit_tr[, -1],
               y = credit_tr$Status,
               trials = 20,
               prunem = 7,
               control = C5.0Control(seed = 1))
pruned_pred <- predict(pruned, credit_te)
confusionMatrix(pruned_pred, credit_te$Status)

@topepo
Copy link
Owner

topepo commented May 17, 2018

Closing due to no activity.

@topepo topepo closed this May 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants