Added a module to prune boosted ensemble. #2

navamikairanda · 2015-06-06T06:56:34Z

A module to prune the boosted ensemble using kappa pruning algorithm to obtain user defined number of decision trees/rules.

A module to prune the boosted ensemble to obtain user defined number of decision trees/rules.

topepo · 2015-06-27T19:00:30Z

Sorry for the delay; I'm just now getting to look at this.

Question: is there any instance where the number of boosting iterations to be pruned is less than the number of trials? How would that work?

Can you

update the Rd files with the proper syntax and an explanation as to how the pruning works?
provide some test files (either as part of the package or not) to show that the pruning works under different options?

Thanks,

Max

navamikairanda · 2015-11-08T12:11:09Z

I hadn't responded due to some personal reasons. Sorry for the delay.
To answer your queries,

The number of boosting iterations to be pruned is always lesser than or equal to the number of trials.
The explanation for the boost pruning process is as follows:

The ensemble techniques work by generating set of classifiers and then having a voting measure to classify the data in the test set. However, one of the disadvantage of ensemble techniques is that it requires a large amount of memory to store all the classifiers that are generated. Hence, there is a need to reduce the memory consumption.
This can be achieved by selecting a subset of classifiers which would produce a performance comparable to the performance got when all the boosted classifiers are considered. Several algorithms like Early Stopping, KL-Divergence Pruning, Kappa Pruning, Reduced error pruning with backfitting can be used to find the subset of classifiers. In the module that we have developed, Kappa Pruning is used as an algorithm to find the subset of classifiers.

Refer https://www.lri.fr/~sebag/Examens/Margineantu.pdf for detailed explanation of Kappa pruning algorithm.

The kappa pruning algorithm has been implemented as follows. Kappa K on the training set is calculated for every pair of classifiers produced by adaboost. Once all kappas are calculated, choose pairs of classifiers starting with the pair that has the lowest K and considering them in increasing order of K until M classifiers are obtained.
The source code of C50 package in R was modified to prune the boosted classifiers. Kappa pruning algorithm was applied on T boosted classifiers to obtain M pruned boosted classifiers (M <= T).
The Modified C5.0 (MC5.0) was tested on ten data sets. All these data sets were taken from the UCI Repository. The AdaBoost of MC5.0 was run on each data set to generate 100 classifiers. Then the Kappa pruning technique was evaluated by generating a set of 20, 40, 60, 80 and 100 classifiers. This corresponds to 80%, 60%, 40%, 20% and 0% pruning.
In the plotted overall performance figures, the gain is defined as the difference in percentage points between the performance of full boosted MC5.0 and the performance of MC5.0 alone(trials = 1).
The gain was always positive for all of our ten data sets. The relative performance of Kappa pruning is defined as the difference between its performance and MC5.0 divided by the gain.
Hence, a relative performance of 1.0 indicates that the Kappa pruning method obtains the same gain as AdaBoost. A relative performance of 0.0 indicates that Kappa pruning obtains the same performance as MC5.0 alone.

Fig. Relative performance of Kappa Pruning with various amounts of pruning.

The above figure shows that pruning improves performance over AdaBoost for Chess, Adult, Car, Letter, Wine and Forest-Cover. The only data sets that show very bad behaviour are Adult and Breast-Cancer, which appears to be very unstable. Hence, in many cases, significant pruning does not hurt performance very much, while saving the memory by reducing the number of classifiers.

I have attached data sets as part of the package. You can find it in the folder named 'test data sets'.
The following is the code used for testing.

library(C50)
#train = read.csv("adult.csv")
train = read.csv("haberman.csv")
#train = read.csv("iris.csv")
#train = read.csv("krkopt.csv")
#train = read.csv("forest_cover.csv")

index <- 1:nrow(train)
index <- sample(index) ### shuffle index
fold <- rep(1:10, each=nrow(train)/10)[1:nrow(train)]
folds <- split(index, fold) ### create list with indices for each fold

#' Do each foldx
accs <- vector(mode="numeric")
#aucs <- vector(mode="numeric")
sizes <- vector(mode="numeric")
for(i in 1:length(folds)) {
  tree <- C5.0(factor(class) ~.,data = train[-folds[[i]],],trials=2)
sizes <- tree$size
# accs[i] <- accuracy(factor(predict(tree, train[folds[[1]],])),factor(train[folds[[1]],]$Cover_Type)))
a <- predict(tree, train[folds[[i]],])    
b <- train[folds[[i]],]$class
accs[i] <- sum(a == b) /length(a)
#aucs[i] <- auc(roc(a,b))
print(i)
}
#' Report the average
mean(accs)
#mean(aucs)
mean(sizes)

Thanks & Regards,
Navami

topepo · 2015-11-08T23:57:03Z

Navami ,

Thanks for the follow-up. No big deal about timing. A few things though:

Your description in this thread is helpful, but you need to document it in the man file (or create a vingette with the details). You should also word it for non-techinical experts.

Please sync before submitting the pull request. This one would remove funcitonality from the package (including contributors).

Please remove these files in the PR:

pkg/C50_0.1.0-21.tar.gz
pkg/C50/.Rproj.user/
pkg/C50/.RData
pkg/C50.zip
pkg/C50/.Rbuildignore
pkg/C50/.Rhistory

Also, please make some test cases with the current version of the package and test that you get the same results when the additional functionality is not used.

I'm very interested in these changes but, to be honest, the PR is really sloppy and it makes be hesitant about merging it in

navamikairanda · 2015-11-12T05:36:54Z

Hi Max,

I have already updated the .Rd man files (C5.0.Rd) in my previous commit. Could have a look at it and see if I am missing something?

I synced the repository to include the update in as.party.C5.0.

I also removed the list of files you mentioned from the PR. I realize, those files are build related and are specific to mine and hence should not be included. Sorry, for missing on that thing earlier.

Which additional functionality?

Yeah, I understand your concerns, and this is my and my teammates first time raising PR on github, hence the errors. Any suggestion for improvement that you provide will be valuable to us.

Navami

navamikairanda · 2016-03-10T04:26:38Z

Hi Max,

Can you please update us on the status?

Regards,
Navami

topepo · 2016-05-31T01:40:37Z

Sorry for the delay. I would really like to see more detail about the algorithm in the Rd file or, preferably, in a vignette or other document not in the package. Please make them as clear as possible. Here is an example that you can use to understand the level of detail that we would need.

This issue is that, in other packages, people have made extensive changes to code and not been available for bug fixes when they occur. You'll need to outline the details somewhere so that I and others do not have to pour through the C code to figure out what it is doing and where the issue is.

topepo · 2017-03-23T17:30:09Z

After running some examples, I think that a lot more details and explanation are needed.

There is no discussion on how the pruning is done. Looking at the outcome of the function, it appears that the iterations are pruned sequentially backwards (instead of in a global/non-greedy way).
The output doesn't reflect how the pruning affected the results. For examle, if prunem = 5 and trials = 20:

Evaluation on training data (2227 cases):

Trial	    Decision Tree   
-----	  ----------------  
	  Size      Errors  

   0	    72  342(15.4%)
   1	    33  472(21.2%)
   2	    44  480(21.6%)
   3	    49  449(20.2%)
   4	    44  491(22.0%)
boost	        281(12.6%)   <<

Even when pruning is used, the summary function shows all trees from all iterations.
The output from the print method isn't consistent. If prunem = 8, the output says "8 used due to early stopping". This isn't correct.

For example:

library(C5.0)
library(caret)

# data from github/topepo/recipes
library(recipes)
data("credit_data")

set.seed(1)
in_train <- createDataPartition(credit_data$Status, p = .5, list = FALSE)

credit_tr <- credit_data[ in_train,]
credit_te <- credit_data[-in_train,]

no_prune <- C5.0(x = credit_tr[, -1],
                 y = credit_tr$Status,
                 trials = 20,
                 control = C5.0Control(seed = 1))
no_prune_pred <- predict(no_prune, credit_te)
confusionMatrix(no_prune_pred, credit_te$Status)

pruned <- C5.0(x = credit_tr[, -1],
               y = credit_tr$Status,
               trials = 20,
               prunem = 7,
               control = C5.0Control(seed = 1))
pruned_pred <- predict(pruned, credit_te)
confusionMatrix(pruned_pred, credit_te$Status)

topepo · 2018-05-17T19:15:34Z

Closing due to no activity.

Boost Pruning

33750f1

A module to prune the boosted ensemble to obtain user defined number of decision trees/rules.

navamikairanda added 2 commits November 7, 2015 15:17

Updated Rd files with pruning

23f5caa

Data sets to test boost pruning

5f2a1b7

navamikairanda added 3 commits November 11, 2015 19:22

Merge remote-tracking branch 'refs/remotes/topepo/master'

4709756

Boost Pruning Updates

f6260de

Delete .Rhistory

4b357f3

topepo closed this May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a module to prune boosted ensemble. #2

Added a module to prune boosted ensemble. #2

navamikairanda commented Jun 6, 2015

topepo commented Jun 27, 2015

navamikairanda commented Nov 8, 2015

topepo commented Nov 8, 2015

navamikairanda commented Nov 12, 2015

navamikairanda commented Mar 10, 2016

topepo commented May 31, 2016

topepo commented Mar 23, 2017

topepo commented May 17, 2018

Added a module to prune boosted ensemble. #2

Added a module to prune boosted ensemble. #2

Conversation

navamikairanda commented Jun 6, 2015

topepo commented Jun 27, 2015

navamikairanda commented Nov 8, 2015

topepo commented Nov 8, 2015

navamikairanda commented Nov 12, 2015

navamikairanda commented Mar 10, 2016

topepo commented May 31, 2016

topepo commented Mar 23, 2017

topepo commented May 17, 2018