Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random forest revision #1218

Merged
merged 63 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
2871845
created bagging duplicate for deep dive
manuelhelmerichs May 2, 2024
7c862b3
bagging intro
manuelhelmerichs May 7, 2024
2d569ee
meeting
manuelhelmerichs May 7, 2024
8de340c
bagging überarbeitet
manuelhelmerichs May 8, 2024
fb79466
Update fig-bagging-ntree_MSE.R
manuelhelmerichs May 8, 2024
8f228bc
auotplot
manuelhelmerichs May 13, 2024
9057c7f
benchmark updated
manuelhelmerichs May 13, 2024
99cd6fb
cursive
manuelhelmerichs May 13, 2024
44e2877
improving bagging
manuelhelmerichs May 14, 2024
a63ffe2
adjusted fig-bagging-mean
manuelhelmerichs May 14, 2024
23d9e34
averaging-prob
manuelhelmerichs May 15, 2024
ff1a053
first RF-basics draft
manuelhelmerichs May 15, 2024
767200d
random feature sampling graphic
manuelhelmerichs May 16, 2024
69a664a
mtry and maxdepth visualized
manuelhelmerichs May 16, 2024
a78d7d4
added some headers
manuelhelmerichs May 16, 2024
70d7d68
small adjustment
manuelhelmerichs May 16, 2024
0fe9d0b
included benchmark in RF intro
manuelhelmerichs May 16, 2024
1bade56
first draft for oob slides
manuelhelmerichs May 17, 2024
be206de
updated OOB draft
manuelhelmerichs May 17, 2024
def8abc
implemented discussed changes
manuelhelmerichs May 22, 2024
5d1df64
implemented discussed changes
manuelhelmerichs May 22, 2024
96d5718
bagging iteration LB
ludwigbothmann May 27, 2024
be90f8a
oob
manuelhelmerichs May 27, 2024
9270c63
Merge branch 'random-forest-revision' of https://github.com/slds-lmu/…
manuelhelmerichs May 27, 2024
3eb376e
bagging überarbeitet
manuelhelmerichs May 27, 2024
0bfbbae
ludwig's changes
manuelhelmerichs May 28, 2024
ac2cc4e
intuition draft
manuelhelmerichs May 28, 2024
3e817e6
intuition draft
manuelhelmerichs May 28, 2024
8f926fa
OOB iteration LB
ludwigbothmann May 28, 2024
5a6b52f
todos from meeting
manuelhelmerichs May 29, 2024
dc12106
overfitting plot
manuelhelmerichs May 29, 2024
bdc71be
bagging algo
manuelhelmerichs May 31, 2024
7fe7efa
merge master into random forest revision
manuelhelmerichs May 31, 2024
1411def
new titlemeta for RFs
manuelhelmerichs May 31, 2024
1cb759d
first draft of fimp and prox
manuelhelmerichs May 31, 2024
80d4b0c
proximity figure
manuelhelmerichs Jun 2, 2024
9e7e42e
proximities
manuelhelmerichs Jun 2, 2024
2db96cb
fimp
manuelhelmerichs Jun 3, 2024
1ce5b78
new figures
manuelhelmerichs Jun 4, 2024
9d9395b
proximities for review
manuelhelmerichs Jun 5, 2024
fa3e307
typo
manuelhelmerichs Jun 5, 2024
a20926d
oob
manuelhelmerichs Jun 5, 2024
2a51cfc
permutation importance
manuelhelmerichs Jun 5, 2024
839e289
edits lb
ludwigbothmann Jun 5, 2024
bf9d436
fimp for review
manuelhelmerichs Jun 5, 2024
733f685
little adjustments
manuelhelmerichs Jun 5, 2024
acb3447
small fix impurity figure
manuelhelmerichs Jun 6, 2024
1ba5c55
iter LB forests-basic
ludwigbothmann Jun 6, 2024
714bf93
Merge branch 'random-forest-revision' of github.com:slds-lmu/lecture_…
ludwigbothmann Jun 6, 2024
b5d1791
update fig bagging pred
ludwigbothmann Jun 6, 2024
aff80e9
bb: rf slides final corrections
berndbischl Jun 6, 2024
b9d8eb5
bb: rf slides final corrections
berndbischl Jun 6, 2024
079ed24
bb: rf slides final corrections
berndbischl Jun 6, 2024
7a7c1c4
prox fig LB
ludwigbothmann Jun 6, 2024
0ed4032
prox figs
ludwigbothmann Jun 6, 2024
f8397a0
bb: rf slides final corrections
berndbischl Jun 6, 2024
0570ac8
prox figs 2
ludwigbothmann Jun 6, 2024
6c58a9c
prox figs 3
ludwigbothmann Jun 6, 2024
b3001be
bb: rf slides final corrections
berndbischl Jun 6, 2024
d5ab343
Merge branch 'random-forest-revision' of github.com:slds-lmu/lecture_…
berndbischl Jun 6, 2024
1fa7e33
proximities aligned figures
manuelhelmerichs Jun 7, 2024
de3db9c
new pdfs
manuelhelmerichs Jun 7, 2024
15a9e38
cleanup RF chapter; recompiled
manuelhelmerichs Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added slides-pdf/slides-forests-bagging-deepdive.pdf
Binary file not shown.
Binary file modified slides-pdf/slides-forests-bagging.pdf
Binary file not shown.
Binary file added slides-pdf/slides-forests-basics.pdf
Binary file not shown.
Binary file removed slides-pdf/slides-forests-benchmark.pdf
Binary file not shown.
Binary file removed slides-pdf/slides-forests-discussion.pdf
Binary file not shown.
Binary file modified slides-pdf/slides-forests-featureimportance.pdf
Binary file not shown.
Binary file removed slides-pdf/slides-forests-intro.pdf
Binary file not shown.
Binary file modified slides-pdf/slides-forests-nutshell.pdf
Binary file not shown.
Binary file added slides-pdf/slides-forests-oob.pdf
Binary file not shown.
Binary file modified slides-pdf/slides-forests-proximities.pdf
Binary file not shown.
81 changes: 0 additions & 81 deletions slides/forests/attic/slides-forests-intro_corrigendum.tex

This file was deleted.

12 changes: 6 additions & 6 deletions slides/forests/chapter-order.tex
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@
\subsection{Bagging Ensembles}
\includepdf[pages=-]{../../slides-pdf/slides-forests-bagging.pdf}

\subsection{Introduction}
\includepdf[pages=-]{../../slides-pdf/slides-forests-intro.pdf}
\subsection{Basics}
\includepdf[pages=-]{../../slides-pdf/slides-forests-basics.pdf}

\subsection{Benchmarking Trees, Forests, and Bagging K-NN}
\includepdf[pages=-]{../../slides-pdf/slides-forests-benchmark.pdf}
\subsection{Out-of-Bag Error Estimate}
\includepdf[pages=-]{../../slides-pdf/slides-forests-oob.pdf}

\subsection{Feature Importance}
\includepdf[pages=-]{../../slides-pdf/slides-forests-featureimportance.pdf}

\subsection{Proximities}
\includepdf[pages=-]{../../slides-pdf/slides-forests-proximities.pdf}

\subsection{Discussion}
\includepdf[pages=-]{../../slides-pdf/slides-forests-discussion.pdf}
% \subsection{Bagging: Deep Dive}
% \includepdf[pages=-]{../../slides-pdf/slides-forests-bagging-deepdive.pdf}
Binary file added slides/forests/figure/bagging-bench.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/bagging-bench_RF.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/bagging-mean.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed slides/forests/figure/cart_forest_fimp_1a.png
Binary file not shown.
Binary file removed slides/forests/figure/cart_forest_intro_4.pdf
Binary file not shown.
Binary file added slides/forests/figure/forest-fimp_gini.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/forest-fimp_perm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/forest-minnode.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/forest-mtry.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/forest-ntree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added slides/forests/figure/forest-prox-vis_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure/forest-prox-vis_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed slides/forests/figure_man/Proximity_plot.png
Binary file not shown.
Binary file removed slides/forests/figure_man/bagging.pdf
Binary file not shown.
Binary file added slides/forests/figure_man/bagging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed slides/forests/figure_man/bm_stable_vs_unstable.pdf
Binary file not shown.
Binary file added slides/forests/figure_man/forest-bagg_regr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-bagging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-fimp_idea.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-gpt4o.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-oob-error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-oob.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/forests/figure_man/forest-prox-matrix.png
Binary file removed slides/forests/figure_man/forests-featimp-new.jpg
Diff not rendered.
Binary file removed slides/forests/figure_man/forests-oob-error-2.jpg
Diff not rendered.
Binary file removed slides/forests/figure_man/rF_oob_error_new.pdf
Binary file not shown.
Binary file not shown.
Binary file removed slides/forests/figure_man/rf_majvot_averaging.png
Diff not rendered.
52 changes: 52 additions & 0 deletions slides/forests/references.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
@article{STROBL2007,
author = {Strobl, Carolin and Boulesteix, Anne-Laure and Zeileis, Achim and Hothorn, Torsten},
journal = {BMC Bioinformatics},
number = 25,
title = {Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution},
volume = 8,
year = 2007,
url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25}
}


@book{HASTIE2001,
author = {Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome},
edition = 2,
publisher = {Springer},
title = {The elements of statistical learning: data mining, inference and prediction},
url = {http://www-stat.stanford.edu/~tibs/ElemStatLearn/},
year = 2009
}

@article{BREIMAN2001,
author = {Breiman, Leo},
journal = {Machine Learning},
number = 1,
pages = {5-32},
publisher = {Kluwer Academic Publishers},
title = {Random Forests},
url = {http://dx.doi.org/10.1023/A%3A1010933404324},
volume = 45,
year = 2001
}

@misc{LOUPPE2015,
title={Understanding Random Forests: From Theory to Practice},
author={Gilles Louppe},
year={2015},
eprint={1407.7502},
archivePrefix={arXiv},
primaryClass={stat.ML},
url = {https://arxiv.org/abs/1407.7502}
}

@Article{PROBST2018,
title = {To Tune or Not to Tune the Number of Trees in Random Forest},
author = {Philipp Probst and Anne-Laure Boulesteix},
journal = {Journal of Machine Learning Research},
year = {2018},
volume = {18},
number = {181},
pages = {1-18},
url = {http://jmlr.org/papers/v18/17-269.html},
}
62 changes: 62 additions & 0 deletions slides/forests/rsrc/fig-bagging-bench.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# goal here is to visualize the need for unstable learners in bagging
# by using a benchmark_grid from mlr3
# and later show how RFs further improve accuracy!

library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3viz)
library(ggplot2)
library(data.table)

set.seed(123)

# bagging via a pipeline (taken from mlr3 book)
create_bagging_pipeline <- function(base_learner) {
gr_single_pred = po("subsample", frac = 1, replace=TRUE) %>>% lrn(base_learner) # equals bootstrapping (with replacement, frac = 1)
gr_pred_set = ppl("greplicate", graph = gr_single_pred, n = 100)
gr_bagging = gr_pred_set %>>% po("classifavg", innum = 100)
as_learner(gr_bagging)
}

# setup learners
glrn_bagging_log = create_bagging_pipeline("classif.log_reg")
glrn_bagging_log$id = "bagging_logistic"

glrn_bagging_rpart = create_bagging_pipeline("classif.rpart")
glrn_bagging_rpart$id = "bagging_tree"

glrn_bagging_kknn = create_bagging_pipeline("classif.kknn")
glrn_bagging_kknn$id = "bagging_kknn"

lrn_log_reg = lrn("classif.log_reg")
lrn_rpart = lrn("classif.rpart")

lrn_ranger = lrn("classif.ranger", num.trees = 100)

lrn_kknn = lrn("classif.kknn")
lrn_kknn$param_set$values$k = 7

# benchmark_grid expects a list:
learners = list(glrn_bagging_log, lrn_log_reg, glrn_bagging_rpart, lrn_rpart, glrn_bagging_kknn, lrn_kknn, lrn_ranger)

# tasks to be included in the benchmark
tasks = lapply(c("spam"), tsk)

# run the benchmark!
bmr = benchmark(benchmark_grid(tasks, learners, rsmp("cv", folds = 10)))

# visualization
a <- autoplot(bmr, type = "boxplot") +
ylab("CE for 10-fold CV") +
xlab("Learners") +
scale_x_discrete(labels = c("LR bagged", "LR", "CART bagged", "CART", "7-nn bagged", "7-nn", "RF")) +
theme_minimal() +
theme(
axis.title = element_text(size = 22, face = "bold"),
axis.text = element_text(size = 20, face = "bold"),
legend.title = element_text(size = 22, face = "bold"),
legend.text = element_text(size = 20, face = "bold"),
)
ggsave("../figure/bagging-bench_RF.png", plot = a, width = 20, height = 8, dpi = 300)
# the bagging-bench figure is just a cropped version of this (sorry, had problems with autoplot)
84 changes: 84 additions & 0 deletions slides/forests/rsrc/fig-bagging-mean.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# we want to visualize how bagging works, e.g.
# that the predicitions of the base learners are averaged
# -> creates 100 trees on bootstrapped toy data; visualizes them with their respective average
# additionally, plots ntrees vs MSE to show benefits of large examples

library(ggplot2)
library(rpart)
library(gridExtra)

set.seed(700)

# Generate toy data
n <- 700
x <- runif(n, 0, 10)
y <- sin(x) + rnorm(n, sd=0.6)
data <- data.frame(x=x, y=y)

# Fit data on bootstrapped toy data
num_trees <- 100
trees <- lapply(1:num_trees, function(i) {
samp <- sample(n, replace=TRUE)
rpart(y ~ x, data=data[samp,], method="anova")
})
x_seq <- seq(0, 10, length.out=250)
preds <- sapply(trees, function(tree) predict(tree, newdata=data.frame(x=x_seq)))

# Mean acquired via bagging
mean_preds <- rowMeans(preds)

# Combined into a dataframe for plotting
plot_df <- data.frame(x = rep(x_seq, each = num_trees + 1),
y = c(t(preds), mean_preds),
Model = factor(c(rep("individual trees", num_trees), "bagged mean")))

# Visualization of toy data
p1 <- ggplot(data, aes(x=x, y=y)) +
geom_point(alpha=0.5) +
stat_function(fun = sin, color = "red") +
theme_minimal(base_size = 33)

# visualization of tree's predictions and mean (bagged)
p2 <- ggplot(plot_df, aes(x=x, y=y, color=Model)) +
geom_line(alpha=0.3) +
geom_line(data = subset(plot_df, Model == "bagged mean"), size=1.0) +
theme_minimal(base_size = 33)

# function to calculate MSE for different numbers of trees
bagging_rpart <- function(data, num_trees, sample_size) {
predictions <- matrix(NA, nrow = nrow(data), ncol = num_trees)

for (i in 1:num_trees) {
sample_indices <- sample(nrow(data), size = sample_size, replace = TRUE)
sample_data <- data[sample_indices, ]

model <- rpart(y ~ x, data = sample_data)

predictions[, i] <- predict(model, data)
}

mean_predictions <- rowMeans(predictions, na.rm = TRUE)
error <- mean((mean_predictions - data$y)^2)

return(error)
}

# calculate MSE for different numbers of trees using the toy data
results <- data.frame(Number_of_Trees = integer(), MSE = numeric())
tree_counts <- seq(1, num_trees, by = 2)

for (trees in tree_counts) {
mse <- bagging_rpart(data, num_trees = trees, sample_size = 700)
results <- rbind(results, data.frame(Number_of_Trees = trees, MSE = mse))
}

# plot MSE vs. number of trees
p3 <- ggplot(results, aes(x = Number_of_Trees, y = MSE)) +
geom_line(color = "blue", size = 1.5) +
labs(x = "number of decision trees",
y = "MSE on training data") +
theme_minimal(base_size = 33)

combined_plot <- grid.arrange(p3, p1, p2, ncol=3, nrow=1, widths=c(1, 1, 1.35)) # so all plots are roughly the same width

ggsave("../figure/bagging-mean.png", plot = combined_plot, width = 30, height = 8, dpi = 300)
Loading