Feedback from Sebastian on ML notebook #8

rhiever · 2015-08-21T17:09:39Z

Feedback from @rasbt:

Okay, let be me very nit-picky here. I would either spell all the package name in lower-case or use the common convention: NumPy, seaborn, matplotlib, SciPy, scikit-learn
In "scikit-learn: The main Machine Learning package in Python." I would suggest replacing "main" by "essential" or so. It is really great for basic stuff, and essential has the positive tone of "important" and also "fundamental" at the same time.
About the iris images: They look great! But one question, are they really attribution free? I am wondering because I looked very hard to find some good ones that meet this criterion
Since this is more of a beginner audience, maybe define "accuracy" E.g., "fraction of correctly classified flower samples"
"hand-measuring 100 randomly-sampled flowers of each species" -> Maybe use "50" so that the reader can directly relate to the dataset.
Instead of "scatter matrix", maybe consider the term "scatter plot matrix" since "scatter matrix" is typically something else: an "unnormalized" covariance matrix (e.g., in LDA)
Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos
"There are several random forest classifier parameters that we can tune" -- yes there are, but typically, the idea behind random forest is that you don't need to tune any of these except for the number of trees.
"It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.
When you plot the cross-val error, I could also print the standard deviation
RandomForestClassifier(n_estimators=10, max_depth=1); I wouldn't recommend showing people this example, this could give them the wrong idea; you don't prune trees in a forest.
Maybe also mention the problems with KNN, because people could think that it is typically a great classifier since it performs so well here. It's really susceptible to the curse of dimensionality, and you always have to keep the training set around (lazy learner). In this context, I would also mention that the scale of the features matters (if you use Euclidean distance) and in this case we don't have to worry about it because everything is in cm.

rhiever · 2015-08-21T17:10:11Z

@rasbt, I pinged you on here so you can see how I respond to each point as I work on it. Thank you again for your feedback!

rhiever · 2015-08-21T17:23:32Z

Regarding the images: I pulled them from another repo that was Public Domain. However, looking at the original sources, it seems that they are not attribution free. I will have to fix that.

https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg

http://www.signa.org/index.pl?Display+Iris-setosa+2

http://www.signa.org/index.pl?Display+Iris-virginica+3

rasbt · 2015-08-21T18:03:34Z

Oh, I see that I was a little bit sloppy yesterday night ... Seems like the sentence "On a side-note, but you probably already now this: Most gradient-based optimization algos" got cut-off. What I wanted to say is even if features are on the same scale (e.g., cm), you still want to standardize the features prior to e.g., gradient descent; makes the learning easier because you have more balanced weight updates. Going into this would be way too much detail for the tutorial, but I would at least mention that people should check their features prior to using ML algos other than tree-based ones.

"When you plot the cross-val error, I could also print the standard deviation" I meant "would", not "could" :P

But estimating the variance is actually not that trivial, FYI, look at the papers:

T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. The Journal of Machine Learning Research, 5:1089–1105, 2004.

rhiever · 2015-08-21T21:06:48Z

When you plot the cross-val error, you could also print the standard deviation

Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.

rhiever · 2015-08-21T21:21:33Z

"It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.

I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)

rhiever · 2015-08-21T21:22:33Z

Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos

This ties in nicely with #7. I'll add a note to that issue and check this one off.

rasbt · 2015-08-21T22:01:56Z

Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.

Yes, that's probably even better in this context. I suggested the stddev because

np.mean(cross_val_score(random_forest_classifier, all_inputs, all_classes, cv=10))
0.95999999999999996

followed by the sentence

Now we have a much more consistent rating of our classifier's general classification accuracy.

The info is basically already contained in the plot, but this would maybe be a nice summary statistic. And it is useful in practice too when you are tuning parameters e.g., via k-fold cv or in nested cv using gridsearch, e.g,. as some sort of tie-breaker.

I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)

Sure, but I think that it would maybe be more worthwhile for the reader to use a basic decision tree instead of the Random Forest ... the hyper-parameter tuning (tree depth) would be more intuitive I guess. You could print an unpruned tree with good training acc. but bad generalization performance, and then show how you can address this with pruning (max_depth). But this is just a thought :)

Now we start with a decision tree classifier and build up to a random forest classifier.

rhiever · 2015-08-21T23:26:35Z

Alright, check it out now. It starts with a decision tree classifier then builds up to a random forest.

I think this last commit addresses the rest of your points. Please let me know if I missed anything. :-)

rasbt · 2015-08-22T00:06:24Z

Wow, you seem really determined to turn this IPython notebook into a IPython book :)

Haha, if you are not busy enough, I have another batch for you!

Maybe use a table of contents so that people see in the beginning what to expect; also it helps to navigate through the document I think.
```
 # Table of Contents
```
```
 - [Your Markdown Section Header](#Your-Markdown-Section-Header)
 ... 
```

And then, you could place a little "arrow" or so under each section header to jump back to the overview

[ go back ] (# Table-of-Contents )

Maybe mention in a few words that stratified k-fold keeps the class proportions per fold in contrast to regular k-fold
hm, unfortunately the graphviz part is not working (rendering) yet, maybe try png instead of pdf?
in general, maybe put a graphviz part directly after your first tree so that people know what a decision tree looks like, and maybe a second one after the hyperparam tuning so that they can see how the model changed?
[x]

around that limitation by creating a whole bunch of shallow decision trees (hence "forest")

sorry, that's technically not correct, you use shallow trees (aka decision stumps) in boosting, not in bagging & Random Forests. I would maybe introduce it as (of course with nicer wording):

If we have a decision tree that goes too deep, we saw that it can overinterpret the training data so that it does not perform well on new, unseen data (e.g., test data). (Decision trees are nonparametric models where the number of model params depends on the training set). It is important that we find the optimal tree depth during grid search. A powerful method to overcome this challenge is to build an ensemble of experts, a large number of deep decision trees, and combine their votes. This is actually how random forests work: we create many unpruned tree based on different subsets of the training data (note that they are bootstrapped) and different feature combinations to let the majority vote decide.

rhiever · 2015-08-22T00:29:07Z

Haha... oh dear, what have I gotten myself into? ;-)

Good suggestions - I addressed a couple with some quick fixes and will leave the rest for the weekend.

rasbt · 2015-08-22T00:45:26Z

With great resources (for the next gen data scientists) come great responsibilities! :D

rhiever · 2015-08-25T18:48:56Z

Alrighty, finally got around to most of these! Thanks again for the feedback.

rasbt · 2015-08-25T19:00:10Z

Wow looks awesome, and no prob, you are always welcome! Ah, one unfortunate caveat with how the GitHub IPython Nb rendering is implemented is that it doesn't support jumping between section via internal links (yet) -- but the TOC is still useful anyways :). Haha, I may call you Random F. Olson from now on, but there is maybe one little phrase that you can add in to make it technically unambiguous: Instead of "-- each trained on a random subset of the features" -> sth. like "-- each trained on a random subsets of training samples (drawn with replacement) and features (drawn without replacement)" Otherwise people may think that they'd use the "original" training set for each decision tree in the forest.

rhiever added the enhancement label Aug 21, 2015

rhiever added a commit that referenced this issue Aug 21, 2015

Add acknowledgements for flower photos per #8

35f4a4f

rhiever added a commit that referenced this issue Aug 21, 2015

Several simple fixes for #8

98c20fe

rhiever added a commit that referenced this issue Aug 21, 2015

Rework modeling section per #8

654ef25

Now we start with a decision tree classifier and build up to a random forest classifier.

rhiever added a commit that referenced this issue Aug 25, 2015

Add table of contents per #8

48e701c

rhiever added a commit that referenced this issue Aug 25, 2015

Improve notebook wording per #8

586b473

rhiever added a commit that referenced this issue Aug 25, 2015

Clean up wording per #8

252a720

rhiever closed this as completed Aug 25, 2015

rhiever added a commit that referenced this issue Aug 25, 2015

Rewording for #8

cdc91fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback from Sebastian on ML notebook #8

Feedback from Sebastian on ML notebook #8

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 22, 2015

rhiever commented Aug 22, 2015

rasbt commented Aug 22, 2015

rhiever commented Aug 25, 2015

rasbt commented Aug 25, 2015

Feedback from Sebastian on ML notebook #8

Feedback from Sebastian on ML notebook #8

Comments

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 21, 2015

rhiever commented Aug 21, 2015

rasbt commented Aug 22, 2015

rhiever commented Aug 22, 2015

rasbt commented Aug 22, 2015

rhiever commented Aug 25, 2015

rasbt commented Aug 25, 2015