more documentation needed #657

vicpara · 2012-02-28T12:51:26Z

Hi guys,
I have a hard time struggling with parameter tuning for most of the models in the scikit. I'm a machine learning student and I'm using scikit at various projects. I know in theory how each ML technique works and what is its goal but still I have a hard time figuring out what most of the parameters a model takes in the constructor means. I think a more detailed documentation should be available. The names of the parameters are rarely self explaining and most of the time the details covered on their explanation is tautological. Parameter and hyper-parameter tuning for ML models is tricky. But I find it more obscure when I cannot apply my intuition because I have no idea what a parameter is actually doing/meaning.
I think most of new comers and users will find your platform more easy to understand and efficiently use if there is a detailed explanation for the parameters and the models / algorithm / objective functions that are used under the hood of scikit.
Thanks

amueller · 2012-02-28T12:55:57Z

Hi.
I'm sorry you find the documentation insufficient.
To address your concerns, it would be good if you could give examples, where you find the parameters of a model are not well explained. Then we can work on that.
From my perspective it is quite hard to improve the docs if I don't know where the actual problems are.

What do you mean by "the details covered in the explanation is tautological"?

Thanks for your criticism and input!

Cheers,
Andy

ogrisel · 2012-02-28T13:42:28Z

Indeed. Most estimators have quite extensive docstrings: http://scikit-learn.org/dev/modules/classes.html

Please point us to the specific examples you encountered that lack details.

vicpara · 2012-02-28T13:52:55Z

In the SVC classifier : http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

it is not very clear to me what shrinking heuristics means and how to tune the parameter
C / scale_C parameter is the soft margin of the classifier or has something to do with scaling the weight of the classes depending the number of samples for each class.
coef0 - a polynomial kernel has a c0 ? suspect is a small term you add to each poly kernel value in the end that does not depend on the sample but when is it useful to change it from zero (default) ?

LinearSVC:

multi_class should be set to False when using the LinearSVC inside a OneVsRestClassifier and set to True when using standalone and the data has multiple classes ?

RandomForestClassifier :

min_split and max_features : I understand they influence the way a node is split in the tree. However I don't understand the magnitude of their influence and what some reasonable values would be, in the case of using 100K samples with 100 features for eg. I constantly got back about 41-46 features .
at RF deciding the approximate right values for the tuple (n_estimators, min_split, max_features ) is a pain.

Ridge Regression :

I know ridge regression has some lambda param which penalizes the size of the estimator. is lambda i know from (X'X - lambda* I)**-1 X'y equal with the alpha in the model ? In my attempts most of the values the RidgeCV picked for alpha were very small like 1e-5 which means almost no regularization at all ?

Sometimes even if I normalize / scale data myself using my own code or using the preprocessing.Scaler AND I set in the model Intercept = True, there would still be some interceptors computed for me. Is inside RidgeCV a different alg that computes the interceptor ? Is the Scaler's functionality the same with the interceptor's ?

I also admit that I may have been exposed to some very simplistic models of the algorithms you have implemented in scikit and I may lack some knowledge. I would gladly read and catch up with whatever pieces of information I am missing if there would be some references for the models you have considered when implementing them in scikit. In many cases there have been many papers that improve various aspects of the original algorithm and would be nice to know which version have you considered.

The way I worked so far was by trial and error and relying on the pieces of code you provided.

I would also find very useful to get some information about the computational complexity for fit /predict and memory size so I know how to adapt the data.

ogrisel · 2012-02-28T14:03:05Z

Great thank you very much. I will post this on the mailing list to let other developers know about this feedback.

GaelVaroquaux · 2012-02-28T14:42:01Z

I think that the remarks that you bring up are partly answered in the narrative documentation (i.e. on the web page), in particular for the ridge regression.

This is a problem that keep coming up: a lot of the people will only look at the docstrings, and not at the narrative documentation. There is a limited amount of information that we can carry in the docstrings, for lack of formating and links, and we shouldn't duplicate it. So the real question is: how do we get people to look at the narrative documentation? Do we add a note in each estimator pointing to the relevant section of the docs?

ogrisel · 2012-02-28T14:50:49Z

We could add links to the narrative doc but I think the docstrings should at least carry the "operational usage" info: which parameters are important to grid search and what ranges (exponential, linear, what typical scale for the range)?.

Personally when I experiment with an estimator in a IPython session I would like to be quickly able to setup a basic grid search by quickly glancing at the info available in the docstring by usingthe IPython ? magic suffix in less than 20s.

If I want a full mathematical understanding of the model (e.g. if I were a researcher writing a paper and using the scikit implementation to benchmark hist work against some baseline models), then I would be ok to open a new firefox tab and read the matching chapter (which is typically a minimum 5 min operation).

mblondel · 2012-02-28T16:09:38Z

But the docstrings should always include a brief intuitive explanation of parameters and their impact. This is sometimes lacking, currently.

amueller · 2012-03-03T18:48:39Z

@vicpara You said it would be nice to know which version of an algorithm we implement. In particular for SVMs and RF, which you mention, there are explicit references. In the case of SVM, we even just wrap a well-known library that has multiple papers explaining the algorithm.

Could you say for which algorithms we do not provide the information on which papers the implementation is based?

vicpara · 2012-03-04T14:57:19Z

@amueller In the mean time I've just realized how to better make use the documentation and that in fact, there are a lot of references but I was not looking for them in the right place. I'm sorry about creating this "issue" .

GaelVaroquaux · 2012-03-04T15:05:46Z

I'm sorry about creating this "issue" .

No need to be sorry, just tell us what you found useful and how you think
that we can make it easier to find. That way we can improve people's
experience.

amueller · 2012-03-04T18:07:08Z

@vicpara You really shouldn't be sorry. I made a couple of (hopefully) improvements in the docs and parameter naming (see PR above) based on your suggestions :)

vicpara · 2012-03-04T19:27:23Z

@amueller Sweet. Nice job !

@GaelVaroquaux :

Regarding Random Forest:

It is still not very clear to me what is the difference between RandomForestCLassifier and Regressor. In their technical description besides the criterion and the name everything else is identical. Generally RF are used for classification. What exactly do you mean by Regressor. Is it a RF + some other regression algorithm already available in the toolkit or is a tweaked RF that may do regression and take continuous target variables as input. (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier and http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
On my dataset both of RFC and RFR didn't predict too well the target label (integer in my case). However using the RFC for feature selection and then feeding the features to a Ridge Regressor improved a lot the outcome.
Based on my intuition I have presumed you have recycled some of the already implemented algorithms and objects when building RF and I have decided to take a look at the narrative description of Decision Trees where I have managed to get an insight about min_split, max_features, min_density http://scikit-learn.org/stable/modules/tree.html.
How does a RFC allocate memory ? I could not find this information. Everything I've got was from the decision tree description. My problem is that I get out of memory errors while fitting or predicting data thought I have only 65% of the 8 GB memory committed. I managed to increase it to this level by switching to x64 bit python 2.7. On the 32 bit I could not exceed 150 trees. Now I train about 750 but I still find it frustrating not to have my entire memory used. I'm using python on a Windows 7 x64 bit edition with python 2.7 and scikit 0.10. Once I have fitted a model using RFC does it make sense to get an out of memory exception while predicting some data ?

Classification using SVM and Kernel alg:

I'm still digging here in scikit documentation and I have to read in multiple documents to make sure I didn't miss any relevant details. At this point I have to start at User Guide section and backtrack through desired sections' narrative and technical documentation to make sure I've got it right.
Ideally it would be nice to think about what kind of problem I have to solve...say classification .. and from there go to the available algorithm in the toolkit split in other relevant category say very large data sets ( which can't fit entirely in memory or where kernel methods cannot be used due to memory constraints) and small datasets (where you can apply kernel methods for eg.) etc. Picking the right form ( which actually might not be just one) of presenting documentation is a really hard job.When you are very knowledgeable about your data and your desired algorithm it is indeed very easy to use your documentation. But when you have not solved the problem and you are still looking for viable solution a better way of exploring the available tools might be of REAL VALUE. This kit looks very promising and has a lot of knowledge embedded in it. If this nifty algorithms don't get easily a chance to touch the problems they were designed for then it's a great loos for everybody. For you who have worked long and hard to get them done and for me since I don't get my problem solved.
In the close future I am willing to do some data exploration because my results so far show that I need some insights to get better predictions. I will have to find some instruments inside scikit to help me achieve this. The way I would start would be to go from covariance chapter and afterwards hunt some pieces of code that might hide some gems for say data vizualization ( t-sne like, histograms etc) or use some deicision trees which already have some export tree function. It would be really handy to have a chapter dedicated to this problem.

amueller · 2012-03-04T19:29:56Z

@vicpara please take a look at the dev documentation. I think the documentation of the RandomForests improved somewhat since the last release. You can find them here: http://scikit-learn.org/dev/

amueller · 2012-03-04T19:33:55Z

Generally you make an important point. The question is, if this project is really the right place to give these answers. The docs are not meant to replace a course in machine learning.
Having some general advice on which algorithms to choose would still be a good thing to have. I recently talked to @alextp about this issue. He is pondering a general FAQ for machine learning. This is interesting in general for people who have data and want to find the right algorithms to apply.

ogrisel · 2012-03-04T19:45:37Z

I am alread working on such a FAQ as explained on the mailing list: https://raw.github.com/ogrisel/scikit-learn/doc-faq/doc/faq.rst

Some of it should move to the narrative doc and be replaced by pointers.

Pull requests welcomed.

vicpara · 2012-03-04T19:47:26Z

I totally agree. This project's documentation cannot replace a ML course. However from any theoretical framework to the implementation details there is some distance. Just by implementing and testing an algorithm the developer gets some insights about the pros and the cons, most of the time empirically. These insights would be useful and great to know when starting working with a specific algorithms.
I'm currently studying ML and I also use various books to get better grasps of the theoretical side.However when moving to the real world data most of the certainties and confidence provided by the science itself fades away. This is why I think getting developer's perspective with some technicalities might help.

ogrisel · 2012-03-04T19:52:24Z

Very true. Furthermore theoretical books almost never tell if an algorithm will be able a to converge on a 10000x10000 dataset before the end of the universe or if the intermediate datastructures will fit in 4GB of memory

vicpara · 2012-03-04T23:49:20Z

@ogrisel I really like the FAQ section! It is very practical and easy to follow. I also like the workflow like way of jumping from one question to another which also narrows down the possibilities.

amueller · 2012-03-18T17:05:21Z

I'll close this issue for now, as I think we cleared up most of the issues. @vicpara if you find that any parts of the scikit are not well explained, even when considering the user guide, don't hesitate to say so - here, by opening another issue or on the mailing list.
I think providing good documentation is one of the cornerstones of this project!

amueller mentioned this issue Mar 3, 2012

MRG rename parameter "multi_class" of LinearSVC to "crammer_singer" #673

Merged

amueller closed this as completed Mar 18, 2012

LilianBoulard mentioned this issue Jul 20, 2023

Outline for the main encoding example #26870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more documentation needed #657

more documentation needed #657

vicpara commented Feb 28, 2012

amueller commented Feb 28, 2012

ogrisel commented Feb 28, 2012

vicpara commented Feb 28, 2012

ogrisel commented Feb 28, 2012

GaelVaroquaux commented Feb 28, 2012

ogrisel commented Feb 28, 2012

mblondel commented Feb 28, 2012

amueller commented Mar 3, 2012

vicpara commented Mar 4, 2012

GaelVaroquaux commented Mar 4, 2012

amueller commented Mar 4, 2012

vicpara commented Mar 4, 2012

amueller commented Mar 4, 2012

amueller commented Mar 4, 2012

ogrisel commented Mar 4, 2012

vicpara commented Mar 4, 2012

ogrisel commented Mar 4, 2012

vicpara commented Mar 4, 2012

amueller commented Mar 18, 2012

more documentation needed #657

more documentation needed #657

Comments

vicpara commented Feb 28, 2012

amueller commented Feb 28, 2012

ogrisel commented Feb 28, 2012

vicpara commented Feb 28, 2012

ogrisel commented Feb 28, 2012

GaelVaroquaux commented Feb 28, 2012

ogrisel commented Feb 28, 2012

mblondel commented Feb 28, 2012

amueller commented Mar 3, 2012

vicpara commented Mar 4, 2012

GaelVaroquaux commented Mar 4, 2012

amueller commented Mar 4, 2012

vicpara commented Mar 4, 2012

amueller commented Mar 4, 2012

amueller commented Mar 4, 2012

ogrisel commented Mar 4, 2012

vicpara commented Mar 4, 2012

ogrisel commented Mar 4, 2012

vicpara commented Mar 4, 2012

amueller commented Mar 18, 2012