New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more documentation needed #657
Comments
Hi. What do you mean by "the details covered in the explanation is tautological"? Thanks for your criticism and input! Cheers, |
Indeed. Most estimators have quite extensive docstrings: http://scikit-learn.org/dev/modules/classes.html Please point us to the specific examples you encountered that lack details. |
In the SVC classifier : http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
LinearSVC:
RandomForestClassifier :
Ridge Regression :
Sometimes even if I normalize / scale data myself using my own code or using the preprocessing.Scaler AND I set in the model Intercept = True, there would still be some interceptors computed for me. Is inside RidgeCV a different alg that computes the interceptor ? Is the Scaler's functionality the same with the interceptor's ? I also admit that I may have been exposed to some very simplistic models of the algorithms you have implemented in scikit and I may lack some knowledge. I would gladly read and catch up with whatever pieces of information I am missing if there would be some references for the models you have considered when implementing them in scikit. In many cases there have been many papers that improve various aspects of the original algorithm and would be nice to know which version have you considered. The way I worked so far was by trial and error and relying on the pieces of code you provided. I would also find very useful to get some information about the computational complexity for fit /predict and memory size so I know how to adapt the data. |
Great thank you very much. I will post this on the mailing list to let other developers know about this feedback. |
I think that the remarks that you bring up are partly answered in the narrative documentation (i.e. on the web page), in particular for the ridge regression. This is a problem that keep coming up: a lot of the people will only look at the docstrings, and not at the narrative documentation. There is a limited amount of information that we can carry in the docstrings, for lack of formating and links, and we shouldn't duplicate it. So the real question is: how do we get people to look at the narrative documentation? Do we add a note in each estimator pointing to the relevant section of the docs? |
We could add links to the narrative doc but I think the docstrings should at least carry the "operational usage" info: which parameters are important to grid search and what ranges (exponential, linear, what typical scale for the range)?. Personally when I experiment with an estimator in a IPython session I would like to be quickly able to setup a basic grid search by quickly glancing at the info available in the docstring by usingthe IPython If I want a full mathematical understanding of the model (e.g. if I were a researcher writing a paper and using the scikit implementation to benchmark hist work against some baseline models), then I would be ok to open a new firefox tab and read the matching chapter (which is typically a minimum 5 min operation). |
But the docstrings should always include a brief intuitive explanation of parameters and their impact. This is sometimes lacking, currently. |
@vicpara You said it would be nice to know which version of an algorithm we implement. In particular for SVMs and RF, which you mention, there are explicit references. In the case of SVM, we even just wrap a well-known library that has multiple papers explaining the algorithm. Could you say for which algorithms we do not provide the information on which papers the implementation is based? |
@amueller In the mean time I've just realized how to better make use the documentation and that in fact, there are a lot of references but I was not looking for them in the right place. I'm sorry about creating this "issue" . |
No need to be sorry, just tell us what you found useful and how you think |
@vicpara You really shouldn't be sorry. I made a couple of (hopefully) improvements in the docs and parameter naming (see PR above) based on your suggestions :) |
@amueller Sweet. Nice job ! Regarding Random Forest:
Classification using SVM and Kernel alg:
|
@vicpara please take a look at the |
Generally you make an important point. The question is, if this project is really the right place to give these answers. The docs are not meant to replace a course in machine learning. |
I am alread working on such a FAQ as explained on the mailing list: https://raw.github.com/ogrisel/scikit-learn/doc-faq/doc/faq.rst Some of it should move to the narrative doc and be replaced by pointers. Pull requests welcomed. |
I totally agree. This project's documentation cannot replace a ML course. However from any theoretical framework to the implementation details there is some distance. Just by implementing and testing an algorithm the developer gets some insights about the pros and the cons, most of the time empirically. These insights would be useful and great to know when starting working with a specific algorithms. |
Very true. Furthermore theoretical books almost never tell if an algorithm will be able a to converge on a 10000x10000 dataset before the end of the universe or if the intermediate datastructures will fit in 4GB of memory |
@ogrisel I really like the FAQ section! It is very practical and easy to follow. I also like the workflow like way of jumping from one question to another which also narrows down the possibilities. |
I'll close this issue for now, as I think we cleared up most of the issues. @vicpara if you find that any parts of the scikit are not well explained, even when considering the user guide, don't hesitate to say so - here, by opening another issue or on the mailing list. |
Hi guys,
I have a hard time struggling with parameter tuning for most of the models in the scikit. I'm a machine learning student and I'm using scikit at various projects. I know in theory how each ML technique works and what is its goal but still I have a hard time figuring out what most of the parameters a model takes in the constructor means. I think a more detailed documentation should be available. The names of the parameters are rarely self explaining and most of the time the details covered on their explanation is tautological. Parameter and hyper-parameter tuning for ML models is tricky. But I find it more obscure when I cannot apply my intuition because I have no idea what a parameter is actually doing/meaning.
I think most of new comers and users will find your platform more easy to understand and efficiently use if there is a detailed explanation for the parameters and the models / algorithm / objective functions that are used under the hood of scikit.
Thanks
The text was updated successfully, but these errors were encountered: