-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add reset scores, reset parser #43
Comments
We'll need to balance flexibility and usability/correctness, though: every step at which a software system caches state makes it harder to think about, harder to document, and more vulnerable to errors. Errors can crop up both on our side (bugs in the software) and on the user side (science mistakes in the dscr project based on forgetting when something was updated or cleared). My experience with users, both the data scientists and consultants in industry and the students here, is that they use organizational software as a replacement, not supplement, to scientific record-keeping. I worry about a situation where work becomes even less reproducible and more error-prone than without the tools. My experiences with maintaining caches have been almost universally negative. They have always involved months of problematic bugs that lead to unhappy users and occasionally wrong results. Obviously some caching is needed (and is the point of dscr - to store results of expensive computation). But where it is not overwhelmingly needed, always recomputing seems best to me, even it it carries a noticeable penalty in computation. My vote is to keep the pipeline as bone simple as humanly possible. Happy to discuss, though. |
Here's another way to look at it: if we want to be able to reset the dsc at different points of a deep pipeline, dscr needs to represent a tree of dependencies for the workflow. If the user clears a node on that graph, dscr needs to clear all of that node's children as dirty as well and prepare to recompute all of that stuff. This still doesn't protect the user from changing arbitrary code at an arbitrary step in the computation and forgetting to clear the corresponding parts of the dsc. The very best students/users will screw this up 10-20% of the time, when they're tired or under pressure; the weakest will screw this up basically 100% of the time. These kinds of errors can be extremely pernicious because of people's tendency to think of software as an infallible black box. They'll change some code without clearing the cache...and the results in a figure won't actually reflect what the newest version of the code was written to do...and this may only become clear with some exacting forensic work after the paper is published and a reader has a question. A truly safe dsc would pick up any change in the code and automatically mark all of the cached results that were dependent on that code as dirty. That's a very different (and much harder) project; it has been done in slightly different environments http://pegasus.isi.edu/ but requires years of effort by a large team of software people to get it correct. In the absence of such an effort, I hope that we can keep the hierarchy of cached objects as shallow and transparent as possible. |
@mstephens I had an idea about how to steal most of the heavy work in making a "safe" dscr with a deep pipeline of parsers and scorers, etc.: steal it from git. One could restrict The overall idea is to stop using manual The biggest drawback is that this is paranoid: in the worst case, if the user changed one character in a comment line in a monolithic one-size-fits-all datamaker, this method would mark the whole project as dirty and want to recompute everything. Efficient dsc use would then require users to break down datamakers/methods/scenarios/etc into smaller pieces, ideally one function per file. My own tastes skew toward rigorous, automated cache correctness even at the expense of paranoia, at least when building tools that less sophisticated users will be touching. But I acknowledge that this is a matter of taste and also that it would be an adjustment for students. What do you think? |
@road2stat made a related suggestion of using hashes (although I don't recall the details now), so I'm copying him in. Definitely interested in discussing further. |
this is the kind of thing I had in mind: |
That's a good point - to prove that a combination of scenario/method/parsers is unchanged would require a deeper introspection of the code than I thought. However, dscr is probably going to need to understand a lot about the dependencies in the code to be able to execute safe/efficient parallel computations. For the parallel implementation, one could certainly re-run the whole workflow from datamaker to output parser for each desired combination, but that's a lot of duplication of work. To do anything more efficient will require some kind of dependency management and some way to block calculations that are contingent on unfinished phases of the workflow. @mengyin has been doing naive parallelization of her dsc (I believe that this involves |
ok, but I think that this kind of dependency is reasonably easy to mange: it isn't the most efficient, but it also will not be too inefficient in most cases, and is simple... |
analogous to reset_dsc (method, scenario) need way to reset scores and parsers
The text was updated successfully, but these errors were encountered: