-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit length of times
arrays in profiling *pdata*
#57
Comments
Oh, parallel replies :-) Just followed-up on the other thread: #56 (comment)
This is a good observation but I wouldn't worry about it to much. 1. We can do the stats aggregation in another thread. 2. The stats aggregation is pretty cheap, and would be done rarely. The amortised time impact should be low (if not negligible) compared to the (presumably quite expensive) profiled calls. The biggest counter-argument to this change IMO is: is the extra code complication worth it if no one's ever going to encounter a problem in realistic use cases? (This is a really well-formulated issue btw - thanks for that!) |
Makes sense that you could do the calculation in another thread. Yeah, I'm on the fence about this, too. I could see it being an uncommon need to go beyond 10m iterations. But then, when you need it, you need it. I don't really know. I wonder what a well established profiling library, maybe in the Java world, does about this… Personally, I'd be satisfied with a note somewhere in the documentation that mentions the limitation and possible workarounds. |
That's fair enough :-) Okay, let's do this - let's keep the issue open. I'll continue to think about it. Feel free to add any extra thoughts if you have them. Probably what this will come down to - someone needs to take a stab at implementing the change to see what it'd look like in practice. If it's not too unwieldy, we can pick it up just to be on the safe side. If it needs to be unwieldy, then we'll need to make a decision if it's worth it :-) I've got 15 mins now - will see what I can come up with in that time. |
That makes a lot of sense on all accounts. Thanks! |
Wow, I hit this limitation earlier than I thought. I now have a sequence of about 50 million items and 6 function calls per item to profile. That's 50M items * 6 function calls per item * 8 bytes = about 2.2 GB of times. That's a lot of profiling overhead. The chunked statistics approach would have about 10k times * 6 functions * 8 bytes = about 1/2 MB of overhead. |
That's a lot of profiling! :-) Even without the memory overhead, that's going to incur a significant performance hit. Is there a reason you need/want to profile such high-rep targets - instead of, say, a smaller but representative test case? Anyway, let me finish up the experiment from earlier and let's compare - need ~another 10->15 mins. |
Okay, this could serve as a proof-of-concept: https://github.com/ptaoussanis/timbre/compare/dev?expand=1 (EDIT: Updated URL to reflect most recent dev commit) A few notes:
Feedback welcome! |
True: it is a lot of profiling! Good question. The main reason is that the smaller but representative test case is hard to put together. I'm constrained by needing to keep real customer data out of the code base, which means I would need to anonymize the real user ids (in a consistent way, not just stripping them out). And in reality, I would probably need to do a bit of information auditing to be sure no other information should be considered privileged. Which is not to say I couldn't do all that, just that it's a hassle. :-) Haven't looked carefully at your solution yet, but the approach seems sound to me. Let me try it out on my dataset and report back. |
I looked through the code, and it looks good for a first pass. I think that, with some care, we can shift some things around to make the code as legible as before. So that source of complexity doesn't seem problematic. The bigger problem in my mind is that the stats merge is included in the reported times. That will, of course, affect the stats, which means that we've traded one kind of overhead for another. (I guess, in theory, some amount of overhead is necessary.) Anyway, the problem (Clojure's STM preferring the shorter |
This worked great on my dataset—thanks! One bug I noticed is that the MAD is not being calculated correctly. For any label with >= 50k calls, the MAD is always In other words, if the count-so-far Tell me if this isn't clear; I can explain further. |
Ahh, forgot a multiplication - thanks!
That's interesting - hand't considered that. Figured we'd just accept a loss of accuracy proportional to the variance between timing group means. I'll play around with that tonight - nice idea!
That's an inherent trade-off with accumulating stats unfortunately. This'd be true (due to STM contention) even if we could run the merge op off the main thread. In this case we're trading a small, measurable perf impact against the potential for unbounded memory use. I've bumped the GC limit to kick in only on jobs targeting >100k reps. Even a million would be reasonable I think (~10MB of memory). Really, we're doing this just to account for the rare instances of multi-million rep targets. Will get back to you if I have a chance to play with an improved MAD a bit later. Cheers :-) |
Thought about this a little more, I think accounting for mean variance is going to be tough. A naive sum ( But then we have to carry info for the distribution of signed differences, and somehow reconstitute that when recalculating residuals with the new population mean. It's not immediately obvious to me how we'd approach that (any ideas?). A more obvious alternative might be some kind of streaming variance statistic - I recall that there's some with nice properties that are very simple to implement. Not a big fan of variance in this context though. Have limited time to look into this myself atm - are you familiar with any stream-capable MAD methods? Otherwise - practically speaking: the current naive approach produces an error proportional to the variance between timing group means. Given that the groups are large (~100k timings) - it seems reasonable to conclude that group mean variance will be low in most real-world applications. It's not ideal, but it might be good enough for our purposes here if no obvious alternatives present? |
I agree with your preference of MAD over standard deviation, but if the |
Yeah, that seems to be the one that comes up. (Note that the earlier SO answer is also actually the Welford/Knuth algo). Interestingly, the naive approach (with error proportional to between-group variance) seems to produce a better approximation than Welford/Knuth. This could be explained by the fact that Welford/Knuth was designed to operate on inputs 1-at-a-time whereas the naive approach we're using has batches of n>>1 inputs (100k currently). I'm surprised by that - figured there'd be some more sophisticated technique. Your original suggestion got me thinking though, and I ended up experimenting with a way of correcting for inaccurate prior means by bucketing deltas into -ive and +ive groups: (defn merge-times>stats!
([] ; -> {<pid> <merged-stats>}
(reduce (fn [m pid] (assoc m pid (merge-times>stats! pid)))
{} (keys (or @*pdata* {}))))
([id] ; -> <merged-stats>
(->
(swap! *pdata*
(fn [m]
(let [{:as m-id
:keys [times ntimes stats]
:or {times []
ntimes 0
stats {}}} (get m id {})]
(if (empty? times) m
(let [ts-count (max 1 ntimes)
ts-time (reduce + times)
ts-mean (/ ts-time ts-count)
;; Non-streaming Mean Absolute Deviation = |ts-mean|/n.
;; To get streaming we instead collect deltas separated by
;; sign. This seems to be more accurate that Knuth/Welford,
;; Ref. http://goo.gl/mx5eSK, http://goo.gl/QLSfOc - probably
;; due to larger batch sizes.
[ts-mad-s+ ts-mad-s-]
(reduce (fn [[s+ s-] t]
(let [delta (- t ts-mean)]
(if (>= delta 0) [(+ s+ delta) s-]
[s+ (+ s- delta)])))
[0 0] times)
;;
s-count (+ (:count stats 0) ts-count)
s-time (+ (:time stats 0) ts-time)
s-mean (/ s-time s-count)
;;
s-mad-s+ (if-not (:count stats) ts-mad-s+
(+ (:mad-s+ stats) ts-mad-s+
(* (:count stats) (- s-mean (:mean stats)))))
s-mad-s- (if-not (:count stats) ts-mad-s-
(+ (:mad-s- stats) ts-mad-s-
(* (:count stats) (- s-mean (:mean stats)))))
s-mad-sum (+ (Math/abs (long s-mad-s+))
(Math/abs (long s-mad-s-)))
s-mad (/ s-mad-sum s-count)
;;
s-min (apply min (:min stats Double/POSITIVE_INFINITY) times)
s-max (apply max (:max stats 0) times)]
(assoc m id
(assoc m-id
:times []
:ntimes 0
:stats {:count s-count
:min s-min
:max s-max
:mean s-mean
:mad-s+ s-mad-s+
:mad-s- s-mad-s-
:mad s-mad
:time s-time})))))))
(get-in [id :stats])))) Even with pathological times (very high between-group variances), this seems to approximate the true MAD quite well. Didn't have a lot of time to play with this today, but I'll push what I have now & you can experiment. ADDENDUM: After noting that Welford/Knuth is generally considered pretty good, and our naive approach is already its equal or superior - am inclined to suggest we call it a day and go with the less convoluted approach. What do you think? The naive approach, for comparison: (defn merge-times>stats!
([] ; -> {<pid> <merged-stats>}
(reduce (fn [m pid] (assoc m pid (merge-times>stats! pid)))
{} (keys (or @*pdata* {}))))
([id] ; -> <merged-stats>
(->
(swap! *pdata*
(fn [m]
(let [{:as m-id
:keys [times ntimes stats]
:or {times []
ntimes 0
stats {}}} (get m id {})]
(if (empty? times) m
(let [ts-count (max 1 ntimes)
ts-time (reduce + times)
ts-mean (long (/ ts-time ts-count))
ts-mad-sum (long (reduce + (map #(Math/abs (long (- % ts-mean)))
times))) ; Mean absolute deviation
;;
s-count (+ (:count stats 0) ts-count)
s-time (+ (:time stats 0) ts-time)
s-mean (long (/ s-time s-count))
s-mad-sum (long (+ (:mad-sum stats 0) ts-mad-sum))
s-mad (long (/ s-mad-sum s-count))
s-min (apply min (:min stats Double/POSITIVE_INFINITY) times)
s-max (apply max (:max stats 0) times)]
(assoc m id
(assoc m-id
:times []
:ntimes 0
:stats {:count s-count
:min s-min
:max s-max
:mean s-mean
:mad-sum s-mad-sum
:mad s-mad
:time s-time})))))))
(get-in [id :stats])))) |
Wow, good catch about the SO answer being an adaptation of the Welford algorithm. I didn't notice that. I don't really understand the purpose of bucketing the deltas. But in any case, yes, the relatively naive approach sounds good to me, at least for now. Nice work. :-) At some point in the future, we could set up an experiment to try out various techniques. We could run some program that will produce several million timing measurements, with heap contention towards the end of the run in increasing severity (maybe ending with an OutOfMemory error), and instead of summarizing the stats online, just log them to a file. Then we could try out various algorithms on the measurements, offline. Just an idea for later. |
By definition the sum of the (signed) differences from the mean will always be zero. You'd normally get around this by squaring or taking the absolute value of the differences to get a summable sequence. That kills our ability to compensate for prior inaccuracies in the mean though - so an alternative would be just to keep the +ive and -ive signed differences separate and only apply the sign elimination at the very end of the process. There's still a loss of accuracy this way (depends on the size of the mu compensation) - so it's not 100% accurate, but it can be a lot more accurate the the naive/Welford approach. Anyway, I think in practice our naive approach (basically Welford with large groups instead of individual samples) will be quite resistant to most kinds of real-world behaviour.
This is a good observation. Since our group sizes are large though (currently 100k), it's unlikely that between-group variances will be large. Usually times would start creeping up gradually, so the effect would be continuous between groups. What would trip us up is some kind of large discrete step-wise jump: like the first 100k samples take 100ms each and the next 100k take 1000ms each. If you're curious, I ran some tests on this already - they're at the bottom of the profiling ns: (let [;; MAD = 154.0ms, natural:
;; n->s {0 10 1 100 2 50 3 500 4 8 5 300 6 32 7 433 8 213 9 48}
;; MAD = 236.0ms, pathological:
n->s {0 10 1 11 2 5 3 18 4 7 5 2 6 300 7 400 8 600 9 700}]
(with-redefs [stats-gc-n 3]
(profile :info :high-sigma (dotimes [n 10] (p :sleep (Thread/sleep (n->s n))))))) So this is a synthetic test that represents a kind of upper bound on how bad our approximation could be. The (large group) naive method gets a 36ms MAD (85% error) with both pathological samples and pathological group sizes (3 samples per group in this case). The compensated method (bucketed signed differences) get a ~260ms MAD (~10% error) if I recall correctly. So there's a big difference in this synthetic case, but I feel the difference in real world use will be much smaller due to: continuous rather than discrete mean changes, much larger sample groups (100k vs 3). EDIT: I didn't test the 1-by-1 Welford, but it'll be strictly worse than the 85% (large group) error.
Heh heh, I never considered that :-) Sure we could just log (all) times to disk every 100k iterations, then read them back on job completion to generate stats. No approximation would be necessary - we'd have all the exact times. That'd have been easier :-) This is probably (much) faster though, so probably still a fair tradeoff. |
|
I created a Gist to dump run-times to a file in a head-retention scenario, then run different variance calculations on it: https://gist.github.com/kyptin/9714908 In a REPL with Calling
I'm pretty sure that I haven't gotten your method (which I'm calling |
Hi Jeff, sorry - won't have much time for the next couple days so only skimmed this briefly (apologies for any errors!). Looks like your As a degenerate case, imagine if n = the population size. Unless I'm missing something (quite possible!), Welford with chunks should always be >= Welford w/o chunks since you're performing the same operation but with more accurate information. Also, if I understand your data correctly - it looks like your times are uniform from start to finish? That's a perfect case for Welford/chunked since the variance is stable. To better test their limitations, try times that grow rapidly with each iteration. Both should strain. The test I ran with partitioned signed differences should be in the commit history if you're curious to compare against that. (It performed much better in the tests I ran, but I didn't test it thoroughly or think deeply about possible problems). Welford (chunked or individual) & similar algos are optimised for the limit case where the mean sample value eventually converges. In these cases Welford & co. will converge quickly to the exact value. Does that make sense? Anyway, again, please take my commentary with a grain of salt - it's very possible I'm misunderstanding something. Have a good weekend, cheers :-) |
Thanks for your comments, Peter. And sorry for the late response. (I generally don't have much time to work on this project except on the weekends.) Sorry, I don't quite get Welford well enough to follow your logic. But, I believe you. :-) And yes, I was surprised to see that my times are uniform. I thought that heap contention would make my times increase as the program ran because of garbage collection. But I think the garbage collection doesn't ever happen in the function being mapped ( However, for now, I yield: please feel free to implement your algorithm, and I'll get out of your way and cheer you on. :-) |
No problem Jeff, this turned out to be a fun problem to hack on :-) The currently implementation's just pending the next Timbre release to hit master. Closing for now, please feel free to reopen whenever if you have any further thoughts. Enjoy the rest of your weekend, cheers :-) |
Real-world use of Timbre rarely exceeds even ~100k calls per logged event, and we were paying for chunked streaming support with significant complexity and overhead. As of the current commit, choosing to emphasize the common case (<2m calls) for significant complexity and overhead savings. Call counts >nmax will display with a "<n>+" calls moniker in the output table to help identify when truncation has kicked in.
Real-world use of Timbre rarely exceeds even ~100k calls per logged event, and we were paying for chunked streaming support with significant complexity and overhead. As of the current commit, choosing to emphasize the common case (<2m calls) for significant complexity and overhead savings. Call counts >nmax will display with a "<n>+" calls moniker in the output table to help identify when truncation has kicked in.
The time that a labeled section of code takes during profiling is stored in a vector of times in
*pdata*
. Currently, all of the times are accumulated first, and the summary statistics (min, max, mean, etc) are computed after the code has finished.This approach works but can lead to OutOfMemory errors for code that is repeated heavily, e.g. the function being mapped over a large sequence.
An alternate approach is to compute the stats using a "streaming" (or "one-pass") approach, in which the times themselves are not saved, but are used merely to update the current values of the summary statistics. The stats are amenable to such an approach—even the MAD, so long as M stands for "mean" and not "median". One problem that could be introduced by this approach is a loss of precision, unless fractions and big-int math were used, which could get unwieldy after millions of iterations.
Another approach, suggested by @ptaoussanis, is to let the times accumulate to a point, then summarize them into the statistics. For example, for the first 10,000 times, simply accumulate them in the vector. Then, use the next time as a trigger point for summarizing the first 10k times, and add the 10,001st time as the first time in a new accumulator vector. At the 20,001st time, times 10,001 through 20,000 are folded into the summary statistics, and the accumulator vector is emptied.
A significant challenge faced by both approaches is how to do the calculation apart from the profiled code. In other words, how can you prevent the statistics calculations from falsely inflating the amount of time reported on the 10,001st time? I don't have a good solution to this obstacle. If this is a deal-breaker, I suggest documenting the limitation somewhere.
The text was updated successfully, but these errors were encountered: