-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumberColumn.percentiles #35
Comments
Full disclosure: Never calculated a percentile in my life, but you said 'Newbs welcome.' |
Typo fixed! I've imagined this just returns a list of 100 values: Given that the "10", "20", etc is implied by the order and the operation, I don't think they need to be specified as dictionary keys.. |
Oh and newbs are certainly welcome. Will mark this ticket as "in-progress." |
Stumbled upon this, which looks like a good starting point: http://stackoverflow.com/a/2753343/24608 |
Ah, that's a good find, as my next question after looking at this: Quantiles, Percentiles: Why so many ways to calculate them?, which references (at least) nine methods here was going to be which one to use? Thanks! |
Ha, yeah. I honestly don't know a good answer to that question. Best thing to do (I think) would be to implement one that seems straightforward and then check it against other implementations (R? Excel?) Docstring should probably make note of what algorithm we chose / where we got it from. |
Btw, another hint: I think the right way to structure this is:
That way if you only need one you don't have to compute them all. |
As to a good answer for the question of which method to use; Yeah, you and me both! FWIW, according to this, Excel and R both use the R-7 method. How those map to the stackoverflow one, I haven't quite figured out ... (1, 3, 4, 5, 5, 5, 5, 6, 7, 8, 8, 9) the stackoverflow
rather than:
Don't we want the second result set rather than the first? |
Hrmm. Looking at how OpenOffice does it (I don't have Excel), it returns values like the first list:
|
Yes, def. We want the exact percentile, not whatever value in the list happens to fall closest. |
OT: Is there in the docs a use case for a |
It's the latter, though, as documented here: Chris On Wed, Apr 30, 2014 at 10:29 AM, John Heasly notifications@github.comwrote:
|
Hrm, well, I can't really speak from a statistics perspective, as I don't reallly have one(!). |
Now that I'm spelunking around, this is some finely crafted/workmanlike stuff! |
Thanks! I appreciate that! :) I am sensitive to the "no unnecessary rounding/conversion" issues, however, As best I can discern, Excel, OO, Google, etc all treat numbers internally Chris On Wed, Apr 30, 2014 at 11:03 AM, John Heasly notifications@github.comwrote:
|
Kind of like storing datetimes as UTC and converting to local at presentation time; a tried-and-true approach. |
Exactly. I've been mulling this for a week and the only reason I've been On Wed, Apr 30, 2014 at 11:13 PM, John Heasly notifications@github.comwrote:
|
Got something (finally) to test locally. Tonight I'd like to merge your latest from today into my fork and give my stuff a whirl ... |
Great! I made the changes to kill IntColumn and FloatColumn this morning, C On Thu, May 1, 2014 at 11:08 AM, John Heasly notifications@github.comwrote:
|
(Just putting this here for reference later.) Numpy's implementation of percentile: (It's predictably unintelligible.) |
TIL: the generalized form is actually the |
The numpy implementation unintelligibility did not disappoint. |
I've got the argumentless, return-a-list-of-100 percentile values working pretty good. But in executing the
bit, I can't pass an argument to my Traceback (most recent call last):
File "./test_script.py", line 68, in <module>
pct = states.columns['total'].percentile(5)
File "/Users/jheasly/Development/journalism/journalism/columns.py", line 73, in check
return func(c)
TypeError: percentile() takes exactly 2 arguments (1 given) (The objectionable, referenced If I comment out the decorator, it works fine. What to do? Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for? And on a housekeeping note, I won't be able to hit this again until sometime Saturday. It's been a bit more demanding than I'd anticipated when I raised my hand, but I'm learning and it's fun. Thanks fer puttin' up with my nonsense. |
Awesome! Really excited to have this included.
That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.
You know I did this same sort of thing for That being said, it's perfectly valid to test for value, so I'd check that it's a.) it's a whole number (something like No worries on the "deadline." It makes me happier/saner to have someone else hacking on it. |
On a whim did some more reading about this. Bugger, this stuff is a lot more complex / less standard than I realized. Didn't realize there was so much disagreement about how to calculate percentiles! |
Cool! Happy to have been of use!
Sounds good. Nifty integer check!
Awesomeness. Happy to do what I can.
Yeah, no kidding. Who knew stats could be such a untamed wilderness? |
Hi John, is this ready for a pull request? |
Hi Chris!
But I can omit the decorator, clean-up and run against my local little test script. As for real testing, is there a recommended bit of |
Doh! That's my fault. I forgot all about it. I've made myself a high-pri ticket and will get to it soon: For your testing I'd probably look at something like |
High-pri ticket(!). Whoa. I'll muck about with some test-making tonight. |
Blocker resolved by @mickaobrien! |
Excellent! Diving into the |
Coupla questions: |
HI John! I merged this on a flight today so I didn't have access to your comments. I changed it return a single value when a single value is requested and also made a few other small changes. And I added a few unit tests. I've opened two new tickets for minor outstanding issues, #129 and #130, but I wanted to go ahead and merge it. Thanks again for the contribution! I also added you to the |
Hey Chris! Thanks for the improvements in both the code and the tests. They're educational for me to look at/grok/study. I appreciate that. And thanks for the |
Absolutely! Please let me know if any of the changes don't make sense. Mostly I just pruned back things we weren't using from the original source ( Thanks again! |
Return an list of percentiles for a given column.
The text was updated successfully, but these errors were encountered: