NumberColumn.percentiles #35

onyxfish · 2014-04-25T22:36:57Z

Return an list of percentiles for a given column.

jheasly · 2014-04-26T04:40:24Z

Full disclosure: Never calculated a percentile in my life, but you said 'Newbs welcome.'
Is there a vision for the structure of the result? Like mebbe a list of dicts ? i.e., ({10: xx}, {20: xx}, {30: xx}, {40: xx}, {50: xx}, {60: xx}, {70: xx}, {80: xx}, {90: xx})
(I like to start with the result, establish target, then work backward.)
p.s.: In the contrib docs, 'receive' in point 9 is misspelled.

onyxfish · 2014-04-26T15:18:08Z

Typo fixed!

I've imagined this just returns a list of 100 values: [1.2, 3.4, 4.5, ...]. I can see the temptation to make it a dict, but a list allows you to say "gimme the first percentile" by just doing percentiles[0], which seems right. I'd like to follow the same pattern for quartiles (#45) and quintiles (#46), so that they are all consistent.

Given that the "10", "20", etc is implied by the order and the operation, I don't think they need to be specified as dictionary keys..

onyxfish · 2014-04-26T15:18:53Z

Oh and newbs are certainly welcome. Will mark this ticket as "in-progress."

onyxfish · 2014-04-27T02:34:37Z

Stumbled upon this, which looks like a good starting point: http://stackoverflow.com/a/2753343/24608

#45 and #46 can be implemented by reference to this.

jheasly · 2014-04-27T06:07:17Z

Ah, that's a good find, as my next question after looking at this: Quantiles, Percentiles: Why so many ways to calculate them?, which references (at least) nine methods here was going to be which one to use? Thanks!

onyxfish · 2014-04-27T15:29:04Z

Ha, yeah. I honestly don't know a good answer to that question. Best thing to do (I think) would be to implement one that seems straightforward and then check it against other implementations (R? Excel?) Docstring should probably make note of what algorithm we chose / where we got it from.

onyxfish · 2014-04-27T15:30:41Z

Btw, another hint: I think the right way to structure this is:

Column.percentile(n) returns the nth percentile
Column.percentiles() returns an array of 100 percentiles by calling the former 100 times

That way if you only need one you don't have to compute them all.

jheasly · 2014-04-27T21:17:28Z

As to a good answer for the question of which method to use; Yeah, you and me both! FWIW, according to this, Excel and R both use the R-7 method. How those map to the stackoverflow one, I haven't quite figured out ...
Will note the algorithm source in a docstring.
And your structure hint seems a good one.
One thing I'm noticing immediately is the stackoverflow method doesn't return a member of the original set. i.e., if you've got this list of values:

(1, 3, 4, 5, 5, 5, 5, 6, 7, 8, 8, 9)

the stackoverflow percentile() method returns:

percentile  result
==========  ======
0.01        1.22
0.02        1.44
0.03        1.66
0.04        1.88
0.05        2.1
0.06        2.32
0.07        2.54
0.08        2.76
 ...

rather than:

percentile  result
==========  ======
0.01        1
0.02        1
0.03        1
0.04        1
0.05        3
0.06        3
0.07        3
0.08        3
 ...

Don't we want the second result set rather than the first?

jheasly · 2014-04-27T22:10:35Z

Hrmm. Looking at how OpenOffice does it (I don't have Excel), it returns values like the first list:
With a column A:

	A
1	1
2	3
3	4
4	5
5	5
6	5
7	5
8	6
9	7
10	8
11	8
12	9

=PERCENTILE(A1:A12; 0.25) returns 4.75 (which is the same valute returned by our stackoverflow method, yay!) so I guess this is a long way of saying go with the stackoverflow formula, no?

onyxfish · 2014-04-29T15:30:41Z

Yes, def. We want the exact percentile, not whatever value in the list happens to fall closest.

jheasly · 2014-04-30T15:29:07Z

OT: Is there in the docs a use case for a NumberColumn over an IntegerColumn?
Or is it just more of a case that a NumberColumn has characteristics of/is a base class for both an IntegerColumn and DecimalColumn?
Just curious ...

onyxfish · 2014-04-30T15:39:17Z

It's the latter, though, as documented here:
#64 I'm probably going to be
eliminating both IntegerColumn and DecimalColumn. I think it makes more
sense to just have NumberColumn treat everything as Decimal type. I've been
putting off making the changes because it'll be a hard thing to undo and I
want to be certain I'm not shutting out any useful cases by doing that.
Thoughts.

Chris

On Wed, Apr 30, 2014 at 10:29 AM, John Heasly notifications@github.comwrote:

OT: Is there in the docs a use case for an IntergerColumn over a
NumberColumn?
Or was it more of a case that a NumberColumn has characteristics of/is a
base class for both an IntegerColumn and DecimalColumn?
Just curious ...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41810845
.

jheasly · 2014-04-30T16:00:59Z

Hrm, well, I can't really speak from a statistics perspective, as I don't reallly have one(!).
The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.
I guess at the end of the day, it's a really more of an issue of what does the Major Spreadsheet™ (Excel) and its Knock-offs (OpenOffice, Google) offer?

jheasly · 2014-04-30T16:03:46Z

Now that I'm spelunking around, this is some finely crafted/workmanlike stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!

onyxfish · 2014-04-30T16:24:10Z

Thanks! I appreciate that! :)

I am sensitive to the "no unnecessary rounding/conversion" issues, however,
I'm less worried about that with a purely analytical library than I would
be if there was a presentation component to this. Eliminating the extra
type has one major benefit: you couldn't specify the wrong type when
computing a new column. So for instance, if you create a column by dividing
two integers, you need to make sure you specify the new column is Decimal.
I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally
as decimals and it's only at presentation time that something is
"int-ifed". I know from writing csvkit that the Excel file formats only
represent numbers one way, so I presume that's true and runtime as well.

Chris

On Wed, Apr 30, 2014 at 11:03 AM, John Heasly notifications@github.comwrote:

Now that I'm spelunking around, this is some finely crafted/workmanlike
stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41815141
.

jheasly · 2014-05-01T04:13:37Z

Kind of like storing datetimes as UTC and converting to local at presentation time; a tried-and-true approach.

onyxfish · 2014-05-01T12:28:19Z

Exactly. I've been mulling this for a week and the only reason I've been
able to come up with not to eliminate IntColumn is that sometimes it might
be more performant. That's not a good enough reason, so I'll probably pull
the trigger on this today.

On Wed, Apr 30, 2014 at 11:13 PM, John Heasly notifications@github.comwrote:

Kind of like storing datetimes as UTC and converting to local at
presentation time; a tried-and-true approach.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41878639
.

jheasly · 2014-05-01T16:08:43Z

Got something (finally) to test locally. Tonight I'd like to merge your latest from today into my fork and give my stuff a whirl ...

onyxfish · 2014-05-01T17:28:29Z

Great! I made the changes to kill IntColumn and FloatColumn this morning,
which ended up being messier than I had hoped, but I think it's sorted now.
The cast and validate args when creating a table are gone now too. Data is
always cast the first time a table is created.

C

On Thu, May 1, 2014 at 11:08 AM, John Heasly notifications@github.comwrote:

Got something (finally) to test locally. Tonight I'd like to merge your
latest from today into my fork and give my stuff a whirl ...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41924645
.

onyxfish · 2014-05-01T18:51:32Z

(Just putting this here for reference later.)

Numpy's implementation of percentile:
https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/function_base.py#L2720

(It's predictably unintelligible.)

onyxfish · 2014-05-02T03:40:32Z

TIL: the generalized form is actually the quantile.

http://en.wikipedia.org/wiki/Quantile

jheasly · 2014-05-02T05:50:23Z

The numpy implementation unintelligibility did not disappoint.
The wikipedia Quantile explanation was a little more friendly/grok-able.

jheasly · 2014-05-02T06:23:42Z

I've got the argumentless, return-a-list-of-100 percentile values working pretty good.

But in executing the

Column.percentile(n) returns the nth percentile

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator. I get this:

Traceback (most recent call last):
  File "./test_script.py", line 68, in <module>
    pct = states.columns['total'].percentile(5)
  File "/Users/jheasly/Development/journalism/journalism/columns.py", line 73, in check
    return func(c)
TypeError: percentile() takes exactly 2 arguments (1 given)

(The objectionable, referenced line 73 is here.)

If I comment out the decorator, it works fine. What to do?

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

And on a housekeeping note, I won't be able to hit this again until sometime Saturday. It's been a bit more demanding than I'd anticipated when I raised my hand, but I'm learning and it's fun. Thanks fer puttin' up with my nonsense.

onyxfish · 2014-05-02T13:15:06Z

Awesome! Really excited to have this included.

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator.

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

You know I did this same sort of thing for Table.__init__ last night and was then reminded (after some Googling) that this is actually considered bad juju in Python. Due to it's "duck-typed" nature, you generally shouldn't test for type. Even though you expect an integer, a float, Decimal, or something other invented type could also be valid.

That being said, it's perfectly valid to test for value, so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

onyxfish · 2014-05-03T17:40:50Z

On a whim did some more reading about this. Bugger, this stuff is a lot more complex / less standard than I realized. Didn't realize there was so much disagreement about how to calculate percentiles!

jheasly · 2014-05-03T21:23:10Z

Awesome! Really excited to have this included.

Cool! Happy to have been of use!

so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

Sounds good. Nifty integer check!

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

Awesomeness. Happy to do what I can.

a lot more complex / less standard than I realized.

Yeah, no kidding. Who knew stats could be such a untamed wilderness?

onyxfish · 2014-05-13T01:04:58Z

Hi John, is this ready for a pull request?

jheasly · 2014-05-13T01:34:17Z

Hi Chris!
I was waiting for this:

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

But I can omit the decorator, clean-up and run against my local little test script.

As for real testing, is there a recommended bit of test_columns.py to use as a model?

onyxfish · 2014-05-13T13:58:03Z

Doh! That's my fault. I forgot all about it. I've made myself a high-pri ticket and will get to it soon:

#122

For your testing I'd probably look at something like test_counts. That seems like the most similar method.

jheasly · 2014-05-13T16:47:04Z

High-pri ticket(!). Whoa.

I'll muck about with some test-making tonight.

onyxfish · 2014-05-15T05:13:04Z

Blocker resolved by @mickaobrien!

jheasly · 2014-05-16T04:24:45Z

Excellent! Diving into the unittest module ...

jheasly · 2014-05-16T07:19:46Z

Coupla questions:
• Right now, percentile() returns a list (running the unittests revealed this to me!), whether it's delivering a requested percentile or the whole shebang. A list is okay, right?
• Speaking of the whole shebang (all 100 percentiles), should I make a test for that case too?

onyxfish · 2014-05-31T21:21:04Z

HI John! I merged this on a flight today so I didn't have access to your comments. I changed it return a single value when a single value is requested and also made a few other small changes. And I added a few unit tests. I've opened two new tickets for minor outstanding issues, #129 and #130, but I wanted to go ahead and merge it.

Thanks again for the contribution! I also added you to the AUTHORS file.

jheasly · 2014-06-02T05:39:06Z

Hey Chris! Thanks for the improvements in both the code and the tests. They're educational for me to look at/grok/study. I appreciate that.

And thanks for the AUTHORS addition. Next time I visit, I'm going to show my Mom.

onyxfish · 2014-06-02T12:30:57Z

Absolutely! Please let me know if any of the changes don't make sense. Mostly I just pruned back things we weren't using from the original source (lambda x: x) and tried to clean up the variable names, etc. (Also there was one issue where you were casting float to Decimal, which is bad juju and fails on Python 2.6).

Thanks again!

onyxfish added feature labels Apr 25, 2014

onyxfish changed the title ~~Column.percentile~~ Column.percentiles Apr 26, 2014

onyxfish changed the title ~~Column.percentiles~~ NumberColumn.percentiles Apr 26, 2014

onyxfish added the in-progress label Apr 26, 2014

onyxfish added priority-normal and removed priority-low labels Apr 27, 2014

onyxfish added this to the 0.3.0 milestone May 1, 2014

onyxfish modified the milestones: 0.4.0, 0.3.0 May 5, 2014

onyxfish mentioned this issue May 13, 2014

no_null_computations needs to accept args #122

Closed

jheasly mentioned this issue May 22, 2014

NumberColumn.Percentile #128

Merged

onyxfish closed this as completed May 31, 2014

jpmckinney mentioned this issue Oct 17, 2023

Support "Integer" data types when data has no fractional numbers (floats, decimals) wireservice/agate-sql#43

Closed

NumberColumn.percentiles #35

NumberColumn.percentiles #35

Comments

onyxfish commented Apr 25, 2014

jheasly commented Apr 26, 2014

onyxfish commented Apr 26, 2014

onyxfish commented Apr 26, 2014

onyxfish commented Apr 27, 2014

jheasly commented Apr 27, 2014

onyxfish commented Apr 27, 2014

onyxfish commented Apr 27, 2014

jheasly commented Apr 27, 2014

jheasly commented Apr 27, 2014

onyxfish commented Apr 29, 2014

jheasly commented Apr 30, 2014

onyxfish commented Apr 30, 2014

jheasly commented Apr 30, 2014

jheasly commented Apr 30, 2014

onyxfish commented Apr 30, 2014

jheasly commented May 1, 2014

onyxfish commented May 1, 2014

jheasly commented May 1, 2014

onyxfish commented May 1, 2014

onyxfish commented May 1, 2014

onyxfish commented May 2, 2014

jheasly commented May 2, 2014

jheasly commented May 2, 2014

onyxfish commented May 2, 2014

onyxfish commented May 3, 2014

jheasly commented May 3, 2014

onyxfish commented May 13, 2014

jheasly commented May 13, 2014

onyxfish commented May 13, 2014

jheasly commented May 13, 2014

onyxfish commented May 15, 2014

jheasly commented May 16, 2014

jheasly commented May 16, 2014

onyxfish commented May 31, 2014

jheasly commented Jun 2, 2014

onyxfish commented Jun 2, 2014