Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberColumn.percentiles #35

Closed
onyxfish opened this issue Apr 25, 2014 · 36 comments
Closed

NumberColumn.percentiles #35

onyxfish opened this issue Apr 25, 2014 · 36 comments
Labels
Milestone

Comments

@onyxfish
Copy link
Collaborator

Return an list of percentiles for a given column.

@onyxfish onyxfish changed the title Column.percentile Column.percentiles Apr 26, 2014
@onyxfish onyxfish changed the title Column.percentiles NumberColumn.percentiles Apr 26, 2014
@jheasly
Copy link
Contributor

jheasly commented Apr 26, 2014

Full disclosure: Never calculated a percentile in my life, but you said 'Newbs welcome.'
Is there a vision for the structure of the result? Like mebbe a list of dicts ? i.e., ({10: xx}, {20: xx}, {30: xx}, {40: xx}, {50: xx}, {60: xx}, {70: xx}, {80: xx}, {90: xx})
(I like to start with the result, establish target, then work backward.)
p.s.: In the contrib docs, 'receive' in point 9 is misspelled.

@onyxfish
Copy link
Collaborator Author

Typo fixed!

I've imagined this just returns a list of 100 values: [1.2, 3.4, 4.5, ...]. I can see the temptation to make it a dict, but a list allows you to say "gimme the first percentile" by just doing percentiles[0], which seems right. I'd like to follow the same pattern for quartiles (#45) and quintiles (#46), so that they are all consistent.

Given that the "10", "20", etc is implied by the order and the operation, I don't think they need to be specified as dictionary keys..

@onyxfish
Copy link
Collaborator Author

Oh and newbs are certainly welcome. Will mark this ticket as "in-progress."

@onyxfish
Copy link
Collaborator Author

Stumbled upon this, which looks like a good starting point: http://stackoverflow.com/a/2753343/24608

#45 and #46 can be implemented by reference to this.

@jheasly
Copy link
Contributor

jheasly commented Apr 27, 2014

Ah, that's a good find, as my next question after looking at this: Quantiles, Percentiles: Why so many ways to calculate them?, which references (at least) nine methods here was going to be which one to use? Thanks!

@onyxfish
Copy link
Collaborator Author

Ha, yeah. I honestly don't know a good answer to that question. Best thing to do (I think) would be to implement one that seems straightforward and then check it against other implementations (R? Excel?) Docstring should probably make note of what algorithm we chose / where we got it from.

@onyxfish
Copy link
Collaborator Author

Btw, another hint: I think the right way to structure this is:

Column.percentile(n) returns the nth percentile
Column.percentiles() returns an array of 100 percentiles by calling the former 100 times

That way if you only need one you don't have to compute them all.

@jheasly
Copy link
Contributor

jheasly commented Apr 27, 2014

As to a good answer for the question of which method to use; Yeah, you and me both! FWIW, according to this, Excel and R both use the R-7 method. How those map to the stackoverflow one, I haven't quite figured out ...
Will note the algorithm source in a docstring.
And your structure hint seems a good one.
One thing I'm noticing immediately is the stackoverflow method doesn't return a member of the original set. i.e., if you've got this list of values:

(1, 3, 4, 5, 5, 5, 5, 6, 7, 8, 8, 9)

the stackoverflow percentile() method returns:

percentile  result
==========  ======
0.01        1.22
0.02        1.44
0.03        1.66
0.04        1.88
0.05        2.1
0.06        2.32
0.07        2.54
0.08        2.76
 ... 

rather than:

percentile  result
==========  ======
0.01        1
0.02        1
0.03        1
0.04        1
0.05        3
0.06        3
0.07        3
0.08        3
 ... 

Don't we want the second result set rather than the first?

@jheasly
Copy link
Contributor

jheasly commented Apr 27, 2014

Hrmm. Looking at how OpenOffice does it (I don't have Excel), it returns values like the first list:
With a column A:

A
1 1
2 3
3 4
4 5
5 5
6 5
7 5
8 6
9 7
10 8
11 8
12 9

=PERCENTILE(A1:A12; 0.25) returns 4.75 (which is the same valute returned by our stackoverflow method, yay!) so I guess this is a long way of saying go with the stackoverflow formula, no?

@onyxfish
Copy link
Collaborator Author

Yes, def. We want the exact percentile, not whatever value in the list happens to fall closest.

@jheasly
Copy link
Contributor

jheasly commented Apr 30, 2014

OT: Is there in the docs a use case for a NumberColumn over an IntegerColumn?
Or is it just more of a case that a NumberColumn has characteristics of/is a base class for both an IntegerColumn and DecimalColumn?
Just curious ...

@onyxfish
Copy link
Collaborator Author

It's the latter, though, as documented here:
#64 I'm probably going to be
eliminating both IntegerColumn and DecimalColumn. I think it makes more
sense to just have NumberColumn treat everything as Decimal type. I've been
putting off making the changes because it'll be a hard thing to undo and I
want to be certain I'm not shutting out any useful cases by doing that.
Thoughts.

Chris

On Wed, Apr 30, 2014 at 10:29 AM, John Heasly notifications@github.comwrote:

OT: Is there in the docs a use case for an IntergerColumn over a
NumberColumn?
Or was it more of a case that a NumberColumn has characteristics of/is a
base class for both an IntegerColumn and DecimalColumn?
Just curious ...


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41810845
.

@jheasly
Copy link
Contributor

jheasly commented Apr 30, 2014

Hrm, well, I can't really speak from a statistics perspective, as I don't reallly have one(!).
The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.
I guess at the end of the day, it's a really more of an issue of what does the Major Spreadsheet™ (Excel) and its Knock-offs (OpenOffice, Google) offer?

@jheasly
Copy link
Contributor

jheasly commented Apr 30, 2014

Now that I'm spelunking around, this is some finely crafted/workmanlike stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!

@onyxfish
Copy link
Collaborator Author

Thanks! I appreciate that! :)

I am sensitive to the "no unnecessary rounding/conversion" issues, however,
I'm less worried about that with a purely analytical library than I would
be if there was a presentation component to this. Eliminating the extra
type has one major benefit: you couldn't specify the wrong type when
computing a new column. So for instance, if you create a column by dividing
two integers, you need to make sure you specify the new column is Decimal.
I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally
as decimals and it's only at presentation time that something is
"int-ifed". I know from writing csvkit that the Excel file formats only
represent numbers one way, so I presume that's true and runtime as well.

Chris

On Wed, Apr 30, 2014 at 11:03 AM, John Heasly notifications@github.comwrote:

Now that I'm spelunking around, this is some finely crafted/workmanlike
stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41815141
.

@jheasly
Copy link
Contributor

jheasly commented May 1, 2014

Kind of like storing datetimes as UTC and converting to local at presentation time; a tried-and-true approach.

@onyxfish
Copy link
Collaborator Author

onyxfish commented May 1, 2014

Exactly. I've been mulling this for a week and the only reason I've been
able to come up with not to eliminate IntColumn is that sometimes it might
be more performant. That's not a good enough reason, so I'll probably pull
the trigger on this today.

On Wed, Apr 30, 2014 at 11:13 PM, John Heasly notifications@github.comwrote:

Kind of like storing datetimes as UTC and converting to local at
presentation time; a tried-and-true approach.


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41878639
.

@jheasly
Copy link
Contributor

jheasly commented May 1, 2014

Got something (finally) to test locally. Tonight I'd like to merge your latest from today into my fork and give my stuff a whirl ...

@onyxfish
Copy link
Collaborator Author

onyxfish commented May 1, 2014

Great! I made the changes to kill IntColumn and FloatColumn this morning,
which ended up being messier than I had hoped, but I think it's sorted now.
The cast and validate args when creating a table are gone now too. Data is
always cast the first time a table is created.

C

On Thu, May 1, 2014 at 11:08 AM, John Heasly notifications@github.comwrote:

Got something (finally) to test locally. Tonight I'd like to merge your
latest from today into my fork and give my stuff a whirl ...


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41924645
.

@onyxfish onyxfish added this to the 0.3.0 milestone May 1, 2014
@onyxfish
Copy link
Collaborator Author

onyxfish commented May 1, 2014

(Just putting this here for reference later.)

Numpy's implementation of percentile:
https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/function_base.py#L2720

(It's predictably unintelligible.)

@onyxfish
Copy link
Collaborator Author

onyxfish commented May 2, 2014

TIL: the generalized form is actually the quantile.

http://en.wikipedia.org/wiki/Quantile

@jheasly
Copy link
Contributor

jheasly commented May 2, 2014

The numpy implementation unintelligibility did not disappoint.
The wikipedia Quantile explanation was a little more friendly/grok-able.

@jheasly
Copy link
Contributor

jheasly commented May 2, 2014

I've got the argumentless, return-a-list-of-100 percentile values working pretty good.

But in executing the

Column.percentile(n) returns the nth percentile

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator. I get this:

Traceback (most recent call last):
  File "./test_script.py", line 68, in <module>
    pct = states.columns['total'].percentile(5)
  File "/Users/jheasly/Development/journalism/journalism/columns.py", line 73, in check
    return func(c)
TypeError: percentile() takes exactly 2 arguments (1 given)

(The objectionable, referenced line 73 is here.)

If I comment out the decorator, it works fine. What to do?

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

And on a housekeeping note, I won't be able to hit this again until sometime Saturday. It's been a bit more demanding than I'd anticipated when I raised my hand, but I'm learning and it's fun. Thanks fer puttin' up with my nonsense.

@onyxfish
Copy link
Collaborator Author

onyxfish commented May 2, 2014

Awesome! Really excited to have this included.

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator.

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

You know I did this same sort of thing for Table.__init__ last night and was then reminded (after some Googling) that this is actually considered bad juju in Python. Due to it's "duck-typed" nature, you generally shouldn't test for type. Even though you expect an integer, a float, Decimal, or something other invented type could also be valid.

That being said, it's perfectly valid to test for value, so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

@onyxfish
Copy link
Collaborator Author

onyxfish commented May 3, 2014

On a whim did some more reading about this. Bugger, this stuff is a lot more complex / less standard than I realized. Didn't realize there was so much disagreement about how to calculate percentiles!

@jheasly
Copy link
Contributor

jheasly commented May 3, 2014

Awesome! Really excited to have this included.

Cool! Happy to have been of use!

so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

Sounds good. Nifty integer check!

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

Awesomeness. Happy to do what I can.

a lot more complex / less standard than I realized.

Yeah, no kidding. Who knew stats could be such a untamed wilderness?

@onyxfish onyxfish modified the milestones: 0.4.0, 0.3.0 May 5, 2014
@onyxfish
Copy link
Collaborator Author

Hi John, is this ready for a pull request?

@jheasly
Copy link
Contributor

jheasly commented May 13, 2014

Hi Chris!
I was waiting for this:

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

But I can omit the decorator, clean-up and run against my local little test script.

As for real testing, is there a recommended bit of test_columns.py to use as a model?

@onyxfish
Copy link
Collaborator Author

Doh! That's my fault. I forgot all about it. I've made myself a high-pri ticket and will get to it soon:

#122

For your testing I'd probably look at something like test_counts. That seems like the most similar method.

@jheasly
Copy link
Contributor

jheasly commented May 13, 2014

High-pri ticket(!). Whoa.

I'll muck about with some test-making tonight.

@onyxfish
Copy link
Collaborator Author

Blocker resolved by @mickaobrien!

@jheasly
Copy link
Contributor

jheasly commented May 16, 2014

Excellent! Diving into the unittest module ...

@jheasly
Copy link
Contributor

jheasly commented May 16, 2014

Coupla questions:
• Right now, percentile() returns a list (running the unittests revealed this to me!), whether it's delivering a requested percentile or the whole shebang. A list is okay, right?
• Speaking of the whole shebang (all 100 percentiles), should I make a test for that case too?

@onyxfish
Copy link
Collaborator Author

HI John! I merged this on a flight today so I didn't have access to your comments. I changed it return a single value when a single value is requested and also made a few other small changes. And I added a few unit tests. I've opened two new tickets for minor outstanding issues, #129 and #130, but I wanted to go ahead and merge it.

Thanks again for the contribution! I also added you to the AUTHORS file.

@jheasly
Copy link
Contributor

jheasly commented Jun 2, 2014

Hey Chris! Thanks for the improvements in both the code and the tests. They're educational for me to look at/grok/study. I appreciate that.

And thanks for the AUTHORS addition. Next time I visit, I'm going to show my Mom.

@onyxfish
Copy link
Collaborator Author

onyxfish commented Jun 2, 2014

Absolutely! Please let me know if any of the changes don't make sense. Mostly I just pruned back things we weren't using from the original source (lambda x: x) and tried to clean up the variable names, etc. (Also there was one issue where you were casting float to Decimal, which is bad juju and fails on Python 2.6).

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants