Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add normalize to crosstab #12569

Closed
nickeubank opened this issue Mar 9, 2016 · 8 comments
Closed

ENH: Add normalize to crosstab #12569

nickeubank opened this issue Mar 9, 2016 · 8 comments
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@nickeubank
Copy link
Contributor

It'd be great to have a simple normalization option for cross tab to get shares rather than frequencies.

Something that would do something like:

def normalize(x):
    return len(x)/len(w_mobile.language)

pd.crosstab(w_mobile.language,w_mobile.carrier, values=w_mobile.language, aggfunc=normalize)

as just an option.

(The ability to do row-normalizations and column normalizations would also be great -- so all entries in a row add to 1 or all entries in a column add to 1). Similar in behavior (for row normalizations) as:

l = list()
df = pd.DataFrame({'carrier':['a','a','b','b','b'], 'language':['english','spanish', 'english','spanish','spanish']})

for i in df.carrier.unique():
    temp = df.query('carrier=="{}"'.format(i)).language.value_counts(normalize=True)
    temp.name = i
    l.append(temp)

ctab = pd.concat(l, axis=1)


Out[1]: 
                a         b
 english  0.5  0.333333
spanish  0.5  0.666667

But with a command like: pd.crosstab(df.language, df.carrier, normalization='row')

@sinhrks sinhrks added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 9, 2016
@sinhrks
Copy link
Member

sinhrks commented Mar 9, 2016

I also use this type of operation in pivot_table. Is it can be a part of aggfunc, maybe pct, pct_row and pct_col?

@jreback jreback added this to the Next Major Release milestone Mar 9, 2016
@nickeubank
Copy link
Contributor Author

I like it. I'll try a few things and submit a PR.

@nickeubank
Copy link
Contributor Author

Relatedly, crosstab also has a bug -- it counts np.nan in margin totals even when dropna=True.

df = pd.DataFrame({'a':[1,2,2,2,2,np.nan],'b':[3,3,4,4,4,4]})
pd.crosstab(df.a,df.b, margins=True)
Out[233]: 
b    3  4  All
a             
1.0  1  0    1
2.0  1  3    4
All  2  4    6

Not related to #12558 i don't think

@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

might be #4003

@nickeubank
Copy link
Contributor Author

@jreback don't think so -- that's double counting. This is only happens on columns with np.nan and increments by num of np.nans.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

ok if you can't find a related one, then pls open a new issue

@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

in fact, if you can, pls open a new issue (we'll call it and i'll tag it master), and can list a checkbox for all of the crosstab issues. (each individual one has an issue and we just refernce things) like #11485

@nickeubank
Copy link
Contributor Author

Posted PR to #12578 . Input welcome @sinhrks

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 4, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants