Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] adorn functions #993

Open
thatlittleboy opened this issue Jan 18, 2022 · 2 comments
Open

[ENH] adorn functions #993

thatlittleboy opened this issue Jan 18, 2022 · 2 comments
Assignees

Comments

@thatlittleboy
Copy link
Contributor

thatlittleboy commented Jan 18, 2022

Brief Description

There are a few adorn_* functions from R's janitor that are not yet ported over to pyjanitor. Janitor docs here.

I'm specifically looking at:

  • adorn_totals: adds a "total" column to either the rows, the columns, or both
  • adorn_percentages: converts the cell values into percentages, calculated along either axis or over the entire dataframe. In the R formulation, these are floats between 0 and 1, not the 0-100 percentages.
  • adorn_pct_formatting: formats the 0 to 1 values into the 0 to 100 percentage values, with rounding/formatting options
  • adorn_ns: adds the raw counts back into the cell values (meant to be run after adorn_percentages), so each cell has both percentage & count info, like "56 (24.3%)" for example.

I imagine these might be particularly useful for those doing data reporting.
These should go into the functions module.

Example API

In pyjanitor, I don't think having four separate functions work (how to enforce that adorn_ns comes after adorn_percentages? and where would we get the counts required for adorn_ns? etc.).

Perhaps we could just do a adorn_totals, and an adorn_percentages (which encapsulates the behaviour of adorn_pct_formatting and adorn_ns as well, controlled via function parameters).

adorn_totals

This function should mirror the R function almost 1-1.

>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_totals(
...     subset=None,  # or list of index/col names; preferably can take in ranges like `slice("col_a","col_d")` also since `.loc` supports it
...     axis="col",  # index/0/row or column/1/col or both
...     fill_value: str='-',
...     name: str='Total',
... )
         a  b
0      6.0  x
1      NaN  y
2      2.5  z
Total  3.5  -

A few points I disagree(?) with the R implementation:

  • I'm thinking that NaN values will be treated as 0 here by default, so totals won't be affected by presence of NaN -> sum(1, NaN, 2.5) = 3.5. The R janitor function has an na.rm parameter for this, but I somehow feel this isn't necessary.
  • The where parameter, as defined by the R implementation, is to dictate whether to add a Totals "row" or "col"; as opposed to doing the summation over "row"/"col". In the latter case, where="row" would add a new column containing the Totals across the rows (which to me is more natural). I'm calling this parameter axis here btw.

adorn_percentages

TBD. Let me have a little think about this over the weekend, I decided against my own implementation idea while writing out the example API.. ><

Original idea
>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_percentages(
...     subset=None,  # similar to `adorn_totals`
...     axis='col',  # similar to `adorn_totals`
...     adorn_count=True,
...     count_position='front',  # ignored if adorn_count=False
...     count_format=0,  # ignored if adorn_count=False
...     percentage_format=2,
... )
            a  b
0  6 (70.59%)  x
1         nan  y
2   3 (29.4%)  z

Parameters:

  • count_position: whether to do front=="56 (23.4%)", back=="23.4% (56)"
  • count_format / percentage_format: if int, then represents the number of decimal places to round to. otherwise a string format specification like ':,.2f' or whatever.

I'm not that sold on this API yet. Doesn't look too clean / friendly to use. After all, it is an amalgamation of 3 different behaviours in 1 function 😅). Would be happy to hear comments / suggestions to improve, if any.

@ericmjl
Copy link
Member

ericmjl commented Jan 22, 2022

@thatlittleboy your thoughts on encapsulation to enforce order sound like the right thing to do.

I'd admit I'm not so well-versed in the adorn_* family of functions in janitor, so I'll hold off on commenting on their specific behaviour. That said, I am in favour of adding in janitor functionality into pyjanitor, and I'm also in favour of your way of thinking about how to organize the functions in a sane fashion too. 😄

@thatlittleboy
Copy link
Contributor Author

Great, thanks for the affirmation @ericmjl . I'll have a think about the desired API and propose something in a PR when I'm ready. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants