Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] summarize function #1230

Closed
wants to merge 65 commits into from
Closed

[WIP] summarize function #1230

wants to merge 65 commits into from

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jan 10, 2023

PR Description

Please describe the changes proposed in the pull request:

  • aggregation is with an col class
  • support for multiple column aggregations
  • support for aggregation in the presence of a grouping
  • inspiration from dplyr's across and rdatatable's SD

At its core, it is nothing more than a for loop. All the hard work is passed on to Pandas. It does not supplant agg - users should reach for summarize only if agg does not do the job - it's major addition is for grouping flexibly on multiple columns , via the col class - thanks to the select_columns syntax.

**This PR resolves #1225 **

Examples:

import pandas as pd
import janitor as jn
from janitor import col

In [107]: url = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/
     ...: mtcars.csv"
     ...: mtcars = pd.read_csv(url)
     ...: mtcars.head()
Out[107]: 
               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2

cols = [col("*p").compute("sum"), col("*t").compute("mean")]

mtcars.summarize(*cols, by = "cyl")
       disp    hp      drat        wt
cyl                                  
4    1156.5   909  4.070909  2.285727
6    1283.2   856  3.585714  3.117143
8    4943.4  2929  3.229286  3.999214

cols = [col("*p").compute("sum").rename("{_fn}_{_col}"), 
        col("*t").compute("mean").rename("{_col}_{_fn}")]

mtcars.summarize(*cols, by = "cyl")
     sum_disp  sum_hp  drat_mean   wt_mean
cyl                                       
4      1156.5     909   4.070909  2.285727
6      1283.2     856   3.585714  3.117143
8      4943.4    2929   3.229286  3.999214

summarize can be a useful abstraction for scenarios where agg doesnt quite do the job easily - an example is from this blogpost:

df
   x  y  n
0  1  1  3
1  1  2  2
2  1  3  1
3  2  1  1
4  2  2  2
5  3  1  1


cols = [col("y").compute("sum").rename("freq"), col("y").compute("nth", n=1)]

df.summarize(*cols, by = 'x')
   freq    y
x           
1     6  2.0
2     3  2.0
3     1  NaN

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl
Copy link
Member

ericmjl commented Jan 10, 2023

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #1230 (6ca01c9) into dev (936fa8b) will decrease coverage by 14.66%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##              dev    #1230       +/-   ##
===========================================
- Coverage   97.71%   83.06%   -14.66%     
===========================================
  Files          78       79        +1     
  Lines        3770     3886      +116     
===========================================
- Hits         3684     3228      -456     
- Misses         86      658      +572     

@samukweku samukweku changed the title [ENH] summarize function [WIP] summarize function Jan 16, 2023
@samukweku samukweku marked this pull request as draft January 29, 2023 04:05
@samukweku samukweku closed this Feb 22, 2023
@samukweku samukweku deleted the samukweku/summarise branch February 22, 2023 20:04
@ericmjl
Copy link
Member

ericmjl commented Feb 22, 2023

@samukweku apologies for the delay in reviewing. I was originally planning to get to it after this week’s storm of events is over. Was there a reason for closing?

@samukweku
Copy link
Collaborator Author

haha ... not at all @ericmjl this was a draft ... I'll resurrect it when I feel I have gotten the logic right ... truth be told i'm wary of introducing a function that mimics what Pandas already does (in this case agg) - summarize is supposed to cover cases agg doesnt touch; still I want to just finesse the idea a bit more before resurrecting the draft, and try out other options that could be passed to agg, without having to use summarize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

summarize
2 participants