# tidypandas

> A **grammar of data manipulation** for [pandas](https://pandas.pydata.org/docs/index.html) inspired by [tidyverse](https://tidyverse.tidyverse.org/) 

`tidypandas` python package provides *minimal, pythonic* API for common data manipulation tasks:
   
   - `tidyframe` class (wrapper over pandas dataframe) provides a dataframe with simplified index structure (no more resetting indexes and multi indexes)
   - Consistent 'verbs' (`select`, `arrange`, `distinct`, ...) as methods to `tidyframe` class which mostly return a `tidyframe`
   - Unified interface for summarizing (aggregation) and mutate (assign) operations across groups
   - Utilites for pandas dataframes and series
   - Uses of simple python data structures, No esoteric classes, No pipes, No Non-standard evaluation
   - No copy data conversion between `tidyframe` and pandas dataframes
   - An accessor to apply `tidyframe` verbs to simple pandas datarames
   - ...

#### tidypandas is for you if

- you *frequently* write data manipulation code
- you prefer to have stay in pandas ecosystem (see accessor)
- you *prefer* to remember a [limited set of methods](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)
- you do not want to write or be surprised by [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html), [`rename_axis`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename_axis.html) often
- you prefer writing free flowing, expressive code in [dplyr](https://dplyr.tidyverse.org/) style
 
`tidypandas` does not replace the amazing `pandas` library, rather relies on it. It offers a consistent API with a different [philosophy](https://tidyverse.tidyverse.org/articles/manifesto.html).

## A snippet of tidypandas

#### in comparision with pandas

On penguins dataset:

> Let 'length_depth_ratio' be ratio of 'bill_length_mm' and 'bill_depth_mm'  
Among top 5% male and female birds by 'length_depth_ratio' per 'species',  
compute mean 'body_mass_g' per 'species', 'sex', 'island',  
and display in wide format with values from 'island' and 'sex' as columns

In [2]:
from tidypandas import tidyframe
from palmerpenguins import load_penguins
import numpy as np

penguins      = load_penguins() # pandas dataframe
penguins_tidy = tidyframe(penguins) # create a tidyframe from pandas dataframe
penguins_tidy

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
,<string>,<string>,<Float64>,<Float64>,<Int64>,<Int64>,<string>,<Int64>
0,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
1,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
2,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207,4000,male,2009
340,Chinstrap,Dream,43.5,18.1,202,3400,female,2009
341,Chinstrap,Dream,49.6,18.2,193,3775,male,2009


#### tidypandas style

In [3]:
(penguins_tidy
    .drop_na('sex')
    .mutate({'length_depth_ratio': (lambda x, y: x/y, ['bill_length_mm', 'bill_depth_mm'])})
    .slice_max(prop = 0.05, order_by_column = 'length_depth_ratio', by = ['sex', 'species'])  
    .summarize({'body_mass_g': (np.mean, )}, by = ['species', 'sex', 'island'])
    .pivot_wider(id_cols = 'species', names_from = ['island', 'sex'], values_from = 'body_mass_g')
    )

Unnamed: 0,species,Biscoe__female,Biscoe__male,Dream__female,Dream__male,Torgersen__female,Torgersen__male
,<string>,<Float64>,<Float64>,<Float64>,<Float64>,<Float64>,<Float64>
0.0,Adelie,3075.0,,3250.0,3725.0,3575.0,4283.333333
1.0,Chinstrap,,,3512.5,3900.0,,
2.0,Gentoo,4625.0,5683.333333,,,,


#### pandas style

In [4]:
(penguins
  .dropna(subset = ['sex'])
  .assign(length_depth_ratio = lambda x: x['bill_length_mm'] / x['bill_depth_mm'])
  .groupby(['sex', 'species'])
  .apply(lambda x: x.nlargest(n = int(np.round(0.05 * x.shape[0])),
                               columns = 'length_depth_ratio'
                               )
        )
  .reset_index(drop = True)
  .groupby(['species', 'sex', 'island'])
  .agg({'body_mass_g': np.mean})
  .reset_index()
  .pivot(index = 'species', columns = ['island', 'sex'], values = 'body_mass_g')
  )

island,Biscoe,Dream,Torgersen,Dream,Torgersen,Biscoe
sex,female,female,female,male,male,male
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Adelie,3075.0,3250.0,3575.0,3725.0,4283.333333,
Chinstrap,,3512.5,,3900.0,,
Gentoo,4625.0,,,,,5683.333333


## Overview of tidypandas

A pandas dataframe is said to be 'simple' if:
    
    1. Column names (x.columns) are an unnamed pd.Index object of unique 
       strings.
    2. Row names (x.index) are an unnamed pd.RangeIndex object with start = 0
       and step = 1.

`tidypandas` provides the following utilities:

In [1]:
from tidypandas import tidyframe           # tidyframe class (tidy datarame class wrapping a pandas dataframe)
from tidypandas.tidy_utils import simplify # simplify attempts to simplify a pandas dataframe
from tidypandas.series_utils import *      # series utils like ifelse, case_when, min_rank

In [None]:
# Add a tabel to show verbs and near andas equivalent