# 10  Data Aggregation and Group Operations

## 10.1 How to Think About Group Operations

`split-apply-combine`

 1. data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (`axis="index"`) or its columns (`axis="columns"`). 
    Each grouping key can take many forms:

        - A list or array of values that is the same length as the axis being grouped

        - A value indicating a column name in a DataFrame

        - A dictionary or Series giving a correspondence between the values on the axis being grouped and the group names

        - A function to be invoked on the axis index or the individual labels in the index
 
 2. Once this is done, a function is applied to each group, producing a new value. 
 3. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. 

df["data1"].groupby(df["key1"]).mean() # note "key1" becomes the index name of the prouped.mean series

df["data1"].groupby([df["key1"], df["key2"]]).mean() # Use two keys

Keys are provided by lists
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]
df["data1"].groupby([states, years]).mean()

Frequently, the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:

df.groupby("key1").mean()

df.groupby("key2").mean(numeric_only=True) # because 'key1` col is not numeric, can not aggregated with mean

Multiple keys provided by a list of col names
df.groupby(["key1", "key2"]).mean()

Regardless of the objective in using groupby, a generally useful GroupBy method is `size`, which returns a Series containing ***group sizes***:

df.groupby(["key1", "key2"]).size() # default drop na values 

do not drop na values
df.groupby("key1", dropna=False).size()

df.groupby("key1").count() # non-null values in each group

### Iterating over Groups
The object returned by groupby supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. 

for name, group in df.groupby("key1"): # name is different value of key1, and group is each group member
    print('-'*20)
    print(name)
    print('-'*20)
    print(group)

for (k1, k2), group in df.groupby(["key1", "key2"]):
    print('-'*20)
    print((k1, k2))
    print('-'*20)
    print(group)


A recipe you may find useful is computing a dictionary of the data pieces as a one-liner:

pieces = {name: group for name, group in df.groupby("key1")}
pieces["b"]

By default groupby groups on `axis="index"`, but you can group on any of the other axes. For example, we could group the columns of our example df here by whether they start with "key" or "data":

grouped = df.groupby({"key1": "key", "key2": "key", # note the groupping keys are passed with a dictionary
                      "data1": "data", "data2": "data"}, axis="columns") 
#equivalent to
#grouped = df.groupby(['key', 'key', 'data','data'], axis="columns") 

for group_key, group_values in grouped:
    
    print('-'*30)
    print(group_key)
    print('-'*30)
    print(group_values)
    print('-'*30)


### Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that
```
df.groupby("key1")["data1"]  # SeriesGroupBy obj
df.groupby("key1")[["data2"]] #DataframeGoupBy obj
```
are conveniences for:
```
df["data1"].groupby(df["key1"]) # note the group_by key needs to be a series. 
df[["data2"]].groupby(df["key1"])
```

### Grouping with Dictionaries and Series
Grouping information may exist in a form other than an array. 

mapping = {"a": "red", "b": "red", "c": "blue",
           "d": "blue", "e": "red", "f" : "orange"}

by_column = people.groupby(mapping, axis="columns") #unused key 'f' is ok. 
by_column.sum()

map_series = pd.Series(mapping)



people.groupby(map_series, axis="columns").count()

### Grouping with Functions
Using Python functions is a more generic way of defining a group mapping compared with a dictionary or Series. Any function passed as a group key will be called once per index value (or once per column value if using axis="columns"), with the return values being used as the group names. 

people.groupby(len).sum() # len applied to index

Mixing functions with arrays, dictionaries, or Series is not a problem, as everything gets converted to arrays internally:



key_list = ["one", "one", "one", "two", "two"]
people.groupby([len, key_list]).min()

### Grouping by Index Levels
A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. 

columns = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"],
                                    [1, 3, 5, 1, 3]],
                                    names=["cty", "tenor"])
hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)

hier_df.groupby(level="cty", axis="columns").count()

## 10.2 Data Aggregation
Aggregations refer to any data transformation that produces scalar values from arrays. 

Table 10.1: Optimized groupby methods
Function name|	Description
|:---------------|:-------------------------------------------------------------------|
any, all|	Return True if any (one or more values) or all non-NA values are "truthy"
count|	Number of non-NA values
cummin, cummax	|Cumulative minimum and maximum of non-NA values
cumsum|	Cumulative sum of non-NA values
cumprod|	Cumulative product of non-NA values
first, last	|First and last non-NA values
mean|	Mean of non-NA values
median|	Arithmetic median of non-NA values
min, max|	Minimum and maximum of non-NA values
nth|	Retrieve value that would appear at position n with the data in sorted order
ohlc|	Compute four "open-high-low-close" statistics for time series-like data
prod|	Product of non-NA values
quantile|	Compute sample quantile
rank	|Ordinal ranks of non-NA values, like calling Series.rank
size	|Compute group sizes, returning result as a Series
sum	|Sum of non-NA values
std, var|	Sample standard deviation and variance

You can use aggregations of your own devising and additionally call any method that is also defined on the object being grouped. For example, the `nsmallest` Series method selects the smallest requested number of values from the data. While `nsmallest` is not explicitly implemented for GroupBy, we can still use it with a nonoptimized implementation. Internally, GroupBy slices up the Series, calls `piece.nsmallest(n)` for each piece, and then assembles those results into the result object:

grouped["data1"].nsmallest(2)

To use your own aggregation functions, pass any function that aggregates an array to the aggregate method or its short alias `agg`:

def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak) # the funciton peak_to_peak needs to aggregates an array

grouped.describe()

### Column-Wise and Multiple Function Application


grouped_pct.agg(["mean", "std", peak_to_peak]) # aggregate with multiple funcitons
# col names are taken from the funciton names

 if you pass a list of `(name, function)` tuples, the first element of each tuple will be used as the DataFrame column names 

grouped_pct.agg([("average", "mean"), ("stdev", np.std)])

With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column. 

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

functions = ["count", "mean", "max"]
result = grouped[["tip_pct", "total_bill"]].agg(functions)

As before, a list of tuples with custom names can be passed:

Now, suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dictionary to agg that contains a mapping of column names to any of the function specifications listed so far:

grouped.agg({"tip" : np.max, "size" : "sum"})


grouped.agg({"tip_pct" : ["min", "max", "mean", "std"],
             "size" : "sum"})

### Returning Aggregated Data Without Row Indexes
In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn’t always desirable, you can disable this behavior in most cases by passing `as_index=False` to groupby. 

Of course, it’s always possible to obtain the result in this format by calling reset_index on the result. Using the `as_index=False` argument avoids some unnecessary computations.

grouped = tips.groupby(["day", "smoker"], as_index=False) # the grouping indices becomes columns
grouped.mean(numeric_only=True)

## 10.3 Apply: General split-apply-combine
The most general-purpose GroupBy method is apply, which is the subject of this section. apply splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces.

What occurs inside the function passed is up to you; it must either return a pandas object or a scalar value. 

tips.groupby("smoker").apply(top)

If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:

tips.groupby(["smoker", "day"]).apply(top, n=1, column="total_bill")

Inside GroupBy, when you invoke a method like describe, it is actually just a shortcut for:
```
def f(group):
    return group.describe()

grouped.apply(f)
```

### Suppressing the Group Keys
You can disable this by passing `group_keys=False` to groupby:

tips.groupby("smoker", group_keys=False).apply(top)

### Quantile and Bucket Analysis
Combining `pd.cut` and `pd.qcut` with groupby makes it convenient to perform bucket or quantile analysis on a dataset. 

quartiles = pd.cut(frame["data1"], 4) # or use .qcut
quartiles.head(10)

The Categorical object returned by cut can be passed directly to groupby.

def get_stats(group):
    return pd.DataFrame(
        {"min": group.min(), "max": group.max(),
        "count": group.count(), "mean": group.mean()}
    )

grouped = frame.groupby(quartiles)
grouped.apply(get_stats)

Keep in mind the same result could have been computed more simply with:

grouped.agg(["min", "max", "count", "mean"])

These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use pandas.qcut. We can pass 4 as the number of bucket compute sample quartiles, and pass labels=False to obtain just the quartile indices instead of intervals:

### Example: Filling Missing Values with Group-Specific Values

def fill_mean(group):
    return group.fillna(group.mean())

data.groupby(group_key).apply(fill_mean) # fill na using group means

fill_values = {"East": 0.5, "West": -1}
def fill_func(group):
    #print(group.name)
    return group.fillna(fill_values[group.name]) # group has name attribute

data.groupby(group_key).apply(fill_func) 

### Example: Random Sampling and Permutation
There are a number of ways to perform the “draws”; here we use the `sample` method for Series.

deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

### Example: Group Weighted Average and Correlation
Under the split-apply-combine paradigm of groupby, operations between columns in a DataFrame or two Series, such as a group weighted average, are possible. 

grouped = df.groupby("category")
def get_wavg(group):
    return np.average(group["data"], weights=group["weights"])

grouped.apply(get_wavg)

def spx_corr(group):
    return group.corrwith(group["SPX"])

def get_year(x):
    return x.year

by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

def corr_aapl_msft(group):
    return group["AAPL"].corr(group["MSFT"])
by_year.apply(corr_aapl_msft)

### Example: Group-Wise Linear Regression
In the same theme as the previous example, you can use groupby to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value. 

import statsmodels.api as sm
def regress(data, yvar=None, xvars=None):
    Y = data[yvar]
    X = data[xvars]
    X["intercept"] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

by_year.apply(regress, yvar="AAPL", xvars=["SPX"])

## 10.4 Group Transforms and "Unwrapped" GroupBys

In Apply: General split-apply-combine, we looked at the apply method in grouped operations for performing transformations. There is another built-in method called transform, which is similar to apply but imposes more constraints on the kind of function you can use:

It can produce a scalar value to be broadcast to the shape of the group.

It can produce an object of the same shape as the input group.

It must not mutate its input.

def get_mean(group):
    return group.mean()
g.transform(get_mean)

g.transform('mean') # pass a built-in aggregate function "mean"

def times_two(group):
    return group * 2
g.transform(times_two)

Built-in aggregate functions like `'mean'` or `'sum'` are often much faster than a general `apply` function. These also have a "fast path" when used with transform. This allows us to perform what is called an unwrapped group operation:

g.transform('mean')

Here, we are doing arithmetic between the outputs of multiple GroupBy operations instead of writing a function and passing it to `groupby(...).apply`. That is what is meant by "unwrapped."


normalized = (df['value'] - g.transform('mean')) / g.transform('std')

## 10.5 Pivot Tables and Cross-Tabulation

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the groupby facility described in this chapter, combined with reshape operations utilizing hierarchical indexing. DataFrame also has a `pivot_table` method, and there is also a top-level `pandas.pivot_table` function. In addition to providing a convenience interface to `groupby`, `pivot_table` can add partial totals, also known as `margins`.


tips.pivot_table(index=["day", "smoker"],
                 values=["size", "tip", "tip_pct", "total_bill"])

tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["tip_pct", "size"])

We could augment this table to include `partial totals` by passing `margins=True`. This has the effect of adding All row and column labels, with corresponding values being the group statistics for all the data within a single tier:

tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["tip_pct", "size"],  margins=True) # aggfunc=mean, find means

To use an aggregation function other than mean, pass it to the `aggfunc` keyword argument. For example, `"count"` or `len` will give you a cross-tabulation (count or frequency) of group sizes (though `"count"` will exclude null values from the count within data groups, while `len` will not):

tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["tip_pct", "size"], aggfunc="count",  margins=True)

tips.pivot_table(index=["time", "size", "smoker"], columns="day",
                 values="tip_pct", fill_value=0)

Table 10.2: pivot_table options
Argument|	Description
|:------------|:-----------------------------------------------------------------------|
values|	Column name or names to aggregate; by default, aggregates all numeric columns
index|	Column names or other group keys to group on the rows of the resulting pivot table
columns|	Column names or other group keys to group on the columns of the resulting pivot table
aggfunc|	Aggregation function or list of functions ("mean" by default); can be any function valid in a groupby context
fill_value	|Replace missing values in the result table
dropna|	If True, do not include columns whose entries are all NA
margins|	Add row/column subtotals and grand total (False by default)
margins_name|	Name to use for the margin row/column labels when passing margins=True; defaults to "All"
observed|	With Categorical group keys, if True, show only the observed category values in the keys rather than all categories

### Cross-Tabulations: Crosstab
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. Here is an example:

pd.crosstab(data["Nationality"], data["Handedness"], margins=True) #calculate frequency

pd.crosstab([tips["time"], tips["day"]], tips["smoker"], margins=True)# note the first argument is a list of two series