# Data Aggregation (grouping)

Let's first import a dataset 'product.csv'

Unnamed: 0,object,colour,price,quantity
0,mug,red,5.67,120
1,mug,red,4.28,50
2,glass,white,1.28,30
3,plate,green,2.29,40
4,mug,blue,5.12,32
5,plate,blue,2.12,18
6,plate,green,1.18,30


**Question**: How can we define groups?

One common approach is to look at a categorical variable. pandas Series have the methods `unique()` and `value_counts()` that are very useful in this regard.

* `unique()`: returns an array of unique categories in the series
* `value_counts()`: a Series, where the frequency of each category is provided

## Grouping

We can use the function `groupby()` to construct groups for a variable

* The variable based on which groups are contructed should be categorical variable
* For a numerical variable is numerical, it is possible to discretize the variable by defining ranges of values; then each range can be a group

### Grouping based on a single variable

Inside the round bracket of `groupby()`, we should indicate the variable (i.e., the **column**), based on which the grouping is performed

In [66]:
grouped = products.groupby(products['object'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x13b8d3710>

The object above (grouped) is a *Groupby* object.

This object has a few methods to calculate summary statistics of each group

In [72]:
mean_grouped = grouped.mean(numeric_only=True)
mean_grouped

Unnamed: 0_level_0,price,quantity
object,Unnamed: 1_level_1,Unnamed: 2_level_1
glass,1.28,30.0
mug,5.023333,67.333333
plate,1.863333,29.333333


Looking at the "price" column, the average price for glass, mug and plate are 1.28, 5.02, 1.86.

It is important to mention that `grouped.mean()` returns a DataFrame. 
* To confirm, use `type(mean_grouped)`

To show medians of each group, use `grouped.median()`

What we just did was grouping the entire Dataframe based on values of a certain column.

Sometimes are interested to group only a certain column. 
* For example, we can do grouping for the variable 'price' only


In [31]:
grouped =  products['price'].groupby(products['object'])

Let's calculate the standard deviation of each group

In [74]:
std_grouped = grouped.std(numeric_only=True)

std_grouped

Unnamed: 0_level_0,price,quantity
object,Unnamed: 1_level_1,Unnamed: 2_level_1
glass,,
mug,0.700024,46.490142
plate,0.597857,11.015141


**Question** Why the standard deviations for the "glass" group is `NaN`?

Note that we did grouping for a single column. The resulting 'Groupby' is a pandas Series intead of a DataFrame

In [37]:
type(std_by_obj)

pandas.core.series.Series

### Grouping based on several variables

One can group a column with respect to more than one variable. 

* For example,we will group the variable price, based on 'object' as well as 'colour'.
* To do that, use `groupby.(list_of_varibles_to_define_groups)`

In [77]:
#Perform the grouping for 'price', based on object and colour
grouped = products['price'].groupby([products['object'],products['colour']])

#Show the average prices by group
grouped.mean().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,price
object,colour,Unnamed: 2_level_1
glass,white,1.28
mug,blue,5.12
mug,red,4.975
plate,blue,2.12
plate,green,1.735


Finally, we can group several variables together, based on two or more other variables

In [81]:
# Let's group 'price' and 'quantity' together, based on 'object' and 'colour'
grouped = products[['price','quantity']].groupby([products['object'],products['colour']])

grouped.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,price,quantity
object,colour,Unnamed: 2_level_1,Unnamed: 3_level_1
glass,white,1.28,30.0
mug,blue,5.12,32.0
mug,red,4.975,85.0
plate,blue,2.12,18.0
plate,green,1.735,35.0
