# Data Aggregation (grouping)

Let's first import a dataset from 'product.csv'

**Question**: How can we define groups?

One common approach is to look at a categorical variable. pandas Series have the methods `unique()` and `value_counts()` that are very useful in this regard.

* `unique()`: returns an array of unique categories in the series
* `value_counts()`: a Series, where the frequency of each category is provided

## Grouping

We can use the function `groupby()` to construct groups for a variable

* The variable based on which groups are contructed should be categorical variable
* For a numerical variable is numerical, it is possible to discretize the variable by defining ranges of values; then each range can be a group

### Grouping based on a single variable

Inside the round bracket of `groupby()`, we should indicate the variable (i.e., the **column**), based on which the grouping is performed

The object above (grouped) is a *Groupby* object.

This object has a few methods to calculate summary statistics of each group

In [5]:
# Note the use of numeric_only=True

Looking at the "price" column, the average price for glass, mug and plate are 1.28, 5.02, 1.86.

It is important to mention that `grouped.mean()` returns a DataFrame. 
* To confirm, use `type(mean_grouped)`

To show medians of each group, use `grouped.median()`

What we just did was grouping the entire Dataframe based on values of a certain column.

Sometimes are interested to group only a certain column. 
* For example, we can do grouping for the variable 'price' only


Let's calculate the standard deviation of each group

**Question** Why the standard deviations for the "glass" group is `NaN`?

Note that we did grouping for a single column. The resulting 'Groupby' is a pandas Series intead of a DataFrame

In [10]:
# find the type of std_grouped

### Grouping based on several variables

One can group a column with respect to more than one variable. 

* For example,we will group the variable price, based on 'object' as well as 'colour'.
* To do that, use `groupby.(list_of_varibles_to_define_groups)`

In [12]:
#Perform the grouping for 'price', based on object and colour

#Show the average prices by group


Finally, we can group several variables together, based on two or more other variables

In [13]:
# Let's group 'price' and 'quantity' together, based on 'object' and 'colour'

