# 12_Aggregate Data with Groupby
### Syntax:

I found the <code>groupby</code> syntax a little difficult to remember at first, so here it is broken down:

><code>df.groupby([<font color="blue">"col", "optional_col", "etc"</font>])[<font color="red">"col", "optional_col", "etc"</font>].<font color="green">function()</font> </code>

The column(s) in <font color="blue">blue</font> are the features that you want to group *by*.

The column(s) in <font color="red">red</font> are the features you want to aggregate.

The function in <font color="green">green</font> is how you want to aggregate the data, for example, "mean" or "sum".

### Example:

In [1]:
import pandas as pd

# Load the example timeseries_daily.csv dataset:
df = pd.read_csv("./Dummy datasets/timeseries_daily.csv")
df.head()

Unnamed: 0,Date,feature_1,feature_2,feature_3,feature_4,categorical_feature,weekday
0,01/02/2017,0,0,37,0,foo,Wednesday
1,02/02/2017,0,0,168,0,foo,Thursday
2,03/02/2017,0,0,157,0,other,Friday
3,04/02/2017,0,0,720,0,other,Saturday
4,05/02/2017,0,0,721,0,bar,Sunday


#### Group 1 feature by 1 feature:

In [2]:
df.groupby(["weekday"])["feature_3"].mean()

weekday
Friday       1484.565217
Monday       1355.822222
Saturday     1855.043478
Sunday       1825.847826
Thursday     1407.826087
Tuesday      1388.400000
Wednesday    1389.652174
Name: feature_3, dtype: float64

#### Group multiple features by 1 feature:

In [3]:
df.groupby(["weekday", "categorical_feature"])["feature_3"].mean()

weekday    categorical_feature
Friday     bar                    1748.272727
           foo                    1253.117647
           other                  1542.000000
Monday     bar                    1412.250000
           foo                    1566.307692
           other                  1185.150000
Saturday   bar                    1716.000000
           foo                    1918.666667
           other                  1860.631579
Sunday     bar                    2013.600000
           foo                    1838.000000
           other                  1678.350000
Thursday   bar                    1263.000000
           foo                    1264.769231
           other                  1631.833333
Tuesday    bar                    1414.235294
           foo                    1302.000000
           other                  1443.428571
Wednesday  bar                    1332.363636
           foo                    1422.700000
           other                  1387.600000
Nam

#### Group multiple features by multiple features:

In [4]:
df.groupby(["weekday", "categorical_feature"])["feature_3", "feature_1"].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,feature_3,feature_1
weekday,categorical_feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Friday,bar,1748.272727,551.545455
Friday,foo,1253.117647,271.705882
Friday,other,1542.0,455.388889
Monday,bar,1412.25,376.166667
Monday,foo,1566.307692,319.538462
Monday,other,1185.15,160.6
Saturday,bar,1716.0,682.333333
Saturday,foo,1918.666667,528.277778
Saturday,other,1860.631579,568.578947
Sunday,bar,2013.6,655.466667


#### Change data type of output:
Add the <code>astype()</code> flag to the end to change the data type, for exmaple from <code>float</code> to <code>int</code>:

In [5]:
df.groupby(["weekday"])["feature_3"].mean().astype(int)

weekday
Friday       1484
Monday       1355
Saturday     1855
Sunday       1825
Thursday     1407
Tuesday      1388
Wednesday    1389
Name: feature_3, dtype: int32

#### Change format of output:
The format of the object returned can also be changed, for example using <code>.to_json()</code>, <code>.to_dict()</code>, <code>.to_csv()</code>, etc.

In [6]:
df.groupby(["weekday"])["feature_3"].mean().astype(int).to_dict()

{'Friday': 1484,
 'Monday': 1355,
 'Saturday': 1855,
 'Sunday': 1825,
 'Thursday': 1407,
 'Tuesday': 1388,
 'Wednesday': 1389}