# Aggregation & Grouping

There are several built-in aggregation functions that Numpy (and by extension, Pandas) has to offer us.  
`sum()`, `mean()`, `median()`, `min()` and `max()` can give us insights on a large dataset.  
Later in this notebook we will explore also the `groupby()` function that Pandas implemented for more sophisticated queries and insights.

In [1]:
import pandas as pd
import numpy as np

We will use the same function to visualize some tables

In [2]:
from IPython.display import display_html
def display_pds(*args):
    html_str=''
    for _df in args:
        html_str += _df.to_html()
    display_html(html_str.replace('table','table style="display:inline; margin:5px;"'),raw=True)

### Countries Dataset

We will load the Countries dataset (from Kaggle) and explore it

In [6]:
countries = pd.read_csv("../countries.csv")
countries.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


Using the aggregation function lets see what is the highest, lowest population in the dataset:

In [11]:
countries['Population'].max()

1313973713

In [12]:
countries['Population'].min()

7026

we can also check the median, mean and sum of the population:

In [13]:
countries['Population'].sum()

6524044551

In [14]:
countries['Population'].median()

4786994.0

In [15]:
countries['Population'].mean()

28740284.365638766

Some of those aggregation built-in functions are included in another method, `describe()`, we used it before in the [Introduction section](https://github.com/TomerGoldfeder/data-science-notebooks/blob/main/pandas/pandas_introduction.ipynb)  
Before using the describe method we will check if there are any columns with NaN values:

In [18]:
countries.isnull().any()

Country                               False
Region                                False
Population                            False
Area (sq. mi.)                        False
Pop. Density (per sq. mi.)            False
Coastline (coast/area ratio)          False
Net migration                          True
Infant mortality (per 1000 births)     True
GDP ($ per capita)                     True
Literacy (%)                           True
Phones (per 1000)                      True
Arable (%)                             True
Crops (%)                              True
Other (%)                              True
Climate                                True
Birthrate                              True
Deathrate                              True
Agriculture                            True
Industry                               True
Service                                True
dtype: bool

We can see that we have some columns with NaN values, we will drop them before using the `describe()` method

In [24]:
countries.dropna().describe()

Unnamed: 0,Population,Area (sq. mi.),GDP ($ per capita)
count,179.0,179.0,179.0
mean,34214150.0,564183.0,9125.698324
std,131763900.0,1395657.0,9644.123141
min,13477.0,28.0,500.0
25%,1188580.0,19915.0,1800.0
50%,6940432.0,118480.0,5100.0
75%,20860140.0,496441.0,12950.0
max,1313974000.0,9631420.0,37800.0


The `describe()` method is giving us some first insights about the dataset like the GDP range spread from 500$ to 37800$ but the mean is only 9125$.

>There are some other built-in aggregation functions:
    <ul>
    <li>`count()` - total number of elements</li>
    <li>`sum()` - sum of elements</li>
    <li>`min()`, `max()` - minimum and maximum</li>
    <li>`mad()` - mean absolute deviation</li>
    <li>`prod()` - product of elements</li>
    <li>`first()`, `last()` - first and last elements</li>
    <li>`mean()`, `median()` - mean and median of the elements</li>
    <li>`std()`, `var()` - standart deviation and variance</li>
    </ul>

## GroupBy Aggregation

The `groupby()` aggregation got the name from the SQL language, but it has some hidden 'layers' under the hood.   
while the user does not notice it but the `groupby()` function can iterate over the data just once and updating the aggregation function that he called.  

The GroupBy does not returns a `DataFrame` object but a `DataFrameGroupBy` object which can handle more complex aggregations.

In [25]:
countries.groupby('Climate')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x123487640>

In [48]:
countries.groupby('Climate')['Population']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x1236f42e0>

The `DataFrameGroupBy` object is kind of a 'view' (like in SQL) which he does not perform the aggregation until the function is called. That means the user can implement any aggregation function that he needs by using the `agg()` method of the object.

In [41]:
countries.groupby('Climate')['Population'].agg(lambda x: x.mean())

Climate
1      2.150973e+07
1,5    2.059263e+08
2      1.425539e+07
2,5    3.672341e+08
3      2.281059e+07
4      1.565195e+07
Name: Population, dtype: float64

Under the hood the `groupby()` method has 3 'steps':
<ol>
    <li>Split</li>
    <li>Apply</li>
    <li>Combine</li>
</ol>

The `groupby()` is spliting the data into the unique values of the column we selected, then apply the aggregation function that we want and finally it combine the results for each unique value into a dataframe.

![groupby.png](attachment:groupby.png)

We can iterate over the `DataFrameGroupBy` object using a regular iteration loop, though it is much faster and easy to use the `apply()` built-in function.

In [47]:
for (met, group) in countries.groupby('Climate'):
    print("method: {0:4s}, shape: {1}".format(met, group.shape))

method: 1   , shape: (29, 20)
method: 1,5 , shape: (8, 20)
method: 2   , shape: (111, 20)
method: 2,5 , shape: (3, 20)
method: 3   , shape: (48, 20)
method: 4   , shape: (6, 20)


Another advantage of the `groupby()` method is the ability to apply the `describe()` on each of the groups.

In [49]:
countries.groupby('Climate')['Population'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Climate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,29.0,21509730.0,34249470.0,56361.0,2418393.0,7961619.0,27019730.0,165803600.0
15,8.0,205926300.0,450178400.0,4786994.0,15252788.0,31505210.0,113552100.0,1313974000.0
2,111.0,14255390.0,35238350.0,9439.0,185160.5,1641564.0,11609420.0,245452700.0
25,3.0,367234100.0,630571900.0,1136334.0,3175116.0,5213898.0,550282900.0,1095352000.0
3,48.0,22810590.0,47899900.0,65409.0,2414052.75,7454650.5,17944480.0,298444200.0
4,6.0,15651950.0,22817100.0,33987.0,3357023.0,7395993.5,13998190.0,60876140.0


This gives us the ability to draw more insigths on the data.  
for example in this dataset we can see that the majority of the countries has a type 2 Climate.

### Agg, Filter, Transform & Apply

We mentioned before that the `DataFrameGroupBy` object has the `agg()` method, in addition to this method the object has `filter()`, `transform()` and `apply()` methods for various operations

**Agg**  
When using the `agg()` method we can use more complex aggregation functions. previously we used it to calculate the mean of population of each climate group.  

we can extend the query by passing to the `agg()` method the functions that we want, lets see an example for multiple aggregation in one command:

In [56]:
countries.groupby('Climate').agg(['min', np.mean, max])

Unnamed: 0_level_0,Population,Population,Population,Area (sq. mi.),Area (sq. mi.),Area (sq. mi.),GDP ($ per capita),GDP ($ per capita),GDP ($ per capita)
Unnamed: 0_level_1,min,mean,max,min,mean,max,min,mean,max
Climate,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,56361,21509730.0,165803560,665,976558.8,7686850,500.0,8150.0,29000.0
15,4786994,205926300.0,1313973713,121320,2007061.0,9596960,700.0,3237.5,9000.0
2,9439,14255390.0,245452739,21,291563.6,8511965,500.0,6526.126126,36000.0
25,1136334,367234100.0,1095351995,17363,1167818.0,3287590,1600.0,3133.333333,4900.0
3,65409,22810590.0,298444215,78,412812.3,9631420,600.0,16970.833333,37800.0
4,33987,15651950.0,60876136,160,592169.8,2717300,3500.0,12433.333333,27600.0


Another way is to pass a dictionary of columns and what aggregation we want to perform on them:  
The dictionary can be also a list of aggregations that we want to check on a specific column.

In [58]:
countries.groupby('Climate').agg({
    "Population": [min, max],
    "Area (sq. mi.)": max,
    "GDP ($ per capita)": np.mean
})

Unnamed: 0_level_0,Population,Population,Area (sq. mi.),GDP ($ per capita)
Unnamed: 0_level_1,min,max,max,mean
Climate,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,56361,165803560,7686850,8150.0
15,4786994,1313973713,9596960,3237.5
2,9439,245452739,8511965,6526.126126
25,1136334,1095351995,3287590,3133.333333
3,65409,298444215,9631420,16970.833333
4,33987,60876136,2717300,12433.333333


**Filter**  
The `filter()` method allows us to drop data based on cretiria:

In [81]:
def filter_function(a):
    return a['Population'].std() > 6e8

countries.groupby('Climate').filter(filter_function)

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
94,India,ASIA (EX. NEAR EAST),1095351995,3287590,3332,21,-7,5629,2900.0,595,454,544,274,4286,25,2201,818,186,276,538
112,Kyrgyzstan,C.W. OF IND. STATES,5213898,198500,263,0,-245,3564,1600.0,970,840,73,35,9235,25,228,708,353,208,439
194,Swaziland,SUB-SAHARAN AFRICA,1136334,17363,655,0,0,6927,4900.0,816,308,1035,7,8895,25,2741,2974,119,515,366


**Transform**  
In contrast to the `agg()` and `filter()` methods, which reduce our data so we can deduce insights, the `transform()` method is used to adjust our data for our needs, this means that the shape of the output will be the same as input shape.  

For example: in machine learning training we usually want to normalize the data, for that use we can use the `transform()` method. (in a second we will see that we can also use the `apply()` method for that)

In [137]:
countries.dropna().groupby('Climate')['Population'].transform(lambda x: x - x.mean())

0      7.966155e+06
1     -2.417326e+07
2      9.839249e+06
6     -1.627429e+07
7     -1.621866e+07
           ...     
218    9.442669e+06
219    6.811520e+07
224   -1.634654e+06
225   -4.785756e+06
226   -4.050961e+06
Name: Population, Length: 179, dtype: float64

**Apply**  
