# Group Operations

One of the very common workflows in data analysis is

Split the data in to multiple groups >
Perform some operation on each group >
Combine the results
      
This workflow can be implemented in python using a groupby facility provided by pandas.

Suppose the `Series` object `Pmarks` contains the marks obtained by students in an examination. We want to compute the mean marks for students grouped by their gender. The gender data is available in a `list` object `gender`.

To do this we call `groupby` method on the `Pmarks` as shown below:

In [1]:
import pandas as pd
Pmarks = pd.Series([25, 23, 18, 16, 20, 9, 11, 16, 24, 29])
gender = pd.Series(['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female'], name = 'Gender')
byGender = Pmarks.groupby(gender)

In [2]:
type(byGender)

pandas.core.groupby.generic.SeriesGroupBy

Here, `byGender` is an object of class `SeriesGroupBy`.

Suppose we want to compute the mean of the ‘Pmarks’ Series. To do this, `mean` method can be used on to call groupby method on the `Pmarks` Series as

In [3]:
byGender.mean()

Gender
Female    18.4
Male      19.8
dtype: float64

Note that the result is a Series object indexed by the distinct values of `gender` Series. The name attribute of `gender` is used as the name attribute of index in the result. 

One can also groupby a key that is a list object or a 1d-array. However, in that case the index of the result Series doesn't have name attribute.

Next, we consider the DataFrame `Marks` that contains marks of candidates in multiple courses.

In [4]:
import pandas as pd;
Pmarks = pd.Series([25, 23, 18, 16, 20, 9, 11, 16, 24, 29])
Dmarks = pd.Series([22, 25, 20, 12, 18, 12, 12, 18, 22, 27])
Marks = pd.DataFrame({'Python':Pmarks, 'Database':Dmarks})
gender = pd.Series(['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female'], name = 'Gender')
Marks

Unnamed: 0,Python,Database
0,25,22
1,23,25
2,18,20
3,16,12
4,20,18
5,9,12
6,11,12
7,16,18
8,24,22
9,29,27


Next, we apply `groupby` method on the DataFrame `Marks` to create `DataFrameGroupBy` object as shown below.

In [5]:
byGender = Marks.groupby(gender)

In [6]:
type(byGender)

pandas.core.groupby.generic.DataFrameGroupBy

Now we apply `mean` method on the groupby object.

In [7]:
byGender.mean()

Unnamed: 0_level_0,Python,Database
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,18.4,19.0
Male,19.8,18.6


The result is now a DataFrame with rows indexed by the distinct values of `gender`, and the columns indices same as that of the `Marks` DataFrame.

We can also use aggregate method on byGender object as shown below.

In [8]:
byGender.agg(['mean', 'std'])

Unnamed: 0_level_0,Python,Python,Database,Database
Unnamed: 0_level_1,mean,std,mean,std
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,18.4,7.231874,19.0,5.385165
Male,19.8,6.058052,18.6,6.14817


### Column of DataFrame as key

When the groupby method is applied on a DataFrame object, you can use one of columns of the same DataFrame as the grouping key. To do this, use the name of the column as key. For example,

In [9]:
Marks['Gender'] = gender

In [10]:
Marks

Unnamed: 0,Python,Database,Gender
0,25,22,Male
1,23,25,Male
2,18,20,Female
3,16,12,Male
4,20,18,Female
5,9,12,Female
6,11,12,Male
7,16,18,Female
8,24,22,Male
9,29,27,Female


In [11]:
byGender2 = Marks.groupby('Gender')
byGender2.mean()

Unnamed: 0_level_0,Python,Database
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,18.4,19.0
Male,19.8,18.6


In general, a groupby key can be any sequence such as list or array having the _same length_ as that of the axis on which we want to perform grouping.

We can also create groupby object with multiple keys . For example,see the following code.

In [12]:
Marks['Res'] = ['Home', 'Home', 'PG', 'Hostel', 'PG', 'Home', 'Home', 'PG','Hostel', 'Hostel']
byGenRes = Marks.groupby(['Gender', 'Res'])
byGenRes.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Python,Database
Gender,Res,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,Home,9.0,12.0
Female,Hostel,29.0,27.0
Female,PG,18.0,18.666667
Male,Home,19.666667,19.666667
Male,Hostel,20.0,17.0


Note that the result is indexed by hierarchical index with the two keys as two levels respectively.

A groupby object can be indexed by a column name (or a list of column names). The result is again a `SeriesGroupBy` (or `DataFrameGroupBy`) object.

In [13]:
byGender['Database'].mean()

Gender
Female    19.0
Male      18.6
Name: Database, dtype: float64

In [14]:
byGender[['Database', 'Python']].mean()

Unnamed: 0_level_0,Database,Python
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,19.0,18.4
Male,18.6,19.8


### Using a Series as groupby key

When a series is used as a groupby key, the index values of Series/ DataFrame correspond to the index values of key for the formation of groups. This is demonstrated in the following example.

In [15]:
x = pd.Series({'a':12, 'b':22, 'c':15, 'd':27})
y = pd.Series({'a':'M', 'c':'M', 'b':'F', 'd':'F'})
x.groupby(y).mean()

F    24.5
M    13.5
dtype: float64

### Index as groupby key

When applying groupby to Series/ DataFrame, the index can also be used as the key. For this purpose, we use level parameter as shown below.

In [16]:
x = pd.Series([12, 22, 25,17], index = ['a', 'b', 'b', 'a'])
x.groupby(level=0).agg(['mean', 'std'])

Unnamed: 0,mean,std
a,14.5,3.535534
b,23.5,2.12132


### Iterating over GroupBy object

In [17]:
for sex, data in byGender:
    print('Average for ', sex, ' is:')
    print(data.mean())

Average for  Female  is:
Python      18.4
Database    19.0
dtype: float64
Average for  Male  is:
Python      19.8
Database    18.6
dtype: float64


  print(data.mean())


Note that the body of `for` loop can perform any complex task.

## Pivot Tables

A pivot table aggregate data in a tabular form by one or more keys. Spreadsheet like pivot tables can be computed using `pivot_table` method as shown below.

In [18]:
Marks.pivot_table(values = 'Python', index = 'Gender')

Unnamed: 0_level_0,Python
Gender,Unnamed: 1_level_1
Female,18.4
Male,19.8


The values computed in the cells are the arithmatic means (`np.mean`) by default. Other aggregation functions can be specified as `aggfunc` argument.

In [19]:
Marks.pivot_table(values = 'Python', index = 'Gender', aggfunc = 'sum')

Unnamed: 0_level_0,Python
Gender,Unnamed: 1_level_1
Female,92
Male,99


In [20]:
Marks.pivot_table(values = ['Python', 'Database'],
                  index = 'Gender', aggfunc = ['count', 'mean'])

Unnamed: 0_level_0,count,count,mean,mean
Unnamed: 0_level_1,Database,Python,Database,Python
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,5,5,19.0,18.4
Male,5,5,18.6,19.8


In [21]:
Marks

Unnamed: 0,Python,Database,Gender,Res
0,25,22,Male,Home
1,23,25,Male,Home
2,18,20,Female,PG
3,16,12,Male,Hostel
4,20,18,Female,PG
5,9,12,Female,Home
6,11,12,Male,Home
7,16,18,Female,PG
8,24,22,Male,Hostel
9,29,27,Female,Hostel


The example given below uses keys for row and column both.

In [22]:
Marks.pivot_table(values = ['Python', 'Database'], index = 'Gender', columns = 'Res')

Unnamed: 0_level_0,Database,Database,Database,Python,Python,Python
Res,Home,Hostel,PG,Home,Hostel,PG
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,12.0,27.0,18.666667,9.0,29.0,18.0
Male,19.666667,17.0,,19.666667,20.0,


In [23]:
Marks.pivot_table(index = 'Gender', columns = 'Res')

Unnamed: 0_level_0,Database,Database,Database,Python,Python,Python
Res,Home,Hostel,PG,Home,Hostel,PG
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,12.0,27.0,18.666667,9.0,29.0,18.0
Male,19.666667,17.0,,19.666667,20.0,


Note that omiting the `values` parameter has resulted in inclusion of both the columns, appearing as index level 0 in the result.

In [24]:
pd.crosstab(Marks.Gender, Marks.Res)

Res,Home,Hostel,PG
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,1,1,3
Male,3,2,0
