# 2.8 – Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a ``groupby``.

For convenience, we'll use the same ``display`` magic function that we've seen in previous sections:

In [1]:
import numpy as np
import pandas as pd
import random

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

## Planets Data

Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/) (see [Visualization With Seaborn](L313_Visualization_with_Seaborn.ipynb)).
It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets* or *exoplanets* for short). It can be downloaded with a simple Seaborn command:

In [2]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [3]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


This has some details on the 1,000+ extrasolar planets discovered up to 2014.

## Simple Aggregation in Pandas

Earlier, we explored some of the data aggregations available for NumPy arrays (["Aggregations: Min, Max, and Everything In Between"](L14_Computation_on_Arrays_Aggregates.ipynb)).
As with a one-dimensional NumPy array, for a Pandas ``Series`` the aggregates return a single value:

In [4]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [5]:
ser.sum()

2.811925491708157

In [6]:
ser.mean()

0.5623850983416314

For a ``DataFrame``, by default the aggregates return results within each column:

In [7]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [8]:
df.mean()

A    0.477888
B    0.443420
dtype: float64

By specifying the ``axis`` argument, you can instead aggregate within each row:

In [9]:
df.mean(axis='columns')

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

Pandas ``Series`` and ``DataFrame``s include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](L14_Computation_on_Arrays_Aggregates.ipynb); in addition, there is a convenience method ``describe()`` that computes several common aggregates for each column and returns the result.
Let's use this on the Planets data, for now dropping rows with missing values:

In [10]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the ``year`` column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.
This is largely thanks to the *Kepler* mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

These are all methods of ``DataFrame`` and ``Series`` objects.

To go deeper into the data, however, simple aggregates are often not enough.
The next level of data summarization is the ``groupby`` operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

## GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*.

### Split, apply, combine

A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:

![](figures/split-apply-combine.png)

This makes clear what the ``groupby`` accomplishes:

- The *split* step involves breaking up and grouping a ``DataFrame`` depending on the value of the specified key.
- The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The *combine* step merges the results of these operations into an output array.

While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the ``GroupBy`` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the ``GroupBy`` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.

As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.
We'll start by creating the input ``DataFrame``:

In [11]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column:

In [12]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7eeeea8e50>

Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.
This object is where the magic is: you can think of it as a special view of the ``DataFrame``, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object, which will perform the appropriate apply/combine steps to produce the desired result:

In [13]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


The ``sum()`` method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid ``DataFrame`` operation, as we will see in the following discussion.

**Your turn.** Find the mean of each key of ``df``.

In [14]:
# write your code here



### The GroupBy object

The ``GroupBy`` object is a very flexible abstraction.
In many ways, you can simply treat it as if it's a collection of ``DataFrame``s, and it does the difficult things under the hood. Let's see some examples using the Planets data.

Perhaps the most important operations made available by a ``GroupBy`` are *aggregate*, *filter*, *transform*, and *apply*.
We'll discuss each of these more fully in ["Aggregate, Filter, Transform, Apply"](#Aggregate,-Filter,-Transform,-Apply), but before that let's introduce some of the other functionality that can be used with the basic ``GroupBy`` operation.

#### Column indexing

The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.
For example:

In [15]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7eeee8f730>

In [16]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f7eeee33310>

Here we've selected a particular ``Series`` group from the original ``DataFrame`` group by reference to its column name.
As with the ``GroupBy`` object, no computation is done until we call some aggregate on the object:

In [17]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.

**Your turn.** Count the number of planets discovered each year.

In [18]:
# write your code here



#### Iteration over groups

The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:

In [19]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


Here ``group`` is a sub-DataFrame of the ``planets`` DataFrame obtained by selecting a particular value of ``method``. For instance, the last record in the list above is:

In [20]:
print("{0:30s} shape={1}".format(method, group.shape), "\n\n", group)

Transit Timing Variations      shape=(4, 6) 

                         method  number  orbital_period  mass  distance  year
680  Transit Timing Variations       2        160.0000   NaN    2119.0  2011
736  Transit Timing Variations       2         57.0110   NaN     855.0  2012
749  Transit Timing Variations       3             NaN   NaN       NaN  2014
813  Transit Timing Variations       2         22.3395   NaN     339.0  2013


This can be useful for doing certain things manually, though it is often much faster to use the built-in ``apply`` functionality, which we will discuss momentarily.

#### Dispatch methods

Through some Python class magic, any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.
For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:

In [21]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade.
The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods.
Notice that they are applied *to each individual group*, and the results are then combined within ``GroupBy`` and returned.
Again, any valid ``DataFrame``/``Series`` method can be used on the corresponding ``GroupBy`` object, which allows for some very flexible and powerful operations!

### Aggregate, filter, transform, apply

The preceding discussion focused on aggregation for the combine operation, but there are more options available.
In particular, ``GroupBy`` objects have ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods that efficiently implement a variety of useful operations before combining the grouped data.

For the purpose of the following subsections, we'll use this ``DataFrame``:

In [22]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(1,7),
                   'data2': rng.choice(range(1,11), 6, replace=False)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7


#### Aggregation

We're now familiar with ``GroupBy`` aggregations with ``sum()``, ``median()``, and the like, but the ``aggregate()`` method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:

In [23]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,1,2.5,4,3,6.5,10
B,2,3.5,5,2,5.5,9
C,3,4.5,6,5,6.0,7


Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

In [24]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,10
B,2,9
C,3,7


**Your turn.** Consider the DataFrame below:

In [25]:
rng = np.random.RandomState(0)
df1 = pd.DataFrame({'key1': ['A', 'B', 'A', 'B', 'A', 'B'],
                   'key2': ['a', 'a', 'a', 'b', 'b', 'b'],
                   'data1': rng.choice(range(1,11), 6, replace=False),
                   'data2': rng.choice(range(1,11), 6, replace=False)})
df1

Unnamed: 0,key1,key2,data1,data2
0,A,a,3,4
1,B,a,9,6
2,A,a,5,2
3,B,b,10,3
4,A,b,2,10
5,B,b,7,9


Group by both ``key1`` and ``key2`` and compute the means of ``data1`` and ``data2``. This is a natural generalisation of what you have learned above.

In [26]:
# write your code here



#### Filtering

A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [27]:
def filter_func(x):
    return x['data2'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,4.949747
B,2.12132,4.949747
C,2.12132,1.414214

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
3,A,4,10
4,B,5,2


The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.

**Your turn.** Compute the standard deviation of key groups having mean of ``data1`` less than 4. <br> Hint: you will need to use ``groupby()`` twice.

In [28]:
# write your code here



#### Transformation

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
For such a transformation, the output is the same shape as the input.
A common example is to center the data by subtracting the group-wise mean:

In [29]:
display("df",
        "df.groupby('key').mean()",
        "df.groupby('key').transform(lambda x: x - x.mean())")

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.5,6.5
B,3.5,5.5
C,4.5,6.0

Unnamed: 0,data1,data2
0,-1.5,-3.5
1,-1.5,3.5
2,-1.5,-1.0
3,1.5,3.5
4,1.5,-3.5
5,1.5,1.0


**Your turn.** Standardise data by dividing each entry by its group's range, i.e. x.max() - x.min().

In [30]:
# write your code here



#### The apply() method

The ``apply()`` method lets you apply an arbitrary function to the group results.
The function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar; the combine operation will be tailored to the type of output returned.

For example, here is an ``apply()`` that normalizes the first column by the sum of the second:

In [31]:
pd.DataFrame(df.groupby('key').data2.sum())

Unnamed: 0_level_0,data2
key,Unnamed: 1_level_1
A,13
B,11
C,12


In [32]:
rng = np.random.RandomState(0)

def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

df_group_sums = pd.DataFrame(df.groupby('key')['data2'].sum())
# here pd.DataFrame() is needed to turn Series object into a DataFrame

display("df",
        "df_group_sums",
        "df.groupby('key').apply(norm_by_data2)")

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7

Unnamed: 0_level_0,data2
key,Unnamed: 1_level_1
A,13
B,11
C,12

Unnamed: 0,key,data1,data2
0,A,0.076923,3
1,B,0.181818,9
2,C,0.25,5
3,A,0.307692,10
4,B,0.454545,2
5,C,0.5,7


Here, for instance, 0.076923 = 3/13, where 13 is the sum of ``key`` group A.

``apply()`` within a ``GroupBy`` is quite flexible: the only criterion is that the function takes a ``DataFrame`` and returns a Pandas object or scalar; what you do in the middle is up to you!

**Your turn.** Consider the ``df`` DataFrame. Multiply each entry of ``data2`` column by the product of the ``data1`` column grouped by ``key`` values. 

In [33]:
# write your code here



### Specifying the split key

In the simple examples presented before, we split the ``DataFrame`` on a single column name.
This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.

#### A list, array, series, or index providing the grouping keys

The key can be any series or list with a length matching that of the ``DataFrame``. For example:

In [34]:
L = [0, 1, 2, 0, 1, 2]                 # A=0, B=1, C=1
display('df', 'df.groupby(L).sum()')

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7

Unnamed: 0,data1,data2
0,5,13
1,7,11
2,9,12


Of course, this means there's another, more verbose way of accomplishing the ``df.groupby('key')`` from before:

In [35]:
display('df', "df.groupby(df['key']).sum()")

Unnamed: 0,key,data1,data2
0,A,1,3
1,B,2,9
2,C,3,5
3,A,4,10
4,B,5,2
5,C,6,7

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,5,13
B,7,11
C,9,12


#### A dictionary or series mapping index to group

Another method is to provide a dictionary that maps index values to the group keys:

In [36]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,3
B,2,9
C,3,5
A,4,10
B,5,2
C,6,7

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
consonant,16,23
vowel,5,13


Note that setting ``key`` column to index is necessary for this to work since the mapping is applied to the index:

In [37]:
df.groupby(mapping).sum() # df has index 0, 1, ..., 5, hence the result is not the wanted one

Unnamed: 0,data1,data2


#### Any Python function

Similar to mapping, you can pass any Python function that will input the index value and output the group:

In [38]:
display('df2', 'df2.groupby(str.lower).mean()')

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,3
B,2,9
C,3,5
A,4,10
B,5,2
C,6,7

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.5,6.5
b,3.5,5.5
c,4.5,6.0


#### A list of valid keys

Further, any of the preceding key choices can be combined to group on a multi-index:

In [39]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,2.5,6.5
b,consonant,3.5,5.5
c,consonant,4.5,6.0


### Grouping example

As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:

In [40]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [41]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0) 
# try commenting out .unstack().fillna(0) to see how this works

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets.
We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!

Here I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result.
It's certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.

### Groupby on Multi-Indices

We've previously seen that Pandas has built-in data aggregation methods, such as ``mean()``, ``sum()``, and ``max()``.
For hierarchically indexed data, these can be passed a ``level`` parameter that controls which subset of the data the aggregate is computed on.

For example, let's return to the health data from [Hierarchical Indexing](L25_Hierarchical_Indexing.ipynb):

In [42]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,39.0,37.8,36.0,36.5,36.0,35.4
2013,2,47.0,38.7,39.0,38.1,50.0,37.8
2014,1,25.0,35.3,38.0,39.5,38.0,37.6
2014,2,39.0,39.1,25.0,37.1,26.0,37.3


Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:

In [43]:
data_mean = health_data.groupby(level='year').mean()
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,43.0,38.25,37.5,37.3,43.0,36.6
2014,32.0,37.2,31.5,38.3,32.0,37.45


By further making use of the ``axis`` keyword, we can take the mean among levels on the columns as well:

In [44]:
data_mean.groupby(axis=1, level='type').mean()

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,41.166667,37.383333
2014,31.833333,37.65


Thus in two lines, we've been able to find the average heart rate and temperature measured among all subjects in all visits each year. While this is a toy example, many real-world datasets have similar hierarchical structure.

---

## Exercises

**Exercise 2.8.1** Dataset [data/auto.csv](data/auto.csv) has different characteristics of an auto such as body-style, wheel-base, engine-type, price, mileage, horsepower, etc.

- Read the data and remove rows that have missing data 

In [45]:
auto = pd.read_csv("data/auto.csv")
auto.head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


In [46]:
# write your solution here



In [47]:
# write your solution here



- Find the most expensive car company name

In [48]:
# write your solution here



- Print All Toyota Cars details

In [49]:
# write your solution here



- Count total cars per company

In [50]:
# write your solution here



- Find the highest price (of all car prices) for each company

In [51]:
# write your solution here



- Find the average mileage of each car company

In [52]:
# write your solution here



- Sort all cars by the company and price columns

In [53]:
# write your solution here



- Find the min, median and max prices of each car making company

In [54]:
# write your solution here



- Sort body-styles of cars by their average prices

In [55]:
# write your solution here



- Find the average horsepower per num-of-cylinders and visualise your result

In [56]:
# write your solution here



In [57]:
# write your solution here



---

**Exercise 2.8.2** Consider data files: [data/surveys2001.csv](data/surveys2001.csv) and [data/surveys2002.csv](data/surveys2002.csv).

- Read the data into Python and combine the files to make one new DataFrame. Export your results as a CSV and make sure it reads back into Python properly.

In [58]:
# write your solution here - read CSV files



In [59]:
# write your solution here



- Create a plot of average weight by year grouped by sex.

In [60]:
# write your solution here



In [61]:
# write your solution here



---

**Exercise 2.8.3** Consider [data/drinks.csv](data/drinks.csv) dataset.

In [62]:
drinks = pd.read_csv("data/drinks.csv")
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


- Which continent drinks more beer on average? Answer: EU

In [63]:
# write your solution here



- For each continent print the statistics for wine consumption. Hint: use the method .describe()

In [64]:
# write your solution here



- Print the min, mean and max alcohol consumption per continent for every column

In [65]:
# write your solution here



- Plot the mean alcohol consumption per continent for every column

In [66]:
# write your solution here



---

**Exercise 2.8.4** Import a user occupation dataset from https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user and assign it to a variable called users.

In [67]:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


- Find the mean age per occupation

In [68]:
# write your solution here



- Find the male ratio per occupation and sort it in the increasing order. Hint: create a new column ``gender_n`` as a copy of the ``gender`` column only with M replaced with 1 and F replaced with 0.

In [69]:
# write your solution here



In [70]:
# write your solution here



- For each occupation and gender, calculate the minimum and maximum ages

In [71]:
# write your solution here



-  For each occupation present the percentage of women and men

In [72]:
# write your solution here



---

**Exercise 2.8.5** Frank Mulligrew is the algebra coordinator for Washington, DC public schools. He is required by the school board to gather some. Using the information about his class given in [data/algebradata.csv](data/algebradata.csv), calculate the following:

In [73]:
algebra = pd.read_csv("data/algebradata.csv")
algebra.head()

Unnamed: 0,Fname,Lname,Gender,Grade,Hours of Study
0,Mary,Ettienne,F,B,16
1,Charles,Looner,M,F,8
2,Betty,Franklin,F,A,24
3,Roger,Withers,M,C,5
4,John,Mulgrew,M,A,5


- The percentage of each grade

In [74]:
# write your solution here



- The percentage of each grade among men and women

In [75]:
# write your solution here



- The percentage of students with a passing grade

In [76]:
# write your solution here



- The percentage of men and women with a passing grade

In [77]:
# write your solution here



- The average hours of study for all students

In [78]:
# write your solution here



- The average hours of study for students with a passing grade

In [79]:
# write your solution here



---

**Exercise 2.8.6** Carlos Hugens is the sales manager for Axis Auto Sales, a low-cost regional chain of used car lots. Carlos is getting ready for his annual sales meeting and is looking for the best way to improve his sales group's performance. His data [data/axisdata.csv](data/axisdata.csv) includes the gender, years of experience, sales training, and hours worked per week for each team member. It also includes the average cars sold per month by each salesperson. Find out the following:

In [80]:
axis = pd.read_csv("data/axisdata.csv")
axis.head()

Unnamed: 0,Fname,Lname,Gender,Hours Worked,SalesTraining,Years Experience,Cars Sold
0,Jada,Walters,F,39,N,3,2
1,Nicole,Henderson,F,46,N,3,6
2,Tanya,Moore,F,42,Y,4,6
3,Ronelle,Jackson,F,38,Y,5,3
4,Brad,Sears,M,33,N,4,2


- The average number of cars sold per month

In [81]:
# write your solution here



- The maximum number of cars sold per month

In [82]:
# write your solution here



- The minimum number of cars sold per month

In [83]:
# write your solution here



- The average number of cars sold per month by gender

In [84]:
# write your solution here



- The average number of hours worked by people selling more than three cars per month

In [85]:
# write your solution here



- The average number of years of experience

In [86]:
# write your solution here



- The average number of years of experience for people selling more than three cars per month

In [87]:
# write your solution here



- The average number of cars sold per month sorted by whether the salesmen have had sales training

In [88]:
# write your solution here



---

**Exercise 2.8.7** Consider [data/adult.csv](data/adult.csv) dataset.

In [89]:
data = pd.read_csv('data/adult.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


- How many men and women (sex feature) are represented in this dataset?

In [90]:
# write your solution here



- What is the average age (age feature) of women?

In [91]:
# write your solution here



-  What is the percentage of German citizens (native-country feature)?

In [92]:
# write your solution here



- What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year? 

In [93]:
# write your solution here



- Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [94]:
# write your solution here



- Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.

In [95]:
# write your solution here



- Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

In [96]:
# write your solution here



In [97]:
# write your solution here



In [98]:
# write your solution here



- What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

In [99]:
# write your solution here



- Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [100]:
# write your solution here



In [101]:
# write your solution here



---

<!--NAVIGATION-->
< [2.7 – Merge and Join](L27_Merge_and_Join.ipynb)| [Contents](../index.ipynb) | [2.9 – Pivot Tables](L29_Pivot_Tables.ipynb) >


*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; also available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*