Objectives

- Calculate summary statistics for column(s) of interest
- Calculate summary statistics for column(s) of interest grouped by given column
- Introduce the split-apply-combine approach
- Get the value counts for a given column

Content to cover

- min/mean/max/corr
- groupby() + min/max/mean…
- value_counts


In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [36]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Calculate statistics

### Aggregating statistics

![](../schemas/06_aggregate.svg)

> What is the average age of the titanic passengers?

In [37]:
titanic["Age"].mean()

29.69911764705882

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operates across rows by default.

![](../schemas/06_reduction.svg)

> What is the median age and ticket fare price of the titanic passengers?

In [38]:
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

The aggregating statistic can be calculated for multiple columns at the same time. Remember the `describe` function from the [first tutorial](1_table_oriented.ipynb)?

In [39]:
titanic[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the `agg` method:

In [42]:
titanic.agg({'Age' : ['min', 'max', 'median', 'skew'], 
             'Fare' : ['min', 'max', 'median', 'mean']})

Unnamed: 0,Age,Fare
max,80.0,512.3292
mean,,32.204208
median,28.0,14.4542
min,0.42,0.0
skew,0.389108,


__To user guide:__ Further details about descriptive statistics is provided in :ref:`basics.stats`.

### Aggregating statistics grouped by category

To introduce the concept, let's focus on the Age and Sex columns of the titanic data set:

In [61]:
titanic_subset = titanic[["Age", "Sex"]]
titanic_subset.head()

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


![](../schemas/06_groupby.svg)

> What is the average age for male versus female titanic passengers?

In [62]:
titanic_subset.groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


Calculating a given statistic (e.g. `mean` age) _for each category in a column_ (e.g. male/female in the `Sex` column) is a common pattern. The `groupby` method is used to support this type of operations. More general, this fits in the more general `split-apply-combine` pattern:

* __Split__ the data into groups
* __Apply__ a function to each group independently
* __Combine__ the results into a data structure

The apply and combine steps are typically done together in Pandas.

In the previous example, a subset of the data was used. To apply the pattern on the entire DataFrame, the selection of columns is supported on the grouped data as well:

In [63]:
titanic.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

![](../schemas/06_groupby_select_detail.svg)

> What is the mean ticket fare price for each of the sex and cabin class combinations?

In [65]:
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby method.

__To user guide:__ More information on groupby and the split-apply-combine approach is provided in :ref:`api.groupby`.

### Count number of records by category

![](../schemas/06_valuecounts.svg)

> What is the number of passengers in each of the cabin classes?

In [77]:
titanic["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

The `value_counts` function counts the number of records for each category in a column. The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group: 

In [72]:
titanic.groupby("Pclass")["Pclass"].count()

Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

<div class="alert alert-info">
    
__Note__: Both `size` and `count` can be used in combination with `groupby`. Whereas `size` includes `NaN` values and just provides the number of rows (size of the table), `count` excludes the missing values. In the `value_counts` method, use the `dropna` argument to include or exclude the `Nan` values.

</div>

__To user guide:__ For more information about `value_counts`, see :ref:`basics.discretization`

## REMEMBER

- Aggregation statistics can be calculated on entire columns or rows
- `groupby` provides the power of the _split-apply-combine_ pattern
- `value_counts` is a convenient shortcut to count the number of entries in each category of a variable

__To user guide:__ More information on groupby and the split-apply-combine approach is provided in :ref:`api.groupby`.