# Computing Descriptive Statistics with Pandas

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset's properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

## Dataset

The following code loads in the packages we will need and also the `iris` dataset.

In [1]:
import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris    # sklearn.datasets includes common example datasets

In [2]:
iris_obj = load_iris()    # A function to load in the iris dataset
iris_obj.data    # Dataset preview

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4

In [3]:
iris_obj.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
iris_obj.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
iris_obj.target_names

array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')

`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.

In [6]:
iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,
                 index=pd.Index([i for i in range(iris_obj.data.shape[0])])).\
           join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [7]:
iris.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

In [8]:
iris_grps = iris.groupby("species")

for name, data in iris_grps:
    print(name)
    print("---------------------\n\n")
    print(data.iloc[:, 0:4])
    print("\n\n\n")

setosa
---------------------


    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                 5.1               3.5                1.4               0.2
1                 4.9               3.0                1.4               0.2
2                 4.7               3.2                1.3               0.2
3                 4.6               3.1                1.5               0.2
4                 5.0               3.6                1.4               0.2
5                 5.4               3.9                1.7               0.4
6                 4.6               3.4                1.4               0.3
7                 5.0               3.4                1.5               0.2
8                 4.4               2.9                1.4               0.2
9                 4.9               3.1                1.5               0.1
10                5.4               3.7                1.5               0.2
11                4.8               3.4      

A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.

## Getting the Basics

Let's compute some basic statistics.

I use $n$ to denote the sample size. This number is the number of rows in the dataset, and can be obtained via `count()`.

In [9]:
iris.count()

sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
species              150
dtype: int64

The **sample mean** is the arithmetic mean of the dataset.

$$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$$

In [10]:
iris.mean()    # Sample mean for every numeric column

sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
dtype: float64

The **sample median** is the "middle" data point, after ordering the dataset. Let $x_{(i)}$ represent ordered data ($x_{(1)}$ is smallest, $x_{(n)}$ largest).

$$\tilde{x} = \begin{cases}
x_{\left(\frac{n+1}{2}\right)} & \text{ if } n \text{ is odd} \\
\frac{1}{2}\left(x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}\right) & \text{ if } n \text{ is even} \\
\end{cases}$$

In [11]:
iris.median()

sepal length (cm)    5.80
sepal width (cm)     3.00
petal length (cm)    4.35
petal width (cm)     1.30
dtype: float64

The **sample variance** is a measure of dispersion, roughly the "average" squared distance of a data point from the mean. The **standard deviation** is the square root of the variance and interpreted as the "average" distance a data point is from the mean.

$$s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2$$
$$s = \sqrt{s^2}$$

In [12]:
iris.var()

sepal length (cm)    0.685694
sepal width (cm)     0.188004
petal length (cm)    3.113179
petal width (cm)     0.582414
dtype: float64

In [13]:
iris.std()

sepal length (cm)    0.828066
sepal width (cm)     0.433594
petal length (cm)    1.764420
petal width (cm)     0.763161
dtype: float64

The **$p$th percentile** is the number in the dataset such that roughly $p$% of the data is less than this number. This number is also referred to as a quantile.

In [14]:
iris.quantile(.1)   # The 10th percentile

sepal length (cm)    4.8
sepal width (cm)     2.5
petal length (cm)    1.4
petal width (cm)     0.2
Name: 0.1, dtype: float64

In [15]:
iris.quantile(.95)    # The 95th percentile

sepal length (cm)    7.255
sepal width (cm)     3.800
petal length (cm)    6.100
petal width (cm)     2.300
Name: 0.95, dtype: float64

In [16]:
iris.quantile(.75)    # Commonly known as the third quartile

sepal length (cm)    6.4
sepal width (cm)     3.3
petal length (cm)    5.1
petal width (cm)     1.8
Name: 0.75, dtype: float64

In [17]:
iris.quantile(.25)    # Commonly known as the first quartile

sepal length (cm)    5.1
sepal width (cm)     2.8
petal length (cm)    1.6
petal width (cm)     0.3
Name: 0.25, dtype: float64

If $Q_i$ denotes the $i$th quartile, the **inner-quartile range** (**IQR**) is the difference between the third quartile and the first quartile.

$$IQR = Q_3 - Q_1$$

In [18]:
# There is no function for computing the IQR but it is nevertheless easy to obtain
iris.quantile(.75) - iris.quantile(.25)

sepal length (cm)    1.3
sepal width (cm)     0.5
petal length (cm)    3.5
petal width (cm)     1.5
dtype: float64

Other interesting quantities include the maximum and minimum values.

In [19]:
iris.max()

sepal length (cm)          7.9
sepal width (cm)           4.4
petal length (cm)          6.9
petal width (cm)           2.5
species              virginica
dtype: object

In [20]:
iris.min()

sepal length (cm)       4.3
sepal width (cm)          2
petal length (cm)         1
petal width (cm)        0.1
species              setosa
dtype: object

Many of these summaries work for grouped data as well.

In [21]:
iris_grps.mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.418,1.464,0.244
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [22]:
iris_grps.std()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.35249,0.381024,0.173511,0.10721
versicolor,0.516171,0.313798,0.469911,0.197753
virginica,0.63588,0.322497,0.551895,0.27465


In [23]:
iris_grps.quantile(.75)

0.75,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.575,0.3,5.2,3.675
versicolor,4.6,1.5,6.3,3.0
virginica,5.875,2.3,6.9,3.175


In [24]:
iris_grps.quantile(.75) - iris_grps.quantile(.25)

0.75,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.175,0.1,0.4,0.55
versicolor,0.6,0.3,0.7,0.475
virginica,0.775,0.5,0.675,0.375


## Other Useful Methods

The method `describe()` gets a number of useful summaries for a dataset.

In [25]:
iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [26]:
# This also works well for grouped data.
iris_grps.describe()

Unnamed: 0_level_0,petal length (cm),petal length (cm),petal length (cm),petal length (cm),petal length (cm),petal length (cm),petal length (cm),petal length (cm),petal width (cm),petal width (cm),...,sepal length (cm),sepal length (cm),sepal width (cm),sepal width (cm),sepal width (cm),sepal width (cm),sepal width (cm),sepal width (cm),sepal width (cm),sepal width (cm)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
setosa,50.0,1.464,0.173511,1.0,1.4,1.5,1.575,1.9,50.0,0.244,...,5.2,5.8,50.0,3.418,0.381024,2.3,3.125,3.4,3.675,4.4
versicolor,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1,50.0,1.326,...,6.3,7.0,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
virginica,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9,50.0,2.026,...,6.9,7.9,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8


If we want custom numerical summaries, we can write functions to compute them for Pandas `Series` then apply them to the columns of a `DataFrame`.

I demonstrate by writing a function that computes the **range**, which is the difference between the maximum and minimum of a dataset.

$$\text{range} = x_{(n)} - x_{(1)}$$

In [27]:
# Compute the range of a dataset
def range_stat(s):
    return s.max() - s.min()

iris.iloc[:, 0:4].apply(range_stat)

sepal length (cm)    3.6
sepal width (cm)     2.4
petal length (cm)    5.9
petal width (cm)     2.4
dtype: float64

In [28]:
# Use aggregate() for groups
iris_grps.aggregate(range_stat)

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.5,2.1,0.9,0.5
versicolor,2.1,1.4,2.1,0.8
virginica,3.0,1.6,2.4,1.1
