# Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

In [1]:
import pandas as pd
import numpy as np

rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7))
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2

Unnamed: 0,0,1,2
2022-02-03,-1.394921,0.464538,0.42878
2022-02-04,0.050359,1.324827,0.688342
2022-02-05,-0.769438,1.76499,0.835642
2022-02-06,-0.637301,-0.974347,0.958251
2022-02-07,0.484021,-0.780501,-1.367992
2022-02-08,0.283652,-0.531962,-1.587233
2022-02-09,,,


Calling the `pandas.DataFrame.sum` method returns a series containing column totals:

In [2]:
df2.sum()

0   -1.983628
1    1.267546
2   -0.044210
dtype: float64

Passing `axis='columns'` or `axis=1` instead sums over the columns:

In [3]:
df2.sum(axis='columns')

2022-02-03   -0.501603
2022-02-04    2.063528
2022-02-05    1.831195
2022-02-06   -0.653397
2022-02-07   -1.664472
2022-02-08   -1.835543
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is `0`. This can be disabled with the `skipna` option:

In [4]:
df2.sum(axis='columns', skipna=False)

2022-02-03   -0.501603
2022-02-04    2.063528
2022-02-05    1.831195
2022-02-06   -0.653397
2022-02-07   -1.664472
2022-02-08   -1.835543
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as `mean`, require at least one non-`NaN` value to obtain a valuable result:

In [5]:
df2.mean(axis='columns')

2022-02-03   -0.167201
2022-02-04    0.687843
2022-02-05    0.610398
2022-02-06   -0.217799
2022-02-07   -0.554824
2022-02-08   -0.611848
2022-02-09         NaN
Freq: D, dtype: float64

## Options for reduction methods

Method | Description
:----- | :----------
`axis` | the axis of values to reduce: `0` for the rows of the DataFrame and `1` for the columns
`skipna` | exclude missing values; by default `True`.
`level` | reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as `idxmin` and `idxmax`, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

In [6]:
df2.idxmax()

0   2022-02-07
1   2022-02-05
2   2022-02-06
dtype: datetime64[ns]

Other methods are accumulations:

In [7]:
df2.cumsum()

Unnamed: 0,0,1,2
2022-02-03,-1.394921,0.464538,0.42878
2022-02-04,-1.344563,1.789365,1.117123
2022-02-05,-2.114,3.554356,1.952765
2022-02-06,-2.751301,2.580009,2.911016
2022-02-07,-2.26728,1.799508,1.543024
2022-02-08,-1.983628,1.267546,-0.04421
2022-02-09,,,


Another type of method is neither reductions nor accumulations. `describe` is one such example that produces several summary statistics in one go:

In [8]:
df2.describe()

Unnamed: 0,0,1,2
count,6.0,6.0,6.0
mean,-0.330605,0.211258,-0.007368
std,0.721868,1.154114,1.154521
min,-1.394921,-0.974347,-1.587233
25%,-0.736403,-0.718366,-0.918799
50%,-0.293471,-0.033712,0.558561
75%,0.225329,1.109755,0.798817
max,0.484021,1.76499,0.958251


For non-numeric data, `describe` generates alternative summary statistics:

In [9]:
data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],
        'Octal': ['001', '002', '003', '004', '004', '005']}
df3 = pd.DataFrame(data)

df3.describe()

Unnamed: 0,Code,Octal
count,6,6
unique,6,5
top,U+0000,4
freq,1,2


Descriptive and summary statistics:

Method | Description
:----- | :----------
`count` | number of non-NA values
`describe` | calculation of a set of summary statistics for series or each DataFrame column
`min`, `max` | calculation of minimum and maximum values
`argmin`, `argmax` | calculation of the index points (integers) at which the minimum or maximum value was reached
`idxmin`, `idxmax` | calculation of the index labels at which the minimum or maximum values were reached
`quantile` | calculation of the sample quantile in the range from 0 to 1
`sum` | sum of the values
`mean` | arithmetic mean of the values
`median` | arithmetic median (50% quantile) of the values
`mad` | mean absolute deviation from the mean value
`prod` | product of all values
`var` | sample variance of the values
`std` | sample standard deviation of the values
`skew` | sample skewness (third moment) of the values
`kurt` | sample kurtosis (fourth moment) of the values
`cumsum` | cumulative sum of the values
`cummin`, `cummax` | cumulated minimum and maximum of the values respectively
`cumprod` | cumulated product of the values
`diff` | calculation of the first arithmetic difference (useful for time series)
`pct_change` | calculation of the percentage changes