# Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(7, 3), index=pd.date_range("2022-02-02", periods=7))
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2

Unnamed: 0,0,1,2
2022-02-03,-0.013938,-2.260412,2.315637
2022-02-04,-0.499093,1.407191,-0.361063
2022-02-05,-0.236897,-1.1647,0.785671
2022-02-06,-0.133355,-0.754735,-0.185206
2022-02-07,0.259555,-0.61245,0.133201
2022-02-08,-1.132296,0.513303,-1.023171
2022-02-09,,,


Calling the `pandas.DataFrame.sum` method returns a series containing column totals:

In [2]:
df2.sum()

0   -1.756023
1   -2.871803
2    1.665069
dtype: float64

Passing `axis='columns'` or `axis=1` instead sums over the columns:

In [3]:
df2.sum(axis='columns')

2022-02-03    0.041287
2022-02-04    0.547036
2022-02-05   -0.615925
2022-02-06   -1.073296
2022-02-07   -0.219694
2022-02-08   -1.642164
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is `0`. This can be disabled with the `skipna` option:

In [4]:
df2.sum(axis='columns', skipna=False)

2022-02-03    0.041287
2022-02-04    0.547036
2022-02-05   -0.615925
2022-02-06   -1.073296
2022-02-07   -0.219694
2022-02-08   -1.642164
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as `mean`, require at least one non-`NaN` value to obtain a valuable result:

In [5]:
df2.mean(axis='columns')

2022-02-03    0.013762
2022-02-04    0.182345
2022-02-05   -0.205308
2022-02-06   -0.357765
2022-02-07   -0.073231
2022-02-08   -0.547388
2022-02-09         NaN
Freq: D, dtype: float64

## Options for reduction methods

Method | Description
:----- | :----------
`axis` | the axis of values to reduce: `0` for the rows of the DataFrame and `1` for the columns
`skipna` | exclude missing values; by default `True`.
`level` | reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as `idxmin` and `idxmax`, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

In [6]:
df2.idxmax()

0   2022-02-07
1   2022-02-04
2   2022-02-03
dtype: datetime64[ns]

Other methods are accumulations:

In [7]:
df2.cumsum()

Unnamed: 0,0,1,2
2022-02-03,-0.013938,-2.260412,2.315637
2022-02-04,-0.51303,-0.85322,1.954574
2022-02-05,-0.749927,-2.01792,2.740245
2022-02-06,-0.883282,-2.772656,2.555039
2022-02-07,-0.623727,-3.385106,2.68824
2022-02-08,-1.756023,-2.871803,1.665069
2022-02-09,,,


Another type of method is neither reductions nor accumulations. `describe` is one such example that produces several summary statistics in one go:

In [8]:
df2.describe()

Unnamed: 0,0,1,2
count,6.0,6.0,6.0
mean,-0.292671,-0.478634,0.277511
std,0.481398,1.286844,1.161608
min,-1.132296,-2.260412,-1.023171
25%,-0.433544,-1.062209,-0.317099
50%,-0.185126,-0.683593,-0.026003
75%,-0.043792,0.231865,0.622554
max,0.259555,1.407191,2.315637


For non-numeric data, `describe` generates alternative summary statistics:

In [9]:
data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],
        'Octal': ['001', '002', '003', '004', '004', '005']}
df3 = pd.DataFrame(data)

df3.describe()

Unnamed: 0,Code,Octal
count,6,6
unique,6,5
top,U+0000,4
freq,1,2


Descriptive and summary statistics:

Method | Description
:----- | :----------
`count` | number of non-NA values
`describe` | calculation of a set of summary statistics for series or each DataFrame column
`min`, `max` | calculation of minimum and maximum values
`argmin`, `argmax` | calculation of the index points (integers) at which the minimum or maximum value was reached
`idxmin`, `idxmax` | calculation of the index labels at which the minimum or maximum values were reached
`quantile` | calculation of the sample quantile in the range from 0 to 1
`sum` | sum of the values
`mean` | arithmetic mean of the values
`median` | arithmetic median (50% quantile) of the values
`mad` | mean absolute deviation from the mean value
`prod` | product of all values
`var` | sample variance of the values
`std` | sample standard deviation of the values
`skew` | sample skewness (third moment) of the values
`kurt` | sample kurtosis (fourth moment) of the values
`cumsum` | cumulative sum of the values
`cummin`, `cummax` | cumulated minimum and maximum of the values respectively
`cumprod` | cumulated product of the values
`diff` | calculation of the first arithmetic difference (useful for time series)
`pct_change` | calculation of the percentage changes