<h1>Summarizing and Computing Descriptive Statistics</h1>

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series, or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                    [np.nan, np.nan], [0.75, -1.3]],
                    index=["a", "b", "c", "d"],
                    columns=["one", "two"])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [3]:
# df.sum()

# sum values in one row
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA, then the result is NA. This can be disabled with the `skipna` option, in which case any NA value in a row or column names the corresponding result NA:

In [4]:
# df.sum(axis="index", skipna=False)

df.sum(axis="columns", skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

Some aggregations, like `mean`, require at least one non-NA value to yield a value result, so here we have:

In [5]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like `idxmin` and `idxmax`, return indirect statistics, like the index value where the minimum or maximum values are attained:

In [6]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations:

In [7]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Some methods are neither reductions nor accumulations. `describe` is one such example, producing multiple summary statistics in one shot:

In [8]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On nonnumeric data, `describe` produces alternative summary statistics:

In [9]:
obj = pd.Series(["a", "a", "b", "c"] * 4)

obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

<h3>Correlation and Covariance</h3>

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes originally obtained from Yahoo! Finance and available in binary

In [10]:
price = pd.read_pickle("../data/yahoo_price.pkl")

volume = pd.read_pickle("../data/yahoo_volume.pkl")

I now compute percent changes of the prices, a time series operation that will be explored further in Ch Time Series:

In [11]:
returns = price.pct_change()

returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


The `corr` method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, `cov` computes the covariance:

In [12]:
returns["MSFT"].corr(returns["IBM"])

0.49976361144151155

In [13]:
returns["MSFT"].cov(returns["IBM"])

8.870655479703546e-05

DataFrame’s `corr` and `cov` methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [14]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [15]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


Using DataFrame’s `corrwith` method, you can compute pair-wise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [16]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

<h3>Unique Values, Value Counts, and Membership</h3>

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [17]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

The first function is `unique`, which gives you an array of the unique values in a Series:

In [18]:
uniques = obj.unique()

uniques

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in the order in which they first appear, and not in sorted order, but they could be sorted after the fact if needed (`uniques.sort()`). Relatedly, `value_counts` computes a Series containing value frequencies:

In [19]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as a convenience. `value_counts` is also available as a top-level pandas method that can be used with NumPy arrays or other Python sequences:

In [20]:
pd.value_counts(obj.to_numpy(), sort=False)

c    3
a    3
d    1
b    2
dtype: int64

`isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:

In [21]:
mask = obj.isin(["b", "c"])

mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

To compute this for all columns, pass `pandas.value_counts` to the DataFrame’s `apply` method:

In [22]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                        "Qu2": [2, 3, 1, 2, 3],
                    "   Qu3": [1, 5, 2, 4, 4]})

result = data.apply(pd.value_counts).fillna(0)

result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column.