# Measures of Central Tendency

In this notebook, we will discuss ways to summarize a set of data using a single number. The goal is to capture information about the distribution of data.

In [1]:
import numpy as np
import scipy.stats as stats

## Arithmetic Mean

The *arithmetic mean* is used very frequently to summarize numerical data, and is usually the one assumed to be meant by the word "average." It is defined as the sum of the observations divided by the number of observations:
$$\mu = \frac{\sum_{i=1}^N X_i}{N}$$

where $X_1, X_2, \ldots , X_N$ are our observations.

In [3]:
# We'll use these two datasets as examples
x1 = [1, 2, 2, 3, 4, 5, 5, 7]
x2 = x1 + [100]

print(f"Mean of x1: {sum(x1)} / {len(x1)} = {np.mean(x1)}")
print(f"Mean of x2: {sum(x2)} / {len(x2)} = {np.mean(x2)}")

Mean of x1: 29 / 8 = 3.625
Mean of x2: 129 / 9 = 14.333333333333334


We can also define a *weighted* arithmetic mean, which is useful for explicitly specifying the number of times each observation should be counted. For example, in computing the average value of a portfolio, it is more convenient to say that 70% or our stocks are of type $X$ rather than making a list of every share we hold.

The weighted arithmetic mean is defined as
$$\sum_{i=1}^n w_i X_i $$

where $\sum_{i=1}^n w_i = 1$. In the usual arithmetic mean, we have $w_i = 1/n$ for all $i$.

## Median

The *median* of a set of data is the number which appears in the middle of the list when it is sorted in increasing or decreasing order. When we have an odd number $n$ of data points, this is simple the value in position $(n + 1)/2$. When we have an even number of data points, the list splits in half and there is no item in the middle, so we define the median as the average of the values in position $n/2$ and $n/2 + 1$. 

The median is less affected by extreme values in the data than the arithmetic mean. It tells us the value that splits the dataset in half, but not how much smaller or larger the other values are.

In [4]:
print(f"Median of x1: {np.median(x1)}")
print(f"Median of x2: {np.median(x2)}")

Median of x1: 3.5
Median of x2: 4.0


## Mode

The *mode* is the most frequently occuring value in a dataset. It can be applied to non-numerical data too, unlike the mean and the median. One situation in which it is useful is for data whose possible values are independent. For example, in the outcomes of a weighted die, coming up 6 often does not mean it is likely to come up 5, so knowing that the dataset has a mode of 6 is more useful than knowing it has a mean of 4.5.

In [8]:
# Scipy has a built-in mode function, but it will return only one value
# even if two or more values occur the same number of times, or if no value appears more than once
print(f"Mode of x1: {stats.mode(x1)[0]}")

Mode of x1: 2


In [9]:
# We can write our own function to return all modes
def mode(l):
    # Count the number of times each element appears in the list
    counts = {}
    for e in l:
        if e in counts:
            counts[e] += 1
        else:
            counts[e] = 1
        
    # Return the element(s) that appear the most times
    maxcount = 0
    modes = {}
    for (key, value) in counts.items():
        if value > maxcount:
            maxcount = value
            modes = {key}
        elif value == maxcount:
            modes.add(key)

    if (maxcount > 1) or (len(l) == 1):
        return list(modes)
    return "No mode"

# Then use this function to find the modes of our datasets
print(f"Mode of x1: {mode(x1)}")

Mode of x1: [2, 5]


For data that can take on many different values, such as returns data, there may not be any values that appear more than once. In this case, we can bin values, like we do when constructing a histogram, and then find the mode of the dataset where each value is replaced with the name of its bin. That is, we find which bin elements fall into most often.

In [10]:
# Mock implementation of get_pricing function
import pandas as pd
import yfinance as yf

_FIELD_MAP = {
    "price": "Close",        # Quantopian "price" ~= daily close
    "open_price": "Open",
    "high": "High",
    "low": "Low",
    "close_price": "Close",
    "volume": "Volume",
    "adj_close": "Adj Close",
}

def get_pricing(symbol, start_date, end_date, fields="price", adjusted=False):
    if isinstance(symbol, str):
        tickers = [symbol]
    else:
        tickers = list(symbol)

    col = _FIELD_MAP.get(fields, fields)

    df = yf.download(
        tickers=tickers,
        start=start_date,
        end=end_date,
        interval="1d",
        auto_adjust=adjusted,
        actions=False,
        progress=False,
        group_by="ticker",
    )

    # Case 1: MultiIndex columns: (ticker, field)
    if isinstance(df.columns, pd.MultiIndex):
        # return Series for single ticker, DataFrame for multi tickers
        if len(tickers) == 1:
            t = tickers[0]
            if (t, col) not in df.columns:
                raise KeyError(f"Missing {(t, col)}. Available: {list(df.columns)}")
            out = df[(t, col)].copy()
            out.name = t
            return out
        else:
            out = {}
            for t in tickers:
                if (t, col) not in df.columns:
                    raise KeyError(f"Missing {(t, col)}. Available: {list(df.columns)[:10]} ...")
                out[t] = df[(t, col)]
            return pd.DataFrame(out)

    # Case 2: Flat columns: "Open", "High", ...
    else:
        if col not in df.columns:
            raise KeyError(f"Field '{fields}' mapped to '{col}' not found. Available: {list(df.columns)}")
        if len(tickers) == 1:
            out = df[col].copy()
            out.name = tickers[0]
            return out
        else:
            # In flat-column case with multiple tickers, yfinance usually returns MultiIndex,
            # but handle defensively anyway.
            return df[col].copy()

  from pandas.core import (


In [None]:
# Get the data for an asset
start   = "2014-01-01"
end     = "2015-01-01"
pricing = get_pricing("SPY", start_date=start, end_date=end, fields="price", adjusted=False) 

# Calculate daily returns
returns = pricing.pct_change()[1:]

# Since all returns are unique, there is no mode
print(f"Mode of returns: {mode(returns)}")

Mode of returns: No mode


In [None]:
# We instead use a frequency distribution to get an alternative mode
# np.histogram returns the frequency distribution over the bins as well as the endpoints of the bins
hist, bins = np.histogram(returns, bins=20)   # Break data up into 20 bins
maxfreq = max(hist)

# Find all the bins that are hit with frequency maxfreq
# Then print the interval(s) corresponding to those bins
print(f"Mode of bins: {[(bins[i], bins[i + 1]) for i, j in enumerate(hist) if j == maxfreq]}")

Mode of bins: [(-0.0012499981123169877, 0.0011117022955209332)]


## Geometric Mean

While the arithmetic mean averages using addition, the *geometric mean* uses multiplication:
$$ G = \sqrt[n]{X_1X_1\ldots X_n} $$

for observations $X_i \geq 0$. We can also rewrite it as an arithmetic mean using logarithms:
$$ \ln G = \frac{\sum_{i=1}^n \ln X_i}{n} $$

The geometric mean is always less than or equal to the arithmetic mean (when working with nonnegative observations), with equality only when all of the observations are the same.



In [14]:
# Use Scipy's gmean to compute the geometric mean
print(f"Geometric mean of x1: {stats.gmean(x1)}")
print(f"Geometric mean of x2: {stats.gmean(x2)}")

Geometric mean of x1: 3.0941040249774403
Geometric mean of x2: 4.552534587620071


What if we want to compute the geometric mean when we have negative observations? This problem is easy to solve in the case of asset returns, where our values are always at least $-1$. We can add 1 to a return $R_t$ to get $1 + R_t$, which is the ratio of the price of the asset for two consecutive periods (as opposed to the percent change between the prices, $R_t$). This quantity will always be nonnegative. So we can compute the geometric mean return,
$$ R_G = \sqrt[T]{(1 + R_1)\ldots (1 + R_T)} - 1$$

In [15]:
# Add 1 to every value in the returns array then compute the geometric mean
ratios = returns + np.ones(len(returns))
R_G = stats.gmean(ratios) - 1
print(f"Geometric mean of returns: {R_G}")

Geometric mean of returns: 0.00046461682827958484


The geometric mean is defined so that if the rate of return over the whole time period were constant and equal to $R_G$, the final price of the security would be the same as in the case of returns $R_1, \ldots, R_T$.

In [16]:
T = len(returns)
init_price  = pricing[0]
final_price = pricing[T]
print(f"Initial price: {init_price}")
print(f"Final price: {final_price}")
print(f"Final price as computed with geometric mean: {init_price * (1 + R_G) ** T}")

Initial price: 182.9199981689453
Final price: 205.5399932861328
Final price as computed with geometric mean: 205.53999328613


  init_price  = pricing[0]
  final_price = pricing[T]


## Harmonic mean

The *harmonic mean* is less commonly used than the other types of means. It is defined as
$$ H = \frac{n}{\sum_{i=1}^n \frac{1}{X_i}} $$

As with the geometric mean, we can rewrite the harmonic mean to look like an arithmetic mean. The reciprocal of the harmonic mean is the arithmetic mean of the reciprocals of the observations:
$$ \frac{1}{H} = \frac{\sum_{i=1}^n \frac{1}{X_i}}{n} $$

The harmonic mean for nonnegative numbers $X_i$ is always at most the geometric mean (which is at most the arithmetic mean), and they are equal only when all of the observations are equal.

In [17]:
print(f"Harmonic mean of x1: {stats.hmean(x1)}")
print(f"Harmonic mean of x2: {stats.hmean(x2)}")

Harmonic mean of x1: 2.5590251332825593
Harmonic mean of x2: 2.869723656240511


The harmonic mean can be used when the data can naturally phrased in terms of ratios. For example, in the dollar-cost averaging strategy, a fixed amount is spent on shares of a stock at regularly intervals. The higher the price of the stock, the fewer shares an investor following this strategy buys. The average (arithmetic mean) amount they pay for the stock is the harmonic mean of the prices.

## Points Estimates Can Be Deceiving

Means by nature hide a lot of information, as they collapse the entire distributions into one number. As a result, often "point estimates" or metrics that use one number can disguise large programs in our Data. Thus, we should be careful that we are not losing key information by summarizing our data, and we should rarely, if ever, use a mean without also referring to a measure of spread.

## Underlying Distribution Can Be Wrong

Even when we are using the right metrics for mean and spread, they can make no sense if our underlying distribution is not what we think it is. For instance, using standard deviation to measure frequency of an event will usually assume normality. Try not to assume distributions unless we have to, in which case we should rigorously check that the data do fit the distribution that we are assuming.