# Statistics

### Describing a Single Set of Data

* A simple approach is to plot a histogram
* can also look at the number of data points - **len(data)**
* may be interested in the min and max - **min(data), max(data)**

### Central Tendencies

* the *mean* (or average) is the sum of the data divided by its count
* if you have 2 data points, the mean is simply the point halfway between them
* the *median* is the middlemost value (if the # of data points is odd) or the average of the two middle most values (if the # is even)

In [12]:
from typing import List

def mean(xs: List[float]) -> float:
    return sum(xs) / len(xs)

def _median_odd(xs: List[float]) -> float:
    return sorted(xs)[len(xs) // 2]

def _median_even(xs: List[float]) -> float:
    sorted_xs = sorted(xs)
    hi_midpoint = len(xs) // 2
    return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2

* if we have *n* data points, and one of them increases by a small amount *e*, the mean will increase by *e / n*

* a generalization of the median is the *quantile*, which represents the value under which a certain percentile of the data lies (the median represents the value under which 50% of the data lies)

In [8]:
def quantile(xs: List[float], p: float) -> float:
    """Returns the pth-percentile value in x"""
    p_index = int(p * len(xs))
    return sorted(xs)[p_index]

In [9]:
def mode(x: List[float]) -> List[float]:
    """ Returns a list, since there may be more than one mode """
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.items()
            if count == max_count]

### Dispersion

*Dispersion* refers to measures of how spread out our data is. 
* A simple measure is the *range*, which is just the difference between the max and the min
* A more complex measure is the *variance* which is computed as:

In [11]:
from ipynb.fs.full.ch4 import sum_of_squares

In [13]:
def de_mean(xs: List[float]) -> List[float]:
    """Translate xs by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(xs)
    return [x - x_bar for x in xs]

In [14]:
def variance(xs: List[float]) -> float:
    """ Almost the average squared deviateion from the mean """
    assert len(xs) >= 2, "variance requires at least two elements"
    
    n = len(xs)
    deviations = de_mean(xs)
    return sum_of_squares(deviations) / (n - 1)

This looks like it is almost the average squared deviation from the mean, except that we're dividing by **n - 1** instead of **n**. In fact, when we're dealing with a sample from a larger population, **x_bar** is only an *estimate* of the actual mean, which means that on average **(x_i - x_bar)^2** is an underestimate of **x_i**'s squared deviation from the mean, which is why we divide by **n - 1** instead of **n** (wikipedia: unbiased estimation of standard deviation)

* the *variance* is in units squared, which can be hard to interpret
* for that reason, we often use the *standard deviation*

In [15]:
import math

def standard_deviation(xs: List[float]) -> float:
    """The standard deviation is the square root of the variance"""
    return math.sqrt(variance(xs))

* an alternative measure more robust to extreme outliers is the **interquartile range**

In [16]:
def interquartile_range(xs: List[float]) -> float:
    """Returns the difference between the 75%-ile and the 25%-ile"""
    return quantile(xs, 0.75) - quantile(xs, 0.25)

### Correlation

*covariance* is the paired analogue of variance. Whereas variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means:

In [18]:
from ipynb.fs.full.ch4 import dot

In [19]:
def covariance(xs: List[float], ys: List[float]) -> float:
    assert len(xs) == len(ys), "xs and ys must have the same number of elements"
    
    return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)

Recall that **dot** sums up the products of corresponding pairs of elements. When corresponding elements of x and y are either both above their means or both below their means, a positive number enters the sum. When one is above its mean and the other below, a negative number neters the sum. Accordingly, a "large" positive covariance means that x tends to be large when y is large and small when y is small. A "large" negative covariance means the opposite. A covariance close to zero means that no such relationship exists.

This number can be hard to interpret for a couple of reasons:
* its units are the product of the inputs' units, which can be hard to make sense of
* if each user had twice as many friends, the covariance would be twice as large, but in a sense, the variables would be just as interrelated

For this reason, it's more common to look at the *correlation*, which divides out the standard deviation of both variables:

In [20]:
def correlation(xs: List[float], ys: List[float]) -> float:
    """Measures how much xs and ys vary in tandem about their means"""
    stdev_x = standard_deviation(xs)
    stdev_y = standard_deviation(ys)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(xs, ys) / stdev_x / stdev_y
    else:
        return 0

* the correlation is unitless and always lies between -1 (perfect anticorrelation) and 1 (perfect correlation)
* a number like 0.25 represents a relatively weak positive correlation
* correlations can be very sensitive to outliers. if you remove them, you may find a stronger correlation

### Simpson's Paradox

* *Simpson's Paradox* says correlations can be misleading when *confounding* variables are ignored
* correlation measures the relationship between two variables *all else being equal*

### Other Correlational Caveats

* a correlation of zero indicates that there is no linear relationship between the two variables
* however, there may be other sorts of relationships. For example, if:

In [23]:
x = [-2, -1, 0, 1, 2]
y = [2, 1, 0, 1, 2]

* then x and y have zero correlation
* but they certainly have a relationship: **y = abs(x)**

* in addition. correlation tells you nothing about how large the relationship is. The variables:

In [24]:
x = [-2, -1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]

* are perfectly correlated, but (depending on what you're measuring) its possible this relationship isn't very interesting

### Correlation and Causation

if **x** and **y** are strongly correlated this might mean
* **x** causes **y**
* **y** causes **x**
* each causes the other
* some other third factor causes both
* nothing at all