In [None]:
#: the usual imports
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 21

## Center and Spread

## Questions 
* How can we quantify natural concepts like “center” and “variability”?
* Why do many of the empirical distributions that we generate come out bell shaped?
* How is sample size related to the accuracy of an estimate?

## The Average (or Mean)

Given Data: $2, 3, 3, 9$, the average (or mean) is:
$$\rm{Average } = \frac{2 + 3 + 3 + 9}{4} = 4.25$$

## The Average (or Mean)

* Need not be a value in the collection
* Need not be an integer even if the data are integers
* Somewhere between min and max, but not necessarily halfway in between
* Same units as the data.
* Smoothing operator: collect all the contributions in one big pot, then split evenly

## The Median

- A median is a number in the "middle" of the data.
- Sort the data, pick number in the middle.
    - If there are two middle numbers, pick either (or any number between).
- Example: Median(1, 4, 7, 12, 32) is 7.
- Example: Median(1, 4, 7, 12) is 4 or 7 or any number between, like 4.2

### Example

Create a data set that has this histogram. (You can do it with a short list of whole numbers.) 

![image.png](attachment:image.png)

What are its median and mean?

In [None]:
bpd.DataFrame().assign(data=[1, 2, 2, 3, 3, 3, 4, 4, 5]).plot(kind='hist', bins=np.arange(.5, 12.5), density=True)

### Discussion Question

Are the medians of these two distributions the same or different? Are the means the same or different? If you say “different,” then which is bigger?

![image.png](attachment:image.png)

- A) All the same
- B) Means are different, medians are same
- C) Means are same, medians are different
- D) Everything is different

### Answer

In [None]:
#:
tbl1 = bpd.DataFrame().assign(data=[1, 2, 2, 3, 3, 3, 4, 4, 5])
tbl2 = bpd.DataFrame().assign(data=[1, 2, 2, 3, 3, 3, 4, 4, 10])

In [None]:
#:
print(
    'median #1:\t%f' % tbl1.get('data').median(),
    'median #2:\t%f' % tbl2.get('data').median(),
    'mean #1:\t%f' % tbl1.get('data').mean(),
    'mean #2:\t%f' % tbl2.get('data').mean(),
    sep='\n'
)

## Discussion Question

Which is bigger:

- A) The mean
- B) The median

In [None]:
delays = bpd.read_csv('data/delays.csv').get('Delay')
delays.plot(kind='hist', bins=np.arange(-.5, 210, 10), density=True)
plt.title('Flight Delays')
plt.xlabel('Delay (minutes)')

In [None]:
print('mean:\t%f' % delays.mean())
print('median:\t%f' % delays.median())
delays.plot(kind='hist', bins=np.arange(-.5, 210, 10), density=True)
plt.title('Flight Delays')
plt.xlabel('Delay (minutes)')

## Comparing Mean and Median
* Mean: Balance point of the histogram
* Median: Half-way point of data; half the area of histogram is on either side of median
* If the distribution is symmetric about a value, then that value is both the average and the median.
* If the histogram is skewed, then the mean is pulled away from the median in the direction of the tail.

## The Mean vs the Median

- The median is more **robust** (less **sensitive**) to **outliers**.

## Example

- Suppose we have the net worth of all UCSD students:

In [None]:
worths = np.random.lognormal(5.5, 1.25, 20_000)

In [None]:
worths.mean()

In [None]:
np.median(worths)

In [None]:
plt.hist(worths, bins=np.arange(0, 2_000, 100))

## Example

- Now Jeff Bezos (net worth: 127 billion) enrolls as a Data Science major

In [None]:
new_worths = np.append(worths, 127e9)

In [None]:
np.mean(new_worths)

In [None]:
np.median(new_worths)

# Standard Deviation

## Measuring Width

- How **wide** is the distribution?

**Plan A:** “biggest value - smallest value” (the **range**)
* Doesn’t tell us much about the shape of the distribution

**Plan B:** "the standard deviation"
* Measures variability around the mean

### Deviations from the mean

In [None]:
data = np.array([2, 3, 3, 9])

In [None]:
mean = ...
mean

In [None]:
deviations = ...
deviations

## The average deviation?

## Average **squared** deviation

a.k.a, the **variance**

In [None]:
#square all the deviations

In [None]:
variance = ...
variance

## Standard deviation

- Suppose our data has units (say, minutes)
- Variance has units of minutes^2
- Correction: take square root

In [None]:
# Standard Deviation (SD) is the square root of the variance
sd = ...
sd

## Standard Deviation
* numpy function: `np.std`
* Standard deviation (SD) measures roughly how far the data are from their average
* SD has the same units as the data

In [None]:
np.std(data)

## Why use the SD?

No matter what the shape of the distribution, the bulk of the data are in the range “average ± a few SDs”

## Chebyshev’s Inequality

No matter what the shape of the distribution, the proportion of values in the range “average ± z SDs” is at least 

$$1 - 1/z²$$

## Chebyshev's Bounds

|Range|Proportion|
|---|---|
|average ± 2 SDs|	at least 1 - 1/4   (75%)|
|average ± 3 SDs|	at least 1 - 1/9   (88.888…%)|
|average ± 4 SDs|	at least 1 - 1/16 (93.75%)|
|average ± 5 SDs|	at least 1 - 1/25  (96%)|

No matter what the distribution is!

## Example

In [None]:
nba = bpd.read_csv('data/nba2013.csv')
nba.plot(kind='hist', y='Height', density=True, bins=np.arange(69.5, 90, 1))
nba.get('Height').describe()

In [None]:
mean = nba.get('Height').mean()
sd = np.std(nba.get('Height'))
print('mean:\t\t\t%f' % mean, 
      '\nstandard deviation:\t%f' % sd)

## Example

- Mean $\approx$ 79
- STD $\approx$ 3.5
- *At least* 75% of the data is in $[79 - 2\times 3.5,\, 79 + 2\times 3.5] = [72, 86]$
- *At least* 88% of the data is in $[79 - 3\times 3.5,\, 79 +3\times 3.5] = [68.5, 89.5]$

In [None]:
nba.plot(kind='hist', y='Height', density=True, bins=np.arange(69.5, 90, 1))
plt.plot([mean-2*sd, mean+2*sd], [0, 0], color='lime', linewidth=10)

## Chebyshev's Inequality

- Chebyshev works no matter the distribution
- But if we know the type of the distribution (e.g., Normal), we can say more!

# The Normal Curve (Bell Curve)

In [None]:
#: data set of height/weight of 5000 adult males
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight

## Distribution of heights

In [None]:
#: height histogram
defaults = dict(bins=20, linewidth=0, density=True, alpha=.8)
plt.hist(height_and_weight.get('Height'), **defaults);

## Distribution of weights

In [None]:
#: weight histogram
plt.hist(height_and_weight.get('Weight'), color='C1', **defaults);

## A familiar shape

- We've seen this bell-like shape frequently.
- These bells are different in two key aspects: center and spread.

In [None]:
#: draw histograms on same scale
defaults = dict(bins=20, linewidth=0, density=True, alpha=.8)
plt.hist(height_and_weight.get('Height'), label='Height', **defaults);
plt.hist(height_and_weight.get('Weight'), label='Weight', **defaults);
plt.legend(loc='upper right')

## Centering

- Subtracting the mean centers distribution

In [None]:
#: compute mean height
mean_height = height_and_weight.get('Height').mean()
mean_height

In [None]:
#: compute mean weight
mean_weight = height_and_weight.get('Weight').mean()
mean_weight

In [None]:
#: insert them into table
centered_height_and_weight = height_and_weight.assign(
    Height=height_and_weight.get('Height') - mean_height,
    Weight=height_and_weight.get('Weight') - mean_weight,
)

## Centering

In [None]:
#: plot centered distributions
plt.hist(centered_height_and_weight.get('Height'), label='Height', **defaults);
plt.hist(centered_height_and_weight.get('Weight'), label='Weight', **defaults);
plt.legend(loc='upper right')

## Scaling

- Want distributions to have the same width.
- So we divide by standard deviation.
- Data that is centered and scaled is *standardized*.

In [None]:
import pandas as pd

In [None]:
std_height = np.std(height_and_weight.get('Height'))
std_height

In [None]:
std_weight = np.std(height_and_weight.get('Weight'))
std_weight

In [None]:
standardized = centered_height_and_weight.assign(
    Height=centered_height_and_weight.get('Height') / std_height,
    Weight=centered_height_and_weight.get('Weight') / std_weight
)

## Standardized Histograms

In [None]:
plt.hist(standardized.get('Height'), label='Height', **defaults);
plt.hist(standardized.get('Weight'), label='Weight', **defaults);
plt.legend(loc='upper right')

## The (standard) normal curve

- The bell curves we've seen look essentially the same once standardized.
- This shape is called the **standard normal curve**.

$$
\phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^2}
$$


## The standard normal curve

In [None]:
# define normal_curve using numpy
def normal_curve(x):
    return 1 / np.sqrt(2*np.pi) * np.exp(-x**2/2)

In [None]:
#: plot the curve
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)

plt.plot(x, y, color='black')

## Heights/weights are approximately normal

In [None]:
#: plot against normal curve
plt.hist(standardized.get('Height'), label='Height', **defaults);
plt.hist(standardized.get('Weight'), label='Weight', **defaults);
plt.plot(x, y, color='black', linestyle='--', label='Normal')
plt.legend(loc='upper right')