# Summary statistics (and the lies they tell)

It's hard to mentally analyze (let alone communicate) large groups of numbers, so we often summarize them mathematically.  You're probably already familiar with concepts of mean (average) and median of a set of numbers; these are two "summary statistics" which summarize a collection of numbers with a single value.

Computing these with NumPy is easy, using the appropriately-named functions `mean()` and `median()`.

In [None]:
import numpy as np
x = np.array([1, 2, 5, 7, 20])

mean = np.mean(x)
median = np.median(x)

print("Mean: {}".format(mean))
print("Median: {}".format(median))

Standard deviation can be a little trickier.  Intuitively, the standard deviation measures how close the data is to the mean.  A low standard deviation means that most of the data points are clustered around the mean, while a large standard deviation means that the values are more spread out.

Mathematically, we:
* compute the distance of each value from the mean
* square these values (this gets rid of negative values and has nicer mathematical properties than square root)
* average the squared values together
* take the sqare root of the average.

The NumPy function `std()` computes the standard deviation:

In [None]:
x = np.array([1, 2, 5, 7, 20])

stddev = np.std(x)
print("Standard deviation: {}".format(stddev))

# Here's the mathematical way to calculate the standard deviation, just for fun:
manual_stddev = np.sqrt(np.mean((x - np.mean(x)) ** 2))
print(manual_stddev)

## Summary stats on "real" data

We've prepared a JSON file containing several sets of X-Y data points.  For each of the data sets, compute the **mean** and the **standard deviation** in both X and Y.  Make some hypotheses about these datasets --- how similar are they?

In [None]:
# Run this cell to (re)load the data from the JSON file
import json
data = json.load(open('lecture15_data.json'))

print(data.keys()) # The top-level dictionary has four keys corresponding to the four datasets
print(data['set0'].keys()) # Each sub-dictionary has keys 'x' and 'y' corresponding to the x and y coordinatesb

In [None]:
# Your code here...
# Compute the mean and standard deviation for the X and Y coordinates of all 4 datasets

Now try plotting the data using matplotlib's `scatter()` function.  This works just like `plot()` that you're used to --- you just pass in arrays of X and Y coordinates.

In [None]:
import matplotlib.pyplot as plt

# Your code here...
# Create a scatter plot of each of the datasets

# Histograms

Hopefully the previous exercise illustrated the importance of visualizing your data instead of relying on summary statistics!  However, sometimes we have too much data for a scatter plot, or perhaps it's 1-dimensional data that isn't really appropriate for a scatter plot.
This is where histograms shine: telling a more complete story than a simple summary statistic, but not overwhelming viewers with all of the data.

Suppose we have the following four sets of grades on an exam:

In [None]:
grades1 = [98, 90, 12, 92, 90, 91, 93, 99, 93, 93, 99, 94]
grades2 = [87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87]
grades3 = [81, 82, 83, 91, 92, 93, 81, 82, 83, 91, 92, 93]
grades4 = [98, 76, 74, 99, 99, 73, 76, 98, 77, 97, 99, 78]

We could compute summary statistics on them, but it turns out that isn't very helpful:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

print("Average grade on 1 is ", np.mean(grades1))
print("Average grade on 2 is ", np.mean(grades2))
print("Average grade on 3 is ", np.mean(grades3))
print("Average grade on 4 is ", np.mean(grades4))

It would be nice to get more detail.  The basic idea of a histogram is to create "bins" --- ranges of data --- and then count how many samples fall into each bin.

For example, we could create bins corresponding to letter grades, and count how many scores fall into each:

    A (90-100) | 12
    B (80-90)  | 0
    C (70-80)  | 0
    D (60-70)  | 0
    F (< 60)   | 1
    
If we plot this with a bar graph, it's called a histogram.  The X-axis shows the value, and the Y-axis shows the number of samples with that value.

In [None]:
plt.hist(grades1)
plt.show()

You'll notice that this didn't come out exactly the same way as our manual count above.  This is because matplotlib is picking the bins for us automatically, and it chose differently than we did.  We can fix this by specifying the bins manually:

In [None]:
bins = [0, 60, 70, 80, 90, 100] # This specifies 5 bins: 0-60, 60-70, 70-80, 80-90, 90-100
plt.hist(grades1, bins)
plt.show()

But this gives the impression that we had 1 student each at 10, 20, 30, 40 and 50, when in reality there's only one student.  There are several ways to deal with this, but one is just to label our bins more carefully using the `xticks` function.

In [None]:
plt.hist(grades1, bins)
plt.xticks([30, 65, 75, 85, 95], ['F', 'D', 'C', 'B', 'A'])
plt.show()

Try making histograms with the other grades.

*Challenge*: Figure out how to make the bins all the same width, so as to keep a consistent data-ink ratio.  You may want to look at the return values from `hist()` and  the `bar()` function.

In [None]:
# Your code here

Instead of specifying the bins manually, we can simply specify the number of bins.

Suppose hundreds of students take a really hard exam, producing the grades in `biggrades`.  We can draw this with 5 evenly spaced bins, simply by passing `5` as the second argument to `hist()`.

* Take some time to play around with different numbers of bins.  Try small numbers (like 2) and really big numbers (like 500).  Is there a number that feels "about right"?

* When you use a number of bins greater than about 50, some bins suddenly get far larger than others.  Why do you think this is?  How might you avoid it?  (The effect becomes more obvious if you increase the number of students to 5000 or more.)

*Challenge*: What sort of methods could you use to automatically select a "good" number of histogram bins?  See [this article](https://docs.astropy.org/en/stable/visualization/histogram.html) for some discussion and examples.

In [None]:
np.random.seed(0) # Seed the random number generator so we get the same "random" numbers every time
N = 500 # Number of students
biggrades = np.floor(np.maximum(np.zeros(N), np.random.randn(N) * 20 + 50))
plt.hist(biggrades, 5)
plt.show()

(Side note: I once took an exam that had a distribution like this... and I was well below the mean.)