# Summary Statistics

Summary statistics are values calculated from sample data that measure some characteristic about the data. Most people use the **average** of the data as a the standard summary statistic; for instance, when receiving their exam scores, students will usually ask what the class average was for the exam. The average is also the most commonly used summary statistic by data scientists. Most statisticians use the term **sample mean** for this statistic, and often refer to it as simply the **mean**, but this can create confusion with another type of mean for random data: the **ensemble mean**, which is also usually just called the **mean**, and which is introduced in Chapter ZZ.

Because of this ambiguity in the **mean**ing of the word **mean**, I will use the term *average* to refer to the value that is computed from data. 

When teaching a class on this subject, I usually ask the class: "What does the **average** or **sample mean** mean?". Here are some of the common answers:

1. The value where most of the data "sits" is centered around
2. The value that has minimum distance from every value
3. Value most likely to occur
4. Value that divides group into 2 sets of equal size 

It turns out that none of these are very accurate descriptions of the average. To understand average, we first have to understand that representing data by a summary statistic results in errors, and we can use these errors to help choose a "good" summary statistic.


Let $d_0, d_1, \ldots, d_{N-1}$ be our data points, and let $\nu$ be a summary decision statistic. 

The *error* $e_i$ between data point $d_i$ and $\nu$ is simply $e_i= d_i-\nu$. Note that the error may be positive or negative.

Intuitively, we should try to choose $\nu$ to minimize the errors to the data. However, how to do this is not entirely clear because we 

<span style="color:red">JMS: Working here</span>


**Put this in a definition block**
The average is the value that minimizes the total squared error to the data.**


Both Pandas and Numpy provide methods to calculate the average:




In [85]:
import numpy as np
import pandas as pd
df=pd.read_csv(
 "http://wireless.ece.ufl.edu/jshea/idse/data/firearms-combined.csv")
rate2005=np.array(df["RATE-2005"])
rate2014=np.array(df["RATE-2014"])

In [11]:
df["RATE-2005"].mean()

10.81

In [12]:
df["RATE-2014"].mean()

11.440000000000003

In [13]:
rate2005.mean()

10.809999999999997

In [14]:
np.mean(rate2014)

11.44

The sample mean of the 2014 data set is larger than that for the 2005 data set. This may indicate that the overturn of the assault weapon ban in 2014 is associated with an increase in firearms mortality.

However, the difference is relatively small, as are the sample sizes (50).

In [15]:
diff=rate2014.mean()-rate2005.mean()
diff

0.6300000000000026