## Descriptive Statistics

We often need to summarize large data sets using a few numbers. Descriptive Statistics enable us to do that. 

## Measures of location

You are probably familiar with the **mean**, or arithmetic average, used as a summary  number. If we have $n$ numbers $x_1,x_2,\dots x_n$, then the mean $\bar{x}$ is given by
\begin{equation}
\bar{x} = \frac{x_1+x_2\dots+x_n}{n} = \frac{1}{n}\sum_{i=1}^n x_i
\end{equation}

The **median** of the data is a similar measure. The word 'median' refers to the middle. We can sort the numbers $x_1,x_2,\dots x_n$ and pick the middle value as the median. The middle value is obvious if $n$ is odd: it is the $(\frac{n+1}{2})$-th number. If $n$ is even, then we take the average of the two numbers in the middle:
\begin{equation}
\text{median}=
    \begin{cases}
    x_{\frac{n+1}{2}} & \text{if}\ n\ \text{is odd}\\
    \frac{x_{\frac{n}{2}}+x_{\frac{n}{2}+1}}{2} & \text{if}\ n\ \text{is even}
    \end{cases}
\end{equation}

A third measure of location for the data is the **mode**. The mode is the data value which occurs most frequently in the data. Clearly, there might be several values which occur with the highest frequency. In such cases, the mode is not unique. In the simple case when there is only one mode, the data are said to have a **unimodal** distribution.

In [36]:
data = [9,6,3,2,2,4,3,4,7,8,4]

n = len(data)
mean = sum(data)/n

sdata = sorted(data)                  # sorts list, needed for median

if n%2 == 1:                          # '%' is the modulo division operator
    median = sdata[int((n+1)/2)]
else:
    median = (sdata[int(n/2)] + sdata[int(n/2)+1])/2

print("sorted data", sdata)
print("mean: ", round(mean,2), "median: ",median)

sorted data [2, 2, 3, 3, 4, 4, 4, 6, 7, 8, 9]
mean:  4.73 median:  4


In [47]:
# Mode values

# 1. Build a dictionary to record frequencies for each value in the data

frequency = {}
for x in data:
    if x not in frequency.keys():       # Not encountered before
        frequency[x] = 1
    else:
        frequency[x] = frequency[x] + 1
        
print("frequency: ",frequency)

# 2. Find the keys in the dict with the largest valueb

vmax = max(frequency.values())
modes = [k for k,v in frequency.items() if v==vmax]

print("modes: ",modes)

frequency:  {9: 1, 6: 1, 3: 2, 2: 2, 4: 3, 7: 1, 8: 1}
modes:  [4]


## Measures of spread