# Describing Data with Statistics

With statistics we can study, describe, and better understand sets of data.

## The Mean

The mean is a common and intuitive way to summarize a set of numbers. It’s what we might simply call the __“average”__ in everyday use, although as we’ll see, there are other kinds of averages as well. Let’s take a sample set of numbers and calculate the mean.

We’ll write a program that calculates and prints the mean for a collection of numbers. To calculate the mean, we’ll need to take the sum of the list of numbers and divide it by the number of items in the list. We have two Python functions that make both of these operations very easy: sum() and len(). So our code will look like:

In [5]:
'''
Calculating the mean
'''

def calculate_mean(numbers):
    s = sum(numbers)
    N = len(numbers)

    mean = s/N
    return mean

donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
mean = calculate_mean(donations)

print("The mean of our set of data is {:.2f}.".format(mean))

The mean of our set of data is 477.75.


The calculate_mean() function will calculate the sum and length of any list, so we can reuse it to calculate the mean for other sets of numbers, too.

## The Median

The median of a collection of numbers is another kind of average. To find the median, we sort the numbers in ascending order. If the length of the list of numbers is odd, the number in the middle of the list is the median. If the length of the list of numbers is even, we get the median by taking the mean of the two middle numbers. Let’s find the median of the previous list of donations (assume, just for this example that we have another donation total for the 13th).

Before we write a program to find the median of a list of numbers, let’s think about how we could automatically calculate the middle elements of a list in either case. If the length of a list ($N$) is odd, the middle number is the one in position $(N + 1)/2$. If $N$ is even, the two middle elements are $N/2$ and $(N/2) + 1$.

In order to write a function that calculates the median, we’ll also need to sort a list in ascending order. Luckily, the sort() method does just that, so our program will look like:

In [10]:
'''
Calcultating the median
'''

def calculate_median(numbers):
    N = len(numbers)
    numbers.sort()

    if (N % 2) == 0:
        # N is even 
        m1 = N/2
        m2 = (N+1)/2
        
        # convert to integer and list match position
        m1 = int(m1) - 1
        m2 = int(m2) - 1
        median = (numbers[m1] + numbers[m2])/2
    else:
        m = (N/2) + 1
        # convert to integer, match position
        m = int(m) - 1
        median = numbers[m]

    return median

donations = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200]
donations2 = [100, 60, 70, 900, 100, 200, 500, 500, 503, 600, 1000, 1200, 300]

print('Median with 12 items: ', calculate_median(donations))
print('Median with 13 items: ', calculate_median(donations2))

Median with 12 items:  500.0
Median with 13 items:  500


As you can see, the mean (477.75) and the median (500) are pretty close in this particular list, but the median is a little higher.

## the Mode

Instead of finding the mean value or the median value of a set of numbers, what if you wanted to find the number that occurs most frequently? This number is called the __mode__. 
There’s no symbolic formula for calculating the mode—you simply count how many times each unique number occurs and find the one that occurs the most.

Finding the most common number in a data set can be thought of as a subproblem of finding an arbitrary number of most common numbers. For instance, instead of the most common score, what if you wanted to know the five most common scores? The most_common() method of the Counter class allows us to answer such questions easily.

In [11]:
from collections import Counter

simplelist = [4, 2, 1, 3, 4]
c = Counter(simplelist)
print(c.most_common())


[(4, 2), (2, 1), (1, 1), (3, 1)]


Each member of the list is a tuple. The first element of the first tuple is the number that occurs most frequently, and the second element is the number of times it occurs. The second, third, and fourth tuples contain the other numbers along with the count of the number of times they appear.

When you call the most_common() method, you can also provide an argument telling it the number of most common elements you want it to return. For example, if we just wanted to find the most common element, we would call it with the argument 1:

In [12]:
print(c.most_common(1))

[(4, 2)]


The most_common() method returns both the numbers and the number of times they occur. What if we want only the numbers and we don’t care about the number of times they occur? Here’s how we can retrieve that information:

In [13]:
mode = c.most_common(1)
print(mode[0][0])

4


We’re ready to write a program that finds the mode for a list of numbers:

In [17]:
'''
Calculating the mode
'''

from collections import Counter

def calculate_mode(numbers):
    c = Counter(numbers)
    mode = c.most_common(1)
    return (mode[0][0])

number_list = [7, 8, 9, 2, 10, 9, 9, 9, 9, 4, 5, 6, 1, 5, 6, 7, 8, 6, 1, 10]
mode = calculate_mode(number_list)

print('The mode of the list of number is {0}'.format(mode))

The mode of the list of number is 9


The calculate_mode() function finds and returns the mode of the numbers passed to it as a parameter.

What if you have a set of data where two or more numbers occur the same maximum number of times? In such cases,
the list of numbers is said to have multiple modes, and our program should find and print all the modes. So, we have:

In [18]:
'''
Having multiple modes
'''
from collections import Counter

def calculate_mode(numbers):
    c = Counter(numbers)
    numbers_freq = c.most_common()
    max_count = numbers_freq[0][1]

    modes = []
    for num in numbers_freq:
        if num[1] == max_count:
            modes.append(num[0])

    return modes

number_list = [5, 5, 5, 4, 4, 4, 9, 1, 3]
modes = calculate_mode(number_list)
print('The mode(s) of the list of numbers are:')

for mode in modes:
    print(mode)
    

The mode(s) of the list of numbers are:
5
4


## Topics to Dive Into

- [Frequency Table](frequency_table.ipynb)

- [The Dispersion](dispersion.ipynb)
---
[Main Page](../README.md)