# 1. Mean, Median, and Mode Workbook

The mean, median, and mode of a dataset are all known as **_measures of central tendency_**. They are known as such because they all offer different perspectives on trying to summarize data. You probably already know what mean, median, and mode are, but in this notebook we will be exploring their meaningful differences and why we might use one over the other when doing exploratory data analysis.

As a review:

**mean**: the sum of all items divided by the total number of items in the dataset.

**median**: the item in the exact center of the distribution (when the data is placed in order)

**mode**: the item that occurs most frequently

Let's start by importing `pandas` and creating a DataFrame with some sample data. 

In [None]:
# necessary imports and preamble
%matplotlib inline
import pandas as pd

# create sample data
data = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

## Calculating central tendency measures

Calculating the mean, median, and mode using `pandas` is extremely simple! Just use the following functions:

In [None]:
print("\n----------- Calculate Mean -----------\n")
print(data.mean())

print("\n----------- Calculate Median -----------\n")
print(data.median())

print("\n----------- Calculate Mode -----------\n")
print(data.mode())

As you can see, the previous code displays the mean, median, and mode for all of the columns. If you wanted to only calculate the measures for a specific column, you could do so in the following way by indexing the appropriate column:

In [None]:
print("\n----------- Calculate Mean of Apples -----------\n")
print(data['Apple'].mean())

print("\n----------- Calculate Median of Apples -----------\n")
print(data['Apple'].median())

print("\n----------- Calculate Mode of Apples -----------\n")
print(data['Apple'].mode())

## Practice

Try displaying only the mean, median, and mode for the column labeled 'Pear' by editing the code block below:

In [None]:
print("\n----------- Calculate Mean of Pear -----------\n")
print()

print("\n----------- Calculate Median of Pear -----------\n")
print()

print("\n----------- Calculate Mode of Pear -----------\n")
print()

## Resistant Statistics

All three of these central tendency statistics are easy to calculate, but if you were to report just one of these, which should you choose? It depends. What is important to remember is that **the mean is very sensitive to extreme values, while the median is not.** The median is known as a **resistant statistic**.

To illustrate this, read in (made up) data from a .csv file that represents the first-year annual starting salary for geography majors from UNC-Chapel Hill. In the following code block, change the name of the file being read in to `'data/first_year_annual_salaries.csv'`.

In [None]:
# read in data from file. -- Change file name!
df = pd.read_csv('filename')

Now, calculate and display the mean and the median, and note the difference!

In [None]:
print("\n----------- Calculate Mean of Salary Data -----------\n")
print()

print("\n----------- Calculate Median of Salary Data -----------\n")
print()

Whoa, check out the difference! You should see a huge difference between the mean and the median! Why is this? We can explore this by displaying the data directly and plotting the histogram.

In [None]:
# display the items in the dataset
print(df)

# display the histogram
df.hist()

Notice the outlier! The legend goes that the starting salary data for geography majors at UNC-Chapel Hill was skewed because of a person named Michael Jordan.

**The lesson**: If the median and the mean are extremely different, then it might be a warning sign of a skewed distribution with outliers.

This is why the median is called **resistant** while the mean is not. This usually makes it a better choice for exploratory data analysis because you don't want outliers to give you a skewed perspective on the data. Of course, this does not mean that the mean is useless and that you should always use the median. If you care about outliers and are actually interested in the average of the whole dataset, and not on what the "average value of an average item" is, then the mean might make more sense.