<a href="https://colab.research.google.com/github/vectrlab/apex-stats-modules/blob/main/Central_Tendency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APEX STATS Central Tendency
Module by David Schuster, based on code by Andy Qui Le

Licensed under CC BY-NC-SA

Release 1.0

<img src="https://www.publicdomainpictures.net/pictures/260000/velka/soccer-football-player-american.jpg" width="400"/>

Image credit: ["Soccer, Football Player, American"](https://www.publicdomainpictures.net/en/view-image.php?image=257368&picture=soccer-football-player-american) in the public domain

## I. Start Here
Welcome back! This module will review ways to summarize distributions using numbers, which are called measures of central tendency.

Along the way, we will show you how you can use Python to do statistics. You can work right in this notebook.

Arrows (➡) indicate something for you to do as you work through this notebook. The first thing you should do is to save a copy.

###➡ Save a Copy###
This notebook is "view only," meaning that you can view it, but you cannot save any changes. To create your own editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy of the notebook. 

If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`. Specifically, your copy of this notebook will be called: "Copy of Codebase-Module.ipynb"

Once finished, share your completed notebook with your intructor using the *Share* button at the upper right.

###➡ No, Really, Save a Copy###

Hi again. We are double-checking that you saved a copy of this notebook before you continue. We want your work to be safely saved!

---

## II. Background

The following background is recommended before starting this module:

1. **Descriptive stats in your course.** This module will work best if you have already covered distributions and histograms in your course. You may want to review your notes or textbook before starting this module.

2. **Welcome to Colab** If you have not worked through the <a href="https://colab.research.google.com/drive/1zTk_n6BL8Tvdhaufq_FYNXK0dbvGKnxX?usp=sharing">Welcome to Colab notebook yet</a>, you should do that first.

### Learning Outcomes

These exercises map onto several learning objective(s) for the C-ID descriptor for [Introduction to Statistics](https://c-id.net/descriptors/final/show/365). Upon successful completion of the course, you will be able to:  

* LO 3: Calculate measures of central tendency and variation for a given data set

Next, read through the activity and follow the steps indicated by the arrows.


## III. Activity

The next section of this module involves a series of hands-on activities that use data on real soccer players in the International Federation of Association Football (FIFA). The data are from FIFA 19, a soccer videogame.


### 1. Get data

Before you can begin these exercises, you need to run the code cell below, which will import the FIFA file and create a dataframe (i.e., spreadsheet) named `data`. Once you run the cell, you will see a preview of the dataframe. Take note that it contains several columns (these will be descibed in the exercise).

###➡ Run the cell###

To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

After you click it, you should see the text "The data were loaded" and a preview of the data. If you see that, continue to the next section. If you come back to this notebook later, you will need to rerun this cell to load the data again.

If you see the text "There was a problem loading the data," then the most likely explanation is a bug that is our fault. Let your instructor know the notebook is not working properly.

In [None]:
# Setup Example Data

# import library
import pandas as pd 
# read the datafile
data = pd.read_csv("https://raw.githubusercontent.com/vectrlab/apex-stats-datasets/main/fifa19/example.csv") 
# handle data loading error
try:
    data
    print("The data were loaded")
except NameError:
    print("There was a problem loading the data.")

###➡ Run the cell###

Next, we will generate a quick preview of the data you just loaded. Run the cell that follows. To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

After you run it, you should see a table with rows and columns. The rows are numbered starting with zero, and the columns are labeled y, x, x1, and so on. 


In [None]:
data

### 2. Explore the data

Now that you have seen a preview, it is time to explore the dataframe! We will be using the following format to refer to each variable: 
`name of dataframe["column name"]` 

Because our FIFA dataframe is named `data`, we will use notation like: `data["x7"]`

- `data["y"]`: Wage in thousands of Euros
- `data["x"]`: Age in years
- `data["x1"]`: Heading Accuracy (0-100, with higher numbers indicating more accuracy) 
- `data["x2"]`: Dribbling rating (0-100, with higher numbers indicating more accuracy)
- `data["x3"]`: Agility rating (0-100, with higher numbers indicating more accuracy)
- `data["x4"]`: Shot Power rating (0-100, with higher numbers indicating more accuracy)
- `data["x5"]`: Jersey Number
- `data["x6"]`: Position (abbreviated)
- `data["x7"]`: Name
- `data["x8"]`: Club

It is very typical to have more variables in your dataframe than you plan to explore in a given sitting. In this module, we will focus on players' ages, so the only variable we will need for now is `x`. We can ignore the other variables for the moment.

The collection of all the scores in variable `x` forms our **population distribution**, or collection of scores from all members of our population of interest. Here, our population is players who appeared in FIFA 19. 

### ➡ Paste the code and run the cell###
To focus specifically on values in `data["x"]`, copy and paste `data["x"]` into the code cell below. **Important!** Make sure that you use lowercase x and not uppercase X, and that you include the brackets and quote marks, as well.

The result will show you the first few players' ages (rows 0-4) as well as the last few players' ages (18202 - 18206). If you're curious, the first value displayed is for Lionel Messi, who was 31 in 2019.


### 3. Central tendency

Measures of central tendency are averages. They summarize the scores in the distribution in a single number. In statistics, there are different kinds of averages. If we wanted to give a single number to summarize the players' ages, what number would we choose? We want the find a middle or typical value. There are three measures to choose from: mode, median, and mean. 



### 4. Mode

The score with the highest frequency is called the **mode**. More than one score can have the highest frequency, so distributions with two modes are called **bimodal**, and distributions with more than two modes are called **multimodal**. These terms also apply to binned data. We can summarize binned (grouped) data by talking about the *bin* with the higest frequency, which we will also call the mode. When reporting the mode of a bin, give the interval that is included in that bin (e.g., "the mode is 2-3 people").

One strength of the mode is that it can be calculated for data at any level of measurement. This makes it useful for cases in which the mean or median are inapproriate.

You can find the mode by looking for the tallest bar on the histogram, or you can calculate the mode from the following code.


###➡ Run the cell###

Calculate the mode in the cell that follows. To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

After running it, a single number is shown, which is the mean player age.

In [None]:
# mode
data["x"].mode()

### ➡ Answer the following questions ###

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.


-Q1. What is the most common age for FIFA players?

-Q2. What would happen to the mode if one additional player, aged 18, joined a team?

### 5. Mean

The mean is probably the most common measure of central tendency. It is the balance point of the distribution.

In APA format used in some disciplines, the mean is expressed with an italicized _M_ (_M_ = 3.5). We will use the Greek symbols for the mean found across disciplines: $\bar{X}$ (commonly called “x-bar”) for the sample mean and $\mu$ (“mu” which is pronounced like “mew”) for the population mean.

Calculation of the mean requires a quantitative score at nominal or ratio level of measurement. This is because the differences between values are assumed to have consistent meaning.

You can estimate the mean by determining the balancing point on the histogram, the point where you could balance the bars if the histogram was on a pivot. Calculating the mean will give you the same location. 

You can also calculate the mean from a histogram:

- Multiply the midpoint of each bin by the frequency for that bin. If 50 people scored 90 points on an exam, you would multiply 90 points (the midpoint of the bin) by 50 (the frequency of the bin) to get 4500.

- Sum all the numbers calculated in the prior step.

- Divide the sum by the total number of scores.

Now, it is your turn to write the function to calculate the mean. Python makes it easy to do in one line of code. The code is the same as for the mode, except we will use the mean() function instead of mode(). Once you calculate the mean, find that value on the histogram you created earlier.


###➡ Complete the line and run the cell###

Delete the hash tag (#) and fix the cell so that it calculates the mean. To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

Need a hint? This code will look exactly like the code for the mode in the prior cell, except that `mode` gets replaced with `mean`.

In [None]:
# .mean()

### 6. Outliers

The mean can be affected by **outliers**, which are low-frequency, extreme scores. To illustrate this, we will add an extreme scores to our distribution and see what happens.



###➡ Run the cell###

In the code below, we add a fictional soccer prodigy aged 8 to our distribution, then we calculate the mean. To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell. You will see a single number.

In [None]:
# Add an outlier score

# Create a copy of column 'x'
with_outlier = data["x"] 
# Create a new single score
new_score = pd.Series(8)
# Add the outlier to our copy
with_outlier = with_outlier.append(new_score)
# Display the mean of the new variable
with_outlier.mean()

You will notice that the mean has shifted down, but not by very much. The impact of an outlier is greater when there are fewer scores in the distribution. Here, we have a distribution that is so large, even a score as low as 8 does not make much of an impact.

If you are dealing with four exam scores in a course, for example, then a single extreme number, either low or high, can have more of an impact.

While the size of the distribution affects the outliers impact, so does the distance of the outlier from the mean. That is, a very extreme score can shift the mean by quite a bit. To see what we mean (and that is the last pun, we promise), we will try the same code block again with an unrealistic, but extreme, score (99999). While this is not a realistic age, it is a score you might encounter in real data due to a typo.



###➡ Run the cell###

In the code below, we add an impossible player, aged 99999, to our distribution then calculate the mean. 

To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell. You will see a single number.

In [None]:
# Add an outlier score

# Create a copy of column 'x'
with_outlier = data["x"] 
# Create a new single score
new_score = pd.Series(99999)
# Add the outlier to our copy
with_outlier = with_outlier.append(new_score)
# Display the mean of the new variable
with_outlier.mean()

###➡ Run the cell###

In the code below, we add an impossible player, aged 99999, to our distribution then calculate the *mode*. Can you anticipate the result?

To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell. You will see a single number.

In [None]:
# Add an outlier score

# Create a copy of column 'x'
with_outlier = data["x"] 
# Create a new single score
new_score = pd.Series(99999)
# Add the outlier to our copy
with_outlier = with_outlier.append(new_score)
# Display the MODE of the new variable
with_outlier.mode()

This was a bit of a hack to prove the point that the mean can be affected by outliers. When we added a single extreme score to our distribution, the outlier, it pulled the mean upward. The number of scores in the distribution and the distance of the outlier score from the mean both affected how much the mean was pulled in the outlier's direction. Because our distribution is very large, it is somewhat resistant to realistic outliers. The mode does not have that problem. In all, you should be skeptical of the mean whenever the distribution includes outliers.

### ➡ Answer the following questions ###

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.

-Q3. Under what circumstances can a single score have a big impact on the mean?

-Q4. With the outlier included, is the new mean still a good summary of the data?

### 7. Median

Finally, our third measure of central tendency is the median. First, we order the scores from smallest to largest. The median is the middle score in this list. Half of the scores will always be below the median, and the other half will be above the median. Another term for the median is the 50% percentile. Percentile is the percent of scores in a distribution at or below a particular score.

In APA format, the median is expressed with an italicized Mdn (*Mdn* = 3.5). 

Because the first step in finding the median is to list the scores in order, we need ordinal, interval, or ratio-level measurement.

It can be challenging to find the median from a histogram. The manual method is to list every score in order and then count from both ends to reach the middle score. Python is much faster.

Probably the most common question about the median is what happens when there is an even number of scores, such as { 1, 1, 2, 2 }. In this case, you (or Python) will end up with two middle scores, 1 and 2. When this happens, the mean of the two middle scores is found. For this example, the median would be 1.5. Therefore, while the median describes the middle score, it could end up being not exactly the same as any score in the distribution.

###➡ Complete the line and run the cell###

Delete the hash tag (#) and fix the cell so that it calculates the median. To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

In [None]:
# .median()

### ➡ Answer the following questions ###

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.


-Q5. Were all measures of central tendency the same? How did they differ?


-Q6. "FIFA players are usually 25 years old." Is this a fair statement? Why or why not?


-Q7. Do the measures of central tendency provide redunant or complementary information? That is, is it useful to report more than one measure of central tendency?

----
## IV. Summary

- In this module, we introduced three ways to summarize the values of scores in a distribution: mode, mean, and median.
- You saw how you can use Python code to load data and compute a variety of descriptive statistics.

---
## V. All done, congrats! 

Today, not only did you learn about describing data, but you also learned how to write some Python code. High five!

<img src="https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg" alt="High-five!" width="100"/>

["High-five!"](https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg) by Nick J Webb is licensed under CC BY 2.0