<a href="https://colab.research.google.com/github/vectrlab/apex-stats-modules/blob/main/Frequency_Distributions_Mini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APEX STATS Frequency Distributions Mini Module
Mini Module by David Schuster, based on APEX STATS code by Andy Qui Le

Licensed under CC BY-NC-SA

<img src="https://www.publicdomainpictures.net/pictures/260000/velka/soccer-football-player-american.jpg" width="400"/>

Image credit: ["Soccer, Football Player, American"](https://www.publicdomainpictures.net/en/view-image.php?image=257368&picture=soccer-football-player-american) in the public domain

## I. Start Here
Welcome to the first APEX STATS Mini Module! This module will review some of the most fundamental and useful tools for making sense out of distributions of data.

Along the way, we will show you how you can use Python to do statistics. You can work right in this notebook.

Arrows (➡) indicate something for you to do as you work through this notebook. The first thing you should do is to save a copy.

###➡ Save a Copy###

To save your work, you will need to save a copy of this notebook. In the *File* menu near the top of this window, find the *Save a Copy in Drive* menu item, and click it. A new window will open, and you will need to sign in with your Google account. From there, you can use *File* > *Save* to save your work to your own Google Drive. Once finished, share your completed notebook with your intructor using the *Share* button at the upper right.

###➡ No, Really, Save a Copy###

Hi again. We are double-checking that you saved a copy of this notebook before you continue. We want your work to be safely saved!

---

## II. Background

This module works best if you already have reviewed the following:

1. **Descriptive stats in your course.** This module will work best if you have already covered distributions and histograms in your course. You may want to review your notes or textbook before starting this module.

2. **APEX Welcome Notebook** If you have not worked through the APEX Welcome Notebook yet, you should do that first.

### Learning Outcomes

These exercises map onto several learning objective(s) for the C-ID descriptor for [Introduction to Statistics](https://c-id.net/descriptors/final/show/365). Upon successful completion of the course, you will be able to:  

* LO 1: Interpret data displayed in tables and graphically  

Next, read through the activity and follow the steps indicated by the arrows.


## III. Activity

The next section of this module involves a series of hands-on activities that use data on real soccer players in the International Federation of Association Football (FIFA). The data are from FIFA 19, a soccer videogame.

Before you can begin these exercises, you need to run the code cell below, which will import the FIFA file and create a dataframe (i.e., spreadsheet) named `data`. Once you run the cell, you'll be able to see a preview of the dataframe and note that it contains several columns (these will be descibed in Exercise 1).

###➡ Run the cell###

To run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

After you click it, you should see the text "The data were loaded." If you see that, continue to the next section. If you come back to this notebook later, you will need to rerun this cell to load the data again.

If you see the text "There was a problem loading the data," then the most likely explanation is a bug that is our fault. Let your instructor know the notebook is not working properly.

In [None]:
#Setup Example Data
import pandas as pd # import library
data = pd.read_csv("https://raw.githubusercontent.com/vectrlab/apex-python-datasets/main/fifa19/example.csv") # read the datafile
try:
    data
    print("The data were loaded")
except NameError:
    print("There was a problem loading the data.")

### 1. Explore the Population Distribution

Now that you've seen a preview of the dataframe, let's explore what's inside! We'll be using the following format to refer to each variable: 
`name of dataframe["column name"]` 

Because our FIFA dataframe is named `data`, we'll use notation like: `data["X7"]`

- `data["Y"]`: Wage in thousands of Euros
- `data["X"]`: Age in years
- `data["X1"]`: Heading Accuracy (0-100, with higher numbers indicating more accuracy) 
- `data["X2"]`: Dribbling rating (0-100, with higher numbers indicating more accuracy)
- `data["X3"]`: Agility rating (0-100, with higher numbers indicating more accuracy)
- `data["X4"]`: Shot Power rating (0-100, with higher numbers indicating more accuracy)
- `data["X5"]`: Jersey Number
- `data["X6"]`: Position (abbreviated)
- `data["X7"]`: Name
- `data["X8"]`: Club

It is very typical to have more variables in your dataframe than you plan to explore in a given sitting. In this module we'll focus on players' ages, so the only variable we will need for now is `X`. We can ignore the other variables for the moment.

The collection of all the values in variable `X` forms our **population distribution**, or collection of values from all members of our population of interest. Here, our population is players who appeared in FIFA 19. 

#### ➡ Paste the code and run the cell####
To focus specifically on values in `data["X"]`, copy and paste this variable into the code cell below. **Important!** Make sure that you use capital X and not lower case x, and that you include the brackets and quote marks, as well.

The result will show you the first few players' ages (rows 0-4) as well as the last few players' ages (18202 - 18206). If you're curious, the first value displayed is for Lionel Messi, who was 31 in 2019.


In [None]:
# enter data["X"] on its own line in this box, and run it to see a list of player ages



### 2. Population size

Python only showed you the first and last few rows of the data. How many players are in this data file? While we could open the data file in a spreadsheet program and scroll to the bottom to look, we have a faster way. We can use Python to find the **population size**, or the number of scores in the distribution. Python calls this the length of the variable, so we use a function called `len()`.

#### ➡ Run the cell####

Run the cell below to find the answer! Run the cell below by clicking on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.


In [None]:
#@title Population size
len(data["X"])

#### ➡ Answer the following questions ####

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.

Q1. Was the population size more or less than you expected? By how much?


Q2. When would it be faster to use the `len()` function, and when would it be faster to open the data file in a spreadsheet? 


Q3. Does the population size tell you anything about the typical age of a player?


Q4. Based only on the preview you saw, what would you estimate is the typical age of a player?

### 3. Frequency

As you saw, we have several thousand players in the distribution. If we want to summarize the ages of players, we cannot simply list all the values. We need to summarize the distribution. The first way we will summarize the distribution is by reporting the **frequency distribution**. **Frequency** is a simple concept; it means a count of the number of times a value occcurs. In our case, we want to count how many times each age occurs.

A list of the frequencies for all values in a distribution is called the frequency distribution. Whenever you see the word frequency, think count.

#### ➡ Run the cell####

Run the cell below by  clicking on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

When you run this code, a frequency table will be generated with three columns. We only need to look at the middle and right column. The middle column will list all of the values in the distribution. The right colujmn will list the frequency of each value.

In [None]:
#@title Generate Frequency Table 
import numpy as np

unique_vals, occurrences = np.unique(data['X'], return_counts=True) # create two arrays, one is a set of values, and the other indicates the occurences of each value
freq_dist_dict = { # create a python dictionary object whose keys are names of columns and values are the pandas series above
    "Value": pd.Series(unique_vals),
    "Frequency": pd.Series(occurrences),
} 
freq_table = pd.DataFrame(freq_dist_dict) # frequency distribution table
freq_table # display the table

#### ➡ Answer the following questions ####

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.


Q5. What was the frequency of 28 years?


Q6. Do all ages have the same frequency?


Q7. Which age(s) had the highest frequency? 


Q8. Which age(s) had the lowest frequency?



### 5. Visualize the population

Exploring a data set through frequency tables can certainly be useful, but we've all heard the phrase "a picture is worth a thousand words." Indeed, visualizing data can be hugely helpful, and this is exactly what you'll do for the next few exercises.

A **histogram** gives us a visual representation of a frequency distribution. The key to understanding histograms is to remember that they always have **frequency** plotted on the y-axis (vertical) and the values plotted on the x-axis (horizontal). 

For this exercise, you'll create a histogram representing the distrubtion of player ages in our data set. You can even choose which color you'd like the histogram to be!

Run the cell below, and enter the name of a color when prompted. You'll see player age represented on the x-axis and counts on the y-axis.

In [None]:
#@title Histogram with automatic binning and custom color
# color names that work should include https://matplotlib.org/stable/gallery/color/named_colors.html
import seaborn as sns # import library
custom_color = input("Type the name of a color : ") # get user input for color
sns.histplot(data["X"], color = custom_color, binwidth = 5) # display the histogram

#### ➡ Answer the following questions ####

Text like this is also in a cell. Double click right here to edit this text cell. Then, type your answers to each question below the question. When finished, click outside the cell, and you will see your answers in the notebook.

Q9. How would you describe the shape of this distribution?


Q10. What is the **bin width** or **bin size**? In other words, what is the interval (i.e., span of years) represented by the width of one bar?


Q11. How would the histogram change if the bin width was set to 50 years?


Q12. Which age group had the highest frequency? How did you know?



----
## IV. Summary

- In this module, we introduced fundamental and useful tools for making sense out of distributions.
- You saw how you can use Python code to load data, generate a frequency table, and then represent the frequencies in a histogram.

---
## V. All done, congrats! 

Today you've not only learned about describing and visualizing data, but you've also learned how to write some Python code. High five!

<img src="https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg" alt="High-five!" width="100"/>

["High-five!"](https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg) by Nick J Webb is licensed under CC BY 2.0