<a href="https://colab.research.google.com/github/vectrlab/apex-stats-modules/blob/main/Descriptive_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APEX STATS Descriptive Statistics
Module by David Schuster, based on APEX STATS code by Andy Qui Le

Licensed under CC BY-NC-SA

<img src="https://www.publicdomainpictures.net/pictures/260000/velka/soccer-football-player-american.jpg" width="400"/>

Image credit: ["Soccer, Football Player, American"](https://www.publicdomainpictures.net/en/view-image.php?image=257368&picture=soccer-football-player-american) in the public domain

## I. Intro and Learning Objectives
In this module, we will introduce some of the most fundamental and useful tools for making sense out of distributions of data. We call these **descriptive statistics**, and they are methods of summarizing a distribution of numbers in a single value. Imagine summarizing your performance in college in a single number, your GPA. While your GPA is informative, it is not the whole picture of your college performance. In the same way, we will see that single number summaries give us an informative, but incomplete, summary of data. However, when we examine several descriptive statistics, we are able to gain a much clearer summary of our data.

We focus on quantitative distributions in the module, which have three properties that can be summarized. Distributions have **central tendency**, which is the value of the scores. Distributions have **spread** or **variability**, which is how different scores are from other scores in the distribution. Finally, distributions have a **shape** which becomes apaprent when they are graphed using a histogram. 


These exercises map onto several learning objective(s) for the C-ID descriptor for [Introduction to Statistics](https://c-id.net/descriptors/final/show/365). Upon successful completion of the course, you will be able to:  

* LO 1: Interpret data displayed in tables and graphically  
* LO 3: Calculate measures of central tendency and variation for a given data set
* LO 5: Summarize and describe discrete distributions
* LO 7: Distinguish the difference between sample and population distributions
* LO 6: Calculate probabilities using normal and t-distributions; Apply the empirical rule to normal distributions


---

## II. Background Reading

Read through the descriptive statistics chapter in your textbook (see, for example, [a sample chapter](https://www.davidschuster.info/books/statistics-legacy/descriptive-statistics-and-data-visualization.html)) before starting this module and consider these questions: 


### Discussion Questions

1. How does visualizing data help us understand distributions?
2. What is the difference between measures of central tendency and measures of variability?
3. What is the difference between the mean and the median?
4. Why is it possible to have more than one mode but not more than one median?

## III. Activity

The next section of this module involves a series of hands-on activities that use data on real soccer players in the International Federation of Association Football (FIFA). The data are from FIFA 19, a soccer videogame.

Before you can begin these exercises, you need to run the code cell below, which will import the FIFA file and create a dataframe (i.e., spreadsheet) named `data`. Once you run the cell, you'll be able to see a preview of the dataframe and note that it contains several columns (these will be descibed in Exercise 1).

Reminder: to run the cell, click on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

In [None]:
#Setup Example Data
import pandas as pd # import library
data = pd.read_csv("https://raw.githubusercontent.com/vectrlab/apex-python-datasets/main/fifa19/example.csv") # read the datafile
data # display the data

### 1. Explore the Population Distribution

Now that you've seen a preview of the dataframe, let's explore what's inside! We'll be using the following format to refer to each variable: 
`name of dataframe["column name"]` 

Because our FIFA dataframe is named `data`, we'll use notation like: `data["X7"]`

- `data["Y"]`: Wage in thousands of Euros
- `data["X"]`: Age in years
- `data["X1"]`: Heading Accuracy (0-100, with higher numbers indicating more accuracy) 
- `data["X2"]`: Dribbling rating (0-100, with higher numbers indicating more accuracy)
- `data["X3"]`: Agility rating (0-100, with higher numbers indicating more accuracy)
- `data["X4"]`: Shot Power rating (0-100, with higher numbers indicating more accuracy)
- `data["X5"]`: Jersey Number
- `data["X6"]`: Position (abbreviated)
- `data["X7"]`: Name
- `data["X8"]`: Club

It is very typical to have more variables in your dataframe than you plan to explore in a given sitting. In this module we'll focus on players' ages, so the only variable we will need for now is `X`. We can ignore the other variables for the moment.

The collection of all the values in variable `X` forms our **population distribution**, or collection of values from all members of our population of interest. Here, our population is players who appeared in FIFA 19. 

To focus specifically on values in `data["X"]`, copy and paste this variable into the code cell below. **Important!** Make sure that you use capital X and not lower case x, and that you include the brackets and quote marks, as well.

The result will show you the first few players' ages (rows 0-4) as well as the last few players' ages (18202 - 18206). If you're curious, the first value displayed is for Lionel Messi, who was 31 in 2019.


In [None]:
# enter data["X"] on its own line in this box, and run it to see a list of player ages



### 2. Population size

Python only showed you the first and last few rows of the data. How many players are in this data file? While we could open the data file in a spreadsheet program and scroll to the bottom to look, we have a faster way. We can use Python to find the **population size**, or the number of scores in the distribution. Python calls this the length of the variable, so we use a function called `len()`.

Run the cell below to find the answer!

In [None]:
#@title Population size
len(data["X"])

**⍰ Consider the following questions:**

- Was the population size more or less than you expected? By how much?
- When would it be faster to use the `len()` function, and when would it be faster to open the data file in a spreadsheet? 
- Does the population size tell you anything about the typical age of a player?
- Based only on the preview you saw, what would you estimate is the typical age of a player?

### 3. Frequency

As you saw, we have several thousand players in the distribution. If we want to summarize the ages of players, we cannot simply list all the values. We need to summarize the distribution. The first way we will summarize the distribution is by reporting the **frequency distribution**. **Frequency** is a simple concept; it means a count of the number of times a value occcurs. In our case, we want to count how many times each age occurs.

A list of the frequencies for all values in a distribution is called the frequency distribution. Whenever you see the word frequency, think count.

Run the cell below by  clicking on it, and then you can either simultaneously hit `Shift` + `Enter`, or you can click the play button to the left of the cell.

When you run this code, a frequency table will be generated with three columns. We only need to look at the middle and right column. The middle column will list all of the values in the distribution. The right colujmn will list the frequency of each value.

In [None]:
#@title Generate Frequency Table 
import numpy as np

unique_vals, occurrences = np.unique(data['X'], return_counts=True) # create two arrays, one is a set of values, and the other indicates the occurences of each value
freq_dist_dict = { # create a python dictionary object whose keys are names of columns and values are the pandas series above
    "Value": pd.Series(unique_vals),
    "Frequency": pd.Series(occurrences),
} 
freq_table = pd.DataFrame(freq_dist_dict) # frequency distribution table
freq_table # display the table

**⍰ Consider the following questions:**

- What was the frequency of 28 years?
- Do all ages have the same frequency?
- Which age(s) had the highest frequency? 
- Which age(s) had the lowest frequency?

### 4. Relative Frequency

The frequencies you generated indicated how many players were a particular age. Of course, to interpret whether an age is common or uncommon, you need to know the population size. For example, we can see that 42 players were age 16. But 42 out of how many total players?

We can pull these two numbers together to create a frequency table with **relative frequency**. Relative frequency is the proportion (or a percentage) of scores in the distribution that match a particular value.

Run the cell below. You will get the same frequency table as before except with a new column called percent. The percent column gives the relative frequency of each value.

In [None]:
#@title Generate Frequency Table with Relative Frequency
import numpy as np

def calculate_percentage(counts, total): # calculate the percentage of the frequencies 
  percentages = [] # result list to be returned  
  for each in counts: # iterate through counts array
    percentages.append((each/total)*100) # calculate percentage and append it to our list
  return percentages

unique_vals, occurrences = np.unique(data['X'], return_counts=True) # create two arrays, one is a set of values, and the other indicates the occurences of each value

rel_freq = calculate_percentage(occurrences, len(data['X'])) # find relative frequency using customized function, calculate_percentage

s1 = pd.Series(unique_vals) # create pandas series objects for each column in dataframe to be created later
s2 = pd.Series(occurrences)
s3 = pd.Series(rel_freq)

freq_dist_dict = { # create a python dictionary object whose keys are names of columns and values are the pandas series above
    "Value": s1,
    "Frequency": s2,
    "Percent": s3,
} 

freq_table_w_rel_freq = pd.DataFrame(freq_dist_dict) # frequency distribution table
freq_table_w_rel_freq # display the table

**⍰ Consider the following questions:**

- What proportion of players were 28 years old in 2019?
- Aproximately what proportion of players were aged 18 or younger?

### 5. Visualize the population

Exploring a data set through frequency tables can certainly be useful, but we've all heard the phrase "a picture is worth a thousand words." Indeed, visualizing data can be hugely helpful, and this is exactly what you'll do for the next few exercises.

A **histogram** gives us a visual representation of a frequency distribution. The key to understanding histograms is to remember that they always have **frequency** plotted on the y-axis (vertical) and the values plotted on the x-axis (horizontal). 

For this exercise, you'll create a histogram representing the distrubtion of player ages in our data set. You can even choose which color you'd like the histogram to be!

Run the cell below, and enter the name of a color when prompted. You'll see player age represented on the x-axis and counts on the y-axis.

In [None]:
#@title Histogram with automatic binning and custom color
# color names that work should include https://matplotlib.org/stable/gallery/color/named_colors.html
import seaborn as sns # import library
custom_color = input("Type the name of a color : ") # get user input for color
sns.histplot(data["X"], color = custom_color, binwidth = 5) # display the histogram

**⍰ Consider the following questions:**

- How would you describe the shape of this distribution?
- What is the **bin width** or **bin size**? In other words, what is the interval (i.e., span of years) represented by the width of one bar?
- How would the histogram change if the bin width was set to 50 years?
- Which age group had the highest frequency? How did you know?

The histogram has a number of useful features:

1. The filled in bars reflect observations. The higher the bar, the more observations of scores in that bin. That allows for quick conclusions about frequency.
2. When constructed properly for quantiative data, the distribution has a shape. We will classify distributions according to their shape, and there are some really interesting uses for that later in this course.
3. We do not have one bar for each year. Instead, the years are grouped into **bins**. The bars have equal width that is called **bin size**. In this example, we used a bin size of 5. It is important that the bin size is consistent throughout the histogram, and the bars in the histogram are touching with no space between then. Bin size allows us to group the values so that we can better see the shape of the distribution. An analogy is a low-resolution picture. If it has too few pixels, the image is blurry and hard to see. If it has a lot of pixels and we are zoomed in too much, it is hard to know what the picture is about. In the same way, we want to choose a bin size that give us enough 'resolution' to summarize the data.

Try experimenting with different bin sizes to see how the histogram changes. Notice that it only has one shape, but it gets very hard to see it when the bin size is too low or too high.


In [None]:
#@title Histogram with custom binning and custom color
# color names that work should include https://matplotlib.org/stable/gallery/color/named_colors.html
import seaborn as sns # import library

custom_color = input("Type the name of a color : ") # get user input for color
custom_binwidth = int(input("Enter the width of the bins : ")) # get user input for bins
sns.histplot(data["X"], color = custom_color, binwidth = custom_binwidth) # display the histogram

### 6. When is the histogram meaningful?

Histograms are visualizations of continuous, quantitative data. You can see how some of the meaning of the histogram is lost when we use qualitative data. We will create a histogram of player positions.

In [None]:
#@title Histogram with automatic binning and color
import seaborn as sns # import library
sns.histplot(data["X6"]) # display the histogram

When we tried to make a histogram of player positions, there were a few problems. First, we cannot use binning, so there are a lot of values along the x-axis, which makes it hard to read them. Next, the order of the positions is arbitrary. As it turns out, the order of the values matters in the histogram, because this is what gives the graph a shape. Later on, we will find it useful to talk about the shape of a distribution. In all, histograms are useful for quantiative data only.

### 7. Summarizing Discrete Distributions

Age is a continuous, quantitative variable. We can also summarize and visualize discrete variables, like the club and jersey number. Club is a nominal variable because the values (the club names) are category labels. Because the values are words, it seems obvious that the club is not a quantitative variable. However, this also applies to the jersey number. Even though jerseys are labeled with numbers, they are still nominal variables. Higher jersey numbers do not indicate more of anything. Adding two jersey numbers together does not result in a meaningful number.

Discrete distributions are not appropriate for histograms, especially when the order of the scores is aribtrary (which club goes first? second? last?). Such a graph does not have a shape.

If the variable is discrete, a **bar graph** is constructed the same way but includes an equal space in between each bar. This emphasizes the discrete nature of the data. Bar graphs are useful summaries, but they do not have a shape like a histogram would. A variation on the bar graph is a **Pareto chart**, which orders the bars from tallest (highest frequency) to smallest (lowest frequency), leaving an equal space between each bar. But even if we follow this convention, the graph is not technically a histogram. That said, it shows frequencies, just like a histogram.

Next, we generate bar plots for player position (`data["X6"]`) and Jersey Number (`data["X5"]`).

In [None]:
#@title Bar Plot aka Count Plot (Vertical)
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,8))
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="X6", data=data,
                   palette="Set1") 

# style parameter of set_theme method can be darkgrid, whitegrid, dark, white, and ticks as args
# palette can be Set1, Set2, or Set3

In [None]:
#@title Bar Plot aka Count Plot (Vertical)
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,8))
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="X5", data=data,
                   palette="Set1") 

# style parameter of set_theme method can be darkgrid, whitegrid, dark, white, and ticks as args
# palette can be Set1, Set2, or Set3

**⍰ Consider the following questions:**

- What do the bar chart and the histogram have in common?
- Which is the most frequent jersey number?
- Which team had the most players in FIFA 19?

### 8. Central Tendency

Measures of central tendency are averages. They summarize the scores in the distribution in a single number. If we wanted to give a single number to summarize the players' ages, what number would we choose? We want the find a middle or typical value. There are three measures to choose from: mode, median, and mean.

### 9. Mode

The value with the highest frequency is called the **mode**. More than one score can have the highest frequency, so distributions with two modes are called **bimodal**, and distributions with more than two modes are called **multimodal**. These terms also apply to binned data. We can summarize binned data by talking about the *bin* with the higest frequency, which we will also call the mode. When reporting the mode of a bin, give the interval that is included in that bin (e.g., "the mode is 2-3 people").

One strength of the mode is that it can be calculated for data at any level of measurement. This makes it useful for cases in which the mean or median are inapproriate.

You can find the mode by looking for the tallest bar on the histogram, or you can calculate the mode from the following code:

In [None]:
# mode
data["X"].mode()

### 10. Mean

The mean is probably the most common measure of central tendency. It is the balance point of the distribution.

In APA format used in some disciplines, the mean is expressed with an italicized _M_ (_M_ = 3.5). We will use the Greek symbols for the mean found across disciplines: $\bar{X}$ (commonly called “x-bar”) for the sample mean and $\mu$ (“mu” which is pronounced like “mew”) for the population mean.

Calculation of the mean requires a quantitative score at nominal or ratio level of measurement. This is because the differences between values are assumed to have consistent meaning.

You can estimate the mean by determining the balancing point on the histogram, the point where you could balance the bars if the histogram was on a pivot. Calculating the mean will give you the same location. 

You can also calculate the mean from a histogram:

- Multiply the midpoint of each bin by the frequency for that bin. If 50 people scored 90 points on an exam, you would multiply 90 points (the midpoint of the bin) by 50 (the frequency of the bin) to get 4500.

- Sum all the numbers calculated in the prior step.

- Divide the sum by the total number of scores.

Now, it is your turn to write the function to calculate the mean. Python makes it easy to do in one line of code. The code is the same as for the mode, except we will use the mean() function instead of mode(). Once you calculate the mean, find that value on the histogram you created earlier.

In [None]:
# .mean()

The mean is affected by **outliers** which are low-frequency, extreme scores. In the code below, we add a fictional soccer prodigy aged 10 to our distribution, then we calculate the mean. 

**⍰ Consider the following questions:**

- Can a single score have a big impact on the mean?
- With the outlier included, is the new mean still a good summary of the data?

In [None]:
#@ Title add an outlier score
with_outlier = data["X"] #  Create a copy of one variable from the distribution
new_score = data["X"].mean()*3 # Create a new single score equal to three times the mean
with_outlier.append(new_score) # Add the outlier to our copy
with_outlier.mean() # Show the mean of the new variable


### 11. Median

Finally, our third measure of central tendency is the median. First, we order the scores from smallest to largest. The median is the middle score in this list. Half of the scores will always be below the median, and the other half will be above the median. Another term for the median is the 50% percentile. Percentile is the percent of scores in a distribution at or below a particular score.

In APA format, the median is expressed with an italicized Mdn (*Mdn* = 3.5). 

Because the first step in finding the median is to list the scores in order, we need ordinal, interval, or ratio-level measurement.

It can be challenging to find the median from a histogram. The manual method is to list every score in order and then count from both ends to reach the middle score. Python is much faster.

Probably the most common question about the median is what happens when there is an even number of scores, such as { 1, 1, 2, 2 }. In this case, you (or Python) will end up with two middle scores, 1 and 2. When this happens, the mean of the two middle scores is found. For this example, the median would be 1.5. Therefore, while the median describes the middle score, it could end up being not exactly the same as any score in the distribution.

In [None]:
# .median()

**⍰ Consider the following questions:**

- Were all measures of central tendency the same? Why did they differ?
- Which of these measures provides the best summary of the data?
- Do the measures of central tendency provide redunant or complementary information? That is, is it useful to report more than one measure of central tendency?

### 12. Variability

You're doing great! So far, you have visualized (using the histogram) the distribution and computed measures of central tendency. Each measure of central tendency summarizes the distribution in a single numnber.

The last piece of descriptive statistics is measuring variability, or differences among the scores.

Think of the difference in ages on a youth sports team, which tend to require players to be within a couple years of each other. That team will have low diversity of ages, which we could label low variability. Compare this to our FIFA population, which has a lot more age diversity, or variability. In this section, we will learn to get a bit more precise by computing quantitative measures of variability. As with central tendency, there are several ways to do it.



### 13. Range

Range is the difference between the highest and lowest score. A larger range suggests more variability, and a smaller range suggests less.

Range is the width of the histogram.

Rnage is measured in the same units as the measure.

In [None]:
# find range
# if data["X"].max() gives the maximum, and .min() gives the minimum, can you write a formula to find the range?

### 14. Interquartile Range

Interquartile range (IQR) is a way of measuring width near the middle of a distribution. The IQR is the difference between the first (Q1) and third (Q3) **quartiles**, cutoff points that divide the distribution into four groups of equal frequency. The second quartile (Q2) is the median.

It can be helpful to imagine the process of finding the median: listing all the scores in the distribution in order, and then finding the value in the middle. If we do this, we have two groups with equal frequency. Imagine doing the same process again on just the lower half of the distribution. It is like slicing a cake into two equal-sized pieces, and then slicing one of those pieces into two more pieces. The first quartile, Q1, is exactly that. It divides the values below the median into two groups. Therefore, 25% of the scores in the distribution will be below Q1, and 75% of scores in the distribution will be above Q1. We can do the same thing with values above the median. That would create Q3, with 75% of the scores in the distribution below Q3 and 25% of scores above Q3.

The proportion of scores below a score has another name, the **percentile**. Q1 is the same thing as the 25th percentile (because 25% of scores are at or below Q1). Q2 is the 50th percentile, and Q3 is the 75th percentile.

In the code below, Q1 is calculated and saved in a variable called Q1. Write another line of code to calculate Q3 (hint: modify the percentile in the code!). Then, subtract Q3 - Q1 to find the IQR.

In [None]:
#@title Quartiles
q1 = data.describe()["X"]["25%"] # get Q1

# Write one line of code here to calculate q3
# q3 = 

q3 - q1 # display the IQR

### 15. Box Plots

**Box plots** are a handy visualization that bring together the median and the quartiles. Together, the quartiles (Q1, Q2, Q3), the minimum value, and the maximum value are called the **five number summary**.

In [None]:
#@title Box Plot
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,8))
sns.boxplot(x="X", data=data)


### 16. Sum of squares

Measuring how spread out the data are in terms of width, whether it is the whole histogram (range) or the middle part of the histogram (the IQR) is one helpful way to describe variability. In the next few sections, we introduce a complementary way to talk about variability: we can measure how spread out scores tend to be from each other. These additional measures of variability let us distinguish between a distribution where all the values are clustered in the middle and one where the values are spread out from the middle, even when they may have the same range.

The first of these measures is the **sum of squares**, which has more steps than it sounds. A better name would be the *sum of squared deviations from the mean*. Sum of squares involves the following process:

1. Find the mean of the distribution.

3. Subtract the mean minus each value, generating a list of **mean deviations**.

4. Square all of the mean deviations. That is the sum of squares.


First, we will generate a list of mean deviations. Because we need to determine mean deviation for every score, this process will create a new distribution. We will create a new variable, `data["Z"]`, to hold the mean deviation scores.

In [None]:
#@title Mean deviation
import numpy as np # import library

data["Z"] = np.mean(data["X"]) - data['X'] # Make a new column and assign mean deviation values
data # Display the data

Scroll to the right to see our new variable, `data["Z"]`.

Mean deviation, or distance from the mean, gives us a way to measure the tendency for a score to be away from the mean. To summarize this property for all of the scores, we will try adding them together to get the **sum of mean deviations**. What happens when we sum all of the mean deviation scores?

In [None]:
#@title Sum of mean deviations

round(data["Z"].sum(), 5) # we will round to five decimal places to avoid misleading small computational error

The problem is that summing mean deviations will always result in that number.

**⍰ Consider the following question:**

- Why does the sum of mean deviations always equal the same number?

To overcome this problem, we will square each mean deviation before taking the sum. This process is the **sum of squares**. For clarity, we will include all the steps, including the mean deviation.

In [None]:
#@title Mean deviation
import numpy as np # import library

data["Z"] = np.mean(data["X"]) - data['X'] # Make a new column and assign mean deviation values
data["Z1"] = data['Z']**2 # Make another new column with squared deviations
data["Z1"].sum() # This is now the sum of squared deviations, or the sum of squares

### 17. Variance

By squaring each value, we get a measure of the amount of spread present in a distribution. There are some limitions of this method. For one, the sum of squares is in *squared units*, so it is difficult to interpret. Second, this measure of variability is affected by the number of scores in the distribution. 

Variance solves the second problem. We will divide each score by the size of the distribution. This, too, has a special name, the **degrees of freedom**. The degrees of freedom is equal to the sample size whenever we have a population distribution. The degrees of freedom is equal to `N - 1` whenever we have a sample distribution.

In [None]:
#@title Population variance, calculated manually for this example
import numpy as np # import library
sum_of_squares = data["Z1"].sum() # This is now the sum of squared deviations, or the sum of squares
pop_variance = sum_of_squares / data["Z1"].count()
pop_variance

Python helps us out by having a function we can use to find population variance. Going forward, we do not have to calculate sum of squares first. Notice how both methods give us the same value.

In [None]:
#@title Population variance
import numpy as np # import library
np.std(data["X"], ddof=1)**2 # display the population standard deviation

### 18. Standard deviation

Finally, standard deviation solves the problem of squared units. We simply take the square root of variance to get standard deviation.

Because of that, you can also square the standard deviation in order to get the variance.


In [None]:
#@title Population standard deviation
import numpy as np # import library
np.std(data["X"], ddof=1) # display the population standard deviation

### 19. Standard deviation is useful

Compared to sum of squares and variance, standard deviation is the most useful single-number summary of variability. Unlike variance and sum of squares, standard deviation is in the same units as the variable.

The mean is often a good single-number summary of the central tendency. The standard deviation gives us a measure of variability. Together, we can make an inference about where most of the scores lie on the histogram (the mean) and how spread out the distribution is from the mean (the standard deviation)

If we assume a distribution is normal, meaning that it follows a bell-shaped curve, we can make further inferences by combining the mean and standard deivation.

**In normally distributed data, a majority of the scores (about 68%) will be +/- 1 standard deviation from the mean.**

This is called the **empirical rule** or the **68-95-99.7 rule**. When data are normally distributed, we know where scores will fall. You can determine the interval of scores that include about 68% of the scores this way:

In [None]:
#@title Calculate Intervals Around Mean
mean = data['X'].mean()
sdev = np.std(data["X"], ddof=1) # find the population standard deviation
(mean - 1 * sdev, mean + 1 * sdev) # set to one standard deivation

**In normally distributed data, a most of the scores (about 95%) will be +/- 2 standard deviations from the mean.**

This uses the same code as before but adds and subtracts the standard deviation a second time. This gives us an interval of scores that include 95% of the distribution.

In [None]:
#@title Calculate Intervals Around Mean
mean = data['X'].mean()
sdev = np.std(data["X"], ddof=1) # find the population standard deviation
(mean - 2 * sdev, mean + 2 * sdev) # set to two standard deivations

**In normally distributed data, a most of the scores (about 95%) will be +/- 2 standard deviations from the mean.**

This uses the same code as before but now lets you input the number of standard deviations to add and subtract. Try entering 2 to confirm the same answer as before. Then, try entering 3. When you use 3 standard deviations, this gives us an interval of scores that include 99.7% of the distribution.

In [None]:
#@title Calculate Intervals Around Mean
mean = data['X'].mean()
sdev = np.std(data["X"], ddof=1) # find the population standard deviation
num_sd = int(input("Number of standard deviation away from mean: "))
(mean - num_sd * sdev, mean + num_sd * sdev)

----
### IV. What's Next

In this section we mentioned the normal distribution but did not explain its meaning for importance. In the next section, we will continue to explore distributions with a focus on their shape. We will introduce the concept of the normal distribution and when we are likely to encounter it.

#### Discussion questions


1. How does visualizing data help us to understand distributions?
2. What is the difference between measures of central tendency and measures of variability?
3. What is the difference between the mean and the median?
4. Why is it possible to have more than one mode but not more than one median?

----
## V. Summary

- In this module, we introduced fundamental and useful tools for making sense out of distributions. We call these **descriptive statistics**.
- We saw that we can summarize and visualize both qualitative and quantitative distributions. 
- Quantitative distributions have **central tendency**, which is the value of the scores. Distributions have **spread** or **variability**, which is how different scores are from other scores in the distribution. Finally, distributions have a **shape** which becomes apaprent when they are graphed using a histogram. 

---
## VI. All done, congrats! 

Today you've not only learned about describing and visualizing data, but you've also learned how to write some Python code. High five!

<img src="https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg" alt="High-five!" width="100"/>

["High-five!"](https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg) by Nick J Webb is licensed under CC BY 2.0