<img src="./images/banner.png" width="800">

# Describing Variability

Variability refers to the amount by which scores in a distribution are dispersed or scattered. It is a measure of how much the scores differ from each other and from the center of the distribution.


Measuring variability is crucial because it provides valuable information about the spread of the data, which is not captured by measures of central tendency alone. Understanding variability helps us:
- Describe the distribution of scores more accurately
- Identify unusual or extreme scores
- Compare the spread of different distributions
- Make informed decisions based on the data


*For example, knowing only the average water depth of a stream is not enough to decide whether it is safe to cross. The variability of the water depth is equally important for making a well-informed decision.*


Central tendency and variability are two fundamental aspects of describing a distribution of scores:
- *Central tendency* measures the center or typical value of a distribution (e.g., mean, median, mode)
- *Variability* measures the spread or dispersion of scores around the center (e.g., range, variance, standard deviation)

Both aspects are essential for understanding the nature of the distribution and should be considered together.


To develop an intuitive understanding of variability, consider the following three distributions, each with the same mean ($\mu = 10$) but different levels of variability:

Distribution A: 10, 10, 10, 10, 10, 10, 10
Distribution B: 9, 10, 10, 10, 10, 10, 11
Distribution C: 7, 9, 9, 10, 11, 11, 13

- Distribution A has the least variability (no variation among scores)
- Distribution B has intermediate variability (scores vary slightly from the mean)
- Distribution C has the most variability (scores vary more from the mean)

By visually inspecting the distributions and noting the differences among individual scores, we can intuitively grasp the concept of variability and its relationship to the spread of the data.


In the following sections, we will explore various measures of variability, such as the range, variance, standard deviation, and interquartile range, which quantify the spread of a distribution and provide a more precise understanding of variability.

<img src="./images/measures-of-variability.png" width="800">

**Table of contents**<a id='toc0_'></a>    
- [Range](#toc1_)    
  - [Calculation of Range](#toc1_1_)    
  - [Advantages and Disadvantages of Using Range](#toc1_2_)    
- [Variance](#toc2_)    
  - [Weakness of Variance](#toc2_1_)    
  - [Sum of Squares (SS)](#toc2_2_)    
    - [Sum of Squares Formulas for Population](#toc2_2_1_)    
    - [Sum of Squares Formulas for Sample](#toc2_2_2_)    
  - [Variance Formula for Population](#toc2_3_)    
  - [Variance Formula for Sample](#toc2_4_)    
- [Standard Deviation](#toc3_)    
  - [Standard Deviation: An Interpretation](#toc3_1_)    
  - [Standard Deviation as a Measure of Distance (Unlike the Mean)](#toc3_2_)    
  - [Standard Deviation Formula for Population](#toc3_3_)    
  - [Standard Deviation Formula for Sample](#toc3_4_)    
- [Degrees of Freedom (df)](#toc4_)    
  - [n-1 in Sample Variance and Standard Deviation Formulas](#toc4_1_)    
  - [Mathematical Restrictions and Their Impact on Degrees of Freedom](#toc4_2_)    
- [Interquartile Range (IQR)](#toc5_)    
  - [Calculation of IQR](#toc5_1_)    
  - [Advantages of Using IQR](#toc5_2_)    
  - [Boxplots and the Relationship Between IQR and Boxplots](#toc5_3_)    
  - [IQR's Resistance to the Distorting Effect of Extreme Scores or Outliers](#toc5_4_)    
- [Measures of Variability for Qualitative and Ranked Data](#toc6_)    
  - [Qualitative Data: Noting the Division of Scores Among Classes](#toc6_1_)    
  - [Ordered Qualitative and Ranked Data: Identifying Extreme Scores or Ranks](#toc6_2_)    
- [Worked Examples and Practice Problems](#toc7_)    
  - [Worked Examples](#toc7_1_)    
    - [Example 1: Calculating Range](#toc7_1_1_)    
    - [Example 2: Calculating Variance and Standard Deviation](#toc7_1_2_)    
    - [Example 3: Calculating Interquartile Range (IQR)](#toc7_1_3_)    
  - [Practice Problems](#toc7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Range](#toc0_)

The **range** is the simplest measure of variability. It is defined as the difference between the largest (maximum) and smallest (minimum) scores in a distribution.

$Range = X_{max} - X_{min}$

where $X_{max}$ is the largest score and $X_{min}$ is the smallest score.


<img src="./images/range.jpeg" width="800">

### <a id='toc1_1_'></a>[Calculation of Range](#toc0_)

To calculate the range, follow these steps:

1. Identify the largest score ($X_{max}$) in the distribution.
2. Identify the smallest score ($X_{min}$) in the distribution.
3. Subtract the smallest score from the largest score: $Range = X_{max} - X_{min}$


For example, consider the following distribution:
10, 7, 12, 9, 11, 8, 13

$X_{max} = 13$
$X_{min} = 7$
$Range = 13 - 7 = 6$


### <a id='toc1_2_'></a>[Advantages and Disadvantages of Using Range](#toc0_)


Advantages:
1. *Simplicity:* The range is easy to calculate and understand, making it a quick way to get a rough idea of the variability in a distribution.
2. *Interpretation:* The range provides a clear measure of the total spread of scores in a distribution.

Disadvantages:
1. *Sensitivity to extreme values:* The range is sensitive to outliers or extreme values because it only considers the largest and smallest scores. A single extreme value can greatly affect the range, even if the rest of the scores are clustered closely together.
2. *Lack of information about the middle of the distribution:* The range does not provide any information about the variability of scores between the minimum and maximum values.
3. *Dependence on sample size:* The range tends to increase as the sample size increases because larger samples are more likely to include extreme values.


Due to these disadvantages, the range is not the most reliable or informative measure of variability. Other measures, such as the variance and standard deviation, are often preferred because they take into account all scores in the distribution and are less sensitive to extreme values.

## <a id='toc2_'></a>[Variance](#toc0_)

The **variance** is a more sophisticated measure of variability that takes into account all scores in a distribution. To understand the variance, let's reconstruct it step by step:

1. Calculate the mean of the distribution.
2. Subtract the mean from each score to obtain the deviation scores.
3. Square each deviation score to eliminate negative values.
4. Sum the squared deviation scores to obtain the sum of squares.
5. Divide the sum of squares by the number of scores (for populations) or the number of scores minus one (for samples) to obtain the variance.


<img src="./images/variance.png" width="800">

### <a id='toc2_1_'></a>[Weakness of Variance](#toc0_)

The main weakness of the variance is that it is expressed in squared units of the original scale, which can be difficult to interpret. For example, if the original scores are in inches, the variance would be in square inches, which is not as intuitive as the original unit of measurement.


### <a id='toc2_2_'></a>[Sum of Squares (SS)](#toc0_)


The **sum of squares (SS)** is a key component in calculating the variance. It is the sum of the squared deviation scores and represents the total amount of variability in a distribution. The sum of squares is important because it is used in many statistical calculations, including the variance and standard deviation.


#### <a id='toc2_2_1_'></a>[Sum of Squares Formulas for Population](#toc0_)


*Definition Formula:*
$SS = \sum(X - \mu)^2$

where $X$ is a score, $\mu$ is the population mean, and $\sum$ denotes the sum of all squared deviation scores.


*Computation Formula:*
$SS = \sum X^2 - \frac{(\sum X)^2}{N}$

where $\sum X^2$ is the sum of squared scores, $\sum X$ is the sum of scores, and $N$ is the number of scores in the population.


#### <a id='toc2_2_2_'></a>[Sum of Squares Formulas for Sample](#toc0_)


*Definition Formula:*
$SS = \sum(X - \bar{X})^2$

where $X$ is a score, $\bar{X}$ is the sample mean, and $\sum$ denotes the sum of all squared deviation scores.


*Computation Formula:*
$SS = \sum X^2 - \frac{(\sum X)^2}{n}$

where $\sum X^2$ is the sum of squared scores, $\sum X$ is the sum of scores, and $n$ is the number of scores in the sample.


### <a id='toc2_3_'></a>[Variance Formula for Population](#toc0_)

$\sigma^2 = \frac{SS}{N}$

where $\sigma^2$ is the population variance, $SS$ is the sum of squares, and $N$ is the number of scores in the population.


### <a id='toc2_4_'></a>[Variance Formula for Sample](#toc0_)

$s^2 = \frac{SS}{n - 1}$

where $s^2$ is the sample variance, $SS$ is the sum of squares, and $n$ is the number of scores in the sample. Note that the denominator is $n - 1$ (degrees of freedom) instead of $n$ to correct for bias in estimating the population variance from a sample.

Intuitively, when calculating the sample variance, we use $n - 1$ in the denominator because it helps us obtain a more accurate estimate of the population variance. When we calculate the variance of a sample, we are typically trying to estimate the variance of the larger population from which the sample was drawn.

By using $n - 1$ instead of $n$, we are essentially "stretching out" the variance to account for the fact that the sample variance tends to underestimate the true population variance. This correction is necessary because the sample mean, which is used to calculate the deviation scores, is itself an estimate based on the sample data and is likely to be closer to the sample scores than the true population mean.


Using $n - 1$ in the denominator increases the value of the sample variance slightly, making it a better estimate of the population variance. This concept is related to the idea of degrees of freedom, which will be discussed in more detail in later sections.

## <a id='toc3_'></a>[Standard Deviation](#toc0_)

The **standard deviation** is a measure of the average amount by which scores in a distribution deviate from the mean. It is the most commonly used measure of variability and is often preferred over the variance because it is expressed in the original units of measurement.


The standard deviation is calculated by taking the square root of the variance:

*Population Standard Deviation:*
$\sigma = \sqrt{\frac{SS}{N}} = \sqrt{\sigma^2}$

*Sample Standard Deviation:*
$s = \sqrt{\frac{SS}{n - 1}} = \sqrt{s^2}$

where $\sigma$ is the population standard deviation, $s$ is the sample standard deviation, $SS$ is the sum of squares, $N$ is the number of scores in the population, and $n$ is the number of scores in the sample.


<img src="./images/standard-deviation.png" width="800">

### <a id='toc3_1_'></a>[Standard Deviation: An Interpretation](#toc0_)


The standard deviation can be thought of as a rough measure of the average amount by which scores deviate from the mean. However, it is important to note that the standard deviation is not exactly equal to the average deviation because it is calculated using the squared deviations, which gives more weight to larger deviations.


In many distributions, approximately 68% of the scores fall within one standard deviation of the mean. This means that if you know the mean and standard deviation of a distribution, you can get a rough idea of where most of the scores lie.


In most distributions, only a small minority of scores (approximately 5%) deviate more than two standard deviations from the mean. Scores that fall more than two standard deviations from the mean are often considered unusual or extreme.


### <a id='toc3_2_'></a>[Standard Deviation as a Measure of Distance (Unlike the Mean)](#toc0_)

While the mean is a measure of central tendency and represents a specific point in the distribution, the standard deviation is a measure of variability and represents a distance from the mean. This means that the standard deviation can be used to describe the typical distance of scores from the mean.


Because the standard deviation is calculated by taking the square root of the variance, it can never be negative. A standard deviation of zero indicates that there is no variability in the distribution (i.e., all scores are equal to the mean).


### <a id='toc3_3_'></a>[Standard Deviation Formula for Population](#toc0_)

$\sigma = \sqrt{\frac{SS}{N}}$

where $\sigma$ is the population standard deviation, $SS$ is the sum of squares, and $N$ is the number of scores in the population.


### <a id='toc3_4_'></a>[Standard Deviation Formula for Sample](#toc0_)

$s = \sqrt{\frac{SS}{n - 1}}$

where $s$ is the sample standard deviation, $SS$ is the sum of squares, and $n$ is the number of scores in the sample.


As a general rule, the standard deviation should be less than one-half of the range. If the calculated standard deviation is much larger than this, it may indicate an error in the calculations. It is always a good practice to double-check your calculations to ensure accuracy.

## <a id='toc4_'></a>[Degrees of Freedom (df)](#toc0_)

**Degrees of freedom (df)** is a concept in statistics that refers to the number of independent values or quantities that can vary in an analysis without violating any constraints or restrictions. In other words, it is the number of values that are free to vary while estimating a statistical parameter.

- **Example 1:** Consider a data sample consisting of five positive integers. The values of the five integers must have an average of six. If four items within the data set are {3, 8, 5, and 4}, the fifth number must be 10. Because the first four numbers can be chosen at random, the degree of freedom is four.
- **Example 2:** Consider a data sample consisting of five positive integers. The values could be any number with no known relationship between them. Because all five can be chosen at random with no limitations, the degree of freedom is four.
- **Example 3:** Consider a data sample consisting of one integer. That integer must be odd. Because there are constraints on the single item within the data set, the degree of freedom is zero.


### <a id='toc4_1_'></a>[n-1 in Sample Variance and Standard Deviation Formulas](#toc0_)

When calculating the sample variance and sample standard deviation, we use $n - 1$ in the denominator instead of $n$. This is because of the concept of degrees of freedom.


In the case of the sample variance and standard deviation, we are using the sample mean $\bar{X}$ to estimate the population mean $\mu$. The sample mean is calculated from the same data that we are using to calculate the variance and standard deviation. As a result, the sample mean introduces a constraint or restriction on the variability of the data.

Specifically, the sum of the deviations of scores from their sample mean is always zero:

$\sum(X - \bar{X}) = 0$


This constraint means that if we know the values of $n - 1$ deviations, the value of the $n$th deviation is automatically determined (it must be the negative sum of the other $n - 1$ deviations). Thus, there are only $n - 1$ independent deviations or degrees of freedom.


By using $n - 1$ in the denominator of the sample variance and standard deviation formulas, we are accounting for the fact that we have lost one degree of freedom due to the constraint introduced by the sample mean.


### <a id='toc4_2_'></a>[Mathematical Restrictions and Their Impact on Degrees of Freedom](#toc0_)

The concept of degrees of freedom applies whenever we are estimating population parameters from sample statistics. In general, the number of degrees of freedom is equal to the number of independent values minus the number of parameters estimated from the data.


For example, when estimating the parameters of a linear regression model with one predictor variable, we estimate two parameters (the slope and the intercept) from the data. As a result, the degrees of freedom for the residuals (the differences between the observed and predicted values) is $n - 2$, where $n$ is the number of observations.


Similarly, when conducting a t-test to compare the means of two independent samples, the degrees of freedom is $n_1 + n_2 - 2$, where $n_1$ and $n_2$ are the sample sizes of the two groups. This is because we are estimating two parameters (the two sample means) from the data.


Understanding the concept of degrees of freedom is crucial for correctly interpreting the results of statistical analyses and for making valid inferences about populations based on sample data.

## <a id='toc5_'></a>[Interquartile Range (IQR)](#toc0_)

**Quartiles** are values that divide a ranked dataset into four equal parts. The first quartile (Q1) is the middle value between the smallest value and the median. The second quartile (Q2) is the median. The third quartile (Q3) is the middle value between the median and the highest value.

The **Interquartile Range (IQR)** is a measure of variability that is based on quartiles. It is defined as the difference between the third quartile (Q3) and the first quartile (Q1):

$IQR = Q3 - Q1$


<img src="./images/interquartile-range.png" width="800">

### <a id='toc5_1_'></a>[Calculation of IQR](#toc0_)

To calculate the IQR, follow these steps:

1. Arrange the data in ascending order.
2. Determine the median (Q2) of the dataset.
3. Calculate the median of the lower half of the dataset (values below Q2) to find Q1.
4. Calculate the median of the upper half of the dataset (values above Q2) to find Q3.
5. Subtract Q1 from Q3 to obtain the IQR.


### <a id='toc5_2_'></a>[Advantages of Using IQR](#toc0_)

The IQR has several advantages as a measure of variability:

1. It is less sensitive to extreme values or outliers compared to the range because it only considers the middle 50% of the data.
2. It is a good measure of variability for skewed distributions, where the mean and standard deviation may not be appropriate.
3. It is easy to calculate and interpret, especially in conjunction with the median as a measure of central tendency.


### <a id='toc5_3_'></a>[Boxplots and the Relationship Between IQR and Boxplots](#toc0_)

A **boxplot** (also known as a box-and-whisker plot) is a graphical representation of a dataset that shows the distribution of the data based on quartiles. The IQR is a key component of a boxplot.

In a boxplot:
- The box represents the IQR, with the bottom edge of the box representing Q1 and the top edge representing Q3.
- The line inside the box represents the median (Q2).
- The whiskers extend from the box to the minimum and maximum values, excluding outliers.
- Outliers (if any) are represented as individual points beyond the whiskers.


The IQR is directly related to the size of the box in a boxplot. A larger IQR indicates greater variability in the middle 50% of the data and will result in a longer box in the boxplot.


### <a id='toc5_4_'></a>[IQR's Resistance to the Distorting Effect of Extreme Scores or Outliers](#toc0_)

One of the main advantages of the IQR is its resistance to the distorting effect of extreme scores or outliers. Because the IQR only considers the middle 50% of the data, it is not affected by changes in the minimum or maximum values, as long as those changes do not affect the quartiles.


For example, consider a dataset with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100. The range of this dataset is 99 (100 - 1), which is heavily influenced by the extreme value of 100. However, the IQR is only 5 (8 - 3), which reflects the variability in the middle 50% of the data and is not affected by the extreme value.


This resistance to outliers makes the IQR a robust measure of variability, particularly for datasets with extreme values or skewed distributions.

## <a id='toc6_'></a>[Measures of Variability for Qualitative and Ranked Data](#toc0_)

When dealing with qualitative (categorical) data or ranked data, the measures of variability discussed earlier, such as range, variance, and standard deviation, are not applicable. However, there are still ways to describe the variability in these types of data.


### <a id='toc6_1_'></a>[Qualitative Data: Noting the Division of Scores Among Classes](#toc0_)

For qualitative data, variability can be described by observing how the scores are divided among the different categories or classes. There are three main ways to describe the variability in qualitative data:

1. **Maximum variability**: When the scores are evenly divided among all categories, the data is said to have maximum variability or diversity. This indicates that each category has roughly the same number of observations.

2. **Minimum variability**: When most of the scores fall into a single category, with only a few scores in the other categories, the data is said to have minimum variability or diversity. This indicates that one category is dominant, and there is little variation in the data.

3. **Intermediate variability**: When the scores are unevenly divided among the categories, but there is no single dominant category, the data is said to have intermediate variability or diversity. This indicates that some categories are more common than others, but there is still a fair amount of variation in the data.


For example, consider a dataset on the color of cars in a parking lot, with categories: red, blue, green, and yellow. If the number of cars in each color category is roughly equal, the data has maximum variability. If most cars are blue, with only a few cars in the other color categories, the data has minimum variability. If the number of cars in each color category is uneven, but no single color dominates, the data has intermediate variability.


### <a id='toc6_2_'></a>[Ordered Qualitative and Ranked Data: Identifying Extreme Scores or Ranks](#toc0_)

For ordered qualitative data (data with categories that have a natural order) and ranked data (data where observations are assigned ranks based on some criteria), variability can be described by identifying the extreme scores or ranks.

To describe the variability in these types of data:

1. Identify the lowest and highest categories or ranks in the dataset.
2. Report the range of categories or ranks present in the data.

For example, consider a dataset on the education level of employees in a company, with categories: high school, bachelor's degree, master's degree, and doctorate. If the dataset includes employees with education levels ranging from high school to doctorate, you can report that the data covers the full range of education levels. If the dataset only includes employees with bachelor's and master's degrees, you can report that the variability in education levels is limited to these two categories.


Similarly, for ranked data, you can report the lowest and highest ranks present in the dataset to describe the variability. For instance, in a dataset of employee performance rankings from 1 to 100, if the lowest rank is 25 and the highest rank is 90, you can report that the variability in performance rankings spans from 25 to 90.


While these methods of describing variability for qualitative and ranked data are not as precise as the measures used for quantitative data, they still provide valuable information about the spread and diversity of the data.

## <a id='toc7_'></a>[Worked Examples and Practice Problems](#toc0_)

In this section, we will provide step-by-step worked examples for calculating each measure of variability discussed in the previous sections. We will also include practice problems for students to apply their knowledge and reinforce their understanding of these concepts.


### <a id='toc7_1_'></a>[Worked Examples](#toc0_)


#### <a id='toc7_1_1_'></a>[Example 1: Calculating Range](#toc0_)

Consider the following dataset: 12, 15, 9, 11, 13, 8, 16.

Step 1: Identify the minimum value (8) and the maximum value (16).
Step 2: Calculate the range by subtracting the minimum value from the maximum value.
Range = 16 - 8 = 8


#### <a id='toc7_1_2_'></a>[Example 2: Calculating Variance and Standard Deviation](#toc0_)

Consider the following dataset: 4, 7, 3, 6, 5.


Step 1: Calculate the mean of the dataset.
Mean = (4 + 7 + 3 + 6 + 5) ÷ 5 = 5

Step 2: Calculate the deviations from the mean and square them.
(4 - 5)^2 = (-1)^2 = 1
(7 - 5)^2 = 2^2 = 4
(3 - 5)^2 = (-2)^2 = 4
(6 - 5)^2 = 1^2 = 1
(5 - 5)^2 = 0^2 = 0

Step 3: Calculate the sum of squared deviations (SS).
SS = 1 + 4 + 4 + 1 + 0 = 10

Step 4: Calculate the variance by dividing the SS by (n - 1).
Variance = 10 ÷ (5 - 1) = 2.5

Step 5: Calculate the standard deviation by taking the square root of the variance.
Standard Deviation = √2.5 ≈ 1.58


#### <a id='toc7_1_3_'></a>[Example 3: Calculating Interquartile Range (IQR)](#toc0_)

Consider the following dataset: 2, 5, 8, 12, 15, 18, 20.


Step 1: Arrange the data in ascending order.
2, 5, 8, 12, 15, 18, 20

Step 2: Identify the median (Q2).
Median (Q2) = 12

Step 3: Identify the first quartile (Q1) and the third quartile (Q3).
Q1 = 5 (median of the lower half of the data)
Q3 = 18 (median of the upper half of the data)

Step 4: Calculate the IQR by subtracting Q1 from Q3.
IQR = Q3 - Q1 = 18 - 5 = 13


### <a id='toc7_2_'></a>[Practice Problems](#toc0_)

1. Calculate the range for the following dataset: 23, 17, 32, 19, 25, 28.
2. Calculate the variance and standard deviation for the following dataset: 10, 12, 8, 14, 11, 13.
3. Calculate the interquartile range (IQR) for the following dataset: 6, 3, 9, 12, 7, 10, 5, 8.


Solutions:
1. Range = 32 - 17 = 15
2. Variance ≈ 4.3, Standard Deviation ≈ 2.07
3. IQR = 9 - 5 = 4


These worked examples and practice problems should help students better understand how to calculate and apply the different measures of variability. Encourage students to work through the practice problems on their own and check their answers against the provided solutions.