In the previous article in this series, we explored the concept of central tendency. The central tedency allows us to grasp the "middle" of the data, but it doesn't tell us anything about the variability of the data. Specifically, how the data is spread out, or the <b>dispersion</b>.

For example, the mean/median/mode of [20, 30, 40, 40, 40, 50, 60] is 40, but the mean/median/mode of [38, 39, 40, 40, 40, 41, 42] is also 40.

The study of dispersion is very important in statistical data. Let's say we are pursuing a job in data science after seeing all the hype arround the field. We are interested in the big money. Well, we can check glassdoor, and find out that the average base pay is ~$120,000. But is that really the pay someone should expect as a junior data scientist? Probably not. The central tendency of the data doesn't tell us the whole story. This is why it is important to also study the dispersion in data.

In this exercise, we are going to use the bike sharing dataset from UCI machine learning repository again. For a full description of the dataset, click [here](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

data = pd.read_csv('bike_rental_hour.csv')

### Measuring Dispersion

There are two categories of dispersion measures, and within these two categories there are five measures I want to talk about.

1) Absolute Measures of Dispersion 
    + Range
    + Quartile Deviation
    + Mean Deviation
    + Standard Deviation
    + Variance
    
2) Relative Measures of Dispersion
    + Coefficient of Range
    + Coefficient of Quartile Deviation
    + Coefficient of Mean Deviation
    + Coefficient of Standard Deviation
    + Coefficient of Variation

I won't go into details with the differences between the two categories. But, I will go in details with each measure. For more details click [here](https://www.emathzone.com/tutorials/basic-statistics/measures-of-dispersion.html). 


In summary, the absolute measures of dispersion give answers in the same units as the original observations. 
The relative measures of dispersion are ratios and do not give answers in the same units as the original observations.

#### The Range

Definition:

$$ \text{Range} = x_{max} - x_{min} $$

Where $x_{max}$ is the maximum value in a given set of observations and $x_{min}$ is the minimum value in a given set of observations. The range only tells us about the total spread of data by taking the difference between the minimum and maximum value. The practical uses for the range are very limited in statistics. However, this measure is often represented in its equation form in specification sheets. For example, clothes are seperated into various size categories. These size categories generally have would have to fit a certain range of measurements.

The Coefficient of Range is defined as:

$$ \text{Coefficient of Range} = \frac{x_{max} - x_{min}} {x_{max} + x_{min}}$$

This equation is the standardized form of $x_{max} - x_{min}$. The coefficient of range is a bit more useful because it allows for comparison across different sets of data.


In [2]:
print('Range_cnt: {}'.format((data['cnt'].max() - data['cnt'].min())))
print('COR_cnt: {}'.format((data['cnt'].max() - data['cnt'].min())/(data['cnt'].max() + data['cnt'].min())))
print('-----')
print('Range_casual: {}'.format((data['casual'].max() - data['casual'].min())))
print('COR_casual: {}'.format((data['casual'].max() - data['casual'].min())/(data['casual'].max() + data['casual'].min())))

Range_cnt: 976
COR_cnt: 0.9979550102249489
-----
Range_casual: 367
COR_casual: 1.0


In the code above, we calculated the range and the coefficient of range for the 'cnt' column and the 'casual' column. The range for the 'cnt' column is 976, and the range for the 'casual' column is 367. This does <b>not</b> imply that the dispersion is greater in the 'cnt' column. The range is an absolute measure specific to the column itself.

We have to compare te dispersion using the coefficient of range. In this case, both columns share similar levels of dispersion.

#### The Mean Absolute Deviation

The arithmetic mean of absolute deviations from the central tendency in a set of observations is defined as the mean deviation. The central tendency, or the "middle" of the data can be the mean, the median, or the mode.

$$ \text{Mean Deviation} = \frac{\sum\lvert{X} - {X_{center}}\lvert} {n} $$

Logically speaking, this formula represents the average of absolute distances away from the "center" in a given set of data. This value provide some indication of variability of data.