# Descriptive Statistics

When we are first presented with a sizable amount of data, we need to be able to describe the data in a quick and summarizable manner. This is where **descriptive statistics** comes in, which allows us to describe data quickly with a few summary-based numbers. This includes familiar tools like the mean, median, and mode. After all, we really cannot scroll through hundreds, thousands, or millions of data records and expect to get meaningful information. 


Let's first get set up. We are going to bring in Pandas and Numpy so we can work with data in Python easily. Let's then bring in a dataset containing lifetime data (in hours) for a given type of lightbulb. There are 150 lightbulb lifetimes in this sample. Let's take a look. 


<div hidden>
# Source Python for generating lightbulb life data (in hours) 
# with mean of 675 hours of life and standard deviation 
# of 50 hours 
for v in np.random.normal(loc=675,scale=50,size=150):
    print(round(v))
</div>

In [None]:
import pandas as pd
import numpy as np 

data_url = r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/distribution/lightbulb_data.csv"
df = pd.read_csv(data_url)
df

## Initial Assessment of Data

It is very easy to dive right into a dataset and start doing quantitative analysis, creating descriptive statistics like the mean and standard deviation as well as creating distributions. However, we need to refrain from that and focus on the *qualitative* data analysis first. This is critical. 

First, we need to clarify where this data came from. Is this data for one specific type of lightbulb of a specific brand? Or various brands and models of lightbulbs?  Do we interpret the number of hours the lightbulb as the time it lasted before burning out? 

There are no stupid questions in understanding data! **It is far more important to ask not just what the data says, but also where it came from.** We need to get context for the problem being solved as well as what produced the data. 

Next, we need to assess what objective we are aligning the data to, and if it fits this objective.  Let's say we are evaluating the life of a new light bulb model, and this data reflects the result of testing 150 units of these light bulbs. We want to analyze this data so we can assess what consumers can expect out of this lightbulb's lifetime. 

<pre>
  ..---..
 /       \
|         |
:         ;
 \  \~/  /
  `, Y ,'
   |_|_|
   |===|
   |===|
    \_/
</pre>


Perhaps we should also ask if these tests were in tightly controlled environments with consistent temperatures, humidity etc. Maybe we want the environment to be tightly controlled and consistent. Maybe we want more varied conditions to reflect how the bulbs will perform in different climates. It depends! But you want this well-defined so you have context going into the data.

Something else you want to watch out for is **bias** in the data, which overrepresents the sample compared to the rest of the population. If the light bulb was created in a tightly controlled and consistent environment, this may bias our data and inflate its performance compared to when it is put out in the real word. Or what if due to supply chain constraints, a quick and accessible material was used to build these 150 light bulbs? And that material happens to perform better than the material that will be deployed to consumers? These are all examples of bias which can mislead us in our conclusions of the population based on the sample. 

Speaking of... let's address samples and populations. 

## Samples and Populations

A **population** is a particular group of interest we want to study, such as all teenagers in North America or all light bulbs of this type that will be released to the public. However it can be impractical to get access to an entire population, as well as survey or test it. This is why we often rely on a **sample** which is a randomly selected part of the population. Sometimes a population can be highly abstract and theoretical, and we have to treat any data we gather as a sample. If we are studying flight delays at a particular airport at a certain time of day, but we do not have many flights at that time of day, we treat the data we do have as a sample of all theoretical flights that could happen at that time. 

Since this is a new light bulb model and we only have 150 units, we treat it as a sample of all light bulbs that have yet to be built.

 svg image

Now that we understand the ideas of a sample's relationship to a population, let's start learning some tools to describe data in both samples and populations.

> This idea of sampling and bias extends to generative AI applications like ChatGPT and Midjourney too! If ChatGPT only has a small and biased sample of training data pertaining to a certain subject, that is going to affect the quality of its output when prompted on that subject. 

## Mean and Weighted Mean

The **mean** (also known as the **average**) is where the “center of gravity” exists for an observed set of values. It sums up the data points and divides by the number of data points $ n $. 

### Sample Mean

$
\Large{\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n} = \frac{\sum{x_i}}{n}}
$

### Population Mean

$
\Large{\mu = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum{x_i}}{N}}
$

It may be disconcerting to see samples and populations using different symbols for the same things. For example, the mean is denoted by $\bar{x}$ for a sample but $\mu$ for a population. Then the number of elements is $n$ for a sample but $N$ for a population. This is to remind you which you are working with! Otherwise they are mathematically the same. 

You can calculate this from scratch using simple Python loops, generators, and comprehensions. 

In [None]:
mean = sum(v[0] for v in df.values) / len(df.values)
mean 

[In Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html), you can simply use the `mean()` function to calculate the mean for each column in a `DataFrame`. 

In [None]:
df.mean()

[NumPy arrays](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) will work unsurprisngly in the same way. Below, we extract the `ndarray` and then calculate the `mean()`. 

In [None]:
df.values.mean()

You are likely well-acquainted with the mean, so here is something lesser known about it. The mean is actually a weighted mean giving equal weight to each item. 

To take the weighted mean $ W $ for all $ x_i $ elements with their respective weights $ w_i $, multiply and add them together, divided by the sum of all weights. 

$
\Large{W = \frac{w_1 x_1 + w_2 x_2 + w_3 x_3 + ... w_n x_n}{w_1 + w_2 + w_3 + ... + w_n}}
$

Here's a familiar example of a weighted mean. If a professor gives her students three exams each with a weight of .20, and a final exam of weight .40, you can weigh the grade accordingly. 

In [None]:
# 3 exams of .20 weight each and final exam of .40 weight
sample = [75, 89, 94, 80]
weights = [.20, .20, .20, .40]

weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)

print(weighted_mean) # prints 83.6

The weights do not necessarily have to add up to 1.0. They can be any arbitrary numbers as long as they are summed in the denominator too where they will be proportionalized. 

Below, we have all four exams equally weighted by having a value of  $ 1 $. This is no different than a typical average. 

In [None]:
# All four exams have equal weight, no different than typical average
sample = [75, 89, 94, 80]
weights = [1, 1, 1, 1]

weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)

print(weighted_mean) # prints 84.5

## Median 

Let's say it is the 1980's and we are analyzing data on students who graduated from a geography progam at a decent university. Here is the dataset. We are looking specifically how much salary each student made at their first job upon graduation. 

In [None]:
incomes = pd.read_csv(r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/distribution/extreme_outlier.csv")

incomes

Let's take the mean of these 221 students.

In [None]:
incomes.mean()

WHOA! The average graduate from this program made $248,192! This must be a fantastically lucrative geography degree that gets high-paying jobs! 

Or is that the case? What's really going on here? Alright let's sort the valued descending first, with the highest incomes on top. 


In [None]:
incomes.sort_values(by=['income'],ascending=False)

Ah okay, here's what's happening. Some lucky graduate made it big and is making \\$50 million! The rest of his peers are making modest incomes around \\$22K give or take. That one graduate is distorting the average and making the graduate program look much more lucrative than it actually is. He must have started a wildly successful company, a hedge fund, or... perhaps became a famous sports figure? 

svg image

This is exactly what happened to the University of North Carolina at Chapel Hill. Michael Jordan was one of their graduates. One of the most famous NBA players of all time graduated with a geography degree from UNC. To make sense of this, we are going to need another parameter called the median. The **median** is the middlemost value in a set of ordered values. You sequentially order the values, and the median will be the centermost value. If you have an even number of values, you average the two centermost values. 

<pre>
1     5     7     11     105

            ^
         Median! 
</pre>

<div hidden="true">
import numpy as np

incomes = np.random.normal(loc=22000,scale=1000,size=220)
incomes = np.append(incomes, 50_000_000)

print(np.mean(incomes), np.median(incomes))

for v in incomes:
    print(round(v))

</div>

Let's take a look at the median instead of the mean for this dataset. We summarize the data with something much more reasonable now. 

In [None]:
incomes.median()

When your median is far from your mean, that means you have a skewed dataset with outliers. This is why the median is necessary in outlier-heavy datasets (such as income-related data). It is less sensitive to outliers and cuts data strictly down the middle based on their relative order. 

## Mode 

The **mode** is the most frequently occurring set of value(s), which are usually discrete rather than continuous. This means we often use the mode on boolean, integer, or categorical variables. 

Here is an example. A meteorologist is looking at incidents and whether there were tornados present (1) or not (0). We can use the `mode()` function on Pandas to determine whether a tornado was present more often than not. Below, we see `1` is the mode therefore a tornado was more often present.

In [None]:
tornado_present = pd.Series([1,1,1,1,1,1,0,1,0,0,0])

tornado_present.mode()

Some datasets can be **bimodal**, meaning that two values are tied with matching numbers of occurrences, and therefore there are two modes. Here are a series of body weight measurements somebody took with different scales. Below, we see that `183` and `185` both occur 3 times. Therefore they both are reported by the `mode()` function. 

In [None]:
weight_measurements = pd.Series([185,183,182,183,187,185,185,186,183,182])

weight_measurements.mode()

In practice, you will not use the mode that often unless your data is repetitive. 

## Variance

It is not enough to just know the mean and/or median of the data to get an idea of where it is centered. We also want to capture how diverse and spread out the data is. This is where the idea of **variance** comes in to play, which measures how "spread out" our data is. 

Let's revisit our lightbulb data from earlier. 

In [None]:
print("MEAN: ", df.mean()[0])
df

We have a mean of 672.2 hours across 150 lightbulbs in our sample. But how "spread out" are the values? This is where the variance comes into play. We can think of the variance as first taking all the differences between each value and the mean for all values. 

In [None]:
value_minus_mean = (df - df.mean()).rename(columns={"lightbulb_life_hours": "value - mean"})
value_minus_mean

To aggregate the "total" spread, we will need to average them somehow. We could use absolute values to deal with the negatives before averaging them, otherwise the negatives will cancel out the positives in addition. It is better to square the values first, as this rids the negative as well as amplifies larger differences. 

In [None]:
squared_differences = value_minus_mean**2
squared_differences

After that, we can average all those squared differences and that will give us the variance.


In [None]:
variance = squared_differences.mean()[0]
variance

However, there is an important nuance between samples and populations regarding the calculation of variance. We just calculated the population variance, not the sample. We denote sample variance as $ s^2 $ and population variance as $ \sigma^2 $. Here are the formulas. Beyond the symbols, see if you can spot the difference. 

**Sample Variance**

$
\Large{s^2 = \sum{\frac{(x_i - \bar{x})^2}{n-1}}}
$


**Population Variance**

$
\Large{\sigma^2 = \sum{\frac{(x_i - \mu)^2}{N}}}
$


The big difference between these two formulas is the $ n - 1 $ existing in the denoninator for sample variance, as opposed to just $ N $ for the population. The reason for subtracting $ 1 $ from the sample size is to increase the uncertainty since this is a sample, and bump up the variance just slightly. When you use Pandas to calculate variance, it will by default treat the data as a sample. 

In [None]:
df.var()

To force Pandas to treat it as a population, set the degrees of freedom (`ddof`) parameter to `0` which by default is `1`. You will see this now matches our "from scratch" calculation. 

In [None]:
df.var(ddof=0)

In practice, you will often treat your data as a sample and therefore use the sample variance. 

## Standard Deviation

You might have found it odd that variance is defined by $ \sigma^2 $ and $ s^2 $. Where did the square come from and what does it mean?

**Sample Variance**

$
\Large{s^2 = \sum{\frac{(x_i - \bar{x})^2}{n-1}}}
$


**Population Variance**

$
\Large{\sigma^2 = \sum{\frac{(x_i - \mu)^2}{N}}}
$

The square is to create a nagging reminder that you squared all the differences, and we are still in "squaring land". The variance is actually difficult to interpret because of all the squaring that occurred. However we can undo the squaring by taking the square root of the variance, creating the **standard deviation**. 

**Sample Standard Deviation**

$
\Large{s = \sqrt{\sum{\frac{(x_i - \bar{x})^2}{n-1}}}}
$


**Population Standard Deviation**

$
\Large{\sigma = \sqrt{\sum{\frac{(x_i - \mu)^2}{N}}}}
$

This is why the variance is denoted by $ \sigma^2 $ and $ s^2 $, as it creates a nagging reminder to take the square root to get the standard deviation, which is much easier to interpret. The same rule of treating samples and populations differently in the denominator still apply. 

Below, we take the standard deviation using Pandas' `std()` function. 

In [None]:
df.std()

Describing the spread as `51.480628` hours (the standard deviation) is much more interpretable than the squared `2650.255034` hours. It will make even more sense when we talk about the normal distribution in the next section that defines spread by standard deviation. 

To treat the data as a population rather than a sample, change the `ddof=0` parameter. 

In [None]:
df.std(ddof=0)

## Exercise

Below is a series of temperature readings from 5 different thermometers. Calculate the mean, median, variance, and standard deviation by completed the code (replacing the question marks "?") below. 

Ask yourself is the data skewed? How spread out is it? 

In [None]:
df = pd.Series([63, 67, 65, 66, 65, 65, 64])

mean = ? 
median = ? 
variance = ? 
std = ?

print(f"MEAN: {mean}, MEDIAN: {median}, VARIANCE: {variance}, STD: {std}")

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [73]:
df = pd.Series([63, 67, 65, 66, 65, 65, 64])

mean = df.mean()
median = df.median()
variance = df.var()
std = df.std()

print(f"MEAN: {mean}, MEDIAN: {median}, VARIANCE: {variance}, STD: {std}")

MEAN: 65.0, MEDIAN: 65.0, VARIANCE: 1.6666666666666667, STD: 1.2909944487358056


The data is not skewed as the mean and median are the same, and the spread is only 1.29 standard deviations. Think of this as saying "the temperature is 65 degrees give or take 1.29 degrees." 