# Using Statistics with Jupyter Notebooks

## Introduction to Statistical Averaging

**Statistics** *(noun)* - a branch of mathematics dealing with the *collection*, *analysis*, *interpreation*, and *presentation* of masses of numerical data. 
> Allowoing for one to make **informned** decisions as a result.

## Averages

### Mean

**Mean** - summatizes an entire data set with a **single number** representing the data's *center point*.

#### Proof
$$mean = \frac{sum\ of\ the\ terms}{number\ of\ terms}$$


$$mean = \frac{\sum{n}}{n}$$

#### Example

Data: $[88, 5, 4, 16, 72, 22, 75, 74, 27, 9]$

Count: $[10]$

Mean: $[39.2]$

$$mean = \frac{(88 + 5 + 4 + 16 + 72 + 22 + 75 + 74 + 27 + 9)}{10} $$
$$mean = 39.2$$


In [1]:
my_nums = [88, 5, 4, 16, 72, 22, 75, 74, 27, 9]
count = len(my_nums)
mean = sum(my_nums)/count
print(mean)

39.2


### Median

**Median** - The middle value of an ordered data set.

#### Proof

Odd: $$\frac{n+1}{2}^{th}term$$
Even: $$\frac{\frac{n}{2}^{th}term + (\frac{n}{2}+1)^{th}term}{2}$$


#### Example

Data: $[4, 5, 9 ,16, 22, 27, 72, 74, 75, 88]$

Find middle value(s).
$[4, 5, 9 ,16, \textbf{22}, \textbf{27}, 72, 74, 75, 88]$

$$ median = \frac{22 + 27}{2} = 24.5 $$


In [2]:
sorted_my_nums = sorted(my_nums)
print(sorted_my_nums)
median = None
if (count % 2 == 0):
    median = (sorted_my_nums[count // 2] + sorted_my_nums[count // 2-1])/2
else:
    median = median = sorted_my_nums[count // 2]
print(median)

[4, 5, 9, 16, 22, 27, 72, 74, 75, 88]
24.5


### Mode

**Mode** - is/are the value(s) repeated most often in a data set. 
- The value with the highest frequency.

#### Example
Data: $[2, 4, 5, 5, 7, 8, 8, 9, 10]$

|Value|2|4|5|7|8|9|10|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|**Frequency**|1|1|2|1|2|1|1|

$$ Mode = [5, 8]$$

In [3]:
# Version 1
my_nums = [2, 4, 5, 5, 7, 8, 8, 9, 10]
my_dict = {num: my_nums.count(num) for num in my_nums}
mode = [k for k, v in my_dict.items() if v == max(my_dict.values())]
#mode = [k for num in my_nums if my_nums.count(num) == ]
print(mode)

[5, 8]


In [4]:
# Version 2
my_dict = {}
for num in my_nums:
    my_dict[num] = my_nums.count(num)
mode = []
for k, v in my_dict.items():
    if v == max(my_dict.values()):
        mode.append(k)
print(mode)

[5, 8]


## Spread of Data Practice

### Domain

Domain - signifies the entire range of values that data points can assume.

In [5]:
import numpy as np

In [6]:
data = np.random.randint(15_000, 200_000, 40)  # 1d array with 5 evenly spaced numbers between 0 and 10
print(data)
max_value = np.max(data)
min_value = np.min(data)
print(f'Max: ${max_value}, Min: ${min_value}')
domain = {'min': min_value, 'max': max_value}
print(f'Domain: ${domain.get("min")} -> ${domain.get("max")}')

[134453 156224 191952 189597 149875  38491 186924  16233  44821 178487
 104030 146520  32151 178908 182201  98203 142744 147286  74155 154059
  21041  21123 125854  21594  62415  45643 152340  21365 186332  75444
 101805  26525 169343  48483  46115  94176 138143  84468 173403 181848]
Max: $191952, Min: $16233
Domain: $16233 -> $191952



### Range

Range - the spread between the maximum and minimum values. 

$$range = (n - n_1)$$
$$range = max - min$$
$$n_1 = \$22868, n = 197281$$
$$range = \$1 74413$$

In [7]:
range  = max_value - min_value
print(f'Range: ${range}')

Range: $175719


## Higher Concepts

### Variance

Variance - measures how far each value in the data set is from the mean. It is the average of the squared differences from the mean.

$$Variance = \frac{\sum (x_i - \mu)^2}{n-1}$$

Where $x_i$ is each value, $\mu$ is the mean, and $n$ is the number of values.

In [8]:
variance = data.var()
print(f"Variance of : {variance}")

Variance of : 3638378347.777499


### Standard Deviation

Standard Deviation - measures the amount of variation or dispersion in a set of values. It is the square root of the variance.

$$Standard\ Deviation = \sqrt{Variance}$$

A low standard deviation means values are close to the mean, while a high standard deviation means values are spread out.

In [9]:

stdev= data.std()
print(f"Standard deviation : {stdev}")

Standard deviation : 60318.971706897486


### Practice Prompt
Try generating a new random dataset and calculate its mean, variance, and standard deviation. Visualize the results with a histogram.