# Demo: Let's Make Some Data!

In [1]:
# range from 20 to 59
ages = range(20, 60)
ages

range(20, 60)

In [2]:
import random
random_ages = [random.choice(ages) for _ in range(100)]

In [3]:
print(random_ages) # using print eschews pretty-printing

[24, 35, 26, 36, 55, 51, 33, 50, 34, 54, 54, 56, 23, 25, 23, 25, 58, 51, 57, 54, 29, 29, 40, 51, 53, 26, 58, 22, 32, 46, 26, 59, 52, 42, 56, 55, 20, 41, 34, 24, 33, 45, 35, 35, 46, 55, 21, 27, 29, 21, 32, 25, 26, 50, 47, 21, 57, 49, 46, 23, 43, 24, 34, 21, 31, 32, 33, 21, 35, 34, 36, 48, 24, 50, 47, 27, 44, 55, 27, 38, 56, 24, 59, 50, 27, 21, 26, 31, 41, 23, 37, 31, 49, 38, 46, 30, 47, 58, 39, 33]


In [4]:
max(random_ages)

59

In [5]:
min(random_ages)

20

## Range: How Wide is the Dispersion of the Data?

In [6]:
def my_range(x):
    '''Python would let us call this function range,
       but if we did that, we would lost access to
       the builtin function range'''
    return max(x) - min(x)

In [7]:
my_range(random_ages)

39

In [8]:
nums = [10, 10, 100, 100]

In [9]:
my_range(nums)

90

In [10]:
nums = [10, 50, 50, 50, 50, 50, 50, 50, 50, 100]

In [11]:
my_range(nums)

90

In [12]:
# numpy has a range function
import numpy as np
np.ptp(random_ages) # "peak to peak"

39

## Mean: Average Value

In [13]:
def mean(x):
    return sum(x) / len(x)

In [16]:
mean(random_ages)

38.12

In [17]:
np.mean(random_ages)

38.12

## Median: Mid-Point of Values

In [18]:
def median(x):
    n = len(x)
    sorted_x = sorted(x)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_x[mid - 1] + sorted_x[mid]) / 2
    else:
        return (sorted_x[mid])

In [19]:
median(random_ages)

35.0

In [20]:
np.median(random_ages)


35.0

## Percentile: How Much Data Falls Below?

In [21]:
np.percentile(random_ages, 40)

33.0

In [24]:
np.percentile(random_ages, 75)

50.0

In [25]:
np.percentile(random_ages, 25)

26.75

## Interquartile Range (IQR)
* $ IQR = Q_3 - Q_1 $
* 75th percentile - 25th percentile 

In [26]:
from scipy import stats
stats.iqr(random_ages)

23.25

## Mode

In [27]:
stats.mode(random_ages)

ModeResult(mode=array([21]), count=array([6]))

## Consider the spread of data in two hypothetical datasets

<img src="images/skew-2.png" width=400 height=400>

* how can we identify/quantify different spreads?
* focus on the <span style="color:blue;font-weight:bold;">mean</span>, <span style="color:green;font-weight:bold;">median</span>, and <span style="color:red;font-weight:bold;">mode</span>

## Variance: How much spread is there in the dataset?
* $Var(X) = \frac{1}{n} \sum_{i=1}^n (a_i - \bar x)^2$
* Why is it squared? **it is because to eliminate the negative and also it would be make any sense without any variance.**
* What are the units of variance, assuming our dataset from above? **no units, thats why we use SD to give a unit**

In [28]:
np.var(random_ages)

148.68560000000002

## Standard Deviation
* $\sqrt {Var(X)}$
* puts the units back into something we like
* "standard variation" from the mean

In [29]:
np.std(random_ages)

12.19367048923334

<img style="height: 350px;" src="images/ss-01.png">

## In a normal distribution...
* 68% of the data will fall within 1 standard deviation
* 95% of the data will fall within 2 standard deviations
* 99.5% of the data will fall within 3 standard deviations

## Skewness
* if we're trying to draw conclusions about a dataset, and we're expecting our sampling to reflect a normal distribution and then we believe we can make generalizations to the population at large, we will be wrong if our sample is skewed
* e.g., polling people who have land lines

<img style="height: 200px;" src="images/skew-1.png">

In [None]:
stats.skew(random_ages)