## Statistics

## Descriptive Statistics 

- numbers which are used to **summarize** and **describe** data
  - we work with all items of interest - **statistical population**
  - we don't try to make predictions, just describe what we're seeing
- not very useful on their own
  - but an important part of other methods
- example: pet shop sales
  - 100 pets in one month: 40 dogs, 30 cats, 30 other
- what percent of all pets are dogs?
- what's the mean number of cats sold per month?
- we can also represent the information graphically
  - what does the distribution of dog sales per day look like?
  - what does the cumulative distribution of sales look like?
  - how do sales compare? 

## Inferential Statistics

- in many cases the **population** is too large (or even infinite)
  - we represent the population by a subset - **sample**
  - the population characteristics can be estimated by using the sample
    - we have to be extremely careful how to choose the sample
  - in most cases we need **random sampling** of the population
- examples
  - voting predictions
    - we ask a small number of people and we draw inferences about the entire country
  - mean salary by age
    - we divide people into age groups (e.g. < 20, 20 — 25, 25 — 30, 30 — 35, ...) and ask several people within each age group
    - this also makes the continuous variable "age" easier to work with 


## Sampling

- the process of selecting a sample from the population
- steps in the sampling process
  - define the population
  - specify the **sampling frame** - a set of items from the population
  - specify the **sampling method** - how to select items from the frame
  - determine the sample size
  - implement the sampling and collect data
- a badly done sampling can induce biases and errors
  - selection bias - selecting a non-random sample
  - e.g. asking only CEOs of companies when sampling data for salaries by age
  - random sampling error - random variations in the results 

## Sampling Methods

- non-random sampling
  - can be biased
  - not representative of the population
- random sampling
  - every member of the population has equal chance of being chosen
  - example: insect population in trees
    - trees are numbered 1-200, 10 trees are chosen at random
    - all insects are counted on the 10 random trees
- stratified sampling
  - divide the population into categories (subpopulations)
  - for each category, sample at random
  - example: foot measurement study —) male / female; age groups
  - select samples for each combination { gender; age} 


## Properties of distributions

## Summarizing Distributions

- a histogram is a complete description of the sample distribution
- we often summarize it using a few descriptive statistics
  - **central tendency**
    - do the values tend to cluster around a center?
  - **modes**
    - how many clusters are there? Where are they?
  - **variance**
    - how much variability is there (how "spread out" is the distribution)?
  - **tails**
    - how quickly do probabilities drop off as we move away from the center(s)?
  - **outliers**
    - are there extreme values, far from the center(s)?
- these are also called **summary statistics**

## Measures of Central Tendency

- **average** - a number which describes a typical data point
  - can be calculated in many ways
- **arithmetic mean**
  - the sum of all measurements divided by the number of observations
- **median**
  - the middle value of the distribution
  - to calculate it, the numbers must be sorted in ascending order
  - examples:
    - Me({1, 2, 2, 3, 4}) = 2
    - Me({1, 2, 2, 3, 4, 10}) = 2,5
- **mode**
  - the most frequent item
  - Mo({1, 3, 2, 3, 4, 3}) = 3
  - many "most frequent items"> multimodal distribution 

## Variances

- describes how far away a sample is from the sample mean
  - all distances from the mean can be positive or negative
  - they all sum up to 0 (that's the definition of the mean)
  - so we square them to make them positive
  - standard deviation: $S(x) = \sqrt{S^2(x)}$ 
- in the sample variance formula, there is n —1 in the denominator
  - it refers to "degrees of freedom" - how many items we can remove
    - the number of parameters that can vary
  - because all distances sum up to 0, if we know n —1 of them, we can find the last one
  - gives us an unbiased estimator 
- why bother to take the standard deviation?
  - instead of using variance directly
- its all about units
- example:
  - let's say we're measuring length in m
  - by definition, the variance will have units of ma
  - we want to see how far is a certain point from the center and the units don't match
    - compare $d = 2m, S^2 = 0,25m^2$ to $d = (2 \pm 0,5)m$
  - in order to make units match, we take the square root
  - so we can say "This measurement is located at 1,5 standard deviations above the mean"
    - in our example, such measurement would be 2,75m
    - comparisons like these are very useful in statistics 

## Population vs. Sample: Measures

- there are differences between a population and samples from that population w we have different statistics
- notation
  - sample statistics - sample mean, sample variance, etc. Latin letters
  - population statistics - Greek letters
- population mean $\mu$
  - also called expected value
  - N - population size
- population variance $\sigma^2$
  - note how since we know the entire population, there is no estimation going on
  - so there is N in the denominator
- population standard error
  - $\sigma(x) = \sqrt{\sigma^2(x)}$

## Five-Number Summary

- conveys similar information to a histogram
  - how many percent of the data are less than or equal to a specified number
    - minimum (0%); first quartile (25%); median (50%); third quartile (75%); maximum (100%)
    - generalization: quantiles - divide the frequency distribution into equal groups
    - 100 groups = percentiles
- visualization: boxplot
  - middle line - median
  - box - quartiles
  - whiskers - largest "non-outliers" - 1.5 times the interquartile range
  - points - outliers 
![Boxplot](boxplot.png)


## Moments of Distributions

- $r^{th}$ central moment:
  - defined for discrete and continuous variables
  - measure the shape of the probability distribution
- zeroth moment: 1 (**total probability**)
- first moment: **arithmetic mean** $\mu$
- second moment: **variance** $\sigma^2$
- third moment: **skewness** $\gamma$
  - asymmetry in the distribution
- fourth moment: **kurtosis** $\beta$
  - heaviness of the tails"
  - "Normal": $\beta = 3$
  - excess kurtosis: $\beta - 3$


## Moments of the Gaussian Distribution

- generalization of the binomial distribution
- mean: $\mu$
- median: $\mu$
- mode: $\mu$
- variance: $\sigma^2$
- skewness: 0
- excess kurtosis: 0 

## Standard Score

- in order to compare different Gaussian distributions, we can "normalize" them
- change their parameters to get a "standard" Gaussian distribution with $\mu = 0$ and  $\sigma = 0$
- we need to "shift" the distribution left or right and "squish" or "stretch" to achieve the required standard deviation
- the shift is denoted by the standard score (or z-score): $Z(x) = \frac{x - \mu}{\sigma}$ 
- example: 50 student scores
  - normal distribution, mean 60 (out of 100) and standard deviation 15
  - how well did a student perform if they had 70 / 100?
    - top 25% of the class
  - what marks do the top 10% of the class have?
    - 79 and up 


## Many Variables

## Covariance
- up to now, we've been looking at variables on their own
  - but in many cases they interact with each other
- covariance is a measure of the joint variability of two variables 
  - $cov(x, y) =\frac{1}{n} \sum (x_i - \bar{x})(y_i - \bar{y})$
  - positive: as one variable increases, the other also increases
  - negative: as one variable increases, the other decreases
  - zero: the two variables don't vary together at all
- we can see that $cov(X,X) = \sigma^2(X)$
- in higher dimensions, we calculate a covariance matrix
  - the same idea: element (i,j) is equal to the covariance of the $i^{th}$ and $j^{th}$ dimensions: $A_{ij} = cov(x_i,x_j)$ 


## Correlation

- like the variance, covariance is in "weird" units
  - we divide by the standard deviations to normalize them standard scores (similar to z-scores)
  - $p_i = \frac{x_i - \bar{x}}{S_x} \frac{y_i - \bar{y}}{S_y}$
  - the mean value can be calculated as
  - $p = \frac{1}{n} \sum p_i = \frac{cov(x, y)}{S_x S_y}$
  - this is called Pearson's correlation coefficient
- the correlation coefficient can be in [-1; 1]
  - high absolute value strong correlation
  - measures the linearity of a relationship ,/ between two variables
  - cannot express other, more complex relationships 

## Scatter Plots 

- the easiest way to see how two variables are correlated
- two versions:
  - "Independent" variable - x-axis, "dependent" variable - y-axis
  - two correlated variables (we can't say which is "independent")
- besides, outliers usually become easily visible
- best practices
  - label your axes; if needed, include a legend
  - scale / transform the variables if needed
    - simplifies the relationship
  - add trendlines if needed
    - you can also plot line charts if that's what your data suggests 


## Common Pitfalls

## Correlation Does Not Imply Causation!

- if two variables are correlated, this does not mean that necessarily the first causes the second
- example: height and weight
  - does a greater weight cause a greater height?
- we can still describe them
- we can predict height from weight and vice versa
- but that still does not say anything about one causing the other 

## Correlation vs. Causation 

- reverse causation
  - the faster the windmills rotate, the more wind there is Windmills cause wind
- Lurking variable
  - the more firefighters there are to put out a fire, the greater the damage caused $\Rightarrow$ Firefighters being present at fires, cause more damage
- bidirectional relationship
  - predator numbers affect prey numbers, but prey numbers (amount of food) also affect predator numbers
- coincidence
  - http://tylervigen.com/spurious-correlations