# Statistics for hackers

Jake Vanderplas

https://speakerdeck.com/jakevdp/statistics-for-hackers



# What do we mean by hacker?
A person whose natural approach to problem solving involves writing code

# If you can write a `for` loop, you can do statistics

> Sometimes the questions are complicated and the answers are simple
  - Dr. Seuss
 
# Warm up: coin toss

  you flip a coin several times and notice a very skewed result... is it a fair coin?
  

In statistics:
* Assume the skeptic is correct. Test the `Null Hypothesis`
* What is the chance of a fair coin landing heads `X` times in a row

Visually, you can plot the distribution, and find `X`, and find the probability that the result is `X` or better


# Recipes for approaching stats with code

## Simulation

In code you can just simulate the sampling distribution directly

In [None]:
M = 0
for i in range(10000):
    trials = randint(2, size = 30)
    if (trials.su() >= 22):
        M +=1
        
#...

## Shuffling
say you took a sample (like some test results) and you have two groups to compare

is the effect that you see significant?

The traditional approach here is to do a "T test" (from stats 101)

> insert "i have no idea what i'm doing" meme here



### You could import some `statsmodels` stuff to actually let a computer do a t test
but still... this is for stats people, and we're not all stats people


### How to simulate
we can't simulate our "student" who took the test

Instead, what we can do is shuffle the "labels" across the results

__motivation__: in the null hypothesis, there's no difference between the groups, so the labels won't matter


so in code, you repeatedly:
* shuffle labels
* rearrange
* compute the means

This works well when the `Null Hypothesis` assumes that the groups are identical

__note__: you still have to beware of sampling bias (like all methods). Your have to have representative samples


## Bootstrapping

Given a series of observations, what would be the average out to infinity?
How reliable is this measurement?


* We don't have a way to generate a model
* we don't have groups to compare


Idea: simulate distribution by drawing samples (with replacement)

Motivation: The data itself estimates its own distribution


s repeat this several thousand times

In [None]:
# see sample code on slides


This can also be used to perform linear regression to see the relationship between two variables


Notes on bootstrapping:

* doesn't work for rank-based stats (e.g. max value)
* works poorly on small samples ( N > 20 is a good rule)
* again, beware of sample bias

## Cross Validation
(like machine learning)

Say you've found some correlation between two variables: (e.g. temperature and sales)

How do you choose which model (i.e. linear, quadratic, etc) fits the data best?

In general, a more flexible/complex model will have a lower RMS error

however... RMS error doesn't tell the whole story


### Classic method

chi-square distribution (fancy maths we forgot and don't want to have to re-learn)

### The hacker way

* take different subsets of the data, and fit the models you are comparing
* switch the data sets around
* calculate the RMS error
* repeat this over and over

this way we can see which model fits best, while avoiding the risk of "overfitting" the data


### Notes on cross validation
* 2 fold cross validation - our example
* Other methods exist
  * check sci-kit learn docs
* this is the go-to method for evaluating models in machine learning


# Beyond this talk

* Bayesian methods ( Cam Davidson-Pilon)
* Selection bias (Chris Fonnesbeck's Scipy 2015 talk)
* Detailed considerations
  * Statistics is easy (Shasha & Wilson)
 
