# week 1
## probability pt. 1: frequency distributions and simulation 

dr. tomomi parins-fukuchi

### what is stochasticity?
  
  - behavior described by an underlying probability distribution
  - characterized by random 'noise'
  - need to think a bit about **probabilities**


### there are debates about what probabilities actually are
  
  - one view: "probabilities are relative frequency of an outcome in repeated trials"
  - another: "probabilities reflect degree of belief"
  - in general, we'll just think of them as **reflecting uncertainty in a 'thing'**


### probability
  
  - number between zero and one
  - an event with probability 0 is impossible
  - an event with probability 1 is certain


### random probability distributions

  - describe the probabilities of different "things"  
  - 'parametric' and 'nonparametric'

### **'parametric' distributions**

  - e.g., normal, bernoulli, exponential 
  - mathematical function that reflects probabilities of 'things' given some parameters
  - will get into this next week


### **'nonparametric' (empirical) distributions**

  - estimated from some data ('empirical' == 'observed')
  - what proportion of the time does an event happen? 
  - data could be real or simulated


In [None]:
## a histogram is a simple frequency distribution

import numpy as np
import matplotlib.pyplot as plt

ovi_goals =     [52,46,65,56,50,32,38,32,51,53,50,33,49,51,48,24,50,42,31]
gretzky_goals = [51,55,92,71,87,73,52,62,40,54,40,41,31,16,38,11,15,8 ,25,23,9]

print(sum(ovi_goals))
print(sum(gretzky_goals))

In [None]:
plt.hist(ovi_goals,label="ovechkin",alpha=0.5)
plt.hist(gretzky_goals,label="gretzky",alpha=0.5)
plt.legend()
plt.xlabel("number of goals")
plt.ylabel("number of seasons")
plt.show()

In [None]:
weights = [i / len(ovi_goals) for i in [1.0] * len(ovi_goals)]
plt.hist(ovi_goals,label="ovechkin",alpha=0.5,weights=weights)
weights = [i / len(gretzky_goals) for i in [1.0] * len(gretzky_goals)]
plt.hist(gretzky_goals,label="gretzky",alpha=0.5,weights=weights)
plt.xlabel("number of goals")
plt.ylabel("proportion of seasons")
plt.legend()

plt.show()

### **simulation**
  - stochastic simulation will be a big focus of this course
      - what distribution of outcomes do we see if we simulate an event many times?
  - this requires some kernel of randomness

In [None]:
import random

In [None]:
print(random.random())

### **computation of stochasticity**

  - general approach is to draw lots of little probabilities over and over
  - these little probabilities sum to an instance of a distribution or stochastic process


In [None]:
r = [random.random() for _ in range(30)]
#plt.hist(r)

### **simulation**
  - **bootstrap**: shuffle data a bunch of times and see how outcome varies
  - simulates the act of collecting a dataset many times
      - mimics the process of sampling a dataset many times in the real world 
      - we are "replaying the tape" or "rerolling the dice" 
    

In [None]:
## resampling data can be done several ways

## WITHOUT replacement

len(ovi_goals)
n_vals = len(ovi_goals)

## let's say we want to subsample 3 values:

subsamp = []
for _ in range(3):
    pick_i = int(random.random() * n_vals)
    pick = ovi_goals[pick_i]
    subsamp.append(pick)
    
print(subsamp)

In [None]:
## resampling data can be done several ways

## with replacement

n_vals = len(ovi_goals)

## let's say we want to subsample 3 values:

subsamp = []
for _ in range(3):
    pick_i = int(random.random() * n_vals)
    pick = ovi_goals[pick_i]
    subsamp.append(pick)
    
print(subsamp)

### simulation

- we can use this technique to simulate random careers for both hockey players
- what would the frequency distributions of number of goals scored for each player look like in an alternate universe?

In [None]:
## resampling data can be done several ways

## with replacement

n_vals = len(ovi_goals)

# let's resample ovechkin's goals
subsamp_ovi = []
for _ in range(len(ovi_goals)):
    pick_i = int(random.random() * n_vals)
    pick = ovi_goals[pick_i]
    subsamp_ovi.append(pick)

n_vals = len(gretzky_goals)

# let's resample gretzky's goals
subsamp_gretzky = []
for _ in range(len(gretzky_goals)):
    pick_i = int(random.random() * n_vals)
    pick = gretzky_goals[pick_i]
    subsamp_gretzky.append(pick)

In [None]:
weights = [i / len(ovi_goals) for i in [1.0] * len(ovi_goals)]
plt.hist(subsamp_ovi,label="ovechkin",alpha=0.5,weights=weights)

weights = [i / len(gretzky_goals) for i in [1.0] * len(gretzky_goals)]
plt.hist(subsamp_gretzky,label="gretzky",alpha=0.5,weights=weights)
plt.xlabel("number of goals")
plt.ylabel("proportion of seasons")
plt.legend()

plt.show()

### statistical predictions

- can use stochastic simulations to make predictions
- ovechkin has fewer total goals over his career than gretzky, but has also played two fewer full seasons
- how often does ovechkin surpass gretzky if playing the same number of seasons?

In [None]:
sums_ovi = []

for _ in range(1000):
    subsamp = []
    for _ in range(2):
        pick_i = int(random.random() * len(ovi_goals))
        pick   = ovi_goals[pick_i]
        subsamp.append(pick)
    sums_ovi.append(sum(ovi_goals + subsamp))

In [None]:
weights = [i / len(sums_ovi) for i in [1.0] * len(sums_ovi)]

plt.hist(sums_ovi,label="ovechkin",alpha=0.5,weights=weights)
plt.vlines(x=sum(gretzky_goals),ymin=0,ymax=0.23)
plt.xlabel("number of goals")
plt.ylabel("proportion of replicates")
plt.show()

### statistical predictions

- we are **simulating** two additional seasons for Ovechkin and seeing how often he outscores Gretzky
    - we do this by drawing two seasons at random from Ovechkin's career many times
    - we are essentially _replaying_ these two seasons for Ovechkin many times and seeing what happens
- our **prediction** is that he should outscore him, given enough playing time

### statistical predictions

- we may wish to know how many goals we _expect_ Ovechkin to score
- one measure of this is simply the **mean** over all the simulations

In [None]:
def calc_mean(values):
    tally = 0
    for i in values:
        tally += i
    mean = tally / float(len(values))
    return mean

### statistical predictions

- we may wish to know how many goals we _expect_ Ovechkin to score
- one measure of this is simply the **mean** over all the simulations

In [None]:
print("ovi expected: ",calc_mean(sums_ovi))
print("gretzky actual: ", sum(gretzky_goals))

### why does this work?

- we are using randomness to see what range of values we might expect to encounter based on those we have seen
- sampling randomly over ovechkin's career uses stochasticity to simulate additional seasons

### statistical predictions

- what if we "replay the tape" of both Ovi and Gretzky's careers many times?
- what is the resulting distribution of all of their goal tallies?

In [None]:
# let's make a general function to resample a list many times and collect the sum of each resampled list

def resample_sum(vals, n_rep): # resample with replacement
    sums = []
    for _ in range(1000):  # resample each list 1000 times and calculate the sum of goals for each
        subsamp = []
        for _ in range(n_rep):
            pick_i = int(random.random() * len(vals))
            pick = vals[pick_i]
            subsamp.append(pick)
        sums.append(sum(subsamp))
    return sums

In [None]:
sums_ovi = resample_sum(ovi_goals,len(gretzky_goals))
sums_gretzky = resample_sum(gretzky_goals, len(gretzky_goals))
weights = [i / len(sums_ovi) for i in [1.0] * len(sums_ovi)]

plt.hist(sums_gretzky,label="gretzky",alpha=0.5,weights=weights)
plt.hist(sums_ovi,label="ovechkin",alpha=0.5,weights=weights)
plt.xlabel("number of goals")
plt.ylabel("proportion of simulations")
plt.legend()
plt.vlines(x=sum(gretzky_goals),ymin=0,ymax=0.3)
plt.show()