# week 2
## probability pt. 1: frequency distributions and simulation 

dr. tomomi parins-fukuchi

### what is stochasticity?
  
  - behavior described by an underlying probability distribution
  - characterized by random 'noise'
  - need to think a bit about **probabilities**


### there are debates about what probabilities actually are
  
  - one view: "probabilities are relative frequency of an outcome in repeated trials"
  - another: "probabilities reflect degree of belief"
  - in general, we'll just think of them as **reflecting uncertainty in a 'thing'**


### probability
  
  - number between zero and one
  - an event with probability 0 is impossible
  - an event with probability 1 is certain


### random probability distributions

  - describe the probabilities of different "things"  
  - 'parametric' and 'nonparametric'

### **'parametric' distributions**

  - e.g., normal, bernoulli, exponential 
  - mathematical function that reflects probabilities of 'things' given some parameters
  - will get into this next week


### **'nonparametric' (empirical) distributions**

  - estimated from some data ('empirical' == 'observed')
  - what proportion of the time does an event happen? 
  - data could be real or simulated


In [None]:
## a histogram is a simple frequency distribution

import numpy as np
import matplotlib.pyplot as plt


fl = open("mam_bs.tab","r")
lines = fl.readlines()

masses = []
for line in lines[1:]:
    spls = line.strip().split(",")
    mass = spls[-1]
    try:
        mass = float(mass)
    except:
        continue
    masses.append(mass)

plt.hist(masses)
plt.show()

### **simulation**
  - we can simulate probabilities and probability distributions
  - this requires some kernel of randomness
  - typically, we use 'pseudo-randomness'

In [None]:
import random

In [None]:
print(random.random())

### **computation of stochasticity**

  - general approach is to draw lots of little probabilities over and over
  - these little probabilities sum to an instance of a distribution or stochastic process


In [None]:
r = [random.random() for _ in range(30)]
#plt.hist(r)

### **statistical uncertanity**
  - can use this approach to estimate certainty of a statistical estimate
  - **bootstrap**: shuffle data a bunch of times and see how a number varies
  - mimics the process of sampling a dataset many times in the real world 


### **statistical uncertanity**
  - can use this approach to estimate certainty of a statistical estimate
  - **bootstrap**: see how an estimate varies over many repeated resamplings of your data
  - mimics the process of sampling a dataset many times in the real world 

In [None]:
## resampling data can be done several ways

## with replacement

values = ["a","b","c","d","e"]
n_vals = len(values)

## let's say we want to subsample 3 values:

subsamp = []
for _ in range(3):
    pick_i = int(random.random() * n_vals)
    pick = values[pick_i]
    subsamp.append(pick)
    
print(subsamp)

In [None]:
## resampling data can be done several ways

## WITHOUT replacement

values = ["a","b","c","d","e"]
n_vals = len(values)

## let's say we want to subsample 3 values:

subsamp = []
for _ in range(3):
    pick_i = int(random.random() * n_vals)
    pick = values[pick_i]
    subsamp.append(pick)
    
print(subsamp)

In [None]:
## local vs global variables, lists, without replacement



### statistical uncertainty

- gauge uncertainty around estimate of body mass means for mammal families

### statistical uncertainty

- gauge uncertainty around estimate of body mass means for mammal families
- first need a function to calculate means:

In [None]:
def calc_mean(vals):
    x = 0.
    for i in vals:
        x += i
    av = x / float( len(vals) )
    return av

### statistical uncertainty

- next can create dictionary to store data by family
    - families as keys, list of body masses as values

In [None]:
fl = open("mam_bs.tab","r")
lines = fl.readlines()
orders = {}
for line in lines:
    spls = line.strip().split(",")
    order = spls[0]
    mass = spls[-1]
    if mass == "NA":
        continue
    try:
        orders[order].append(np.log(float(mass)/1000.))
    except:
        orders[order] = []
        orders[order].append(np.log(float(mass)/1000.))


In [None]:
## let's focus on 'Artiodactyla'

keep_ord = "Artiodactyla"
mean_mass = calc_mean(orders[keep_ord])
plt.hist(orders[keep_ord])
plt.axvline(mean_mass,color="black")
plt.show()

### statistical uncertainty

- now we can resample (bootstrap) our means to estimate uncertainty around this estimate
- first, write a function to randomly sample our data **with** replacement

In [None]:
def resample_mean(vals): # resample with replacement
    resamp_vals = []
    n = len(vals)
    for _ in range(n):
        r = random.random()
        ind = int(r * n)
        pick = vals[ind]
        resamp_vals.append(pick)
    mean = calc_mean(resamp_vals)
    return mean

### statistical uncertainty

- next, need to create another function to repeat this resampling many times
- **how does the estimate change every time the data is resampled?**
- we are simulating the effects of sampling a dataset

In [None]:
def bootstrap_mean(vals, reps):
    means = []
    for _ in range(reps):
        sim_mean = resample_mean(vals)
        means.append(sim_mean)
    return means

In [None]:
# put it all together

import seaborn as sns

simmeans = bootstrap_mean(orders[keep_ord],5000)
#sns.histplot(simmeans,bins=20,stat="probability",kde=True)
plt.hist(simmeans,density=True)
plt.axvline(mean_mass,color="black")
plt.show()

### significance

- can use simulated resampling to create underlying distribution for significance tests
- are mean masses significantly different for artiodactyla and carnivora?

In [None]:
artio_masses = orders["Artiodactyla"]
carn_masses = orders["Carnivora"]
artio_mean = calc_mean(artio_masses)
carn_mean  = calc_mean(carn_masses)

emp_diff = artio_mean-carn_mean
print(emp_diff)

### significance

- our approach will be to repeatedly resample mass values from the larger sample of mammals to estimate the distrubution of mean differences when sampling two different groups

In [None]:
def resample_mean_n(vals,n): # resample with replacement
    resamp_vals = []
    for _ in range(n):
        r = random.random()
        ind = int(r * n)
        pick = vals[ind]
        resamp_vals.append(pick)
    mean = calc_mean(resamp_vals)
    return mean

In [None]:
# log-transform all of the mass values

log_masses = [np.log(i) for i in masses]

differences = []
for _ in range(10000):
    artio_resamp = resample_mean_n(log_masses,len(artio_masses))
    carn_resamp = resample_mean_n(log_masses, len(carn_masses))
    diff = artio_resamp-carn_resamp
    differences.append(diff)

sns.histplot(differences,stat="probability")