# Resampling Techniques

## The Bootstrap

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

 - As an example:the bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit.
 
 - Similarly later on we will see that in random forests we will use bootstrapping tool to be able to cope with variance. 
 

__Bootstrapping:__ In bootstrapping we will create new samples. But rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set. __Important point:__ the sampling is performed by replacement.

### Scenario

Suppose that we wish to invest a fixed sum of money in two financial assests that yield returns of X and Y, respectively, where X and Y are random quantities. We will invest a fraction $\alpha$ of our money in X and will invest the remaining $1-\alpha$ in Y. Our goal is to minimize risk, in other words, minimize the variance in our investment.


- We know that the $\alpha$ value that minimizes the risk can be given by:

$$ \alpha = \frac{\sigma^{2}_{Y} - \sigma_{XY}}{\sigma^{2}_{X} + \sigma^{2}_{Y} - 2\sigma_{XY}}$$

Here $\sigma^{2}_{X} = \text{Var}(X)$, $\sigma^{2}_{Y} = \text{Var}(Y)$ and $\sigma_{XY} = \text{Cov}(X,Y)$


__In reality:__ These quantities are not know as they are the parameters of the population.

__In practice:__ We can try to estimate $\alpha$ from sample.

In [1]:
import numpy as np
import pickle

with open('sample_np.pickle', 'rb') as handle:
    sample = pickle.load(handle)

In [6]:
## write a function that returns
## alpha for a given sample
np.cov(sample.T)

x = sample[:, 0]
np.var(x, ddof=1)
## Then use this function to give 
## an estimate for alpha using sample above.
def alpha_hat(sample):
    cov = np.cov(sample.T)
    num = cov[1, 1] - cov[0, 1]
    denom = cov[0, 0] + cov[1, 1] - 2 * cov[0, 1]
    return num/denom
# %load -r 1-3 supplement.py



In [7]:
alpha_hat(sample)

0.539010703369548

<img src="img/bootstrap.png" alt="alt text" width="400"/>

In [8]:
## apply bootstrapping 1000 times 
## find alpha_hat for each of the resamples

In [9]:
from sklearn.utils import resample

## resample documentation

## https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
    
    
bootstrap1 = resample(sample, 
                      replace = True,
                      n_samples = 100, random_state = 111119)

In [10]:
# %load -r 6-9 supplement.py
a = [alpha_hat(resample(sample, 
                    replace = True, 
                    n_samples = 100)) for i in range(10000)]


In [None]:
## Here find the standard error of alpha_hat

## --your answer here

Note that in the scenario above the true answers were:

SE($\alpha$ = 0.083)

and $\alpha = 0.6$

__Your turn__: 

1. What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

---- your answer here


2. What is the probability that the second bootstrap observation is not the jth observation from the original sample?

--- your answer here

3. Argue that the probability that the jth observation is not in the bootstrap sample is $(1-\frac{1}{n})^{n}$.

-- Your answer here

4. When n = 100, what is the value of this probability?

-- your answer here

5. Using the sample given below, run 1000 bootstrap and find in how many of them the observation-4 was in the bootstrapped sample


In [65]:
# use this data as the initial sample

np.random.seed(111119)
X = np.random.normal(loc = 10, scale =3, size = 100)


In [92]:
#%load -r 12-15 supplement.py


## Exit Ticket

[Exit Ticket for resampling Methods](https://docs.google.com/forms/d/1Nbpknpsr2X8k4L6CxNdbhLqB8F2hcWkGv3hqEVpnavA/viewform?edit_requested=true)

## Further readings and resources

[Introduction to Statistical Learning - 5.2](http://faculty.marshall.usc.edu/gareth-james/ISL/)

[A Gentle Introduction to Bootstrapping](https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/)

[Monte Carlo Wikipedia](https://en.wikipedia.org/wiki/Monte_Carlo_method)

[Python-for-Probability-Statistics-And-Machine-Learning - ch: 2.8](https://www.amazon.com/Python-Probability-Statistics-Machine-Learning/dp/3319307150)