# Data, Learning, and Algorithms

## Week 2 Resampling

Trent Potter - '25 MBA

## Overview

1. Dataset & "Eigenratio" generation
2. Bootstrap to Calculate $\hat{se}_{boot}(\hat{\theta})$
3. Jack Knife to Calculate $\hat{se}_{jack}(\hat{\theta})$


### 1) Dataset & Eigen Ratio Generation

Reproducing table 10.2 from CASI. Data found here: [link](https://hastie.su.domains/CASI_files/DATA/student_score.txt)


In [2]:
import pandas as pd
import numpy as np

pd.set_option('display.precision', 2)

In [3]:
# Load data
scores_df = pd.read_csv('student_score.txt',sep=' ')
scores_df.head()

Unnamed: 0,mech,vecs,alg,analy,stat
0,7,51,43,17,22
1,44,69,53,53,53
2,49,41,61,49,64
3,59,70,68,62,56
4,34,42,50,47,29


In [4]:
# Matches the textbook table
corr = scores_df.corr()
corr.style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1)

Unnamed: 0,mech,vecs,alg,analy,stat
mech,1.0,0.497807,0.756036,0.653476,0.535774
vecs,0.497807,1.0,0.592262,0.507135,0.378604
alg,0.756036,0.592262,1.0,0.762755,0.669825
analy,0.653476,0.507135,0.762755,1.0,0.737671
stat,0.535774,0.378604,0.669825,0.737671,1.0


In [5]:
eigenvalues, eigenvectors = np.linalg.eig(corr)
print("Eigenvalues:", eigenvalues)
print("Eigen Ratio:", eigenvalues / sum(eigenvalues))

Eigenvalues: [3.46267658 0.6599222  0.44717458 0.19665198 0.23357465]
Eigen Ratio: [0.69253532 0.13198444 0.08943492 0.0393304  0.04671493]


In [6]:
# Wrapping all of this into a function
def eigen_ratio(observations: np.ndarray) -> int:
    corr = np.corrcoef(observations.T)
    eigenvalues, _ = np.linalg.eig(corr)
    return eigenvalues[0] / sum(eigenvalues)
  
# Validating against the first pass
eigen_ratio(scores_df.to_numpy())

np.float64(0.6925353153076885)

### 2) Bootstrap to Calculate $\hat{se}_{boot}(\hat{\theta})$


In [7]:
# Choose B=2000, matching the text example
B = 2000

def generate_bootstrap_samples(observations: np.ndarray, B: int) -> np.ndarray:
  """Generate B observations via resampling replacement, each matching the origal number of observations
  """
  n = len(observations)
  return np.array([observations[np.random.choice(n, n, replace=True)] for _ in range(B)])

bootstrap_samples = generate_bootstrap_samples(scores_df.to_numpy(), B)
bootstrap_samples.shape

(2000, 22, 5)

In [8]:
bootstrap_eigen_ratios = np.array([eigen_ratio(sample) for sample in bootstrap_samples])
se_bootstrap_eigen_ratio = bootstrap_eigen_ratios.std(ddof=1) # Bessel's correction since this is subsample of all possible observations
print(se_bootstrap_eigen_ratio) # ~0.075 matches the text example

0.07600552178803079


### 3) Jack Knife to Calculate $\hat{se}_{jack}(\hat{\theta})$


In [9]:
def generate_jackknife_samples(observations: np.ndarray) -> np.ndarray:
  n = len(observations)
  return np.array([np.delete(observations, i, axis=0) for i in range(n)])

jackknife_samples = generate_jackknife_samples(scores_df.to_numpy())
jackknife_samples.shape

(22, 21, 5)

In [10]:
jackknife_eigen_ratios = np.array([eigen_ratio(sample) for sample in jackknife_samples])

n = len(jackknife_eigen_ratios)
jackknife_var = np.var(jackknife_eigen_ratios,) * (n - 1)
se_jackknife_eigen_ratio = np.sqrt(jackknife_var)
print(se_jackknife_eigen_ratio) # ~0.083 matches the text example

0.08300211310624547
