<a href="https://colab.research.google.com/github/swopnimghimire-123123/Maths_For_ML/blob/main/13_Sampling_%26_Resampling_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab 13: Sampling & Resampling Methods
 Learning Goals

* Understand sampling: why and how we take subsets of data.
* Learn resampling methods: Bootstrapping and Cross-Validation.
* Apply these methods in Python to estimate uncertainty and evaluate models.

## 1. Essence + Use Case + Example
### Sampling

*   **Essence:** Take a subset of the population to make inferences.
*   **Use Case:** We can’t always measure the whole population; samples save time and cost.
*   **Example:** Survey 100 students to estimate the average study hours of 10,000 students.

### Bootstrapping

*   **Essence:** Randomly resample your data (with replacement) many times to estimate statistics like mean, variance, confidence intervals.
*   **Use Case:** Estimate variability when population is unknown.
*   **Example:** Resample student exam scores 1000 times to get confidence interval of the mean.

### Cross-Validation

*   **Essence:** Split data into train/test multiple times to check model performance.
*   **Use Case:** Avoid overfitting; check robustness of ML models.
*   **Example:** 5-fold CV → model trained 5 times on different splits, average performance gives reliable estimate.

## 2. Theory

### Bootstrapping:

*   Randomly sample n points with replacement from the dataset.
*   Calculate statistic (mean, median, etc.).
*   Repeat many times → distribution of the statistic.

### Cross-validation:

*   Split dataset into k folds.
*   Train on k-1 folds, test on 1 fold.
*   Repeat k times → average performance metric (accuracy, MSE, etc.).

In [None]:
### 3. Code Demonstration
import numpy as np

# Sample data: exam scores
scores = np.array([60, 65, 70, 75, 80])

# Bootstrap resampling
np.random.seed(42)
n_iterations = 1000
bootstrap_means = []

for _ in range(n_iterations):
    sample = np.random.choice(scores, size=len(scores), replace=True)
    bootstrap_means.append(np.mean(sample))

print("Bootstrap Mean Estimate:", np.mean(bootstrap_means))
print("95% Confidence Interval:", np.percentile(bootstrap_means, [2.5, 97.5]))


Bootstrap Mean Estimate: 70.054
95% Confidence Interval: [64. 76.]


In [None]:
### Cross-Validation Example (for a linear regression model)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# Dataset
X = np.array([1, 2, 3, 4, 5]).reshape(-1,1)
y = np.array([60, 70, 75, 80, 85])

model = LinearRegression()

# 2-fold cross-validation (R² score)
scores = cross_val_score(model, X, y, cv=2)
print("R² scores for each fold:", scores)
print("Average R²:", np.mean(scores))

R² scores for each fold: [ 0.78571429 -2.61111111]
Average R²: -0.9126984126984027


# Task
Solve the practice problems in the notebook section "4. Practice Problems".

## Bootstrap confidence interval

### Subtask:
Use bootstrapping on a numeric dataset to calculate the 95% confidence interval of the mean.


**Reasoning**:
The first step is to create a sample numeric dataset and set the number of bootstrap iterations. Then, create an empty list to store the bootstrap means and perform the bootstrapping loop. Finally, calculate and print the 95% confidence interval.



In [None]:
# Create a sample numeric dataset
data = np.array([10, 12, 15, 18, 20, 22, 25, 28, 30, 35])

# Set the number of bootstrap iterations
n_iterations = 1000

# Create an empty list to store the bootstrap means
bootstrap_means = []

# Loop through the specified number of iterations
for _ in range(n_iterations):
    # Randomly sample the dataset with replacement
    sample = np.random.choice(data, size=len(data), replace=True)
    # Calculate the mean of the resampled data
    bootstrap_means.append(np.mean(sample))

# Calculate the 95% confidence interval
confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5])

# Print the calculated 95% confidence interval
print("95% Confidence Interval:", confidence_interval)

95% Confidence Interval: [17.     26.4025]


## Cross-validation with 5 folds

### Subtask:
Apply 5-fold cross-validation to a small dataset and calculate the Mean Squared Error (MSE) or R².


**Reasoning**:
Apply 5-fold cross-validation to a small dataset and calculate the Mean Squared Error (MSE) or R².



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# Create a larger dataset
X = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).reshape(-1, 1)
y = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95, 105])

# Instantiate a LinearRegression model
model = LinearRegression()

# Perform 5-fold cross-validation with neg_mean_squared_error scoring
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Negate the scores to get positive MSE
mse_scores = -scores

# Print the scores for each fold and the average score
print("Mean Squared Error scores for each fold:", mse_scores)
print("Average Mean Squared Error:", np.mean(mse_scores))

Mean Squared Error scores for each fold: [1.34106354e-28 2.01948392e-28 0.00000000e+00 0.00000000e+00
 4.03896783e-28]
Average Mean Squared Error: 1.479903058194618e-28


**Reasoning**:
The previous cross-validation failed because the dataset was too small for 5 folds. Each fold needs at least two samples to calculate R². I will create a larger dataset and try 5-fold cross-validation again using MSE as the scoring metric.



In [None]:
# Create a larger dataset
X = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).reshape(-1, 1)
y = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95, 105])

# Instantiate a LinearRegression model
model = LinearRegression()

# Perform 5-fold cross-validation with neg_mean_squared_error scoring
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Negate the scores to get positive MSE
mse_scores = -scores

# Print the scores for each fold and the average score
print("Mean Squared Error scores for each fold:", mse_scores)
print("Average Mean Squared Error:", np.mean(mse_scores))

Mean Squared Error scores for each fold: [1.34106354e-28 2.01948392e-28 0.00000000e+00 0.00000000e+00
 4.03896783e-28]
Average Mean Squared Error: 1.479903058194618e-28


## Compare cross-validation with different folds

### Subtask:
Repeat the cross-validation with 3 and 10 folds and compare the results to see the effect on the stability of the performance metric.


**Reasoning**:
Perform 3-fold cross-validation, calculate the positive MSE, and print the results.



In [None]:
# 3-fold cross-validation (Negative Mean Squared Error)
scores_3fold = cross_val_score(model, X, y, cv=3, scoring='neg_mean_squared_error')

# Calculate positive MSE
mse_scores_3fold = -scores_3fold

# Print the scores for each fold and the average score
print("Mean Squared Error scores for each fold (3-fold CV):", mse_scores_3fold)
print("Average Mean Squared Error (3-fold CV):", np.mean(mse_scores_3fold))

Mean Squared Error scores for each fold (3-fold CV): [9.22967259e-29 6.73161306e-29 2.69264522e-28]
Average Mean Squared Error (3-fold CV): 1.4295912626838557e-28


**Reasoning**:
Perform 10-fold cross-validation, calculate the positive MSE, and print the results, then compare the results from 3, 5 (from previous turn), and 10 folds.



In [None]:
# 10-fold cross-validation (Negative Mean Squared Error)
scores_10fold = cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')

# Calculate positive MSE
mse_scores_10fold = -scores_10fold

# Print the scores for each fold and the average score
print("Mean Squared Error scores for each fold (10-fold CV):", mse_scores_10fold)
print("Average Mean Squared Error (10-fold CV):", np.mean(mse_scores_10fold))

# Comparison (assuming mse_scores_5fold and np.mean(mse_scores_5fold) are available from a previous step)
# Note: The previous turn's 5-fold results are not directly available in this turn's execution scope.
# If they were, we would compare np.mean(mse_scores_3fold), np.mean(mse_scores_5fold), and np.mean(mse_scores_10fold)
# and also compare the range/variance of mse_scores_3fold, mse_scores_5fold, and mse_scores_10fold.

print("\nComparison of Average MSE across folds:")
print(f"Average MSE (3-fold CV): {np.mean(mse_scores_3fold)}")
print(f"Average MSE (10-fold CV): {np.mean(mse_scores_10fold)}")

print("\nComparison of MSE Range across folds:")
print(f"Range of MSE (3-fold CV): {np.max(mse_scores_3fold) - np.min(mse_scores_3fold)}")
print(f"Range of MSE (10-fold CV): {np.max(mse_scores_10fold) - np.min(mse_scores_10fold)}")

Mean Squared Error scores for each fold (10-fold CV): [0.00000000e+00 1.26217745e-29 5.04870979e-29 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00]
Average Mean Squared Error (10-fold CV): 6.310887241768095e-30

Comparison of Average MSE across folds:
Average MSE (3-fold CV): 1.4295912626838557e-28
Average MSE (10-fold CV): 6.310887241768095e-30

Comparison of MSE Range across folds:
Range of MSE (3-fold CV): 2.0194839173657902e-28
Range of MSE (10-fold CV): 5.048709793414476e-29


## Summary:

### Data Analysis Key Findings

*   The 95% confidence interval for the mean of the sample dataset calculated using bootstrapping is approximately \[17.0, 26.4].
*   Applying 5-fold cross-validation for MSE calculation on a dataset of size 5 samples failed because each fold contained only one sample, which is insufficient for calculating the metric. Increasing the dataset size to 10 samples resolved this issue.
*   Comparing cross-validation with different numbers of folds (3-fold vs. 10-fold) on the 10-sample dataset showed that the 10-fold cross-validation resulted in a lower average Mean Squared Error (approximately 6.31e-30) and a smaller range of MSE values across folds (approximately 5.05e-29) compared to the 3-fold cross-validation (average MSE approximately 1.43e-28, range approximately 2.02e-28).

### Insights or Next Steps

*   The number of folds in cross-validation can impact the stability and estimate of the model's performance metric. A higher number of folds can potentially lead to a more stable and less biased estimate of performance, especially with smaller datasets, provided there are enough samples per fold.
*   When performing cross-validation, ensure that the dataset size is large enough so that each fold contains a sufficient number of samples required by the chosen performance metric.
