<a href="https://colab.research.google.com/github/yiri20/CS180/blob/main/06_sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/06-sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# Data Sampling

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Random Sampling

### Random Sampling from a List or Array (Using NumPy)
* `replace = False` means sampling without replacement (no repeated elements).
* `replace = True` would allow repeated elements in the sample.

In [None]:
# Using NumPy
np.random.seed(42) # Set for reproducibility

data = np.array([10, 20, 30, 40, 50, 60, 70, 80])

# Randomly sample 3 elements from the array
sample = np.random.choice(data, size=3, replace=False)  # Without replacement

print("Sample:", sample)

### Random Sampling with Pandas DataFrame
* `n` specifies the number of rows to sample.
* You can also use frac to specify the fraction of the DataFrame to sample, e.g., `df.sample(frac = 0.5)` for 50% of rows.

In [None]:
# Sampling using Pandas
data = {'A': np.arange(1, 101),
        'B': np.arange(10, 1010, 10)}

df = pd.DataFrame(data)
df.head()


In [None]:
# Sample by number of samples
sample_df = df.sample(n = 10).sort_index()

# By fraction of samples
# sample_df = df.sample(frac = 0.1).sort_index()


print(sample_df)

## 3. Stratified Sampling

Stratified sampling ensures that each class or group within the data is proportionally represented in the sample. This is particularly useful when dealing with imbalanced datasets. For stratified sampling, you can use Scikit-learn’s StratifiedShuffleSplit. This is especially useful when working with classification problems where you want to maintain the same proportion of classes.

### Stratified Sampling with Sklearn
Here, stratify=y ensures that the train and test sets will have the same class proportions as the original dataset.

In [None]:
from sklearn.model_selection import train_test_split

# Example dataset with labels (class imbalance)
X = np.random.normal(loc = 70, scale = 10, size = 100)
y = np.where(X >= 85, 1, 0)

print('Proportion of class 0:', len(y[y == 0]) / len(y))

In [None]:
print(np.round(X))

In [None]:
print(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1, random_state = 42)

In [None]:
print("Mean of y_train:", np.mean(y_train) * 100)
print("Mean of y_test:", np.mean(y_test) * 100)

In [None]:
# Stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1, random_state = 42, stratify = y)

In [None]:
print("Mean of y_train:", np.mean(y_train) * 100)
print("Mean of y_test:", np.mean(y_test) * 100)

In [None]:
print('Proportion of 0s in population: ', len(y[y == 0]) / len(y))
print('Proportion of 0s in sample: ', len(y_train[y_train == 0]) / len(y_train))
print('Proportion of 0s in test: ', len(y_test[y_test == 0]) / len(y_test))

### Stratified sampling with Pandas
Here, groupby('label') ensures that the sampling is stratified by the label, and sample(frac=0.3) samples 30% from each group.

In [None]:
# An imbalanced dataframe
df = pd.DataFrame({
    'feature': np.random.randn(100),
    'label': np.random.choice([0, 1], size=100, p=[0.9, 0.1])  # 90% of class 0, 10% of class 1
})
# Stratified sampling
stratified_sample = df.groupby('label', group_keys = False).apply(lambda x: x.sample(frac=0.3, random_state=42)).sort_index()

In [None]:
stratified_sample

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

sns.countplot(ax=axes[0], x='label', data=df)
axes[0].set_title('Countplot of Full Dataset')

sns.countplot(ax=axes[1], x='label', data=stratified_sample)
axes[1].set_title('Countplot of Stratified Sample')

plt.tight_layout()
plt.show()


## 4. Systematic Sampling

You can manually implement systematic sampling by selecting every $k-th$ element from a dataset.

In [None]:
# Systematic sampling (every kth element)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Define step size k
k = 2
systematic_sample = data[::k]  # Select every kth element

print("Systematic Sample:", systematic_sample)