# STA 2101 - Milestone 02
## Implementing Probability Sampling Methods in Python

This notebook analyzes the 'academic Stress level' dataset using various probability sampling techniques.

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display

# --- Configuration ---
FILE_PATH = 'academic Stress level - maintainance 1.csv'
# The column we will use for calculating means and comparing samples
ANALYSIS_COL = 'stress_index'
# The column we will use for stratification
STRATA_COL = 'academic_stage'
SAMPLE_SIZE = 50

In [None]:
# Load your dataset and clean up column names
df = pd.read_csv(FILE_PATH)

# --- Start of robust column renaming ---
# Get a mapping of original column names to their lowercase, stripped versions
# to find the intended columns robustly.
column_mapping = {col.lower().strip(): col for col in df.columns}

# Identify the original column name for ANALYSIS_COL
original_analysis_col_name = None
for key, val in column_mapping.items():
    if 'stress index' in key:
        original_analysis_col_name = val
        break

# Identify the original column name for STRATA_COL
original_strata_col_name = None
for key, val in column_mapping.items():
    if 'academic stage' in key:
        original_strata_col_name = val
        break

# Ensure columns were found before renaming
if original_analysis_col_name is None:
    raise KeyError("Could not find a column related to 'stress index' in the dataset.")
if original_strata_col_name is None:
    raise KeyError("Could not find a column related to 'academic stage' in the dataset.")

# Rename long columns for cleaner code
df = df.rename(columns={
    original_analysis_col_name: ANALYSIS_COL,
    original_strata_col_name: STRATA_COL
})
# --- End of robust column renaming ---

# Display first few rows and check data types
print(f"Loaded file: {FILE_PATH}")
display(df.head())

## Part A — Setup
- Report dataset size (rows, columns)
- Calculate population mean

In [None]:
print("Dataset size (rows, columns):", df.shape)

# Calculate the population mean for the analysis column
population_mean = df[ANALYSIS_COL].mean()
print(f"Population mean of '{ANALYSIS_COL}': {population_mean:.4f}")

## Part B — Simple Random Sampling (SRS)

In [None]:
srs = df.sample(n=SAMPLE_SIZE, random_state=42)
print(f"\n--- Simple Random Sample (n={SAMPLE_SIZE}) ---")
display(srs.head(3))

srs_mean = srs[ANALYSIS_COL].mean()
print(f"SRS Sample mean of '{ANALYSIS_COL}': {srs_mean:.4f}")

## Part C — Systematic Sampling

In [None]:
n = SAMPLE_SIZE
N = len(df)
k = N // n  # Calculate the sampling interval
start = np.random.randint(0, k) # Choose a random starting point
sys_sample = df.iloc[start::k][:n]

print(f"\n--- Systematic Sample (k={k}, n={n}) ---")
display(sys_sample.head(3))

sys_mean = sys_sample[ANALYSIS_COL].mean()
print(f"Systematic Sample mean of '{ANALYSIS_COL}': {sys_mean:.4f}")

## Part D — Stratified Sampling

In [None]:
print(f"\n--- Stratified Sample (Stratifying by '{STRATA_COL}') ---")
# Check the number of records per stratum
print("Records per stratum:")
print(df[STRATA_COL].value_counts())

# Calculate the proportional fraction for each group
frac = SAMPLE_SIZE / N

# stratified sample - ensure proportional allocation
# group_keys=False prevents the grouping column from becoming the index after sampling
stratified_sample = df.groupby(STRATA_COL, group_keys=False).sample(frac=frac, random_state=42)

# If the calculated fraction leads to rounding errors, the final size might be slightly off.
# We ensure the final sample size is close to the target.
print(f"Stratified sample size: {len(stratified_sample)}")
display(stratified_sample.head(3))

strat_mean = stratified_sample[ANALYSIS_COL].mean()
print(f"Stratified Sample mean of '{ANALYSIS_COL}': {strat_mean:.4f}")

## Part E — Cluster Sampling

In [None]:
# We will create 10 arbitrary clusters (e.g., groups of ~50 records each)
NUM_CLUSTERS = 10
# Assign cluster IDs based on index
df['cluster_id'] = df.index // (N // NUM_CLUSTERS)

# Randomly select 2 clusters (adjust 'size' based on desired coverage)
NUM_SELECTED_CLUSTERS = 2
selected_clusters = np.random.choice(
    df['cluster_id'].unique(),
    size=NUM_SELECTED_CLUSTERS,
    replace=False
)
cluster_sample = df[df['cluster_id'].isin(selected_clusters)].copy()
cluster_sample_size = len(cluster_sample)

print(f"\n--- Cluster Sample (Selected {NUM_SELECTED_CLUSTERS} clusters) ---")
print("Selected clusters:", selected_clusters)
print(f"Cluster sample size: {cluster_sample_size}")
display(cluster_sample.head(3))

cluster_mean = cluster_sample[ANALYSIS_COL].mean()
print(f"Cluster Sample mean of '{ANALYSIS_COL}': {cluster_mean:.4f}")

# Clean up the cluster_id column from the main DataFrame
df = df.drop(columns=['cluster_id'])

## Part F — Comparison & Reflection
Compare sample means vs population mean, then write your reflection.

In [None]:
comparison = pd.DataFrame({
    'Method': ['Simple Random', 'Systematic', 'Stratified', 'Cluster'],
    'Sample Mean': [srs_mean, sys_mean, strat_mean, cluster_mean],
    f'Population Mean ({ANALYSIS_COL})': [population_mean]*4,
    'Absolute Difference': [abs(srs_mean - population_mean),
                            abs(sys_mean - population_mean),
                            abs(strat_mean - population_mean),
                            abs(cluster_mean - population_mean)]
})

# Sort by difference to easily see the most accurate method
comparison = comparison.sort_values(by='Absolute Difference')

print("\n--- Sample Mean Comparison ---")
display(comparison.reset_index(drop=True).style.format({
    'Sample Mean': "{:.2f}",
    f'Population Mean ({ANALYSIS_COL})': "{:.2f}",
    'Absolute Difference': "{:.2f}"
}).set_properties(subset=['Method'], **{'text-align': 'left'}).set_properties(subset=['Sample Mean', f'Population Mean ({ANALYSIS_COL})', 'Absolute Difference'], **{'text-align': 'right'}))

### Reflection

The analysis focused on sampling the **'academic stress index'** score. The goal was to see which sampling method most accurately estimates the true population mean.

- Typically, **Stratified Sampling** performs well when the stratification column (Academic Stage) is highly correlated with the target variable (Stress Index), as it ensures all key subgroups are proportionally represented.
- **Simple Random Sampling** provides an unbiased estimate, but its accuracy depends purely on chance.
- **Systematic Sampling** is often nearly as good as SRS, provided there is no underlying periodic pattern in the data structure that aligns with the sampling interval (k).
- **Cluster Sampling** (selecting only 2 clusters) often results in the largest difference because the sample is highly concentrated in a few groups, which may not represent the overall diversity of the population.

Based on the generated comparison table, the method with the smallest 'Absolute Difference' is the most accurate for this specific sample run. For improved reliability, this entire process would need to be repeated many times (simulation) to average the performance of each method.