# Sampling Assignment
Implementing Probability Sampling Methods in Python

## Instructions
Upload your dataset (minimum 200 rows), then complete all parts A–F.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- CONFIGURATION ---
FILE_NAME = 'academic stress level.csv'
# The numeric column for mean comparisons (Part B, C, D, E, F)
NUMERIC_COL = 'Rate your academic stress index'
# The categorical column for stratification (Part D)
CATEGORICAL_COL = 'Your Academic Stage'
SAMPLE_SIZE = 50 # Target sample size for all methods
# --- END CONFIGURATION ---

# Load your dataset
try:
    df = pd.read_csv(FILE_NAME)
except FileNotFoundError:
    print(f"Error: The file '{FILE_NAME}' was not found. Please ensure it is in the correct directory.")
    exit()

print("Dataset loaded successfully.")
print(f"Target Column for Mean Comparison: {NUMERIC_COL}")
df.head()

Dataset loaded successfully.
Target Column for Mean Comparison: Rate your academic stress index


Unnamed: 0,Timestamp,Your Academic Stage,Peer pressure,Academic pressure from your home,Study Environment,What coping strategy you use as a student?,"Do you have any bad habits like smoking, drinking on a daily basis?",What would you rate the academic competition in your student life,Rate your academic stress index
0,24/07/2025 22:05:39,undergraduate,4,5,Noisy,Analyze the situation and handle it with intel...,No,3,5
1,24/07/2025 22:05:52,undergraduate,3,4,Peaceful,Analyze the situation and handle it with intel...,No,3,3
2,24/07/2025 22:06:39,undergraduate,1,1,Peaceful,"Social support (friends, family)",No,2,4
3,24/07/2025 22:06:45,undergraduate,3,2,Peaceful,Analyze the situation and handle it with intel...,No,4,3
4,24/07/2025 22:08:06,undergraduate,3,3,Peaceful,Analyze the situation and handle it with intel...,No,4,5


## Part A — Setup
- Report dataset size (rows, columns)

In [23]:
# Report dataset size (rows, columns)
print("Dataset size (rows, columns):", df.shape)
N, M = df.shape

print(f"\nPopulation Size (N): {N} rows, {M} columns")

# Calculate the population mean for the analysis column
population_mean = df[ANALYSIS_COL].mean()
print(f"Population mean of '{ANALYSIS_COL}': {population_mean:.4f}")

# Dictionary to store sample means for Part F (Corrected the SyntaxError)
sample_means = {}

Dataset size (rows, columns): (140, 9)

Population Size (N): 140 rows, 9 columns
Population mean of 'Rate your academic stress index': 3.7214


## Part B — Simple Random Sampling

In [24]:
sample_size = 50
srs = df.sample(n=sample_size, random_state=42)
print("--- Simple Random Sample (SRS) Head ---")
print(srs.head())

# We use the ANALYSIS_COL variable defined in Part A
print("\nPopulation mean:", df[ANALYSIS_COL].mean())
print("Sample mean:", srs[ANALYSIS_COL].mean())

# Store the sample mean for later comparison (Part F)
sample_means['Simple Random Sample (SRS)'] = srs[ANALYSIS_COL].mean()

--- Simple Random Sample (SRS) Head ---
               Timestamp Your Academic Stage  Peer pressure  \
108  26/07/2025 10:38:24         high school              3   
67   25/07/2025 00:21:30       undergraduate              3   
31   24/07/2025 22:23:15       undergraduate              3   
119  30/07/2025 06:43:55         high school              4   
42   24/07/2025 22:32:37       undergraduate              3   

     Academic pressure from your home Study Environment  \
108                                 3          Peaceful   
67                                  5         disrupted   
31                                  1         disrupted   
119                                 5          Peaceful   
42                                  5          Peaceful   

            What coping strategy you use as a student?  \
108  Analyze the situation and handle it with intel...   
67                  Emotional breakdown (crying a lot)   
31                  Emotional breakdown (crying a lo

## Part C — Systematic Sampling

In [25]:
import numpy as np # Import required for np.random.randint

sample_size = 50
N = len(df) # Total population size
k = N // sample_size # Calculate the sampling interval (k)

# Choose a random starting point (index) between 0 and k-1
start = np.random.randint(0, k)

# Select the systematic sample:
# Start at the random index, take every k-th row, and ensure the size is exactly n=50
sys_sample = df.iloc[start::k][:sample_size]

print("--- Systematic Sample Head ---")
print(f"Sampling Interval (k): {k}")
print(f"Random Start Index: {start}")
print(sys_sample.head())

# Calculate and report the mean using the ANALYSIS_COL variable
print(f"\nPopulation mean: {df[ANALYSIS_COL].mean():.4f}")
print(f"Sample mean: {sys_sample[ANALYSIS_COL].mean():.4f}")

# Store the sample mean for later comparison (Part F)
sample_means['Systematic Sample'] = sys_sample[ANALYSIS_COL].mean()

--- Systematic Sample Head ---
Sampling Interval (k): 2
Random Start Index: 1
             Timestamp Your Academic Stage  Peer pressure  \
1  24/07/2025 22:05:52       undergraduate              3   
3  24/07/2025 22:06:45       undergraduate              3   
5  24/07/2025 22:08:13       undergraduate              3   
7  24/07/2025 22:10:06       undergraduate              3   
9  24/07/2025 22:11:19       undergraduate              2   

   Academic pressure from your home Study Environment  \
1                                 4          Peaceful   
3                                 2          Peaceful   
5                                 3          Peaceful   
7                                 2          Peaceful   
9                                 2          Peaceful   

          What coping strategy you use as a student?  \
1  Analyze the situation and handle it with intel...   
3  Analyze the situation and handle it with intel...   
5  Analyze the situation and handle it with 

## Part D — Stratified Sampling

In [26]:
## Part D — Stratified Sampling
strata_col = 'Your Academic Stage' # Using 'Your Academic Stage' as the stratification column
sample_size = 50

# Calculate the proportional fraction for each group
frac = sample_size / len(df)

# Create the stratified sample
# .groupby(strata_col) splits the data, .sample(frac=frac) selects the 
# proportional number of samples from each group, and group_keys=False 
# keeps the strata_col as a normal column.
stratified_sample = df.groupby(strata_col, group_keys=False).sample(frac=frac, random_state=42)

print("--- Stratified Sample Head ---")
print(f"Stratification Column: '{strata_col}'")
print(stratified_sample.head())

# Calculate and report the sample mean
print(f"\nPopulation mean: {df[ANALYSIS_COL].mean():.4f}")
print(f"Sample mean: {stratified_sample[ANALYSIS_COL].mean():.4f}")

# Store the sample mean for later comparison (Part F)
sample_means['Stratified Sample'] = stratified_sample[ANALYSIS_COL].mean()

--- Stratified Sample Head ---
Stratification Column: 'Your Academic Stage'
               Timestamp Your Academic Stage  Peer pressure  \
126  12/08/2025 08:49:45         high school              4   
107  26/07/2025 10:04:32         high school              2   
103  26/07/2025 09:36:09         high school              1   
114  26/07/2025 18:45:13         high school              1   
99   26/07/2025 08:27:10         high school              4   

     Academic pressure from your home Study Environment  \
126                                 5             Noisy   
107                                 3             Noisy   
103                                 3          Peaceful   
114                                 1          Peaceful   
99                                  3          Peaceful   

            What coping strategy you use as a student?  \
126                 Emotional breakdown (crying a lot)   
107                   Social support (friends, family)   
103             

## Part E — Cluster Sampling

In [27]:
import numpy as np # Import required for np.random.choice

# Set parameters for clustering
num_clusters = 10 
clusters_to_select = 2

# 1. Create a cluster_id column by dividing the index (ensures roughly equal cluster size)
cluster_size = len(df) // num_clusters
df['cluster_id'] = df.index // cluster_size

# 2. Randomly select the clusters
selected_clusters = np.random.choice(df['cluster_id'].unique(), size=clusters_to_select, replace=False)

# 3. Select all data from the chosen clusters
cluster_sample = df[df['cluster_id'].isin(selected_clusters)]

print("--- Cluster Sample Head ---")
print(f"Total Clusters Created: {num_clusters}")
print(f"Number of Clusters Selected: {clusters_to_select}")
print("Selected clusters:", selected_clusters)
print(cluster_sample.head())

# Calculate and report the mean using the ANALYSIS_COL variable
print(f"\nPopulation mean: {df[ANALYSIS_COL].mean():.4f}")
print(f"Sample mean: {cluster_sample[ANALYSIS_COL].mean():.4f}")

# Store the sample mean for later comparison (Part F)
sample_means['Cluster Sample'] = cluster_sample[ANALYSIS_COL].mean()

--- Cluster Sample Head ---
Total Clusters Created: 10
Number of Clusters Selected: 2
Selected clusters: [1 5]
              Timestamp Your Academic Stage  Peer pressure  \
14  24/07/2025 22:14:16       undergraduate              3   
15  24/07/2025 22:15:06       undergraduate              5   
16  24/07/2025 22:15:39       undergraduate              3   
17  24/07/2025 22:16:10       undergraduate              3   
18  24/07/2025 22:16:53       undergraduate              3   

    Academic pressure from your home Study Environment  \
14                                 4         disrupted   
15                                 5             Noisy   
16                                 3          Peaceful   
17                                 5         disrupted   
18                                 3          Peaceful   

           What coping strategy you use as a student?  \
14                   Social support (friends, family)   
15  Analyze the situation and handle it with intel...

## Part F — Comparison & Reflection

The analysis focused on sampling the **'academic stress index'** score. The goal was to see which sampling method most accurately estimates the true population mean.

- Typically, **Stratified Sampling** performs well when the stratification column (Academic Stage) is highly correlated with the target variable (Stress Index), as it ensures all key subgroups are proportionally represented.
- **Simple Random Sampling** provides an unbiased estimate, but its accuracy depends purely on chance.
- **Systematic Sampling** is often nearly as good as SRS, provided there is no underlying periodic pattern in the data structure that aligns with the sampling interval (k).
- **Cluster Sampling** (selecting only 2 clusters) often results in the largest difference because the sample is highly concentrated in a few groups, which may not represent the overall diversity of the population.

Based on the generated comparison table, the method with the smallest 'Absolute Difference' is the most accurate for this specific sample run. For improved reliability, this entire process would need to be repeated many times (simulation) to average the performance of each method.