# Sampling Assignment
Implementing Probability Sampling Methods in Python



In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
FILE_NAME = 'academic stress level.csv'
NUMERIC_COL = 'Rate your academic stress index'
CATEGORICAL_COL = 'Your Academic Stage'
SAMPLE_SIZE = 50 
try:
    df = pd.read_csv(FILE_NAME)
except FileNotFoundError:
    print(f"Error: The file '{FILE_NAME}' was not found. Please ensure it is in the correct directory.")
    exit()

print("Dataset loaded successfully.")
print(f"Target Column for Mean Comparison: {NUMERIC_COL}")
df.head()

Dataset loaded successfully.
Target Column for Mean Comparison: Rate your academic stress index


Unnamed: 0,Timestamp,Your Academic Stage,Peer pressure,Academic pressure from your home,Study Environment,What coping strategy you use as a student?,"Do you have any bad habits like smoking, drinking on a daily basis?",What would you rate the academic competition in your student life,Rate your academic stress index
0,24/07/2025 22:05:39,undergraduate,4,5,Noisy,Analyze the situation and handle it with intel...,No,3,5
1,24/07/2025 22:05:52,undergraduate,3,4,Peaceful,Analyze the situation and handle it with intel...,No,3,3
2,24/07/2025 22:06:39,undergraduate,1,1,Peaceful,"Social support (friends, family)",No,2,4
3,24/07/2025 22:06:45,undergraduate,3,2,Peaceful,Analyze the situation and handle it with intel...,No,4,3
4,24/07/2025 22:08:06,undergraduate,3,3,Peaceful,Analyze the situation and handle it with intel...,No,4,5


## Part A — Setup
- Report dataset size (rows, columns)

In [11]:
print("Dataset size (rows, columns):", df.shape)

N, M = df.shape
print(f"\nPopulation Size (N): {N} rows, {M} columns")


Dataset size (rows, columns): (140, 9)

Population Size (N): 140 rows, 9 columns


## Part B — Simple Random Sampling

In [15]:
import difflib, pandas as pd

EXPECTED_COL, sample_size = 'Rate your academic stress index', 50
ANALYSIS_COL = EXPECTED_COL if EXPECTED_COL in df.columns else difflib.get_close_matches(EXPECTED_COL, df.columns, n=1, cutoff=0.5)[0]

df[ANALYSIS_COL] = pd.to_numeric(df[ANALYSIS_COL], errors='coerce')
srs = df.sample(sample_size, random_state=42)

pop_mean, sample_mean = df[ANALYSIS_COL].mean(), srs[ANALYSIS_COL].mean()
sample_means = {'Simple Random Sample (SRS)': sample_mean}

print(f"\n=== Simple Random Sample (SRS) ===\nColumn: {ANALYSIS_COL} | Sample Size: {sample_size}\n")
print(srs.head(), "\n")
print(f"Population Mean: {pop_mean:.4f} | Sample Mean: {sample_mean:.4f}\n")



=== Simple Random Sample (SRS) ===
Column: Rate your academic stress index  | Sample Size: 50

               Timestamp Your Academic Stage  Peer pressure  \
108  26/07/2025 10:38:24         high school              3   
67   25/07/2025 00:21:30       undergraduate              3   
31   24/07/2025 22:23:15       undergraduate              3   
119  30/07/2025 06:43:55         high school              4   
42   24/07/2025 22:32:37       undergraduate              3   

     Academic pressure from your home Study Environment  \
108                                 3          Peaceful   
67                                  5         disrupted   
31                                  1         disrupted   
119                                 5          Peaceful   
42                                  5          Peaceful   

            What coping strategy you use as a student?  \
108  Analyze the situation and handle it with intel...   
67                  Emotional breakdown (crying a lot)

## Part C — Systematic Sampling

In [16]:
import numpy as np

sample_size = 50
N = len(df)
k = N // sample_size
start = np.random.randint(0, k)
sys_sample = df.iloc[start::k][:sample_size]

pop_mean = df[ANALYSIS_COL].mean()
sample_mean = sys_sample[ANALYSIS_COL].mean()
sample_means['Systematic Sample'] = sample_mean

print(f"\n=== Systematic Sampling ===")
print(f"Sample Size: {sample_size} | Interval (k): {k} | Random Start: {start}\n")
print(sys_sample.head(), "\n")
print(f"Population Mean : {pop_mean:.4f}")
print(f"Sample Mean     : {sample_mean:.4f}\n")



=== Systematic Sampling ===
Sample Size: 50 | Interval (k): 2 | Random Start: 1

             Timestamp Your Academic Stage  Peer pressure  \
1  24/07/2025 22:05:52       undergraduate              3   
3  24/07/2025 22:06:45       undergraduate              3   
5  24/07/2025 22:08:13       undergraduate              3   
7  24/07/2025 22:10:06       undergraduate              3   
9  24/07/2025 22:11:19       undergraduate              2   

   Academic pressure from your home Study Environment  \
1                                 4          Peaceful   
3                                 2          Peaceful   
5                                 3          Peaceful   
7                                 2          Peaceful   
9                                 2          Peaceful   

          What coping strategy you use as a student?  \
1  Analyze the situation and handle it with intel...   
3  Analyze the situation and handle it with intel...   
5  Analyze the situation and handle it w

## Part D — Stratified Sampling

In [17]:
strata_col = 'Your Academic Stage'
sample_size = 50
frac = sample_size / len(df)

stratified_sample = df.groupby(strata_col, group_keys=False).sample(frac=frac, random_state=42)

pop_mean = df[ANALYSIS_COL].mean()
sample_mean = stratified_sample[ANALYSIS_COL].mean()
sample_means['Stratified Sample'] = sample_mean

print(f"\n=== Stratified Sampling ===")
print(f"Stratification Column: {strata_col} | Sample Size: {sample_size}\n")
print(stratified_sample.head(), "\n")
print(f"Population Mean : {pop_mean:.4f}")
print(f"Sample Mean     : {sample_mean:.4f}\n")



=== Stratified Sampling ===
Stratification Column: Your Academic Stage | Sample Size: 50

               Timestamp Your Academic Stage  Peer pressure  \
126  12/08/2025 08:49:45         high school              4   
107  26/07/2025 10:04:32         high school              2   
103  26/07/2025 09:36:09         high school              1   
114  26/07/2025 18:45:13         high school              1   
99   26/07/2025 08:27:10         high school              4   

     Academic pressure from your home Study Environment  \
126                                 5             Noisy   
107                                 3             Noisy   
103                                 3          Peaceful   
114                                 1          Peaceful   
99                                  3          Peaceful   

            What coping strategy you use as a student?  \
126                 Emotional breakdown (crying a lot)   
107                   Social support (friends, family)   
1

## Part E — Cluster Sampling

In [18]:
import numpy as np

num_clusters, clusters_to_select = 10, 2
cluster_size = len(df) // num_clusters
df['cluster_id'] = df.index // cluster_size

selected_clusters = np.random.choice(df['cluster_id'].unique(), clusters_to_select, replace=False)
cluster_sample = df[df['cluster_id'].isin(selected_clusters)]

pop_mean = df[ANALYSIS_COL].mean()
sample_mean = cluster_sample[ANALYSIS_COL].mean()
sample_means['Cluster Sample'] = sample_mean

print(f"\n=== Cluster Sampling ===")
print(f"Total Clusters: {num_clusters} | Selected Clusters: {clusters_to_select}")
print("Chosen Cluster IDs:", selected_clusters, "\n")
print(cluster_sample.head(), "\n")
print(f"Population Mean : {pop_mean:.4f}")
print(f"Sample Mean     : {sample_mean:.4f}\n")



=== Cluster Sampling ===
Total Clusters: 10 | Selected Clusters: 2
Chosen Cluster IDs: [2 6] 

              Timestamp Your Academic Stage  Peer pressure  \
28  24/07/2025 22:19:51       undergraduate              5   
29  24/07/2025 22:20:28       undergraduate              4   
30  24/07/2025 22:21:04       undergraduate              5   
31  24/07/2025 22:23:15       undergraduate              3   
32  24/07/2025 22:24:13       undergraduate              3   

    Academic pressure from your home Study Environment  \
28                                 1         disrupted   
29                                 3          Peaceful   
30                                 5         disrupted   
31                                 1         disrupted   
32                                 2          Peaceful   

           What coping strategy you use as a student?  \
28                   Social support (friends, family)   
29  Analyze the situation and handle it with intel...   
30         

## Part F — Comparison & Reflection

The analysis focused on sampling the **'academic stress index'** score. The goal was to see which sampling method most accurately estimates the true population mean.

- Typically, **Stratified Sampling** performs well when the stratification column (Academic Stage) is highly correlated with the target variable (Stress Index), as it ensures all key subgroups are proportionally represented.
- **Simple Random Sampling** provides an unbiased estimate, but its accuracy depends purely on chance.
- **Systematic Sampling** is often nearly as good as SRS, provided there is no underlying periodic pattern in the data structure that aligns with the sampling interval (k).
- **Cluster Sampling** (selecting only 2 clusters) often results in the largest difference because the sample is highly concentrated in a few groups, which may not represent the overall diversity of the population.

Based on the generated comparison table, the method with the smallest 'Absolute Difference' is the most accurate for this specific sample run. For improved reliability, this entire process would need to be repeated many times (simulation) to average the performance of each method.