<a href="https://colab.research.google.com/github/rayamajhiumang5-cloud/lab6inclass/blob/main/Lab6(inclass).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Phase 1: The Danger of Randomness (Manual Split)**

In [2]:
import seaborn as sns
import pandas as pd
import numpy as np

# 1. Data Ingestion (The Population)
df = sns.load_dataset('titanic')
print(f"Total Population: {len(df)}")
print(f"Population Survival Rate: {df['survived'].mean():.4f}")

# 2. Manual Shuffle (Simulation of Sampling)
# We set a seed to ensure reproducibility for the lesson,
# but in production, this variance happens naturally.
np.random.seed(2026)
indices = np.random.permutation(df.index)

Total Population: 891
Population Survival Rate: 0.3838


**Step 2: The Split and The Bias Check**


*   We will split the data 80/20. Then, we will calculate the survival rate in both groups. Ideally, they should be identical.



In [3]:
# 3. Cut the deck (80/20 Split)
split_point = int(len(df.index) * 0.8)

# Slicing the shuffled indices
train_idx = indices[:split_point]
test_idx = indices[split_point:]

# Creating the subsets
train_set = df.loc[train_idx]
test_set = df.loc[test_idx]

# 4. Bias Check (The Delta)
train_surv = train_set['survived'].mean()
test_surv = test_set['survived'].mean()
delta = abs(train_surv - test_surv)

print(f"Train Survival Rate: {train_surv:.4f}")
print(f"Test Survival Rate:  {test_surv:.4f}")
print(f"Sampling Bias (Delta): {delta:.4f}")

Train Survival Rate: 0.3736
Test Survival Rate:  0.4246
Sampling Bias (Delta): 0.0510


**Step 3: Fixing Covariate Shift**
*   We suspect that "Class" (pclass) is a major driver of survival. A random split might accidentally put all First Class passengers in the Training set. We must force the distribution to be identical.




In [7]:
from re import split
from sklearn.model_selection import train_test_split

# Stratify by 'pclass' ensures the distribution of classes is identical
X_train, X_test = train_test_split(df,test_size=0.2,stratify=df['pclass'])

print("\n--- Stratified Split ---")
print("Train Class Dist:\n", X_train['pclass'].value_counts(normalize=True))
print("Test Class Dist:\n", X_test['pclass'].value_counts(normalize=True))


--- Stratified Split ---
Train Class Dist:
 pclass
3    0.550562
1    0.242978
2    0.206461
Name: proportion, dtype: float64
Test Class Dist:
 pclass
3    0.553073
1    0.240223
2    0.206704
Name: proportion, dtype: float64


In [10]:
population_probs = df['pclass'].value_counts(normalize=True).sort_index()
population_probs

sample_size = 100
bad_sample = df.sample(n=sample_size,random_state=2026)

observed = bad_sample['pclass'].value_counts().sort_index().values
expected = (population_probs * sample_size).values

print(f"observed counts (sample):{observed}")
print(f"expected counts (ideal):{expected}")

observed counts (sample):[21 22 57]
expected counts (ideal):[24.24242424 20.65095398 55.10662177]


In [8]:
import numpy as np
from scipy.stats import chisquare

# 1. Observed and expected counts
observed = np.array([450, 550])        # Control, Treatment
expected = np.array([500, 500])        # Planned 50/50 split

# 2. Chi-square test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

# 3. Print results
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.01:
    print("CRITICAL FAILURE: Sample Ratio Mismatch (SRM) Detected. Check Load Balancer.")
else:
    print("Variance is within natural limits.")


Chi-square statistic: 10.0000
P-value: 0.001565
CRITICAL FAILURE: Sample Ratio Mismatch (SRM) Detected. Check Load Balancer.
