### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency

# Step 1: Load the baseline distribution for a categorical feature
baseline_data = pd.Series(['Male', 'Female', 'Male', 'Other', 'Female', 'Male', 'Male'])
baseline_counts = baseline_data.value_counts()
print("Baseline Distribution:")
print(baseline_counts)

# Step 2: Load the same feature from your current production data
current_data = pd.Series(['Male', 'Female', 'Female', 'Male', 'Unknown', 'Female'])
current_counts = current_data.value_counts()
print("\nCurrent Distribution:")
print(current_counts)

# Step 3: Use chi-squared tests to compare the distributions
# Create a contingency table
observed = pd.DataFrame([baseline_counts, current_counts]).fillna(0).values

# Perform the Chi-Squared test
chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"\nChi-Squared Statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, index=['Baseline', 'Current'], columns=baseline_counts.index.union(current_counts.index)).fillna(0))

# Step 4: If significant drift is detected, investigate the cause and update the model as needed.
alpha = 0.05  # Significance level

if p_value < alpha:
    print("\nSignificant categorical feature drift detected (p-value < alpha).")
    print("Investigate the cause and consider updating the model.")
else:
    print("\nNo significant categorical feature drift detected (p-value >= alpha).")

Baseline Distribution:
Male      4
Female    2
Other     1
Name: count, dtype: int64

Current Distribution:
Female     3
Male       2
Unknown    1
Name: count, dtype: int64

Chi-Squared Statistic: 2.8063
P-value: 0.4225
Degrees of Freedom: 3

Expected Frequencies:
            Female      Male     Other   Unknown
Baseline  3.230769  2.692308  0.538462  0.538462
Current   2.769231  2.307692  0.461538  0.461538

No significant categorical feature drift detected (p-value >= alpha).
