### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Feature Correlation Drift

**Steps**:
1. Compute the correlation matrix of features in your training dataset.
2. Compute the correlation matrix of the same features in your production data.
3. Assess changes in the correlation matrix over time to identify any significant deviations.
4. Investigate any significant changes in correlation as they may indicate issues in the data collection process or model assumptions.

In [1]:
import pandas as pd
import numpy as np

# Step 1: Compute the correlation matrix of features in your training dataset
np.random.seed(42)
n_samples = 1000
train_data = pd.DataFrame({
    'feature_1': np.random.normal(0, 1, n_samples),
    'feature_2': 2 * np.random.normal(0, 1, n_samples) + 0.5 * np.random.rand(n_samples),
    'feature_3': np.random.rand(n_samples),
    'feature_4': -1 * np.random.normal(0, 1, n_samples) + 0.2 * np.random.randn(n_samples)
})
train_corr_matrix = train_data.corr(method='pearson')  # You can choose other methods like 'spearman' or 'kendall'
print("Training Data Correlation Matrix:")
print(train_corr_matrix)

# Step 2: Compute the correlation matrix of the same features in your production data
production_data = pd.DataFrame({
    'feature_1': np.random.normal(0.2, 1.1, n_samples),
    'feature_2': 1.5 * np.random.normal(0.1, 1.3, n_samples) + 0.8 * np.random.rand(n_samples),
    'feature_3': np.random.rand(n_samples) - 0.2,
    'feature_4': -0.8 * np.random.normal(0.3, 0.9, n_samples) + 0.5 * np.random.randn(n_samples)
})
production_corr_matrix = production_data.corr(method='pearson')
print("\nProduction Data Correlation Matrix:")
print(production_corr_matrix)

# Step 3: Assess changes in the correlation matrix over time to identify significant deviations.

# Method 1: Element-wise comparison of correlation coefficients
correlation_diff = train_corr_matrix - production_corr_matrix
print("\nDifference in Correlation Matrices (Train - Production):")
print(correlation_diff)

# Method 2: Calculate the absolute difference and its mean
abs_correlation_diff = np.abs(correlation_diff)
mean_abs_diff = abs_correlation_diff.unstack().mean()
print(f"\nMean Absolute Difference in Correlation Coefficients: {mean_abs_diff:.4f}")

# Method 3: Define a threshold for significant change (you'll need to determine an appropriate threshold)
drift_threshold = 0.1  # Example threshold

drifted_pairs = []
for i in range(train_corr_matrix.shape[0]):
    for j in range(i + 1, train_corr_matrix.shape[1]):
        feature_i = train_corr_matrix.columns[i]
        feature_j = train_corr_matrix.columns[j]
        diff = abs(train_corr_matrix.iloc[i, j] - production_corr_matrix.iloc[i, j])
        if diff > drift_threshold:
            drifted_pairs.append(((feature_i, feature_j), diff))

if drifted_pairs:
    print(f"\nSignificant changes in correlation (above threshold of {drift_threshold}):")
    for (pair, diff) in drifted_pairs:
        print(f"Correlation between {pair[0]} and {pair[1]} changed by {diff:.4f}")
else:
    print(f"\nNo significant changes in correlation detected (above threshold of {drift_threshold}).")

# Step 4: Investigate any significant changes in correlation as they may indicate issues
# in the data collection process or model assumptions.
# (Further investigation would depend on the specific drifted feature pairs and your domain knowledge)

Training Data Correlation Matrix:
           feature_1  feature_2  feature_3  feature_4
feature_1   1.000000  -0.036173  -0.034252   0.037764
feature_2  -0.036173   1.000000   0.042969  -0.024759
feature_3  -0.034252   0.042969   1.000000  -0.000325
feature_4   0.037764  -0.024759  -0.000325   1.000000

Production Data Correlation Matrix:
           feature_1  feature_2  feature_3  feature_4
feature_1   1.000000   0.027374  -0.018005  -0.061529
feature_2   0.027374   1.000000   0.011889  -0.008166
feature_3  -0.018005   0.011889   1.000000   0.032642
feature_4  -0.061529  -0.008166   0.032642   1.000000

Difference in Correlation Matrices (Train - Production):
           feature_1  feature_2  feature_3  feature_4
feature_1   0.000000  -0.063547  -0.016247   0.099293
feature_2  -0.063547   0.000000   0.031079  -0.016593
feature_3  -0.016247   0.031079   0.000000  -0.032967
feature_4   0.099293  -0.016593  -0.032967   0.000000

Mean Absolute Difference in Correlation Coefficients: 0.0325