# Data Cleaning and Preparation for MLflow Demo

This notebook demonstrates the data cleaning and preparation process for the MLflow demo project. We'll perform the following steps:

1. Set up the environment and MLflow tracking with DagsHub
2. Load data from scikit-learn datasets
3. Explore and analyze the data
4. Preprocess the data (scaling, handling outliers, etc.)
5. Split the data into training and testing sets
6. Save raw and processed data
7. Create baseline statistics for drift detection
8. Log data preprocessing steps to MLflow

## Prerequisites

- Python 3.11+
- Required packages (scikit-learn, pandas, numpy, matplotlib, mlflow, dagshub)
- DagsHub account and repository

# Data Cleaning and Preprocessing

This notebook demonstrates data cleaning and preprocessing for the MLflow demo project.
We'll work with the wine classification dataset from sklearn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

## 1. Data Loading

In [None]:
# Load wine dataset
wine = load_wine()

# Create DataFrame
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
df['target_name'] = df['target'].map({i: name for i, name in enumerate(wine.target_names)})

print(f"Dataset shape: {df.shape}")
print(f"Features: {len(wine.feature_names)}")
print(f"Classes: {wine.target_names}")
print(f"Samples per class: {df['target'].value_counts().sort_index()}")

## 2. Exploratory Data Analysis

In [None]:
# Basic information about the dataset
print("Dataset Info:")
df.info()
print("\nBasic Statistics:")
df.describe()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])
if missing_values.sum() == 0:
    print("No missing values found!")

In [None]:
# Target distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
df['target'].value_counts().sort_index().plot(kind='bar')
plt.title('Class Distribution (Numeric)')
plt.xlabel('Class')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
df['target_name'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Class Distribution (Names)')
plt.ylabel('')

plt.tight_layout()
plt.show()

## 3. Feature Analysis

In [None]:
# Feature correlation matrix
plt.figure(figsize=(16, 12))
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
mask = np.triu(np.ones_like(correlation_matrix))
sns.heatmap(correlation_matrix, mask=mask, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Feature distributions by class
numerical_features = wine.feature_names[:6]  # First 6 features for visualization

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(numerical_features):
    for target in df['target'].unique():
        subset = df[df['target'] == target][feature]
        axes[i].hist(subset, alpha=0.7, label=f'Class {target}', bins=20)
    
    axes[i].set_title(f'{feature} Distribution by Class')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()

plt.tight_layout()
plt.show()

## 4. Outlier Detection

In [None]:
# Box plots for outlier detection
numerical_features = wine.feature_names[:8]  # First 8 features

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.ravel()

for i, feature in enumerate(numerical_features):
    axes[i].boxplot(df[feature])
    axes[i].set_title(f'{feature} - Outliers')
    axes[i].set_ylabel(feature)

plt.tight_layout()
plt.show()

In [None]:
# Identify outliers using IQR method
def identify_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Check outliers for each numerical feature
outlier_summary = {}
for feature in wine.feature_names:
    outliers, lower, upper = identify_outliers(df, feature)
    outlier_summary[feature] = {
        'count': len(outliers),
        'percentage': len(outliers) / len(df) * 100,
        'bounds': (lower, upper)
    }

# Display outlier summary
outlier_df = pd.DataFrame(outlier_summary).T
outlier_df['count'] = outlier_df['count'].astype(int)
outlier_df['percentage'] = outlier_df['percentage'].round(2)
print("Outlier Summary:")
print(outlier_df[['count', 'percentage']].sort_values('count', ascending=False))

## 5. Data Cleaning

In [None]:
# Create a copy for cleaning
df_cleaned = df.copy()

print(f"Original dataset shape: {df.shape}")

# Remove outliers using IQR method (conservative approach)
features_to_clean = ['total_phenols', 'flavanoids', 'proanthocyanins']  # Features with most outliers

for feature in features_to_clean:
    Q1 = df_cleaned[feature].quantile(0.25)
    Q3 = df_cleaned[feature].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    initial_count = len(df_cleaned)
    df_cleaned = df_cleaned[(df_cleaned[feature] >= lower_bound) & (df_cleaned[feature] <= upper_bound)]
    removed_count = initial_count - len(df_cleaned)
    
    print(f"Removed {removed_count} outliers from {feature}")

print(f"Cleaned dataset shape: {df_cleaned.shape}")
print(f"Data retention rate: {len(df_cleaned)/len(df)*100:.2f}%")

## 6. Feature Engineering

In [None]:
# Create new features
df_engineered = df_cleaned.copy()

# Ratio features
df_engineered['alcohol_to_acidity_ratio'] = df_engineered['alcohol'] / (df_engineered['total_acidity'] + 1e-6)
df_engineered['phenols_to_flavanoids_ratio'] = df_engineered['total_phenols'] / (df_engineered['flavanoids'] + 1e-6)

# Polynomial features (squares)
df_engineered['alcohol_squared'] = df_engineered['alcohol'] ** 2
df_engineered['flavanoids_squared'] = df_engineered['flavanoids'] ** 2

# Interaction features
df_engineered['alcohol_flavanoids_interaction'] = df_engineered['alcohol'] * df_engineered['flavanoids']
df_engineered['phenols_proanthocyanins_interaction'] = df_engineered['total_phenols'] * df_engineered['proanthocyanins']

# Binned features
df_engineered['alcohol_level'] = pd.cut(df_engineered['alcohol'], 
                                       bins=[0, 11, 12.5, 15], 
                                       labels=['Low', 'Medium', 'High'])

print(f"Original features: {len(wine.feature_names)}")
new_features = [col for col in df_engineered.columns if col not in df.columns]
print(f"New features created: {len(new_features)}")
print(f"New features: {new_features}")
print(f"Total features: {len(df_engineered.columns) - 2}")  # Excluding target columns

## 7. Feature Scaling

In [None]:
# Prepare features for scaling
feature_columns = [col for col in df_engineered.columns 
                  if col not in ['target', 'target_name', 'alcohol_level']]

X = df_engineered[feature_columns]
y = df_engineered['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Class distribution in training set: {pd.Series(y_train).value_counts().sort_index()}")
print(f"Class distribution in test set: {pd.Series(y_test).value_counts().sort_index()}")

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("Scaling completed!")
print(f"Scaled training set mean: {X_train_scaled.mean().mean():.6f}")
print(f"Scaled training set std: {X_train_scaled.std().mean():.6f}")

## 8. Visualization of Scaled Features

In [None]:
# Compare original vs scaled features
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
features_to_show = X_train.columns[:6]

for i, feature in enumerate(features_to_show):
    row = i // 3
    col = i % 3
    
    # Original feature
    axes[row, col].hist(X_train[feature], alpha=0.7, label='Original', bins=20)
    # Scaled feature
    axes[row, col].hist(X_train_scaled[feature], alpha=0.7, label='Scaled', bins=20)
    
    axes[row, col].set_title(f'{feature}')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Save Processed Data

In [None]:
# Create data directories
import os
os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)

# Save raw data
df.to_csv('../data/raw/wine_dataset.csv', index=False)
print("Raw data saved to ../data/raw/wine_dataset.csv")

# Save processed training data
train_processed = X_train_scaled.copy()
train_processed['target'] = y_train
train_processed.to_csv('../data/processed/train_processed.csv', index=False)
print("Processed training data saved to ../data/processed/train_processed.csv")

# Save processed test data
test_processed = X_test_scaled.copy()
test_processed['target'] = y_test
test_processed.to_csv('../data/processed/test_processed.csv', index=False)
print("Processed test data saved to ../data/processed/test_processed.csv")

# Save basic train/test splits (unscaled, for drift detection)
train_basic = X_train.copy()
train_basic['target'] = y_train
train_basic.to_csv('../data/processed/wine_train.csv', index=False)

test_basic = X_test.copy()
test_basic['target'] = y_test
test_basic.to_csv('../data/processed/wine_test.csv', index=False)

print("\nData preprocessing completed successfully!")
print(f"Final dataset shape: {train_processed.shape[0] + test_processed.shape[0]} samples, {len(feature_columns)} features")
print(f"Training samples: {train_processed.shape[0]}")
print(f"Test samples: {test_processed.shape[0]}")

## Summary

In this notebook, we:

1. **Loaded** the wine dataset from sklearn
2. **Explored** the data structure and distributions
3. **Identified** and handled outliers
4. **Engineered** new features (ratios, polynomials, interactions)
5. **Scaled** features using StandardScaler
6. **Split** data into training and test sets
7. **Saved** processed data for use in subsequent steps

The data is now ready for model training and drift detection analysis.

**Next steps:**
- Proceed to `02_drift.ipynb` for drift detection analysis
- Use `03_model_training.ipynb` for model training experiments
- Run the complete pipeline with `python main.py`

## 1. Environment Setup

Let's start by importing the necessary libraries and setting up MLflow tracking with DagsHub.

In [None]:
# Import necessary libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
import joblib

# For visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme(style="whitegrid")

# Add parent directory to path to import project modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Check if the directories exist, if not create them
data_dir = os.path.join('..', 'data')
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')
drift_baseline_dir = os.path.join(data_dir, 'drift_baseline')

os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)
os.makedirs(drift_baseline_dir, exist_ok=True)

print("Environment setup completed.")

In [None]:
# Setup MLflow tracking with DagsHub
try:
    import mlflow
    import dagshub
    
    # Initialize DagsHub with MLflow tracking
    dagshub.init(
        repo_owner='yahiaehab10', 
        repo_name='MLflow_demo_MF', 
        mlflow=True
    )
    
    # Set experiment name
    mlflow.set_experiment("data_preparation")
    
    print("MLflow tracking with DagsHub initialized.")
except Exception as e:
    print(f"Warning: Could not initialize DagsHub MLflow tracking: {e}")
    print("Continuing with default MLflow tracking.")
    import mlflow
    mlflow.set_experiment("data_preparation")

## 2. Data Loading

Let's load a dataset from scikit-learn. For this demo, we'll use the diabetes dataset, which is a regression problem.

In [None]:
# Load the diabetes dataset from sklearn
diabetes = datasets.load_diabetes()

# Create dataframes for features and target
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target)

# Save dataset information for reference
dataset_info = {
    'name': 'diabetes',
    'description': diabetes.DESCR,
    'feature_names': diabetes.feature_names,
    'target_names': None  # Diabetes dataset doesn't have target names
}

# Let's look at the data
print(f"Dataset: {dataset_info['name']}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print("\nFeature names:")
for feature in diabetes.feature_names:
    print(f"- {feature}")

# Save raw data
X.to_csv(os.path.join(raw_dir, "diabetes_features.csv"), index=False)
y.to_csv(os.path.join(raw_dir, "diabetes_target.csv"), index=False)
joblib.dump(dataset_info, os.path.join(raw_dir, "diabetes_info.joblib"))

# Log to MLflow
with mlflow.start_run(run_name="data_loading"):
    mlflow.log_param("dataset", "diabetes")
    mlflow.log_param("data_shape", X.shape)
    mlflow.log_param("feature_count", X.shape[1])
    mlflow.log_param("sample_count", X.shape[0])
    
    # Log dataset description as a text artifact
    with open(os.path.join(raw_dir, "dataset_description.txt"), "w") as f:
        f.write(diabetes.DESCR)
    mlflow.log_artifact(os.path.join(raw_dir, "dataset_description.txt"))
    
    print("Data loaded and logged to MLflow.")

## 3. Data Exploration

Let's explore the dataset to understand its characteristics, check for missing values, and analyze distributions.

In [None]:
# Display basic statistics for features
print("Feature Summary Statistics:")
print(X.describe())

# Check for missing values
print("\nMissing Values:")
print(X.isnull().sum())

# Display histogram of target variable
plt.figure(figsize=(10, 6))
plt.hist(y, bins=30, color='blue', alpha=0.7)
plt.title('Distribution of Target Variable (Disease Progression)')
plt.xlabel('Disease Progression')
plt.ylabel('Frequency')
plt.savefig(os.path.join(raw_dir, "target_distribution.png"))
plt.show()

# Distribution of features
plt.figure(figsize=(15, 10))
for i, col in enumerate(X.columns):
    plt.subplot(3, 4, i+1)
    sns.histplot(X[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.savefig(os.path.join(raw_dir, "feature_distributions.png"))
plt.show()

# Correlation matrix
plt.figure(figsize=(12, 10))
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.savefig(os.path.join(raw_dir, "correlation_matrix.png"))
plt.show()

# Check for outliers using box plots
plt.figure(figsize=(15, 10))
for i, col in enumerate(X.columns):
    plt.subplot(3, 4, i+1)
    sns.boxplot(y=X[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.savefig(os.path.join(raw_dir, "feature_boxplots.png"))
plt.show()

# Log exploratory analysis to MLflow
with mlflow.start_run(run_name="data_exploration"):
    # Log statistics as parameters
    for col in X.columns:
        mlflow.log_param(f"mean_{col}", X[col].mean())
        mlflow.log_param(f"std_{col}", X[col].std())
    
    # Log correlation with target as metrics
    for col in X.columns:
        correlation = np.corrcoef(X[col], y)[0, 1]
        mlflow.log_metric(f"target_corr_{col}", correlation)
    
    # Log visualization artifacts
    mlflow.log_artifact(os.path.join(raw_dir, "target_distribution.png"))
    mlflow.log_artifact(os.path.join(raw_dir, "feature_distributions.png"))
    mlflow.log_artifact(os.path.join(raw_dir, "correlation_matrix.png"))
    mlflow.log_artifact(os.path.join(raw_dir, "feature_boxplots.png"))
    
    # Save and log correlation matrix as CSV
    correlation_df = pd.DataFrame(correlation_matrix)
    correlation_df.to_csv(os.path.join(raw_dir, "correlation_matrix.csv"))
    mlflow.log_artifact(os.path.join(raw_dir, "correlation_matrix.csv"))
    
    print("Data exploration completed and logged to MLflow.")

## 4. Data Preprocessing

Now let's preprocess the data:
1. Handle any outliers
2. Scale the features
3. Create a preprocessing pipeline

In [None]:
# Create a preprocessing pipeline
from sklearn.pipeline import Pipeline

# Function to detect outliers using IQR
def detect_outliers(df, n_std=3):
    """
    Detect outliers using standard deviation method
    """
    data_clean = df.copy()
    outliers_dict = {}
    
    for col in df.columns:
        # Calculate mean and standard deviation
        mean = df[col].mean()
        std = df[col].std()
        
        # Find outliers
        outliers = df[(df[col] < mean - n_std * std) | (df[col] > mean + n_std * std)][col]
        outliers_dict[col] = len(outliers)
        
        # Replace outliers with NaN (to be imputed later)
        data_clean.loc[(data_clean[col] < mean - n_std * std) | (data_clean[col] > mean + n_std * std), col] = np.nan
    
    return data_clean, outliers_dict

# Detect and handle outliers
X_clean, outliers_dict = detect_outliers(X, n_std=3)
print("Outliers detected in each feature:")
for col, count in outliers_dict.items():
    print(f"{col}: {count} outliers")

# Create preprocessing pipeline
scaler_type = 'standard'  # Can be 'standard' or 'minmax'

steps = []
# Add imputer to handle missing values and potential NaNs from outlier removal
steps.append(('imputer', SimpleImputer(strategy='median')))

# Add scaler
if scaler_type == 'standard':
    steps.append(('scaler', StandardScaler()))
elif scaler_type == 'minmax':
    steps.append(('scaler', MinMaxScaler()))
else:
    raise ValueError(f"Scaler type {scaler_type} not supported.")

preprocessing_pipeline = Pipeline(steps)

# Fit the pipeline on the data and transform it
X_processed = pd.DataFrame(
    preprocessing_pipeline.fit_transform(X_clean),
    columns=X_clean.columns
)

# Show the processed data
print("\nProcessed Data (first 5 rows):")
print(X_processed.head())

# Save the preprocessing pipeline
preprocessing_pipeline_path = os.path.join(processed_dir, "preprocessing_pipeline.joblib")
joblib.dump(preprocessing_pipeline, preprocessing_pipeline_path)

# Log preprocessing to MLflow
with mlflow.start_run(run_name="data_preprocessing"):
    # Log preprocessing parameters
    mlflow.log_param("scaler_type", scaler_type)
    mlflow.log_param("imputer_strategy", "median")
    mlflow.log_param("outlier_handling", "replace_with_median")
    
    # Log outlier counts as metrics
    for col, count in outliers_dict.items():
        mlflow.log_metric(f"outliers_{col}", count)
    
    # Log processed data statistics
    for col in X_processed.columns:
        mlflow.log_metric(f"processed_mean_{col}", X_processed[col].mean())
        mlflow.log_metric(f"processed_std_{col}", X_processed[col].std())
    
    # Log preprocessing pipeline as artifact
    mlflow.log_artifact(preprocessing_pipeline_path)
    
    # Compare original vs processed data with a visualization
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(X.columns):
        plt.subplot(3, 4, i+1)
        plt.hist(X[col], bins=20, alpha=0.5, label='Original')
        plt.hist(X_processed[col], bins=20, alpha=0.5, label='Processed')
        plt.title(f'{col}')
        plt.legend()
    plt.tight_layout()
    
    # Save and log the comparison
    comparison_path = os.path.join(processed_dir, "original_vs_processed.png")
    plt.savefig(comparison_path)
    mlflow.log_artifact(comparison_path)
    plt.close()
    
    print("Preprocessing completed and logged to MLflow.")

## 5. Data Splitting

Now we'll split the data into training and testing sets.

In [None]:
# Split the data into training and testing sets
test_size = 0.2
random_state = 42

X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=test_size, random_state=random_state
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")

# Save the split data
X_train.to_csv(os.path.join(processed_dir, "diabetes_X_train.csv"), index=False)
X_test.to_csv(os.path.join(processed_dir, "diabetes_X_test.csv"), index=False)
y_train.to_csv(os.path.join(processed_dir, "diabetes_y_train.csv"), index=False)
y_test.to_csv(os.path.join(processed_dir, "diabetes_y_test.csv"), index=False)

# Log data splitting to MLflow
with mlflow.start_run(run_name="data_splitting"):
    mlflow.log_param("test_size", test_size)
    mlflow.log_param("random_state", random_state)
    mlflow.log_param("train_size", X_train.shape[0])
    mlflow.log_param("test_size", X_test.shape[0])
    
    # Log train/test split sizes
    mlflow.log_metric("train_samples", X_train.shape[0])
    mlflow.log_metric("test_samples", X_test.shape[0])
    
    # Log training and testing data paths
    split_info = {
        "X_train_path": os.path.join(processed_dir, "diabetes_X_train.csv"),
        "X_test_path": os.path.join(processed_dir, "diabetes_X_test.csv"),
        "y_train_path": os.path.join(processed_dir, "diabetes_y_train.csv"),
        "y_test_path": os.path.join(processed_dir, "diabetes_y_test.csv")
    }
    
    # Save split info as JSON
    import json
    with open(os.path.join(processed_dir, "split_info.json"), "w") as f:
        json.dump(split_info, f)
    
    mlflow.log_artifact(os.path.join(processed_dir, "split_info.json"))
    
    print("Data splitting completed and logged to MLflow.")

## 6. Create Baseline for Drift Detection

We'll create a baseline of the current data statistics that can be used later to detect data drift.

In [None]:
# Create baseline statistics for drift detection
baseline_stats = {
    'mean': X_train.mean().to_dict(),
    'std': X_train.std().to_dict(),
    'min': X_train.min().to_dict(),
    'max': X_train.max().to_dict(),
    'median': X_train.median().to_dict(),
    'shape': X_train.shape
}

# Save baseline statistics
baseline_stats_path = os.path.join(drift_baseline_dir, "diabetes_baseline_stats.joblib")
joblib.dump(baseline_stats, baseline_stats_path)

# Also save a sample of the baseline data
sample_size = min(1000, len(X_train))
X_train_sample = X_train.sample(sample_size, random_state=42)
X_train_sample.to_csv(os.path.join(drift_baseline_dir, "diabetes_baseline_sample.csv"), index=False)

# Print baseline statistics
print("Baseline Statistics:")
for stat_name, stat_values in baseline_stats.items():
    if stat_name != 'shape':
        print(f"\n{stat_name.capitalize()}:")
        for feature, value in stat_values.items():
            print(f"  {feature}: {value:.4f}")
    else:
        print(f"\nShape: {stat_values}")

# Log drift baseline to MLflow
with mlflow.start_run(run_name="drift_baseline_creation"):
    # Log baseline creation parameters
    mlflow.log_param("baseline_source", "training_data")
    mlflow.log_param("baseline_sample_size", sample_size)
    
    # Log baseline stats as metrics
    for stat_name, stat_values in baseline_stats.items():
        if stat_name != 'shape':
            for feature, value in stat_values.items():
                mlflow.log_metric(f"baseline_{stat_name}_{feature}", value)
    
    # Log baseline files as artifacts
    mlflow.log_artifact(baseline_stats_path)
    mlflow.log_artifact(os.path.join(drift_baseline_dir, "diabetes_baseline_sample.csv"))
    
    # Create and log distributions of baseline data
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(X_train.columns):
        plt.subplot(3, 4, i+1)
        sns.histplot(X_train[col], kde=True)
        plt.title(f'Baseline Distribution: {col}')
    plt.tight_layout()
    
    baseline_dist_path = os.path.join(drift_baseline_dir, "baseline_distributions.png")
    plt.savefig(baseline_dist_path)
    mlflow.log_artifact(baseline_dist_path)
    plt.close()
    
    print("Drift baseline created and logged to MLflow.")

## 7. Summary

In this notebook, we've:

1. Set up the environment and MLflow tracking with DagsHub
2. Loaded the diabetes dataset from scikit-learn
3. Explored and analyzed the data
4. Preprocessed the data (handled outliers and scaled features)
5. Split the data into training and testing sets
6. Created a baseline for drift detection
7. Logged all steps and artifacts to MLflow

All the processed data and artifacts are now available in the respective directories and can be used for model training and evaluation in the next notebooks.

In [None]:
# Print a summary of the work done
print("Data Cleaning and Preparation Summary:")
print(f"  - Dataset: diabetes")
print(f"  - Original data shape: {X.shape}")
print(f"  - Processed data shape: {X_processed.shape}")
print(f"  - Training set size: {X_train.shape[0]} samples")
print(f"  - Testing set size: {X_test.shape[0]} samples")
print("\nFiles created:")
print(f"  - Raw data: {os.path.join(raw_dir, 'diabetes_features.csv')}")
print(f"  - Processed training data: {os.path.join(processed_dir, 'diabetes_X_train.csv')}")
print(f"  - Processed testing data: {os.path.join(processed_dir, 'diabetes_X_test.csv')}")
print(f"  - Preprocessing pipeline: {os.path.join(processed_dir, 'preprocessing_pipeline.joblib')}")
print(f"  - Drift baseline: {os.path.join(drift_baseline_dir, 'diabetes_baseline_stats.joblib')}")
print("\nMLflow experiments created:")
print("  - data_loading")
print("  - data_exploration")
print("  - data_preprocessing")
print("  - data_splitting")
print("  - drift_baseline_creation")
print("\nNext steps:")
print("  - Proceed to drift detection notebook (02_drift.ipynb)")
print("  - Proceed to model training notebook (03_model_training.ipynb)")