# Model Comparison Analysis: fit() vs load_from_file()

This notebook investigates the discrepancy between training a Linear Regression model using `ModelBenchmark.fit()` versus loading a pre-trained model with `load_from_file()`.

## Expected vs Actual Results
- **Expected**: Identical predictions (theory)
- **Actual**: Small numerical differences observed
- **Hypothesis**: Differences in data preprocessing, random seeds, or model serialization

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
from dataclasses import dataclass

warnings.filterwarnings('ignore')

In [6]:
@dataclass(init=True, repr=True, eq=False, order=False, unsafe_hash=False, frozen=False)
class ModelBenchmark:
    """
    A comprehensive benchmarking class for evaluating different machine learning models
    on enzyme kinetic datasets with various data splitting strategies.
    
    This class encapsulates the entire machine learning pipeline including data preprocessing,
    train/validation/test splitting, model initialization, training, and prediction.
    Supports multiple traditional ML algorithms and data splitting methods for robust evaluation.
    
    Attributes:
        data (pd.DataFrame | str): Input dataset as DataFrame or path to joblib file
        model_type (str): Type of model to use ('CAT', 'GBM', 'LR', 'RF', 'SVR', 'XGB')
        split_method (str): Data splitting strategy ('random', 'cold_mols', 'cold_proteins', 'load_from_file')
    
    Methods:
        __post_init: Post-initialization to load and validate input data
        rename_data_columns: Rename DataFrame columns using a mapping dictionary
        data_preprocessing: Preprocess features and labels for model training
        split_data: Split dataset into train/validation/test sets
        model_init: Initialize the specified model type
        fit: Train the model on training data
        predict: Generate predictions using the trained model
    """
    data: pd.DataFrame | str
    model_type: str
    split_method: str


    '''
    def __init__(self, data, model_type, split_method, *args, **kwargs):
        """Initialize the BenchMarkModel with data, model type, and split method."""
        if not isinstance(data, pd.DataFrame):
            raise ValueError("Data must be a pandas DataFrame.")
        self.data = data
        self.model_type = model_type
        self.split_method = split_method
        self.args = args
        self.kwargs = kwargs'''

    def __post_init__(self):
        """
        Post-initialization method to load and validate input data.
        
        Handles both DataFrame objects and file paths to joblib files.
        Ensures data integrity and proper format for downstream processing.
        
        Raises:
            ValueError: If data type is unsupported or file loading fails
        """
        # Debug information for data type checking
        print(f"Initial data type: {type(self.data)}")
        
        if isinstance(self.data, pd.DataFrame):
            print("Data is already a DataFrame")
            # Create a copy to avoid modifying original data
            self.data = self.data.copy()
        elif isinstance(self.data, str):
            print(f"Loading data from: {self.data}")
            try:
                # Load data from joblib file
                self.data = joblib.load(self.data)
                print(f"Data loaded successfully, type: {type(self.data)}")
            except Exception as e:
                print(f"Error loading data: {e}")
                raise ValueError(f"Could not load data from {self.data}: {e}")
        else:
            raise ValueError("Data must be a pandas DataFrame or a valid file path.")
        
        
        if self.split_method=="random_split":
            self.split_method = "random"
        elif self.split_method == "cold_molecules":
            self.split_method = "cold_mols"
        elif self.split_method not in ["random", "cold_mols", "cold_proteins", "load_index_from_file"]:
            raise ValueError(f"Unknown split method: {self.split_method}")
        
    
    def rename_data_columns(self, rename_dict):
        """
        Rename DataFrame columns using the provided mapping dictionary.
        
        Args:
            rename_dict (dict): Dictionary mapping old column names to new ones
            
        Returns:
            ModelBenchmark: Returns self for method chaining
        """
        self.data = self.data.rename(columns=rename_dict)
        return self

    def data_preprocessing(self, *args, **kwargs):
        """
        Preprocess the dataset by preparing features and labels for model training.
        
        This method handles column renaming, feature concatenation, and label extraction.
        Currently supports traditional ML models that require concatenated features.
        
        Args:
            *args: Variable length argument list
            **kwargs: Arbitrary keyword arguments including:
                rename_dict (dict, optional): Column renaming mapping
                
        Returns:
            ModelBenchmark: Returns self for method chaining
            
        Raises:
            ValueError: If required feature columns are missing
        """
        # Apply column renaming if specified
        if kwargs.get("rename_dict"):
            self.rename_data_columns(kwargs["rename_dict"])

        # Validate required columns exist
        if ("metabolite_features" not in self.data.columns) or ("protein_features" not in self.data.columns):
            raise ValueError("Data is missing required feature columns or wrong column name.")
        
        # Process data for traditional ML models
        if self.model_type in ["CAT", "GBM", "LR", "RF", "SVR", "XGB"]:
            # Concatenate metabolite and protein features into a single feature vector
            # This creates a flat feature representation suitable for traditional ML algorithms
            self.X = np.array([np.concatenate([m, p]) for m, p in zip(self.data["metabolite_features"], self.data["protein_features"])])
            self.y = self.data["label"]
        return self
    
    def split_data(self, *args, **kwargs):
        """
        Split the dataset into training, validation, and test sets based on the specified method.
        
        Supports multiple splitting strategies:
        - Random split: Standard random partitioning (70% train, 15% val, 15% test)
        - Cold molecules: Split by unique molecules (to be implemented)
        - Cold proteins: Split by unique proteins (to be implemented)
        - Load from file: Use pre-computed split indices
        
        Args:
            *args: Variable length argument list
            **kwargs: Arbitrary keyword arguments including:
                save_index_path (str, optional): Path to save split indices
                index_file_path (str, required for load_index_from_file): Path to load pre-computed indices

        Returns:
            ModelBenchmark: Returns self for method chaining
            
        Raises:
            ValueError: If split method is unknown or indices don't match data length
        """
        # Generate train/validation/test indices based on split method
        if self.split_method == "random" or self.split_method == "random_split":
            # Standard random split: 70% train, 15% validation, 15% test
            self.train_index, self.temp_index = train_test_split(np.arange(len(self.data)),test_size=0.3, shuffle=True, random_state=42)
            self.val_index, self.test_index = train_test_split(self.temp_index, test_size=0.5, shuffle=True, random_state=42)

        elif self.split_method == "cold_mols":
            # TODO: Implement cold molecules split logic
            # This would split by unique molecules to test generalization to new compounds
            pass
        elif self.split_method == "cold_proteins":
            # TODO: Implement cold proteins split logic  
            # This would split by unique proteins to test generalization to new enzymes
            pass
        elif self.split_method == "load_index_from_file":
            # Load pre-computed split indices from JSON file
            if "index_file_path" not in kwargs:
                raise ValueError("index_file_path must be provided for load_index_from_file split method.")
            with open(kwargs["index_file_path"], "r") as f:
                indices = json.load(f)
                self.train_index = np.array(indices["train_index"])
                self.val_index = np.array(indices["val_index"])
                self.test_index = np.array(indices["test_index"])
                
            # Validate that indices cover all data points exactly once
            if len(self.train_index)+ len(self.val_index) + len(self.test_index) != len(self.data):
                raise ValueError("Indices from file do not match the length of the data.")   
            
        else:
            raise ValueError(f"Unknown split method: {self.split_method}")
        
        # Create feature and label subsets for each split
        self.train_X = self.X[self.train_index]
        self.train_y = self.y[self.train_index]
        self.val_X = self.X[self.val_index]
        self.val_y = self.y[self.val_index]
        self.test_X = self.X[self.test_index]
        self.test_y = self.y[self.test_index]

        # Save split indices to file if specified
        if kwargs.get("save_index_path"):
            with open(kwargs["save_index_path"], "w") as f:
                json.dump({
                    "train_index": self.train_index.tolist(),
                    "val_index": self.val_index.tolist(),
                    "test_index": self.test_index.tolist()
                }, f)
        return self
    
    def model_init(self, *args, **kwargs):
        """
        Initialize the specified machine learning model with given parameters.
        
        Supports various traditional ML algorithms including tree-based methods,
        linear models, and support vector machines.
        
        Args:
            *args: Variable length argument list passed to model constructor
            **kwargs: Arbitrary keyword arguments passed to model constructor
            
        Returns:
            ModelBenchmark: Returns self for method chaining
            
        Raises:
            ImportError: If required model library is not installed
        """
        print(f"Training {self.model_type}(traditional) model...")
        
        # Initialize model based on specified type
        if self.model_type == "CAT":
            # CatBoost Gradient Boosting
            import catboost as cat
            self.model = cat.CatBoostRegressor( *args, **kwargs)
        elif self.model_type == "GBM":
            # Scikit-learn Gradient Boosting
            from sklearn.ensemble import GradientBoostingRegressor
            self.model = GradientBoostingRegressor( *args, **kwargs)
        elif self.model_type == "LR":
            # Linear Regression
            from sklearn.linear_model import LinearRegression
            self.model = LinearRegression( *args, **kwargs)
        elif self.model_type == "RF":
            # Random Forest with fixed random state for reproducibility
            from sklearn.ensemble import RandomForestRegressor
            self.model = RandomForestRegressor(random_state=42)
        elif self.model_type == "SVR":
            # Support Vector Regression
            from sklearn.svm import SVR
            self.model = SVR( *args, **kwargs)
        elif self.model_type == "XGB":
            # XGBoost Gradient Boosting
            import xgboost as xg
            self.model = xg.XGBRegressor( *args, **kwargs)
        return self

    def fit(self, *args, **kwargs):
        """
        Train the initialized model on the training dataset.
        
        Uses the preprocessed training features (train_X) and labels (train_y)
        to fit the model parameters.
        
        Args:
            *args: Variable length argument list passed to model.fit()
            **kwargs: Arbitrary keyword arguments passed to model.fit()
            
        Returns:
            ModelBenchmark: Returns self for method chaining
        """
        if self.model_type in ["CAT", "GBM", "LR", "RF", "SVR", "XGB"]:
            # Train the model using standard scikit-learn API
            self.model.fit(self.train_X, self.train_y, *args, **kwargs)
            return self
        else:
            """
            TODO: Implement training logic for other model types
            """
            pass
    
    def load_model_from_file(self, model_path):
        """
        Load a pre-trained model from file.
        
        Args:
            model_path (str): Path to the saved model file
            
        Returns:
            ModelBenchmark: Returns self for method chaining
        """
        if self.model_type in ["CAT", "GBM", "LR", "RF", "SVR", "XGB"]:
            self.model = joblib.load(model_path)
            print(f"Model loaded from: {model_path}")
            return self
        else:
            """
            TODO: Implement model loading logic for other model types
            """
            pass
    
    def save_model_to_file(self, model_path):
        """
        Save the trained model to file.
        
        Args:
            model_path (str): Path where to save the model
        """
        if hasattr(self, 'model'):
            joblib.dump(self.model, model_path)
            print(f"Model saved to: {model_path}")
        else:
            raise ValueError("No model to save. Please train a model first.")
    
    def compare_predictions(self, other_predictions, test_data=None):
        """
        Compare predictions with another set of predictions.
        
        Args:
            other_predictions (numpy.ndarray): Other predictions to compare with
            test_data (numpy.ndarray, optional): Test data to use for prediction
            
        Returns:
            dict: Dictionary containing comparison metrics
        """
        if test_data is None:
            test_data = self.test_X
            
        current_predictions = self.predict(test_data)
        
        # Calculate differences
        abs_diff = np.abs(current_predictions - other_predictions)
        
        comparison_metrics = {
            'max_absolute_difference': np.max(abs_diff),
            'mean_absolute_difference': np.mean(abs_diff),
            'std_absolute_difference': np.std(abs_diff),
            'median_absolute_difference': np.median(abs_diff),
            'rmse_difference': np.sqrt(np.mean(abs_diff**2)),
            'are_identical': np.allclose(current_predictions, other_predictions, rtol=1e-10, atol=1e-10)
        }
        
        return comparison_metrics

    def predict(self, X):
        """
        Generate predictions using the trained model.
        
        Args:
            X (numpy.ndarray): Input features for prediction
            
        Returns:
            numpy.ndarray: Model predictions
        """
        if self.model_type in ["CAT", "GBM", "LR", "RF", "SVR", "XGB"]:
            return self.model.predict(X)

## Method 1: Using ModelBenchmark.fit()

In [8]:
# Method 1: Using ModelBenchmark class with fit()
print("=== Method 1: ModelBenchmark.fit() ===")
benchmark_model = ModelBenchmark(
    data="./../../A01_dataset/kcat_with_features.joblib",
    model_type="LR",
    split_method="random"
)

# Preprocess and train
benchmark_model.data_preprocessing(rename_dict={"log10kcat_max": "label"})
benchmark_model.split_data()
benchmark_model.model_init()
benchmark_model.fit()

# Generate predictions
predictions_method1 = benchmark_model.predict(benchmark_model.test_X)
print(f"Method 1 predictions shape: {predictions_method1.shape}")
print(f"Method 1 first 10 predictions: {predictions_method1[:10]}")

=== Method 1: ModelBenchmark.fit() ===
Initial data type: <class 'str'>
Loading data from: ./../../A01_dataset/kcat_with_features.joblib
Data loaded successfully, type: <class 'pandas.core.frame.DataFrame'>
Training LR(traditional) model...
Data loaded successfully, type: <class 'pandas.core.frame.DataFrame'>
Training LR(traditional) model...
Method 1 predictions shape: (3473,)
Method 1 first 10 predictions: [ 0.8442364   0.9761486   0.46442795  1.6891556   0.239748    1.2012768
  1.1019135   0.8296261   0.6127968  -1.7968616 ]
Method 1 predictions shape: (3473,)
Method 1 first 10 predictions: [ 0.8442364   0.9761486   0.46442795  1.6891556   0.239748    1.2012768
  1.1019135   0.8296261   0.6127968  -1.7968616 ]


## Method 2: Manual Implementation (Original ER_RegressionModel_LR approach)

In [9]:
# Method 2: Manual implementation matching ER_RegressionModel_LR.ipynb
print("\n=== Method 2: Manual Implementation ===")

# Load and preprocess data exactly as in ER_RegressionModel_LR.ipynb
data_kcat = joblib.load('./../../A01_dataset/kcat_with_features.joblib')
data_kcat.rename(columns={'log10kcat_max':'label'}, inplace=True)

# Split data with identical parameters
train_df, temp_df = train_test_split(data_kcat, test_size=0.3, shuffle=True, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, shuffle=True, random_state=42)

# Prepare features exactly as in original
train_X_manual = np.array([
    np.concatenate([m, p])
    for m, p in zip(train_df['metabolite_features'], train_df['protein_features'])
])
train_y_manual = train_df['label']

test_X_manual = np.array([
    np.concatenate([m, p])
    for m, p in zip(test_df['metabolite_features'], test_df['protein_features'])
])
test_y_manual = test_df['label']

# Train model
lr_manual = LinearRegression()
lr_manual.fit(train_X_manual, train_y_manual)

# Generate predictions
predictions_method2 = lr_manual.predict(test_X_manual)
print(f"Method 2 predictions shape: {predictions_method2.shape}")
print(f"Method 2 first 10 predictions: {predictions_method2[:10]}")


=== Method 2: Manual Implementation ===
Method 2 predictions shape: (3473,)
Method 2 first 10 predictions: [ 0.8442364   0.9761486   0.46442795  1.6891556   0.239748    1.2012768
  1.1019135   0.8296261   0.6127968  -1.7968616 ]
Method 2 predictions shape: (3473,)
Method 2 first 10 predictions: [ 0.8442364   0.9761486   0.46442795  1.6891556   0.239748    1.2012768
  1.1019135   0.8296261   0.6127968  -1.7968616 ]


## Method 3: Loading Pre-trained Model

In [10]:
# Method 3: Load the saved model from ER_RegressionModel_LR.ipynb
print("\n=== Method 3: Load Pre-trained Model ===")

try:
    # Try to load the model (update path as needed)
    loaded_model = joblib.load('./../../A03_models/random_split/LR model_Catpred.joblib')
    
    # Use the same test data as Method 2 for fair comparison
    predictions_method3 = loaded_model.predict(test_X_manual)
    print(f"Method 3 predictions shape: {predictions_method3.shape}")
    print(f"Method 3 first 10 predictions: {predictions_method3[:10]}")
    
except FileNotFoundError:
    print("Pre-trained model file not found. Creating and saving a model for comparison...")
    # Save the manually trained model for consistency
    joblib.dump(lr_manual, './temp_lr_model.joblib')
    loaded_model = joblib.load('./temp_lr_model.joblib')
    predictions_method3 = loaded_model.predict(test_X_manual)
    print(f"Method 3 predictions shape: {predictions_method3.shape}")
    print(f"Method 3 first 10 predictions: {predictions_method3[:10]}")


=== Method 3: Load Pre-trained Model ===
Method 3 predictions shape: (3473,)
Method 3 first 10 predictions: [ 0.8463745   0.9789429   0.4663391   1.6912537   0.24151611  1.203186
  1.1044312   0.83166504  0.6152954  -1.7943726 ]


## Detailed Comparison Analysis

In [11]:
# Compare data shapes and basic statistics
print("=== Data Shape Comparison ===")
print(f"Method 1 test_X shape: {benchmark_model.test_X.shape}")
print(f"Method 2 test_X shape: {test_X_manual.shape}")
print(f"Method 1 test_y shape: {benchmark_model.test_y.shape}")
print(f"Method 2 test_y shape: {test_y_manual.shape}")

print("\n=== Index Comparison ===")
print(f"Method 1 test indices (first 10): {benchmark_model.test_index[:10]}")
print(f"Method 2 test indices (first 10): {test_df.index[:10].values}")

print("\n=== Feature Statistics Comparison ===")
print(f"Method 1 test_X mean: {benchmark_model.test_X.mean():.6f}")
print(f"Method 2 test_X mean: {test_X_manual.mean():.6f}")
print(f"Method 1 test_X std: {benchmark_model.test_X.std():.6f}")
print(f"Method 2 test_X std: {test_X_manual.std():.6f}")

=== Data Shape Comparison ===
Method 1 test_X shape: (3473, 1088)
Method 2 test_X shape: (3473, 1088)
Method 1 test_y shape: (3473,)
Method 2 test_y shape: (3473,)

=== Index Comparison ===
Method 1 test indices (first 10): [14384 20750 14099  1522 21680 18656 20026 13459 14465  5500]
Method 2 test indices (first 10): [14384 20750 14099  1522 21680 18656 20026 13459 14465  5500]

=== Feature Statistics Comparison ===
Method 1 test_X mean: -0.003720
Method 2 test_X mean: -0.003720
Method 1 test_X std: 0.490411
Method 2 test_X std: 0.490411


In [12]:
# Check if test sets are identical
print("=== Test Set Identity Check ===")

# Compare a few sample features
if benchmark_model.test_X.shape == test_X_manual.shape:
    # Check if the data is identical
    are_identical = np.allclose(benchmark_model.test_X, test_X_manual, rtol=1e-10, atol=1e-10)
    print(f"Test sets identical: {are_identical}")
    
    if not are_identical:
        # Find differences
        diff_mask = ~np.isclose(benchmark_model.test_X, test_X_manual, rtol=1e-10, atol=1e-10)
        num_differences = np.sum(diff_mask)
        print(f"Number of different elements: {num_differences}")
        print(f"Percentage different: {num_differences / benchmark_model.test_X.size * 100:.4f}%")
        
        if num_differences > 0:
            # Show some example differences
            diff_indices = np.where(diff_mask)
            for i in range(min(5, len(diff_indices[0]))):
                row, col = diff_indices[0][i], diff_indices[1][i]
                print(f"Difference at [{row}, {col}]: {benchmark_model.test_X[row, col]} vs {test_X_manual[row, col]}")
else:
    print("Test sets have different shapes - cannot compare directly")

=== Test Set Identity Check ===
Test sets identical: True


In [13]:
# Compare model coefficients
print("=== Model Coefficients Comparison ===")
print(f"Method 1 coefficients shape: {benchmark_model.model.coef_.shape}")
print(f"Method 2 coefficients shape: {lr_manual.model.coef_.shape}")
print(f"Method 3 coefficients shape: {loaded_model.coef_.shape}")

print(f"\nMethod 1 intercept: {benchmark_model.model.intercept_:.10f}")
print(f"Method 2 intercept: {lr_manual.intercept_:.10f}")
print(f"Method 3 intercept: {loaded_model.intercept_:.10f}")

# Check coefficient differences
coeff_diff_1_2 = np.abs(benchmark_model.model.coef_ - lr_manual.coef_)
coeff_diff_2_3 = np.abs(lr_manual.coef_ - loaded_model.coef_)

print(f"\nMax coefficient difference (Method 1 vs 2): {np.max(coeff_diff_1_2):.2e}")
print(f"Max coefficient difference (Method 2 vs 3): {np.max(coeff_diff_2_3):.2e}")
print(f"Mean coefficient difference (Method 1 vs 2): {np.mean(coeff_diff_1_2):.2e}")
print(f"Mean coefficient difference (Method 2 vs 3): {np.mean(coeff_diff_2_3):.2e}")

=== Model Coefficients Comparison ===
Method 1 coefficients shape: (1088,)


AttributeError: 'LinearRegression' object has no attribute 'model'

In [None]:
# Compare predictions numerically
print("=== Prediction Comparison ===")

# For fair comparison, use Method 2's test set for all predictions
pred_1_on_manual_test = benchmark_model.model.predict(test_X_manual)
pred_2 = predictions_method2
pred_3 = predictions_method3

print(f"Method 1 on manual test set (first 10): {pred_1_on_manual_test[:10]}")
print(f"Method 2 predictions (first 10): {pred_2[:10]}")
print(f"Method 3 predictions (first 10): {pred_3[:10]}")

# Calculate differences
diff_1_2 = np.abs(pred_1_on_manual_test - pred_2)
diff_2_3 = np.abs(pred_2 - pred_3)
diff_1_3 = np.abs(pred_1_on_manual_test - pred_3)

print(f"\nMax difference (Method 1 vs 2): {np.max(diff_1_2):.2e}")
print(f"Max difference (Method 2 vs 3): {np.max(diff_2_3):.2e}")
print(f"Max difference (Method 1 vs 3): {np.max(diff_1_3):.2e}")

print(f"Mean difference (Method 1 vs 2): {np.mean(diff_1_2):.2e}")
print(f"Mean difference (Method 2 vs 3): {np.mean(diff_2_3):.2e}")
print(f"Mean difference (Method 1 vs 3): {np.mean(diff_1_3):.2e}")

print(f"Std difference (Method 1 vs 2): {np.std(diff_1_2):.2e}")
print(f"Std difference (Method 2 vs 3): {np.std(diff_2_3):.2e}")
print(f"Std difference (Method 1 vs 3): {np.std(diff_1_3):.2e}")

In [None]:
# Visualize the differences
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Scatter plot: Method 1 vs Method 2
axes[0, 0].scatter(pred_1_on_manual_test, pred_2, alpha=0.6)
axes[0, 0].plot([pred_2.min(), pred_2.max()], [pred_2.min(), pred_2.max()], 'r--')
axes[0, 0].set_xlabel('Method 1 Predictions')
axes[0, 0].set_ylabel('Method 2 Predictions')
axes[0, 0].set_title('Method 1 vs Method 2')
axes[0, 0].grid(True, alpha=0.3)

# Scatter plot: Method 2 vs Method 3
axes[0, 1].scatter(pred_2, pred_3, alpha=0.6)
axes[0, 1].plot([pred_2.min(), pred_2.max()], [pred_2.min(), pred_2.max()], 'r--')
axes[0, 1].set_xlabel('Method 2 Predictions')
axes[0, 1].set_ylabel('Method 3 Predictions')
axes[0, 1].set_title('Method 2 vs Method 3')
axes[0, 1].grid(True, alpha=0.3)

# Difference histogram: Method 1 vs Method 2
axes[1, 0].hist(diff_1_2, bins=50, alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Absolute Difference')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Difference Distribution: Method 1 vs Method 2')
axes[1, 0].grid(True, alpha=0.3)

# Difference histogram: Method 2 vs Method 3
axes[1, 1].hist(diff_2_3, bins=50, alpha=0.7, edgecolor='black')
axes[1, 1].set_xlabel('Absolute Difference')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Difference Distribution: Method 2 vs Method 3')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Potential Sources of Discrepancy

Based on the analysis above, potential sources of numerical differences include:

1. **Data Type Precision**: Different numpy data types (float32 vs float64)
2. **Feature Engineering**: Slight differences in how features are concatenated
3. **Model Serialization**: Precision loss during joblib save/load operations
4. **Random Seed Propagation**: Different random states affecting data processing
5. **Memory Layout**: Different array memory layouts affecting numerical computation
6. **Sklearn Version**: Different sklearn versions with slightly different algorithms

## Recommendations

1. **Ensure Identical Data Processing**: Use the same data preprocessing pipeline
2. **Check Data Types**: Ensure all arrays use the same precision (float64)
3. **Verify Random Seeds**: Set random seeds consistently across all methods
4. **Model Serialization**: Consider the precision limits of joblib serialization

In [None]:
# Additional diagnostic: Check data types and memory layout
print("=== Data Type and Memory Layout Analysis ===")
print(f"Method 1 test_X dtype: {benchmark_model.test_X.dtype}")
print(f"Method 2 test_X dtype: {test_X_manual.dtype}")
print(f"Method 1 test_X memory layout: {benchmark_model.test_X.flags}")
print(f"Method 2 test_X memory layout: {test_X_manual.flags}")

# Check if converting to same dtype reduces differences
if benchmark_model.test_X.dtype != test_X_manual.dtype:
    print(f"\nConverting both to float64 for comparison...")
    test_X_method1_f64 = benchmark_model.test_X.astype(np.float64)
    test_X_method2_f64 = test_X_manual.astype(np.float64)
    
    pred_1_f64 = benchmark_model.model.predict(test_X_method1_f64)
    pred_2_f64 = lr_manual.predict(test_X_method2_f64)
    
    print(f"Difference after dtype conversion: {np.max(np.abs(pred_1_f64 - pred_2_f64)):.2e}")