# Scikit-learn Advanced Techniques: A Mathematical Deep Dive into Production-Ready Machine Learning

This notebook explores advanced machine learning techniques essential for building robust, production-ready systems. We'll dive deep into the mathematical foundations of preprocessing, feature engineering, hyperparameter optimization, ensemble methods, and model deployment strategies.

## Advanced Machine Learning: Beyond Basic Algorithms

While understanding individual algorithms is crucial, **real-world machine learning** requires mastery of the entire pipeline - from raw data to deployed models. This notebook bridges the gap between academic understanding and practical implementation.

### The Production ML Challenge

**Research vs. Production**: Key differences that matter

**Research Setting**:
- Clean, well-curated datasets
- Focus on algorithmic novelty
- Single metric optimization
- Unlimited computational resources

**Production Setting**:
- Messy, real-world data
- Multiple competing objectives
- Strict latency and resource constraints
- Need for reliability and interpretability

### Mathematical Framework for Advanced ML

**Multi-Objective Optimization**: Real-world ML involves multiple, often conflicting objectives:
$$\min_{\theta} \mathbf{f}(\theta) = \begin{bmatrix} 
\text{Prediction Error}(\theta) \\
\text{Complexity}(\theta) \\
\text{Latency}(\theta) \\
\text{Memory}(\theta)
\end{bmatrix}$$

**Pareto Optimality**: A solution $\theta^*$ is Pareto optimal if no other solution exists that improves one objective without worsening another.

**Robust Optimization**: Account for uncertainty in data distribution:
$$\min_{\theta} \max_{P \in \mathcal{U}} \mathbb{E}_{(x,y) \sim P}[\mathcal{L}(y, f_{\theta}(x))]$$

where $\mathcal{U}$ is an uncertainty set around the training distribution.

### Advanced Concepts We'll Master

**1. Sophisticated Preprocessing**:
- Handling mixed data types mathematically
- Advanced imputation techniques
- Feature engineering automation
- Pipeline optimization

**2. Hyperparameter Optimization**:
- Bayesian optimization theory
- Multi-fidelity optimization
- Evolutionary strategies
- Early stopping mathematics

**3. Advanced Ensemble Methods**:
- Stacking theory and practice
- Dynamic ensemble selection
- Meta-learning approaches
- Uncertainty quantification

**4. Model Interpretation**:
- SHAP value mathematics
- Counterfactual explanations
- Causal inference integration
- Global vs. local explanations

**5. Robust Evaluation**:
- Cross-validation variants
- Statistical significance testing
- A/B testing for ML
- Fairness metrics

**6. Production Deployment**:
- Model versioning
- Drift detection algorithms
- Online learning systems
- MLOps pipeline design

### Why Advanced Techniques Matter

**Statistical Considerations**:
- **Multiple Testing**: When trying many models, we need correction for multiple comparisons
- **Selection Bias**: Hyperparameter tuning can introduce bias if not done carefully
- **Distribution Shift**: Models must handle changing data distributions

**Computational Considerations**:
- **Scalability**: Algorithms must work with large datasets
- **Resource Constraints**: Memory and time limitations in production
- **Real-time Requirements**: Sub-second prediction latency

**Business Considerations**:
- **Interpretability**: Stakeholders need to understand model decisions
- **Fairness**: Models must not discriminate against protected groups
- **Reliability**: Systems must be robust to failures and edge cases

### Mathematical Foundations for Robustness

**Generalization Theory**: Understanding when models will work on new data

**PAC Learning**: A model is $(1-\epsilon, 1-\delta)$-PAC learnable if:
$$P(\text{Error} \leq \epsilon) \geq 1 - \delta$$

**VC Dimension**: Measure of model complexity affecting generalization

**Rademacher Complexity**: Tighter generalization bounds:
$$\text{Generalization Gap} \leq 2\mathcal{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}$$

**Domain Adaptation**: When training and test distributions differ:
$$\epsilon_{\text{target}} \leq \epsilon_{\text{source}} + d_{\mathcal{H}\Delta\mathcal{H}}(D_s, D_t) + \lambda$$

where $d_{\mathcal{H}\Delta\mathcal{H}}$ is the $\mathcal{H}\Delta\mathcal{H}$-distance between domains.

### The Advanced ML Mindset

**Think in Systems**: Models are components in larger systems
**Embrace Uncertainty**: Quantify and communicate prediction uncertainty
**Plan for Failure**: Build robust systems that degrade gracefully
**Iterate Rapidly**: Use techniques that enable fast experimentation
**Monitor Continuously**: Implement systems to detect when models fail

This notebook will transform your understanding from basic algorithm application to sophisticated machine learning system design, providing both theoretical foundations and practical tools for building production-ready ML systems.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, metrics, preprocessing, model_selection, pipeline, compose
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, RFE, SelectFromModel
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.inspection import permutation_importance, plot_partial_dependence
import warnings
warnings.filterwarnings('ignore')

# Set style and random seed
plt.style.use('seaborn-v0_8')
np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 8)

## Advanced Preprocessing and Feature Engineering: The Mathematical Art of Data Transformation

Preprocessing is often considered mundane, but it's where the most critical decisions are made. Poor preprocessing can doom even the best algorithms, while sophisticated preprocessing can make simple algorithms work remarkably well.

### The Mathematics of Mixed Data Types

Real-world datasets rarely contain only numerical features. We must handle **heterogeneous data** systematically, and this requires understanding the mathematical properties of different data types.

**Data Type Taxonomy**:
1. **Numerical**: Continuous (real-valued) or discrete (integer-valued)
2. **Categorical**: Nominal (no order) or ordinal (ordered)
3. **Text**: Unstructured language data
4. **Temporal**: Time-series or date features
5. **Spatial**: Geographic coordinates, images

### Advanced Imputation Theory

Missing data is inevitable, and naive approaches can introduce significant bias. We need sophisticated mathematical frameworks for handling missingness.

**Missingness Mechanisms**:

**1. Missing Completely at Random (MCAR)**:
$$P(\text{Missing} | \text{Observed}, \text{Unobserved}) = P(\text{Missing})$$
- Missingness is independent of all variables
- Simple imputation methods are unbiased

**2. Missing at Random (MAR)**:
$$P(\text{Missing} | \text{Observed}, \text{Unobserved}) = P(\text{Missing} | \text{Observed})$$
- Missingness depends only on observed variables
- Can be handled with proper imputation

**3. Missing Not at Random (MNAR)**:
$$P(\text{Missing} | \text{Observed}, \text{Unobserved}) \neq P(\text{Missing} | \text{Observed})$$
- Missingness depends on unobserved values
- Requires domain knowledge or specialized methods

### K-Nearest Neighbors Imputation Mathematics

**Algorithm**: For missing value in sample $i$, feature $j$:
1. **Distance calculation**: Find $k$ most similar samples with observed $x_j$
$$d(i, l) = \sqrt{\sum_{m \neq j} (x_{im} - x_{lm})^2}$$

2. **Weighted imputation**: 
$$\hat{x}_{ij} = \frac{\sum_{l \in \mathcal{N}_k(i)} w_{il} x_{lj}}{\sum_{l \in \mathcal{N}_k(i)} w_{il}}$$

where weights can be:
- **Uniform**: $w_{il} = 1$
- **Distance-based**: $w_{il} = \frac{1}{d(i,l) + \epsilon}$
- **Gaussian**: $w_{il} = \exp(-d(i,l)^2/2\sigma^2)$

**Advantages**:
- Preserves local data structure
- Handles non-linear relationships
- Works with mixed data types

**Challenges**:
- Computational complexity: $O(n^2 d)$
- Sensitive to scaling and distance metric choice
- Can propagate errors when many features are missing

### Multiple Imputation Framework

**Theoretical Foundation**: Generate $m$ plausible imputed datasets, analyze each, then combine results.

**Rubin's Rules**: For parameter estimate $\hat{\theta}$:
$$\hat{\theta}_{\text{pooled}} = \frac{1}{m}\sum_{i=1}^{m} \hat{\theta}_i$$

**Variance calculation**:
$$\text{Var}(\hat{\theta}_{\text{pooled}}) = \bar{W} + \left(1 + \frac{1}{m}\right)B$$

where:
- $\bar{W} = \frac{1}{m}\sum_{i=1}^{m} W_i$ (within-imputation variance)
- $B = \frac{1}{m-1}\sum_{i=1}^{m}(\hat{\theta}_i - \hat{\theta}_{\text{pooled}})^2$ (between-imputation variance)

### Advanced Categorical Encoding

**One-Hot Encoding Issues**:
- **Curse of dimensionality**: Creates $k$ features for $k$ categories
- **Sparsity**: Most values are zero
- **Memory explosion**: For high-cardinality categorical features

**Target Encoding (Mean Encoding)**:
$$\text{encoded}_c = \frac{\sum_{i: x_i = c} y_i}{\sum_{i: x_i = c} 1}$$

**Regularized Target Encoding**: Prevent overfitting with smoothing:
$$\text{encoded}_c = \frac{n_c \cdot \bar{y}_c + \alpha \cdot \bar{y}_{\text{global}}}{n_c + \alpha}$$

where $\alpha$ is smoothing parameter, $n_c$ is count of category $c$.

**Leave-One-Out Encoding**: Prevent data leakage:
$$\text{encoded}_{c,i} = \frac{\sum_{j \neq i, x_j = c} y_j}{\sum_{j \neq i, x_j = c} 1}$$

**Binary Encoding**: Logarithmic space complexity
- Convert categories to binary representation
- Creates $\lceil \log_2(k) \rceil$ features instead of $k$

### Advanced Scaling Techniques

**Quantile Uniform Transformation**:
Transform to uniform distribution over $[0,1]$:
$$F(x) = \frac{\text{rank}(x) - 0.5}{n}$$

**Quantile Normal Transformation**:
Transform to standard normal distribution:
$$\Phi^{-1}(F(x))$$

where $\Phi^{-1}$ is inverse CDF of standard normal.

**Power Transformations**:
**Box-Cox**: $y(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(x) & \text{if } \lambda = 0 \end{cases}$

**Yeo-Johnson**: Extension that handles negative values:
$$y(\lambda) = \begin{cases} 
\frac{(x+1)^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, x \geq 0 \\
\log(x+1) & \text{if } \lambda = 0, x \geq 0 \\
-\frac{((-x)+1)^{2-\lambda} - 1}{2-\lambda} & \text{if } \lambda \neq 2, x < 0 \\
-\log((-x)+1) & \text{if } \lambda = 2, x < 0
\end{cases}$$

### Automated Feature Engineering

**Polynomial Feature Generation**:
For features $\mathbf{x} = [x_1, x_2, ..., x_d]$, generate all terms up to degree $p$:
$$\phi(\mathbf{x}) = \{x_1^{a_1} x_2^{a_2} \cdots x_d^{a_d} : a_1 + a_2 + \cdots + a_d \leq p\}$$

**Total number of features**: $\binom{d + p}{p}$

**Interaction Detection**: Use statistical tests or model-based approaches:
- **Pearson correlation**: For linear interactions
- **Mutual information**: For non-linear interactions
- **ANOVA F-test**: For categorical-numerical interactions

### Advanced Pipeline Design

**Column Transformer Mathematics**: Apply different transformations to different feature subsets:
$$\mathbf{X}_{\text{transformed}} = [\mathbf{T}_1(\mathbf{X}_1), \mathbf{T}_2(\mathbf{X}_2), ..., \mathbf{T}_k(\mathbf{X}_k)]$$

where $\mathbf{X}_i$ are feature subsets and $\mathbf{T}_i$ are transformations.

**Feature Union**: Combine multiple feature extraction methods:
$$\phi(\mathbf{x}) = [\phi_1(\mathbf{x}), \phi_2(\mathbf{x}), ..., \phi_k(\mathbf{x})]$$

### Handling Imbalanced Data

**Mathematical Framework**: When classes are highly imbalanced, standard metrics fail.

**SMOTE (Synthetic Minority Oversampling Technique)**:
For minority sample $\mathbf{x}_i$:
1. Find $k$ nearest minority neighbors
2. Generate synthetic sample: $\mathbf{x}_{\text{new}} = \mathbf{x}_i + \lambda(\mathbf{x}_{\text{neighbor}} - \mathbf{x}_i)$
3. Where $\lambda \sim \text{Uniform}[0,1]$

**Borderline-SMOTE**: Focus on samples near decision boundary
**ADASYN**: Adaptive density-based generation

**Cost-Sensitive Learning**: Modify loss function:
$$\mathcal{L}_{\text{weighted}} = \sum_{i=1}^{n} w_{y_i} \mathcal{L}(y_i, \hat{y}_i)$$

where $w_c$ is cost for misclassifying class $c$.

### Feature Selection Mathematics

**Filter Methods**: Statistical relationships between features and target

**Mutual Information**: 
$$I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$

**Chi-Square Test**: For categorical features
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

**Wrapper Methods**: Use model performance

**Recursive Feature Elimination**: Iteratively remove least important features
1. Train model on all features
2. Rank features by importance
3. Remove least important feature
4. Repeat until desired number of features

**Embedded Methods**: Feature selection during model training
- **L1 Regularization**: Automatic feature selection through sparsity
- **Tree-based importance**: Use feature importance from tree models

### Advanced Text Feature Engineering

**TF-IDF Mathematics**:
$$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

where:
- $\text{TF}(t,d) = \frac{\text{count}(t,d)}{\sum_{t' \in d} \text{count}(t',d)}$
- $\text{IDF}(t) = \log \frac{|D|}{|\{d \in D : t \in d\}|}$

**Word Embeddings**: Dense vector representations
- **Word2Vec**: Skip-gram and CBOW
- **GloVe**: Global vectors for word representation
- **FastText**: Subword information

### Time Series Feature Engineering

**Lag Features**: $X_{t-1}, X_{t-2}, ..., X_{t-p}$

**Rolling Statistics**:
- **Moving average**: $\bar{X}_t^{(w)} = \frac{1}{w}\sum_{i=0}^{w-1} X_{t-i}$
- **Moving standard deviation**: $\sigma_t^{(w)} = \sqrt{\frac{1}{w}\sum_{i=0}^{w-1}(X_{t-i} - \bar{X}_t^{(w)})^2}$

**Fourier Features**: Extract frequency components
$$X(f) = \sum_{t=0}^{N-1} x_t e^{-2\pi i ft/N}$$

### Feature Engineering Best Practices

**Domain Knowledge Integration**: 
- Understand business context
- Create features that match domain expertise
- Validate feature engineering with subject matter experts

**Computational Considerations**:
- **Memory complexity**: Track feature explosion
- **Computation time**: Consider real-time constraints
- **Storage requirements**: Balance richness vs. efficiency

**Validation Strategy**:
- **Time-based splits**: For temporal data
- **Group-based splits**: When samples are not independent
- **Nested cross-validation**: When doing feature selection

This foundation sets the stage for sophisticated preprocessing that can dramatically improve model performance while maintaining computational efficiency.

In [None]:
# Create complex synthetic dataset with various data types
from sklearn.datasets import make_classification

# Generate base classification data
X_base, y_base = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                                   n_redundant=2, n_clusters_per_class=1, 
                                   random_state=42)

# Create a more complex dataset with mixed data types
np.random.seed(42)
n_samples = 1000

dataset = pd.DataFrame({
    # Numerical features
    'age': np.random.normal(35, 10, n_samples),
    'income': np.random.lognormal(10, 1, n_samples),
    'score1': X_base[:, 0],
    'score2': X_base[:, 1],
    'score3': X_base[:, 2],
    
    # Categorical features
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 
                                n_samples, p=[0.3, 0.4, 0.2, 0.1]),
    
    # Ordinal feature
    'rating': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.1, 0.2, 0.4, 0.2, 0.1]),
    
    # Text feature (simplified)
    'text_length': np.random.poisson(50, n_samples),
    
    # Target variable
    'target': y_base
})

# Introduce missing values
missing_mask = np.random.random((n_samples, len(dataset.columns))) < 0.05
for col in ['age', 'income', 'score1', 'category', 'education']:
    mask = missing_mask[:, dataset.columns.get_loc(col)]
    dataset.loc[mask, col] = np.nan

print("Dataset overview:")
print(dataset.head())
print(f"\nDataset shape: {dataset.shape}")
print(f"\nMissing values per column:")
print(dataset.isnull().sum())

# Advanced preprocessing pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate features and target
X = dataset.drop('target', axis=1)
y = dataset['target']

# Define column types
numeric_features = ['age', 'income', 'score1', 'score2', 'score3', 'text_length']
categorical_features = ['category', 'region', 'education']
ordinal_features = ['rating']

print(f"\nFeature types:")
print(f"Numeric: {numeric_features}")
print(f"Categorical: {categorical_features}")
print(f"Ordinal: {ordinal_features}")

# Create preprocessing pipelines for different feature types
numeric_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=5)),
    ('scaler', RobustScaler())  # More robust to outliers than StandardScaler
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('ordinal', OrdinalEncoder())
])

# Combine all transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features)
    ],
    remainder='drop'  # Drop any remaining columns
)

# Apply preprocessing
X_preprocessed = preprocessor.fit_transform(X)

# Get feature names after preprocessing
numeric_feature_names = numeric_features
categorical_feature_names = list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features))
ordinal_feature_names = ordinal_features

feature_names = numeric_feature_names + categorical_feature_names + ordinal_feature_names

print(f"\nPreprocessed data shape: {X_preprocessed.shape}")
print(f"Number of features after preprocessing: {len(feature_names)}")
print(f"\nFirst few feature names: {feature_names[:10]}")

# Feature engineering: polynomial features for numeric data
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features for a subset of numeric features
poly_features = ['score1', 'score2', 'score3']
poly_indices = [numeric_features.index(feat) for feat in poly_features]

poly_transformer = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly_transformer.fit_transform(X_preprocessed[:, poly_indices])

# Combine original features with polynomial features
X_final = np.hstack([X_preprocessed, X_poly])

print(f"\nFinal feature matrix shape: {X_final.shape}")
print(f"Added {X_poly.shape[1]} polynomial features")

## Advanced Hyperparameter Optimization: The Mathematics of Intelligent Search

Hyperparameter optimization is one of the most crucial yet under-appreciated aspects of machine learning. The difference between random search and sophisticated optimization can mean the difference between a mediocre model and state-of-the-art performance.

### The Hyperparameter Optimization Problem

**Mathematical Formulation**: 
$$\lambda^* = \arg\min_{\lambda \in \Lambda} \mathcal{L}_{\text{val}}(\mathcal{A}(\mathcal{D}_{\text{train}}, \lambda))$$

Where:
- $\lambda$ represents hyperparameters
- $\Lambda$ is the hyperparameter space
- $\mathcal{A}$ is the learning algorithm
- $\mathcal{L}_{\text{val}}$ is validation loss

**Key Challenges**:
1. **Expensive objective function**: Each evaluation requires training a model
2. **No gradients**: Can't use gradient-based optimization
3. **Noisy evaluations**: Cross-validation introduces variance
4. **Mixed variable types**: Continuous, discrete, categorical hyperparameters
5. **High dimensionality**: Many hyperparameters to optimize simultaneously

### Traditional Approaches and Their Limitations

**Grid Search**: Exhaustive search over predefined grid
$$\mathcal{G} = \{\lambda_1^{(1)}, \lambda_1^{(2)}, ...\} \times \{\lambda_2^{(1)}, \lambda_2^{(2)}, ...\} \times ...$$

**Computational Complexity**: $O(|V_1| \times |V_2| \times ... \times |V_d|)$ where $|V_i|$ is number of values for parameter $i$.

**Curse of Dimensionality**: With $d$ parameters and $n$ values each, we need $n^d$ evaluations.

**Random Search**: Sample hyperparameters randomly
$$\lambda^{(i)} \sim p(\lambda)$$

**Bergstra & Bengio (2012) Result**: Random search is more efficient than grid search when only a few hyperparameters matter.

**Mathematical Intuition**: If only $k$ out of $d$ hyperparameters are important, random search effectively searches over the important subspace, while grid search wastes evaluations on irrelevant dimensions.

### Bayesian Optimization: The Principled Approach

**Key Insight**: Use all previous evaluations to inform where to search next.

**Mathematical Framework**:
1. **Surrogate Model**: $p(f(\lambda) | \mathcal{D}_{1:t})$ - probability model of objective function
2. **Acquisition Function**: $\alpha(\lambda | \mathcal{D}_{1:t})$ - utility of evaluating point $\lambda$
3. **Optimization**: $\lambda_{t+1} = \arg\max_{\lambda} \alpha(\lambda | \mathcal{D}_{1:t})$

### Gaussian Process Surrogate Models

**Gaussian Process**: Collection of random variables, any finite subset has joint Gaussian distribution.

**GP Regression**: Given observations $\mathcal{D} = \{(\lambda_i, y_i)\}_{i=1}^t$:

**Prior**: $f(\lambda) \sim \mathcal{GP}(m(\lambda), k(\lambda, \lambda'))$

**Posterior Mean**:
$$\mu_t(\lambda) = m(\lambda) + \mathbf{k}_t(\lambda)^T (\mathbf{K}_t + \sigma^2\mathbf{I})^{-1} (\mathbf{y}_t - \mathbf{m}_t)$$

**Posterior Variance**:
$$\sigma_t^2(\lambda) = k(\lambda, \lambda) - \mathbf{k}_t(\lambda)^T (\mathbf{K}_t + \sigma^2\mathbf{I})^{-1} \mathbf{k}_t(\lambda)$$

Where:
- $\mathbf{K}_t$ is kernel matrix: $[\mathbf{K}_t]_{ij} = k(\lambda_i, \lambda_j)$
- $\mathbf{k}_t(\lambda) = [k(\lambda, \lambda_1), ..., k(\lambda, \lambda_t)]^T$

**Kernel Functions**:

**RBF Kernel**: $k(\lambda, \lambda') = \sigma_f^2 \exp\left(-\frac{\|\lambda - \lambda'\|^2}{2\ell^2}\right)$
- Smooth functions
- Infinite differentiability

**Matérn Kernel**: $k(\lambda, \lambda') = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|\lambda - \lambda'\|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|\lambda - \lambda'\|}{\ell}\right)$
- Controls smoothness via $\nu$
- More flexible than RBF

### Acquisition Functions

**Expected Improvement (EI)**:
$$\alpha_{\text{EI}}(\lambda) = \mathbb{E}[\max(f(\lambda) - f(\lambda^+), 0)]$$

where $f(\lambda^+)$ is current best value.

**Closed Form**: 
$$\alpha_{\text{EI}}(\lambda) = (\mu(\lambda) - f(\lambda^+))\Phi(Z) + \sigma(\lambda)\phi(Z)$$

where $Z = \frac{\mu(\lambda) - f(\lambda^+)}{\sigma(\lambda)}$, $\Phi$ is CDF, $\phi$ is PDF of standard normal.

**Upper Confidence Bound (UCB)**:
$$\alpha_{\text{UCB}}(\lambda) = \mu(\lambda) + \beta_t \sigma(\lambda)$$

**Theoretical Guarantees**: With appropriate $\beta_t$, UCB has sublinear regret bounds.

**Probability of Improvement (PI)**:
$$\alpha_{\text{PI}}(\lambda) = P(f(\lambda) > f(\lambda^+)) = \Phi\left(\frac{\mu(\lambda) - f(\lambda^+)}{\sigma(\lambda)}\right)$$

**Trade-offs**:
- **EI**: Balances exploitation and exploration well
- **UCB**: Strong theoretical guarantees
- **PI**: Simple but can be too conservative

### Multi-Fidelity Optimization

**Problem**: Full model evaluation is expensive. Can we use cheaper approximations?

**Hyperband Algorithm**: Successive halving with multiple brackets

**Mathematical Framework**:
- Start with many configurations, small budgets
- Eliminate poor performers
- Increase budget for survivors

**Successive Halving**: Given budget $B$ and $n$ configurations:
```
for i = 0 to log_η(n)-1:
    n_i = floor(n / η^i)
    B_i = B * η^i / n
    Train each configuration for B_i budget
    Keep top n_{i+1} = floor(n_i / η) configurations
```

**BOHB (Bayesian Optimization + Hyperband)**:
- Use BO to select configurations for Hyperband
- Combine theoretical guarantees of Hyperband with practical efficiency of BO

### Advanced Acquisition Functions

**Knowledge Gradient**: Value of information from next evaluation
$$\alpha_{\text{KG}}(\lambda) = \mathbb{E}[\max_{\lambda'} \mu_{t+1}(\lambda') - \max_{\lambda'} \mu_t(\lambda') | \lambda_{t+1} = \lambda]$$

**Entropy Search**: Reduce uncertainty about optimum location
$$\alpha_{\text{ES}}(\lambda) = H[p^*] - \mathbb{E}[H[p^* | y]]$$

where $p^*$ is distribution over optimum location.

**Parallel Acquisition**: For batch evaluation
$$\alpha_{\text{batch}}(\{\lambda_1, ..., \lambda_q\}) = \mathbb{E}[\max_i f(\lambda_i) - f(\lambda^+)]$$

### Handling Categorical and Conditional Variables

**Mixed Variable Spaces**: Combine continuous, discrete, and categorical hyperparameters.

**Kernel Design**: 
$$k(\lambda, \lambda') = k_{\text{cont}}(\lambda_{\text{cont}}, \lambda'_{\text{cont}}) \times k_{\text{cat}}(\lambda_{\text{cat}}, \lambda'_{\text{cat}})$$

**Categorical Kernel**: 
$$k_{\text{cat}}(\lambda, \lambda') = \begin{cases} 1 & \text{if } \lambda = \lambda' \\ 0 & \text{otherwise} \end{cases}$$

**Conditional Variables**: Some hyperparameters only matter for certain algorithms
- Use indicator variables
- Specialized kernels for hierarchical spaces

### Population-Based Methods

**Genetic Algorithms**: Evolutionary optimization

**Selection**: Choose parents based on fitness
$$p(\text{select } i) \propto \text{fitness}(i)$$

**Crossover**: Combine parent hyperparameters
$$\lambda_{\text{child}} = \alpha \lambda_{\text{parent1}} + (1-\alpha) \lambda_{\text{parent2}}$$

**Mutation**: Random perturbations
$$\lambda' = \lambda + \mathcal{N}(0, \sigma^2)$$

**Differential Evolution**: 
$$\lambda_{\text{trial}} = \lambda_{\text{target}} + F(\lambda_{\text{best}} - \lambda_{\text{target}}) + F(\lambda_{\text{r1}} - \lambda_{\text{r2}})$$

**Population-Based Training (PBT)**:
- Train population of models simultaneously
- Periodically copy weights from better performers
- Mutate hyperparameters of copied models

### Multi-Objective Optimization

**Real-world scenarios**: Often optimize multiple conflicting objectives

**Pareto Dominance**: $\lambda_1$ dominates $\lambda_2$ if:
$$f_i(\lambda_1) \leq f_i(\lambda_2) \text{ for all } i \text{ and } f_j(\lambda_1) < f_j(\lambda_2) \text{ for some } j$$

**Hypervolume**: Measure of Pareto front quality
$$HV = \text{Volume}\left(\bigcup_{\lambda \in \text{Pareto}} [\mathbf{f}(\lambda), \mathbf{r}]\right)$$

where $\mathbf{r}$ is reference point.

**Multi-Objective Expected Improvement**:
$$\alpha_{\text{MOEI}}(\lambda) = \mathbb{E}[HV(\text{Pareto} \cup \{\mathbf{f}(\lambda)\}) - HV(\text{Pareto})]$$

### Early Stopping and Learning Curves

**Mathematical Formulation**: Predict final performance from partial learning curves.

**Exponential Model**: 
$$\text{performance}(t) = \alpha - \beta e^{-\gamma t}$$

**Power Law Model**:
$$\text{performance}(t) = \alpha - \beta t^{-\gamma}$$

**Bayesian Learning Curve Extrapolation**:
- Fit probabilistic model to partial curves
- Predict probability that run will achieve good final performance
- Stop runs with low probability early

### Practical Implementation Strategies

**Warm Starting**: Use previous experiments to initialize BO
- Transfer learning between related problems
- Meta-learning for hyperparameter initialization

**Multi-Task Optimization**: Share information across related tasks
$$f_t(\lambda) \sim \mathcal{GP}(0, k_t(\lambda, \lambda') + k_{\text{task}}(t, t') k_{\text{shared}}(\lambda, \lambda'))$$

**Asynchronous Optimization**: Handle variable evaluation times
- Use acquisition functions that account for pending evaluations
- Thompson sampling for parallelization

### Cost-Aware Optimization

**Variable Cost Hyperparameters**: Some configurations are more expensive to evaluate

**Cost Model**: $c(\lambda)$ - predicted cost of evaluating $\lambda$

**Cost-Aware EI**:
$$\alpha_{\text{EI/cost}}(\lambda) = \frac{\alpha_{\text{EI}}(\lambda)}{c(\lambda)^\beta}$$

**Multi-Armed Bandit Formulation**: Trade-off between reward and cost
$$\text{Utility}(\lambda) = \frac{\text{Expected Reward}(\lambda)}{\text{Expected Cost}(\lambda)}$$

### Hyperparameter Optimization Best Practices

**Search Space Design**:
- Use log-scale for learning rates: $\log(\text{lr}) \sim \text{Uniform}(-5, -1)$
- Consider parameter interactions
- Include sensible bounds

**Evaluation Strategy**:
- Use consistent random seeds
- Proper cross-validation setup
- Account for computational budget

**Multi-Level Optimization**:
1. **Coarse search**: Broad exploration
2. **Fine search**: Local optimization around promising regions
3. **Final validation**: Careful evaluation of best configurations

This mathematical foundation enables intelligent, efficient hyperparameter optimization that can dramatically improve model performance while minimizing computational cost.

In [None]:
from sklearn.model_selection import (
    GridSearchCV, RandomizedSearchCV, cross_val_score, 
    StratifiedKFold, learning_curve, validation_curve
)
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from scipy.stats import uniform, randint

# Split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X_final, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Class distribution in training set: {np.bincount(y_train)}")

# Define models to compare
models = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

# Quick model comparison with cross-validation
cv_scores = {}
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\n=== Initial Model Comparison (5-fold CV) ===")
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=cv_folds, scoring='roc_auc')
    cv_scores[name] = scores
    print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")

# Advanced hyperparameter tuning with RandomizedSearchCV
print("\n=== Hyperparameter Tuning (Random Forest) ===")

# Define parameter distributions for RandomizedSearchCV
rf_param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'class_weight': [None, 'balanced']
}

# Randomized search
rf_random = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=rf_param_dist,
    n_iter=50,  # Number of parameter settings to sample
    cv=3,  # Reduced for speed
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

rf_random.fit(X_train, y_train)

print(f"Best parameters: {rf_random.best_params_}")
print(f"Best cross-validation score: {rf_random.best_score_:.4f}")

# Grid search for fine-tuning around best parameters
print("\n=== Fine-tuning with GridSearchCV ===")

# Define a smaller grid around the best parameters
best_params = rf_random.best_params_
rf_param_grid = {
    'n_estimators': [max(50, best_params['n_estimators']-20), 
                    best_params['n_estimators'], 
                    best_params['n_estimators']+20],
    'max_depth': [max(3, best_params['max_depth']-2), 
                 best_params['max_depth'], 
                 best_params['max_depth']+2],
    'min_samples_split': [best_params['min_samples_split']],
    'min_samples_leaf': [best_params['min_samples_leaf']],
    'max_features': [best_params['max_features']],
    'bootstrap': [best_params['bootstrap']],
    'class_weight': [best_params['class_weight']]
}

rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid=rf_param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

print(f"Fine-tuned best parameters: {rf_grid.best_params_}")
print(f"Fine-tuned best score: {rf_grid.best_score_:.4f}")

# Learning curves
print("\n=== Learning Curves ===")

best_rf = rf_grid.best_estimator_

train_sizes, train_scores, val_scores = learning_curve(
    best_rf, X_train, y_train, cv=3, 
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='roc_auc', n_jobs=-1
)

# Plot learning curves
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training score')
plt.fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1),
                train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.3)
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation score')
plt.fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1),
                val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.3)
plt.xlabel('Training set size')
plt.ylabel('ROC AUC Score')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Validation curves (hyperparameter impact)
param_range = range(10, 101, 10)
train_scores_val, test_scores_val = validation_curve(
    RandomForestClassifier(random_state=42), X_train, y_train,
    param_name='n_estimators', param_range=param_range,
    cv=3, scoring='roc_auc', n_jobs=-1
)

plt.subplot(1, 2, 2)
plt.plot(param_range, train_scores_val.mean(axis=1), 'o-', label='Training score')
plt.fill_between(param_range, train_scores_val.mean(axis=1) - train_scores_val.std(axis=1),
                train_scores_val.mean(axis=1) + train_scores_val.std(axis=1), alpha=0.3)
plt.plot(param_range, test_scores_val.mean(axis=1), 'o-', label='Validation score')
plt.fill_between(param_range, test_scores_val.mean(axis=1) - test_scores_val.std(axis=1),
                test_scores_val.mean(axis=1) + test_scores_val.std(axis=1), alpha=0.3)
plt.xlabel('Number of estimators')
plt.ylabel('ROC AUC Score')
plt.title('Validation Curve (n_estimators)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Feature Selection and Importance Analysis

In [None]:
from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif, RFE, 
    SelectFromModel, SequentialFeatureSelector
)
from sklearn.inspection import permutation_importance

# Train the best model on full training data
best_rf.fit(X_train, y_train)

print("=== Feature Importance Analysis ===")

# 1. Built-in feature importance (Random Forest)
feature_importance_builtin = best_rf.feature_importances_

# 2. Permutation importance (more reliable)
perm_importance = permutation_importance(
    best_rf, X_train, y_train, n_repeats=5, random_state=42, n_jobs=-1
)

# Create feature importance DataFrame
importance_df = pd.DataFrame({
    'feature': [f'Feature_{i}' for i in range(X_train.shape[1])],
    'builtin_importance': feature_importance_builtin,
    'permutation_importance': perm_importance.importances_mean,
    'permutation_std': perm_importance.importances_std
})

# Sort by permutation importance
importance_df = importance_df.sort_values('permutation_importance', ascending=False)

print("Top 10 most important features:")
print(importance_df.head(10))

# Plot feature importance comparison
plt.figure(figsize=(15, 10))

# Plot top 20 features
top_features = importance_df.head(20)

plt.subplot(2, 2, 1)
plt.barh(range(len(top_features)), top_features['builtin_importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Built-in Importance')
plt.title('Random Forest Built-in Feature Importance')
plt.gca().invert_yaxis()

plt.subplot(2, 2, 2)
plt.barh(range(len(top_features)), top_features['permutation_importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Permutation Importance')
plt.title('Permutation Feature Importance')
plt.gca().invert_yaxis()

plt.subplot(2, 2, 3)
plt.scatter(top_features['builtin_importance'], top_features['permutation_importance'])
plt.xlabel('Built-in Importance')
plt.ylabel('Permutation Importance')
plt.title('Importance Methods Comparison')
plt.plot([0, max(top_features['builtin_importance'])], 
         [0, max(top_features['builtin_importance'])], 'r--', alpha=0.5)

# Feature selection methods comparison
print("\n=== Feature Selection Methods Comparison ===")

# 1. Univariate feature selection
selector_univariate = SelectKBest(score_func=f_classif, k=20)
X_train_univariate = selector_univariate.fit_transform(X_train, y_train)
X_test_univariate = selector_univariate.transform(X_test)

# 2. Recursive Feature Elimination
selector_rfe = RFE(RandomForestClassifier(n_estimators=50, random_state=42), n_features_to_select=20)
X_train_rfe = selector_rfe.fit_transform(X_train, y_train)
X_test_rfe = selector_rfe.transform(X_test)

# 3. Model-based selection
selector_model = SelectFromModel(RandomForestClassifier(n_estimators=50, random_state=42), 
                                max_features=20)
X_train_model = selector_model.fit_transform(X_train, y_train)
X_test_model = selector_model.transform(X_test)

# 4. Sequential Feature Selection (forward)
selector_sequential = SequentialFeatureSelector(
    RandomForestClassifier(n_estimators=50, random_state=42),
    n_features_to_select=20, direction='forward', cv=3
)
X_train_sequential = selector_sequential.fit_transform(X_train, y_train)
X_test_sequential = selector_sequential.transform(X_test)

# Compare performance of different feature selection methods
feature_selection_results = {}
datasets = {
    'All Features': (X_train, X_test),
    'Univariate (F-test)': (X_train_univariate, X_test_univariate),
    'RFE': (X_train_rfe, X_test_rfe),
    'Model-based': (X_train_model, X_test_model),
    'Sequential': (X_train_sequential, X_test_sequential)
}

for method_name, (X_tr, X_te) in datasets.items():
    # Train and evaluate model
    rf_temp = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_temp.fit(X_tr, y_train)
    
    # Cross-validation score
    cv_score = cross_val_score(rf_temp, X_tr, y_train, cv=3, scoring='roc_auc').mean()
    
    # Test score
    test_pred = rf_temp.predict_proba(X_te)[:, 1]
    test_score = roc_auc_score(y_test, test_pred)
    
    feature_selection_results[method_name] = {
        'n_features': X_tr.shape[1],
        'cv_score': cv_score,
        'test_score': test_score
    }
    
    print(f"{method_name:20s}: {X_tr.shape[1]:3d} features, "
          f"CV: {cv_score:.4f}, Test: {test_score:.4f}")

# Plot feature selection comparison
plt.subplot(2, 2, 4)
methods = list(feature_selection_results.keys())
test_scores = [feature_selection_results[method]['test_score'] for method in methods]
n_features = [feature_selection_results[method]['n_features'] for method in methods]

colors = plt.cm.viridis(np.linspace(0, 1, len(methods)))
for i, (method, score, n_feat) in enumerate(zip(methods, test_scores, n_features)):
    plt.scatter(n_feat, score, s=100, c=[colors[i]], label=method)

plt.xlabel('Number of Features')
plt.ylabel('Test ROC AUC')
plt.title('Feature Selection Methods Comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Advanced Model Evaluation and Interpretation

In [None]:
from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    plot_confusion_matrix, classification_report
)
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.patches as patches

# Get predictions from the best model
y_pred = best_rf.predict(X_test)
y_pred_proba = best_rf.predict_proba(X_test)[:, 1]

print("=== Comprehensive Model Evaluation ===")

# Basic metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC AUC
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC AUC Score: {roc_auc:.4f}")

# Average Precision (better for imbalanced datasets)
avg_precision = average_precision_score(y_test, y_pred_proba)
print(f"Average Precision Score: {avg_precision:.4f}")

# Comprehensive evaluation plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Confusion Matrix
from sklearn.metrics import ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot(ax=axes[0, 0], cmap='Blues')
axes[0, 0].set_title('Confusion Matrix')

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
axes[0, 2].plot(recall, precision, linewidth=2, label=f'PR Curve (AP = {avg_precision:.3f})')
baseline = np.sum(y_test) / len(y_test)
axes[0, 2].axhline(y=baseline, color='k', linestyle='--', linewidth=1, label=f'Baseline ({baseline:.3f})')
axes[0, 2].set_xlabel('Recall')
axes[0, 2].set_ylabel('Precision')
axes[0, 2].set_title('Precision-Recall Curve')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Prediction Distribution
axes[1, 0].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5, label='Class 0', density=True)
axes[1, 0].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5, label='Class 1', density=True)
axes[1, 0].axvline(x=0.5, color='red', linestyle='--', label='Threshold')
axes[1, 0].set_xlabel('Predicted Probability')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title('Prediction Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Calibration Plot
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_pred_proba, n_bins=10)
axes[1, 1].plot(mean_predicted_value, fraction_of_positives, "s-", linewidth=2, label='Random Forest')
axes[1, 1].plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
axes[1, 1].set_xlabel('Mean Predicted Probability')
axes[1, 1].set_ylabel('Fraction of Positives')
axes[1, 1].set_title('Calibration Plot')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Threshold Analysis
thresholds = np.linspace(0, 1, 101)
precisions = []
recalls = []
f1_scores = []

for threshold in thresholds:
    y_pred_thresh = (y_pred_proba >= threshold).astype(int)
    if np.sum(y_pred_thresh) > 0:  # Avoid division by zero
        prec = metrics.precision_score(y_test, y_pred_thresh, zero_division=0)
        rec = metrics.recall_score(y_test, y_pred_thresh)
        f1 = metrics.f1_score(y_test, y_pred_thresh)
    else:
        prec = rec = f1 = 0
    
    precisions.append(prec)
    recalls.append(rec)
    f1_scores.append(f1)

axes[1, 2].plot(thresholds, precisions, label='Precision', linewidth=2)
axes[1, 2].plot(thresholds, recalls, label='Recall', linewidth=2)
axes[1, 2].plot(thresholds, f1_scores, label='F1-Score', linewidth=2)
axes[1, 2].axvline(x=0.5, color='red', linestyle='--', alpha=0.7, label='Default Threshold')
axes[1, 2].set_xlabel('Threshold')
axes[1, 2].set_ylabel('Score')
axes[1, 2].set_title('Threshold Analysis')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Model Interpretation: Partial Dependence Plots
print("\n=== Model Interpretation ===")

# Select top features for partial dependence plots
top_feature_indices = importance_df.head(4).index[:4].tolist()

# Note: plot_partial_dependence requires feature names or indices
try:
    from sklearn.inspection import PartialDependenceDisplay
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    axes = axes.ravel()
    
    for i, feature_idx in enumerate(top_feature_indices):
        PartialDependenceDisplay.from_estimator(
            best_rf, X_train, features=[feature_idx], 
            ax=axes[i], feature_names=[f'Feature_{feature_idx}']
        )
        axes[i].set_title(f'Partial Dependence: Feature {feature_idx}')
    
    plt.tight_layout()
    plt.show()
except Exception as e:
    print(f"Could not create partial dependence plots: {e}")

# SHAP values for model explanation (if SHAP is available)
try:
    import shap
    print("\n=== SHAP Analysis ===")
    
    # Create SHAP explainer
    explainer = shap.TreeExplainer(best_rf)
    shap_values = explainer.shap_values(X_test[:100])  # Subset for speed
    
    # Summary plot
    shap.summary_plot(shap_values[1], X_test[:100], 
                     feature_names=[f'Feature_{i}' for i in range(X_test.shape[1])],
                     show=False)
    plt.title('SHAP Summary Plot (Class 1)')
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("SHAP not available. Install with: pip install shap")
except Exception as e:
    print(f"SHAP analysis failed: {e}")

# Model reliability analysis
print("\n=== Model Reliability Analysis ===")

# Prediction confidence analysis
confidence = np.abs(y_pred_proba - 0.5) * 2  # Distance from decision boundary
high_confidence = confidence > 0.8
medium_confidence = (confidence > 0.4) & (confidence <= 0.8)
low_confidence = confidence <= 0.4

print(f"High confidence predictions: {np.sum(high_confidence)} ({np.mean(high_confidence):.1%})")
print(f"Medium confidence predictions: {np.sum(medium_confidence)} ({np.mean(medium_confidence):.1%})")
print(f"Low confidence predictions: {np.sum(low_confidence)} ({np.mean(low_confidence):.1%})")

# Accuracy by confidence level
if np.sum(high_confidence) > 0:
    high_conf_accuracy = metrics.accuracy_score(y_test[high_confidence], y_pred[high_confidence])
    print(f"\nAccuracy on high confidence predictions: {high_conf_accuracy:.3f}")

if np.sum(low_confidence) > 0:
    low_conf_accuracy = metrics.accuracy_score(y_test[low_confidence], y_pred[low_confidence])
    print(f"Accuracy on low confidence predictions: {low_conf_accuracy:.3f}")

## Advanced Ensemble Methods and Model Stacking

In [None]:
from sklearn.ensemble import VotingClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

print("=== Advanced Ensemble Methods ===")

# Define base models for ensemble
base_models = {
    'rf': RandomForestClassifier(n_estimators=100, random_state=42),
    'gb': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'svm': SVC(probability=True, random_state=42),
    'lr': LogisticRegression(random_state=42, max_iter=1000)
}

# Train individual models and collect predictions
individual_scores = {}
for name, model in base_models.items():
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')
    individual_scores[name] = scores.mean()
    print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")

# 1. Voting Classifier (Hard and Soft Voting)
print("\n=== Voting Classifiers ===")

# Hard voting
voting_hard = VotingClassifier(
    estimators=list(base_models.items()),
    voting='hard'
)

# Soft voting
voting_soft = VotingClassifier(
    estimators=list(base_models.items()),
    voting='soft'
)

# Evaluate voting classifiers
voting_scores = {}
for name, classifier in [('Hard Voting', voting_hard), ('Soft Voting', voting_soft)]:
    scores = cross_val_score(classifier, X_train, y_train, cv=3, scoring='roc_auc')
    voting_scores[name] = scores.mean()
    print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")

# 2. Model Stacking
print("\n=== Model Stacking ===")

# Generate cross-validation predictions for stacking
cv_folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
stacking_features = np.zeros((len(X_train), len(base_models)))

for i, (name, model) in enumerate(base_models.items()):
    cv_predictions = cross_val_predict(model, X_train, y_train, cv=cv_folds, method='predict_proba')
    stacking_features[:, i] = cv_predictions[:, 1]  # Use probability for class 1

# Train meta-learner
meta_learner = LogisticRegression(random_state=42)
meta_learner.fit(stacking_features, y_train)

# Evaluate stacking on test set
# First, get base model predictions on test set
test_stacking_features = np.zeros((len(X_test), len(base_models)))
for i, (name, model) in enumerate(base_models.items()):
    model.fit(X_train, y_train)
    test_pred_proba = model.predict_proba(X_test)[:, 1]
    test_stacking_features[:, i] = test_pred_proba

# Meta-learner prediction
stacking_pred_proba = meta_learner.predict_proba(test_stacking_features)[:, 1]
stacking_score = roc_auc_score(y_test, stacking_pred_proba)
print(f"Stacking Score: {stacking_score:.4f}")

# 3. Bagging with different base estimators
print("\n=== Bagging Variants ===")

bagging_variants = {
    'Bagging + Decision Tree': BaggingClassifier(
        base_estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=100, random_state=42
    ),
    'Bagging + SVM': BaggingClassifier(
        base_estimator=SVC(probability=True, random_state=42),
        n_estimators=10, random_state=42  # Fewer estimators due to SVM cost
    )
}

for name, model in bagging_variants.items():
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')
    print(f"{name:30s}: {scores.mean():.4f} ± {scores.std():.4f}")

# 4. AdaBoost
ada_boost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1, random_state=42),
    n_estimators=100, random_state=42
)
ada_scores = cross_val_score(ada_boost, X_train, y_train, cv=3, scoring='roc_auc')
print(f"AdaBoost:                     {ada_scores.mean():.4f} ± {ada_scores.std():.4f}")

# Ensemble comparison visualization
plt.figure(figsize=(15, 10))

# Compile all results
all_results = {**individual_scores, **voting_scores, 'Stacking': stacking_score}

# Add bagging and AdaBoost results
for name, model in bagging_variants.items():
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')
    all_results[name] = scores.mean()

all_results['AdaBoost'] = ada_scores.mean()

# Plot comparison
plt.subplot(2, 2, 1)
methods = list(all_results.keys())
scores = list(all_results.values())
colors = plt.cm.viridis(np.linspace(0, 1, len(methods)))

bars = plt.bar(range(len(methods)), scores, color=colors)
plt.xticks(range(len(methods)), methods, rotation=45, ha='right')
plt.ylabel('ROC AUC Score')
plt.title('Ensemble Methods Comparison')
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, score in zip(bars, scores):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
             f'{score:.3f}', ha='center', va='bottom', fontsize=8)

# Learning curves for best ensemble
best_ensemble = voting_soft  # or choose the best performing one
train_sizes, train_scores, val_scores = learning_curve(
    best_ensemble, X_train, y_train, cv=3,
    train_sizes=np.linspace(0.1, 1.0, 5),
    scoring='roc_auc', n_jobs=-1
)

plt.subplot(2, 2, 2)
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training score')
plt.fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1),
                train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.3)
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation score')
plt.fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1),
                val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.3)
plt.xlabel('Training set size')
plt.ylabel('ROC AUC Score')
plt.title('Ensemble Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Feature importance from stacking meta-learner
plt.subplot(2, 2, 3)
meta_coefs = np.abs(meta_learner.coef_[0])
model_names = list(base_models.keys())
plt.bar(model_names, meta_coefs)
plt.ylabel('Meta-learner Coefficient (abs)')
plt.title('Model Importance in Stacking')
plt.grid(True, alpha=0.3, axis='y')

# Prediction correlation between models
plt.subplot(2, 2, 4)
correlation_matrix = np.corrcoef(stacking_features.T)
im = plt.imshow(correlation_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(im)
plt.xticks(range(len(model_names)), model_names)
plt.yticks(range(len(model_names)), model_names)
plt.title('Model Prediction Correlation')

# Add correlation values
for i in range(len(model_names)):
    for j in range(len(model_names)):
        plt.text(j, i, f'{correlation_matrix[i, j]:.2f}',
                ha='center', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nBest individual model: {max(individual_scores, key=individual_scores.get)} "
      f"({max(individual_scores.values()):.4f})")
print(f"Best ensemble method: {max(all_results, key=all_results.get)} "
      f"({max(all_results.values()):.4f})")
print(f"Improvement: {max(all_results.values()) - max(individual_scores.values()):.4f}")

## Model Deployment Preparation and Production Considerations

In [None]:
import pickle
import joblib
import json
from sklearn.base import BaseEstimator, TransformerMixin
from datetime import datetime
import time

print("=== Model Deployment Preparation ===")

# 1. Create a complete pipeline for production
class ProductionPipeline(BaseEstimator, TransformerMixin):
    """
    Complete pipeline for production deployment
    """
    
    def __init__(self, preprocessor, feature_selector, model):
        self.preprocessor = preprocessor
        self.feature_selector = feature_selector
        self.model = model
        self.feature_names_ = None
        self.training_date_ = None
        self.version_ = "1.0"
    
    def fit(self, X, y):
        """Fit the complete pipeline"""
        # Store training metadata
        self.training_date_ = datetime.now()
        self.feature_names_ = list(X.columns)
        
        # Fit preprocessing
        X_preprocessed = self.preprocessor.fit_transform(X)
        
        # Fit feature selection
        X_selected = self.feature_selector.fit_transform(X_preprocessed, y)
        
        # Fit model
        self.model.fit(X_selected, y)
        
        return self
    
    def predict(self, X):
        """Make predictions"""
        X_processed = self._transform(X)
        return self.model.predict(X_processed)
    
    def predict_proba(self, X):
        """Get prediction probabilities"""
        X_processed = self._transform(X)
        return self.model.predict_proba(X_processed)
    
    def _transform(self, X):
        """Internal transformation method"""
        # Validate input features
        if hasattr(X, 'columns'):
            missing_features = set(self.feature_names_) - set(X.columns)
            if missing_features:
                raise ValueError(f"Missing features: {missing_features}")
            
            # Reorder columns to match training
            X = X[self.feature_names_]
        
        # Apply transformations
        X_preprocessed = self.preprocessor.transform(X)
        X_selected = self.feature_selector.transform(X_preprocessed)
        
        return X_selected
    
    def get_metadata(self):
        """Get model metadata"""
        return {
            'version': self.version_,
            'training_date': self.training_date_.isoformat() if self.training_date_ else None,
            'feature_names': self.feature_names_,
            'n_features': len(self.feature_names_) if self.feature_names_ else None,
            'model_type': type(self.model).__name__,
            'preprocessor_steps': list(self.preprocessor.named_transformers_.keys())
        }

# Create production pipeline
production_pipeline = ProductionPipeline(
    preprocessor=preprocessor,
    feature_selector=selector_model,  # Use the best feature selector
    model=best_rf
)

# Train the production pipeline
production_pipeline.fit(X.iloc[:800], y.iloc[:800])  # Use subset for training

print("Production pipeline trained successfully")
print(f"Metadata: {production_pipeline.get_metadata()}")

# 2. Model validation and testing
print("\n=== Model Validation ===")

# Test on holdout data
X_holdout = X.iloc[800:]
y_holdout = y.iloc[800:]

# Performance on holdout
holdout_pred = production_pipeline.predict(X_holdout)
holdout_pred_proba = production_pipeline.predict_proba(X_holdout)[:, 1]
holdout_score = roc_auc_score(y_holdout, holdout_pred_proba)

print(f"Holdout AUC Score: {holdout_score:.4f}")
print(f"Holdout Accuracy: {metrics.accuracy_score(y_holdout, holdout_pred):.4f}")

# Prediction latency test
print("\n=== Performance Testing ===")

# Single prediction latency
single_sample = X_holdout.iloc[:1]
start_time = time.time()
single_pred = production_pipeline.predict_proba(single_sample)
single_latency = time.time() - start_time

print(f"Single prediction latency: {single_latency*1000:.2f} ms")

# Batch prediction latency
batch_sample = X_holdout.iloc[:100]
start_time = time.time()
batch_pred = production_pipeline.predict_proba(batch_sample)
batch_latency = time.time() - start_time

print(f"Batch prediction latency (100 samples): {batch_latency*1000:.2f} ms")
print(f"Average per sample: {batch_latency/100*1000:.2f} ms")

# 3. Model serialization
print("\n=== Model Serialization ===")

# Save with joblib (recommended for scikit-learn)
model_filename = 'production_model.joblib'
joblib.dump(production_pipeline, model_filename)
model_size = os.path.getsize(model_filename) / (1024 * 1024)  # MB
print(f"Model saved as {model_filename} ({model_size:.2f} MB)")

# Save metadata separately
metadata = production_pipeline.get_metadata()
with open('model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("Metadata saved as model_metadata.json")

# Load and test
loaded_pipeline = joblib.load(model_filename)
test_pred = loaded_pipeline.predict_proba(single_sample)
print(f"Model loaded successfully. Test prediction: {test_pred[0]}")

# 4. Model monitoring setup
print("\n=== Model Monitoring Setup ===")

class ModelMonitor:
    """
    Basic model monitoring class
    """
    
    def __init__(self, reference_data, reference_predictions):
        self.reference_data = reference_data
        self.reference_predictions = reference_predictions
        self.reference_stats = self._calculate_stats(reference_data)
    
    def _calculate_stats(self, data):
        """Calculate basic statistics"""
        return {
            'mean': np.mean(data, axis=0),
            'std': np.std(data, axis=0),
            'min': np.min(data, axis=0),
            'max': np.max(data, axis=0)
        }
    
    def detect_drift(self, new_data, threshold=2.0):
        """Simple drift detection based on statistical distance"""
        new_stats = self._calculate_stats(new_data)
        
        # Calculate z-score for means
        z_scores = np.abs((new_stats['mean'] - self.reference_stats['mean']) / 
                         (self.reference_stats['std'] + 1e-8))
        
        drift_detected = np.any(z_scores > threshold)
        
        return {
            'drift_detected': drift_detected,
            'max_z_score': np.max(z_scores),
            'affected_features': np.where(z_scores > threshold)[0].tolist()
        }
    
    def prediction_drift(self, new_predictions, threshold=0.1):
        """Detect drift in prediction distribution"""
        ref_mean = np.mean(self.reference_predictions)
        new_mean = np.mean(new_predictions)
        
        drift = abs(new_mean - ref_mean) > threshold
        
        return {
            'prediction_drift': drift,
            'reference_mean': ref_mean,
            'new_mean': new_mean,
            'difference': abs(new_mean - ref_mean)
        }

# Set up monitoring with training data
monitor = ModelMonitor(
    reference_data=X_final[:800],
    reference_predictions=production_pipeline.predict_proba(X.iloc[:800])[:, 1]
)

# Test drift detection with holdout data
holdout_processed = production_pipeline._transform(X_holdout)
drift_result = monitor.detect_drift(holdout_processed)
pred_drift_result = monitor.prediction_drift(holdout_pred_proba)

print(f"Data drift detected: {drift_result['drift_detected']}")
print(f"Max Z-score: {drift_result['max_z_score']:.3f}")
print(f"Prediction drift detected: {pred_drift_result['prediction_drift']}")
print(f"Prediction difference: {pred_drift_result['difference']:.3f}")

# 5. Production checklist
print("\n=== Production Deployment Checklist ===")
checklist = [
    "✓ Model pipeline created and tested",
    "✓ Performance validated on holdout data",
    "✓ Latency benchmarked",
    "✓ Model serialized and can be loaded",
    "✓ Metadata saved for tracking",
    "✓ Monitoring system implemented",
    "TODO: Set up A/B testing framework",
    "TODO: Implement logging for predictions",
    "TODO: Set up automated retraining",
    "TODO: Configure alerting for drift detection"
]

for item in checklist:
    print(item)

# Cleanup
import os
for file in [model_filename, 'model_metadata.json']:
    if os.path.exists(file):
        os.remove(file)
        print(f"Cleaned up {file}")

print("\n=== Production Considerations Summary ===")
print("1. Always validate on truly unseen data")
print("2. Monitor for data and prediction drift")
print("3. Log predictions for continuous learning")
print("4. Implement proper error handling")
print("5. Set up A/B testing for model updates")
print("6. Plan for model retraining schedule")
print("7. Document model assumptions and limitations")