# Module 4 - Lesson 4: Building Production Pipelines

## From Notebook to Production: Professional Workflows

### What You'll Learn:
- Building end-to-end scikit-learn pipelines
- Creating custom transformers for your specific needs
- Using ColumnTransformer for mixed data types
- Preventing data leakage with proper pipeline design
- Cross-validation with pipelines
- Saving and deploying pipelines
- Performance optimization techniques
- Production best practices

### Why This Matters:
Moving from experimental notebooks to production systems requires reproducible, maintainable, and error-free code. Pipelines ensure that every transformation is applied consistently, in the right order, without data leakage. This is the difference between a proof-of-concept and a real-world ML system.

## Setup: Import Required Libraries

Let's import everything we'll need for building production pipelines:

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
import joblib
import pickle
import time
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Scikit-learn pipeline tools ready for production!")

Libraries imported successfully!
Scikit-learn pipeline tools ready for production!


## 1. Understanding Pipeline Architecture

### What is a Pipeline?

A pipeline chains together multiple data transformation steps and a final estimator (model). Each step is applied in sequence, and the entire pipeline can be treated as a single object.

**Benefits:**
- **Reproducibility**: Same transformations applied consistently
- **No data leakage**: Transformations fitted only on training data
- **Convenience**: Single object to fit, predict, and deploy
- **Maintainability**: Clean, organized code

Let's start with a simple example:

In [67]:
# Create sample dataset
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=42
)

# Convert to DataFrame for clarity
feature_names = [f'feature_{i}' for i in range(20)]
X_df = pd.DataFrame(X, columns=feature_names)

print("Sample dataset created:")
print(f"Shape: {X_df.shape}")
print(f"Target distribution: {pd.Series(y).value_counts().to_dict()}")
print("\nFirst few rows:")
print(X_df.head())

Sample dataset created:
Shape: (1000, 20)
Target distribution: {0: 502, 1: 498}

First few rows:
   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \
0  -4.906442   3.442789   0.558964  -0.976764  -1.568805  -4.271982   
1   2.162610  -5.286651   2.609846  -1.803898  -1.831216   1.450757   
2  -4.784844  -3.744827   4.657592  -1.408806  -5.444758  -2.416013   
3  10.465024   1.070944  -3.562432  -0.849062   2.183860  -0.609893   
4   5.599516  -1.776412  -1.304322  -0.720074   5.859373  -3.292432   

   feature_6  feature_7  feature_8  feature_9  feature_10  feature_11  \
0  -3.727921   0.111868   2.119795  -2.522812    3.352281   -7.492478   
1   2.648709   2.152307   0.524552   0.493548   -1.401809    6.680603   
2   3.556495  -1.572119  -0.730549   3.447661   -2.609052    7.961059   
3   0.946327  -1.046141  -2.057053  -2.056650   -2.215455   -1.449095   
4   3.152205   7.099882  -3.321076   3.245486   -0.336178    6.608729   

   feature_12  feature_13  feature_14

### Building Your First Pipeline

Let's build a simple pipeline that scales features and applies a classifier:

In [68]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Method 1: Using Pipeline class
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

# Fit entire pipeline on training data
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)

print("Simple Pipeline Results:")
print(f"Accuracy: {accuracy:.3f}")
print("\n✅ The pipeline:")
print("1. Scales features (fit on train, transform both)")
print("2. Trains classifier on scaled data")
print("3. Makes predictions on scaled test data")
print("All in one fit() and predict() call!")

Simple Pipeline Results:
Accuracy: 0.900

✅ The pipeline:
1. Scales features (fit on train, transform both)
2. Trains classifier on scaled data
3. Makes predictions on scaled test data
All in one fit() and predict() call!


### Alternative: Using make_pipeline

For simpler syntax, use `make_pipeline` which automatically names the steps:

In [69]:
from sklearn.pipeline import make_pipeline

# Method 2: Using make_pipeline (automatic naming)
simple_pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state=42)
)

# Fit and evaluate
simple_pipeline.fit(X_train, y_train)
accuracy = simple_pipeline.score(X_test, y_test)

print("Make_pipeline Results:")
print(f"Accuracy: {accuracy:.3f}")
print("\nPipeline steps (auto-named):")
for name, step in simple_pipeline.named_steps.items():
    print(f"  {name}: {step.__class__.__name__}")

Make_pipeline Results:
Accuracy: 0.900

Pipeline steps (auto-named):
  standardscaler: StandardScaler
  randomforestclassifier: RandomForestClassifier


## 2. Handling Mixed Data Types with ColumnTransformer

### The Real-World Challenge

Real datasets have mixed types: numeric features need scaling, categorical features need encoding, and some features might need custom transformations. `ColumnTransformer` handles this elegantly:

In [70]:
# Create a realistic dataset with mixed types
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    # Numeric features
    'age': np.random.uniform(18, 70, n_samples),
    'income': np.random.lognormal(10, 1, n_samples),
    'credit_score': np.random.uniform(300, 850, n_samples),
    
    # Categorical features
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'employment': np.random.choice(['Full-time', 'Part-time', 'Self-employed'], n_samples),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_samples),
    
    # Binary feature
    'has_mortgage': np.random.choice(['Yes', 'No'], n_samples),
    
    # Target
    'approved': np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
})

print("Mixed-type dataset:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\n📝 We need different preprocessing for each type!")

Mixed-type dataset:
         age        income  credit_score    education     employment  \
0  37.476086  26309.911254    834.342980     Bachelor      Full-time   
1  67.437144   5794.448838    482.240854  High School      Full-time   
2  56.063685  32215.334697    565.122582          PhD  Self-employed   
3  49.130241  40561.951268    407.853742     Bachelor      Full-time   
4  26.112969  38553.048228    635.929039       Master  Self-employed   

       city has_mortgage  approved  
0   Houston          Yes         1  
1   Chicago           No         1  
2   Houston          Yes         0  
3   Phoenix           No         1  
4  New York          Yes         1  

Data types:
age             float64
income          float64
credit_score    float64
education        object
employment       object
city             object
has_mortgage     object
approved          int64
dtype: object

📝 We need different preprocessing for each type!


### Building a ColumnTransformer

Let's create a preprocessing pipeline that handles each data type appropriately:

In [71]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Identify column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment', 'city']
binary_features = ['has_mortgage']

# Create preprocessing pipelines for each type
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

binary_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# Combine into ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features),
        ('bin', binary_pipeline, binary_features)
    ])

# Create full pipeline with model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Prepare data
X = df.drop('approved', axis=1)
y = df['approved']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
full_pipeline.fit(X_train, y_train)
accuracy = full_pipeline.score(X_test, y_test)

print("Full Pipeline with ColumnTransformer:")
print(f"Accuracy: {accuracy:.3f}")
print("\n✅ Each column type processed appropriately:")
print("  - Numeric: Imputed (median) → Scaled")
print("  - Categorical: Imputed (constant) → One-hot encoded")
print("  - Binary: Imputed (mode) → Binary encoded")

Full Pipeline with ColumnTransformer:
Accuracy: 0.655

✅ Each column type processed appropriately:
  - Numeric: Imputed (median) → Scaled
  - Categorical: Imputed (constant) → One-hot encoded
  - Binary: Imputed (mode) → Binary encoded


## 3. Creating Custom Transformers

### When Built-in Transformers Aren't Enough

Sometimes you need custom transformations. Here's how to create transformers that integrate seamlessly with scikit-learn pipelines:

In [72]:
from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    """Apply log transformation to specified columns"""
    
    def __init__(self, columns=None):
        self.columns = columns
        
    def fit(self, X, y=None):
        # Nothing to learn
        return self
    
    def transform(self, X):
        # Check if input is a DataFrame or numpy array
        if hasattr(X, 'columns'):
            # It's a DataFrame
            X_copy = X.copy()
            
            if self.columns:
                for col in self.columns:
                    if col in X_copy.columns:
                        X_copy[col] = np.log1p(X_copy[col])  # log1p handles zeros
            else:
                # Apply to all numeric columns
                numeric_cols = X_copy.select_dtypes(include=[np.number]).columns
                X_copy[numeric_cols] = np.log1p(X_copy[numeric_cols])
        else:
            # It's a numpy array
            X_copy = X.copy() if hasattr(X, 'copy') else np.array(X)
            
            if self.columns is not None:
                # If columns specified as indices
                if isinstance(self.columns, (list, tuple)):
                    for col_idx in self.columns:
                        if isinstance(col_idx, int) and col_idx < X_copy.shape[1]:
                            X_copy[:, col_idx] = np.log1p(X_copy[:, col_idx])
            else:
                # Apply to all columns
                X_copy = np.log1p(X_copy)
            
        return X_copy


class OutlierCapper(BaseEstimator, TransformerMixin):
    """Cap outliers at specified percentiles"""
    
    def __init__(self, lower_percentile=1, upper_percentile=99):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
        self.bounds_ = {}
        
    def fit(self, X, y=None):
        # Check if input is a DataFrame or numpy array
        if hasattr(X, 'columns'):
            # It's a DataFrame
            numeric_cols = X.select_dtypes(include=[np.number]).columns
            
            for col in numeric_cols:
                self.bounds_[col] = {
                    'lower': np.percentile(X[col], self.lower_percentile),
                    'upper': np.percentile(X[col], self.upper_percentile)
                }
        else:
            # It's a numpy array
            n_features = X.shape[1] if len(X.shape) > 1 else 1
            
            for i in range(n_features):
                if len(X.shape) > 1:
                    col_data = X[:, i]
                else:
                    col_data = X
                    
                self.bounds_[i] = {
                    'lower': np.percentile(col_data, self.lower_percentile),
                    'upper': np.percentile(col_data, self.upper_percentile)
                }
                
                if len(X.shape) == 1:
                    break
                    
        return self
    
    def transform(self, X):
        # Check if input is a DataFrame or numpy array
        if hasattr(X, 'columns'):
            # It's a DataFrame
            X_copy = X.copy()
            
            for col, bounds in self.bounds_.items():
                if col in X_copy.columns:
                    X_copy[col] = X_copy[col].clip(
                        lower=bounds['lower'],
                        upper=bounds['upper']
                    )
        else:
            # It's a numpy array
            X_copy = X.copy() if hasattr(X, 'copy') else np.array(X)
            
            for col_idx, bounds in self.bounds_.items():
                if isinstance(col_idx, int):
                    if len(X_copy.shape) > 1 and col_idx < X_copy.shape[1]:
                        X_copy[:, col_idx] = np.clip(
                            X_copy[:, col_idx],
                            bounds['lower'],
                            bounds['upper']
                        )
                    elif len(X_copy.shape) == 1:
                        X_copy = np.clip(X_copy, bounds['lower'], bounds['upper'])
                        
        return X_copy


# Test custom transformers
print("Testing custom transformers:")
print("\nOriginal income stats:")
print(df['income'].describe())

# Apply log transformation
log_transformer = LogTransformer(columns=['income'])
df_log = log_transformer.fit_transform(df)
print("\nAfter log transformation:")
print(df_log['income'].describe())

# Apply outlier capping
outlier_capper = OutlierCapper(lower_percentile=5, upper_percentile=95)
df_capped = outlier_capper.fit_transform(df)
print("\nAfter outlier capping (5th-95th percentile):")
print(df_capped['income'].describe())

Testing custom transformers:

Original income stats:
count      1000.000000
mean      39250.174494
std       46037.236602
min        1186.365269
25%       12529.975968
50%       23961.790988
75%       46149.098972
max      536653.314326
Name: income, dtype: float64

After log transformation:
count    1000.000000
mean       10.098963
std         0.988865
min         7.079492
25%         9.435959
50%        10.084256
75%        10.739653
max        13.193109
Name: income, dtype: float64

After outlier capping (5th-95th percentile):
count      1000.000000
mean      35978.686333
std       32650.627898
min        4836.202374
25%       12529.975968
50%       23961.790988
75%       46149.098972
max      123870.609145
Name: income, dtype: float64


### Advanced Custom Transformer: Feature Engineering

In [73]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    """Create interaction and polynomial features"""
    
    def __init__(self, create_interactions=True, create_ratios=True, 
                 age_col=None, income_col=None, credit_col=None):
        self.create_interactions = create_interactions
        self.create_ratios = create_ratios
        # For DataFrame: column names, for array: column indices
        self.age_col = age_col
        self.income_col = income_col
        self.credit_col = credit_col
        
    def fit(self, X, y=None):
        # Auto-detect column positions if not specified
        if hasattr(X, 'columns'):
            # It's a DataFrame - find column positions
            if self.age_col is None and 'age' in X.columns:
                self.age_col = 'age'
            if self.income_col is None and 'income' in X.columns:
                self.income_col = 'income'
            if self.credit_col is None and 'credit_score' in X.columns:
                self.credit_col = 'credit_score'
        return self
    
    def transform(self, X):
        if hasattr(X, 'columns'):
            # It's a DataFrame
            X_copy = X.copy()
            
            if self.create_interactions:
                # Age-income interaction
                if self.age_col in X_copy.columns and self.income_col in X_copy.columns:
                    X_copy['age_income_interaction'] = X_copy[self.age_col] * X_copy[self.income_col]
                
                # Credit score age interaction
                if self.credit_col in X_copy.columns and self.age_col in X_copy.columns:
                    X_copy['credit_age_interaction'] = X_copy[self.credit_col] * X_copy[self.age_col]
            
            if self.create_ratios:
                # Income to age ratio
                if self.income_col in X_copy.columns and self.age_col in X_copy.columns:
                    X_copy['income_per_year'] = X_copy[self.income_col] / (X_copy[self.age_col] + 1)
                
                # Credit to income ratio
                if self.credit_col in X_copy.columns and self.income_col in X_copy.columns:
                    X_copy['credit_income_ratio'] = X_copy[self.credit_col] / (X_copy[self.income_col] + 1)
            
            return X_copy
        else:
            # It's a numpy array - we need to work with indices
            X_copy = X.copy() if hasattr(X, 'copy') else np.array(X)
            new_features = []
            
            # Map column identifiers to indices if they're integers
            age_idx = self.age_col if isinstance(self.age_col, int) else None
            income_idx = self.income_col if isinstance(self.income_col, int) else None
            credit_idx = self.credit_col if isinstance(self.credit_col, int) else None
            
            # For production pipeline, we know the order: age=0, income=1, credit_score=2
            # This is based on the numeric_features list order
            if age_idx is None and X.shape[1] >= 3:
                age_idx = 0
            if income_idx is None and X.shape[1] >= 3:
                income_idx = 1
            if credit_idx is None and X.shape[1] >= 3:
                credit_idx = 2
            
            if self.create_interactions:
                # Age-income interaction
                if age_idx is not None and income_idx is not None and \
                   age_idx < X_copy.shape[1] and income_idx < X_copy.shape[1]:
                    new_features.append(X_copy[:, age_idx] * X_copy[:, income_idx])
                
                # Credit score age interaction
                if credit_idx is not None and age_idx is not None and \
                   credit_idx < X_copy.shape[1] and age_idx < X_copy.shape[1]:
                    new_features.append(X_copy[:, credit_idx] * X_copy[:, age_idx])
            
            if self.create_ratios:
                # Income to age ratio
                if income_idx is not None and age_idx is not None and \
                   income_idx < X_copy.shape[1] and age_idx < X_copy.shape[1]:
                    new_features.append(X_copy[:, income_idx] / (X_copy[:, age_idx] + 1))
                
                # Credit to income ratio
                if credit_idx is not None and income_idx is not None and \
                   credit_idx < X_copy.shape[1] and income_idx < X_copy.shape[1]:
                    new_features.append(X_copy[:, credit_idx] / (X_copy[:, income_idx] + 1))
            
            # Concatenate new features if any were created
            if new_features:
                new_features_array = np.column_stack(new_features)
                X_copy = np.hstack([X_copy, new_features_array])
            
            return X_copy


# Test feature engineering
feature_engineer = FeatureEngineer(age_col='age', income_col='income', credit_col='credit_score')
df_engineered = feature_engineer.fit_transform(df)

print("Original columns:", df.shape[1])
print("After feature engineering:", df_engineered.shape[1])
print("\nNew features created:")
new_features = set(df_engineered.columns) - set(df.columns)
for feature in new_features:
    print(f"  - {feature}")

Original columns: 8
After feature engineering: 12

New features created:
  - income_per_year
  - credit_age_interaction
  - credit_income_ratio
  - age_income_interaction


## 4. Complete Production Pipeline

### Putting It All Together

Let's build a complete, production-ready pipeline that includes:
1. Custom feature engineering
2. Outlier handling
3. Missing value imputation
4. Scaling and encoding
5. Feature selection
6. Model training

In [74]:
# Add some missing values to make it realistic
df_prod = df.copy()
missing_indices = np.random.choice(df_prod.index, size=50, replace=False)
df_prod.loc[missing_indices, 'income'] = np.nan
missing_indices = np.random.choice(df_prod.index, size=30, replace=False)
df_prod.loc[missing_indices, 'education'] = np.nan

print("Production dataset with missing values:")
print(df_prod.isnull().sum())

# Complete production pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# Create the feature engineering step with column specifications
feature_engineer = FeatureEngineer(age_col='age', income_col='income', credit_col='credit_score')

# Apply feature engineering to get column names
df_temp = feature_engineer.fit_transform(df_prod.drop('approved', axis=1))
all_numeric = df_temp.select_dtypes(include=[np.number]).columns.tolist()
all_categorical = df_temp.select_dtypes(exclude=[np.number]).columns.tolist()

# Numeric preprocessing
# Note: LogTransformer with columns=[1] because 'income' is the second column in numeric_features
# numeric_features = ['age', 'income', 'credit_score'] so income is at index 1
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('outlier_capper', OutlierCapper(lower_percentile=5, upper_percentile=95)),
    ('log_transformer', LogTransformer(columns=[1])),  # Apply log to income column (index 1)
    ('scaler', StandardScaler())
])

# Categorical preprocessing
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Preprocessing with dynamic column selection
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, all_numeric),
        ('cat', categorical_transformer, all_categorical)
    ])

# Complete pipeline
production_pipeline = Pipeline([
    ('feature_engineer', FeatureEngineer(age_col='age', income_col='income', credit_col='credit_score')),
    ('preprocessor', preprocessor),
    ('feature_selection', SelectKBest(f_classif, k=20)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Prepare data
X_prod = df_prod.drop('approved', axis=1)
y_prod = df_prod['approved']
X_train, X_test, y_train, y_test = train_test_split(X_prod, y_prod, test_size=0.2, random_state=42)

# Fit pipeline
production_pipeline.fit(X_train, y_train)

# Evaluate
train_score = production_pipeline.score(X_train, y_train)
test_score = production_pipeline.score(X_test, y_test)

print("\n" + "="*50)
print("COMPLETE PRODUCTION PIPELINE RESULTS")
print("="*50)
print(f"Training accuracy: {train_score:.3f}")
print(f"Testing accuracy: {test_score:.3f}")
print("\n✅ Pipeline steps executed:")
for i, (name, step) in enumerate(production_pipeline.named_steps.items(), 1):
    print(f"{i}. {name}: {step.__class__.__name__}")

Production dataset with missing values:
age              0
income          50
credit_score     0
education       30
employment       0
city             0
has_mortgage     0
approved         0
dtype: int64

COMPLETE PRODUCTION PIPELINE RESULTS
Training accuracy: 1.000
Testing accuracy: 0.645

✅ Pipeline steps executed:
1. feature_engineer: FeatureEngineer
2. preprocessor: ColumnTransformer
3. feature_selection: SelectKBest
4. classifier: RandomForestClassifier


## 5. Cross-Validation with Pipelines

### Preventing Data Leakage

One of the biggest advantages of pipelines is that they prevent data leakage during cross-validation. The preprocessing steps are fitted only on the training folds:

In [75]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Cross-validation with pipeline (CORRECT - no data leakage)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Note: This might take a moment to run
cv_scores = cross_val_score(
    production_pipeline, 
    X_prod, 
    y_prod, 
    cv=cv, 
    scoring='accuracy',
    n_jobs=-1
)

print("Cross-validation with Pipeline (No Data Leakage):")
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

print("\n✅ Why this prevents data leakage:")
print("1. For each fold, the pipeline is fitted ONLY on training data")
print("2. Scaler means/stds calculated from training fold only")
print("3. Encoder categories learned from training fold only")
print("4. Test fold transformed using training parameters")
print("5. This mimics real-world deployment perfectly!")

Cross-validation with Pipeline (No Data Leakage):
CV Scores: [0.665 0.65  0.675 0.685 0.675]
Mean CV Score: 0.670 (+/- 0.024)

✅ Why this prevents data leakage:
1. For each fold, the pipeline is fitted ONLY on training data
2. Scaler means/stds calculated from training fold only
3. Encoder categories learned from training fold only
4. Test fold transformed using training parameters
5. This mimics real-world deployment perfectly!


## 6. Saving and Loading Pipelines

### Deployment Ready

Once your pipeline is trained, you need to save it for deployment:

In [77]:
import joblib
import pickle
import os

# Method 1: Using joblib (recommended for scikit-learn)
joblib.dump(production_pipeline, 'production_pipeline.joblib')
print("Pipeline saved with joblib")

# Method 2: Using pickle
with open('production_pipeline.pkl', 'wb') as f:
    pickle.dump(production_pipeline, f)
print("Pipeline saved with pickle")

# Check file sizes
joblib_size = os.path.getsize('production_pipeline.joblib') / 1024  # KB
pickle_size = os.path.getsize('production_pipeline.pkl') / 1024  # KB

print(f"\nFile sizes:")
print(f"Joblib: {joblib_size:.1f} KB")
print(f"Pickle: {pickle_size:.1f} KB")
print("\n💡 Joblib is usually more efficient for NumPy arrays")

Pipeline saved with joblib
Pipeline saved with pickle

File sizes:
Joblib: 3041.5 KB
Pickle: 3031.4 KB

💡 Joblib is usually more efficient for NumPy arrays


### Loading and Using in Production

In [78]:
# Simulate production environment
# Load the saved pipeline
loaded_pipeline = joblib.load('production_pipeline.joblib')

# Simulate new data coming in
new_data = pd.DataFrame({
    'age': [25, 45, 33],
    'income': [45000, 85000, np.nan],  # Note: missing value!
    'credit_score': [720, 650, 800],
    'education': ['Bachelor', 'PhD', 'Master'],
    'employment': ['Full-time', 'Self-employed', 'Full-time'],
    'city': ['Chicago', 'New York', 'Los Angeles'],
    'has_mortgage': ['No', 'Yes', 'Yes']
})

print("New data for prediction:")
print(new_data)

# Make predictions
predictions = loaded_pipeline.predict(new_data)
probabilities = loaded_pipeline.predict_proba(new_data)

print("\nPredictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities[:, 1])):
    status = "Approved" if pred == 1 else "Rejected"
    print(f"Application {i+1}: {status} (Probability: {prob:.2%})")

print("\n✅ Note: The pipeline handled the missing value automatically!")

# Clean up
os.remove('production_pipeline.joblib')
os.remove('production_pipeline.pkl')

New data for prediction:
   age   income  credit_score education     employment         city  \
0   25  45000.0           720  Bachelor      Full-time      Chicago   
1   45  85000.0           650       PhD  Self-employed     New York   
2   33      NaN           800    Master      Full-time  Los Angeles   

  has_mortgage  
0           No  
1          Yes  
2          Yes  

Predictions:
Application 1: Rejected (Probability: 47.00%)
Application 2: Approved (Probability: 65.00%)
Application 3: Rejected (Probability: 36.00%)

✅ Note: The pipeline handled the missing value automatically!
