# 🚀 Machine Learning Pipelines: Complete Preprocessing & Model Evaluation

## 📋 Overview
This notebook demonstrates a **complete machine learning pipeline** from raw data to model evaluation. We'll use the famous **Tips dataset** to predict meal timing (Lunch vs Dinner) based on customer and meal characteristics.

## 🎯 Learning Objectives
By the end of this notebook, you'll understand:
- ✅ **Data Loading & Exploration**: Loading datasets and understanding structure
- ✅ **Target Variable Encoding**: Converting categorical targets to numerical
- ✅ **Train-Test Splitting**: Proper data partitioning with stratification
- ✅ **Missing Value Analysis**: Identifying and handling missing data
- ✅ **Preprocessing Pipelines**: Automated feature transformation workflows
- ✅ **Column Transformers**: Applying different preprocessing to different column types
- ✅ **Model Training**: Training multiple algorithms on preprocessed data
- ✅ **Model Evaluation**: Comparing model performance systematically

## 🛠️ Technical Stack
- **pandas**: Data manipulation and analysis
- **scikit-learn**: Machine learning algorithms and preprocessing
- **seaborn**: Dataset loading and visualization
- **numpy**: Numerical computing

## 📊 Dataset: Restaurant Tips
- **Source**: Seaborn built-in dataset
- **Samples**: 244 restaurant visits
- **Features**: Bill amount, tip, party size, customer demographics
- **Target**: Meal time (Lunch vs Dinner)
- **Task**: Binary classification

## 🔄 Pipeline Workflow
```
Raw Data → EDA → Target Encoding → Train/Test Split → 
Missing Value Check → Feature Pipeline Design → 
Data Preprocessing → Model Training → Performance Evaluation
```

---

In [None]:
# 📚 ESSENTIAL LIBRARIES FOR MACHINE LEARNING PIPELINES
# ===================================================

# Core data manipulation and analysis
import pandas as pd          # DataFrame operations and data manipulation
import numpy as np           # Numerical computing and array operations
import matplotlib.pyplot as plt  # Basic plotting and visualization
import seaborn as sns        # Advanced statistical visualizations and built-in datasets

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

print("✅ All libraries imported successfully!")
print("🎯 Ready for: Data loading, EDA, preprocessing, and model training")

In [None]:
# 🍽️ LOADING THE TIPS DATASET
# ============================

# Load the famous 'tips' dataset from seaborn
# This dataset contains information about restaurant tips, meals, and customer demographics
df = sns.load_dataset("tips")

print("📊 DATASET LOADED SUCCESSFULLY!")
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print("\n🔍 First few rows:")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [None]:
# 🎯 DEFINING OUR PREDICTION TARGET
# =================================

# We'll predict 'time' - whether the meal was during Lunch or Dinner
print("🎯 PREDICTION OBJECTIVE: Classify meal time (Lunch vs Dinner)")
print("=" * 55)

# Explore unique values in the target variable
unique_times = df.time.unique()
print(f"📋 Unique time periods: {unique_times}")
print(f"📊 Number of classes: {len(unique_times)}")

# Count distribution
time_counts = df.time.value_counts()
print(f"\n📈 Class distribution:")
for time_period, count in time_counts.items():
    percentage = (count / len(df)) * 100
    print(f"   {time_period}: {count} samples ({percentage:.1f}%)")
    
print(f"\n💡 This is a binary classification problem!")
print(f"   We'll use customer features to predict meal timing")

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [None]:
# 🔍 EXPLORATORY DATA ANALYSIS (EDA)
# ===================================

print("🕵️ DATASET OVERVIEW:")
print("=" * 30)

# Get dataset information
print("📋 Dataset Info:")
print(f"   • Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum()} bytes")
print()

# Display data types and non-null counts
print("🏗️ Column Details:")
df.info()

print(f"\n📊 SUMMARY STATISTICS:")
print("=" * 25)
df.describe()

<bound method DataFrame.info of      total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
..          ...   ...     ...    ...   ...     ...   ...
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[244 rows x 7 columns]>

In [72]:
# 🏷️ TARGET VARIABLE ENCODING
# ============================

print("🏷️ ENCODING TARGET VARIABLE (TIME)")
print("=" * 40)

# Since 'time' is categorical (nominal), we use LabelEncoder
# This converts text labels to numerical values for ML algorithms
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
print(f"📝 Original values: {df['time'].unique()}")

# Transform categorical time values to numerical
df['time'] = encoder.fit_transform(df['time'])

print(f"🔢 Encoded values: {df['time'].unique()}")
print(f"📊 Encoding mapping:")
for i, label in enumerate(encoder.classes_):
    print(f"   '{label}' → {i}")

print(f"\n✅ Target variable successfully encoded!")
print(f"📈 Class distribution after encoding:")
print(df['time'].value_counts().sort_index())

# Display the updated dataset
print(f"\n🔍 Updated dataset:")
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,0,2
1,10.34,1.66,Male,No,Sun,0,3
2,21.01,3.50,Male,No,Sun,0,3
3,23.68,3.31,Male,No,Sun,0,2
4,24.59,3.61,Female,No,Sun,0,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,0,3
240,27.18,2.00,Female,Yes,Sat,0,2
241,22.67,2.00,Male,Yes,Sat,0,2
242,17.82,1.75,Male,No,Sat,0,2


In [None]:
# ✅ VERIFICATION OF TARGET ENCODING
# ==================================

print("🔍 VERIFYING TARGET VARIABLE ENCODING:")
print("=" * 42)

# Verify the encoding worked correctly
encoded_values = df.time.unique()
print(f"📊 Encoded time values: {sorted(encoded_values)}")
print(f"📈 Value counts:")

for value in sorted(encoded_values):
    count = (df.time == value).sum()
    original_label = encoder.inverse_transform([value])[0]
    print(f"   {value} ('{original_label}'): {count} samples")

print(f"\n✅ Target encoding verification complete!")
print(f"💡 Ready for train-test split and model training")

array([0, 1])

In [None]:
# 🎯 PREPARING FEATURES AND TARGET VARIABLES
# ===========================================

print("🎯 SEPARATING FEATURES AND TARGET:")
print("=" * 38)

# Separate features (X) and target variable (y)
X = df.drop('time', axis=1)  # Features: all columns except 'time'
y = df.time                  # Target: 'time' column (0=Dinner, 1=Lunch)

print(f"📊 DATASET SPLIT SUMMARY:")
print(f"   Features (X): {X.shape[0]} samples × {X.shape[1]} features")
print(f"   Target (y):   {y.shape[0]} samples")
print()

print(f"📋 FEATURE COLUMNS:")
print(f"   {list(X.columns)}")
print()

print(f"🎯 TARGET DISTRIBUTION:")
for value in sorted(y.unique()):
    count = (y == value).sum()
    original_label = encoder.inverse_transform([value])[0]
    percentage = (count / len(y)) * 100
    print(f"   {value} ('{original_label}'): {count} samples ({percentage:.1f}%)")

print(f"\n✅ Features and target successfully separated!")
print(f"💡 Ready for train-test split")

In [None]:
# 🔄 TRAIN-TEST SPLIT
# ===================

print("🔄 CREATING TRAIN-TEST SPLIT:")
print("=" * 32)

# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split the data: 80% training, 20% testing
# random_state=1 ensures reproducible results
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=1,     # For reproducibility
    stratify=y          # Maintain class distribution in both splits
)

print(f"📊 SPLIT SUMMARY:")
print(f"   Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   Test set:     {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print()

print(f"🎯 CLASS DISTRIBUTION CHECK:")
print(f"   Training target distribution:")
for value in sorted(y_train.unique()):
    count = (y_train == value).sum()
    percentage = (count / len(y_train)) * 100
    original_label = encoder.inverse_transform([value])[0]
    print(f"     {value} ('{original_label}'): {count} samples ({percentage:.1f}%)")

print(f"   Test target distribution:")
for value in sorted(y_test.unique()):
    count = (y_test == value).sum()
    percentage = (count / len(y_test)) * 100
    original_label = encoder.inverse_transform([value])[0]
    print(f"     {value} ('{original_label}'): {count} samples ({percentage:.1f}%)")

print(f"\n✅ Train-test split completed successfully!")
print(f"💡 Stratified split maintains balanced class distribution")

In [76]:
# 🔍 EXAMINING TRAINING DATA STRUCTURE
# ====================================

print("🔍 TRAINING DATA EXAMINATION:")
print("=" * 33)

# Display first few rows of training features
print(f"📊 Training set shape: {X_train.shape}")
print(f"📋 Feature columns: {list(X_train.columns)}")
print()

print(f"🔍 First 5 training samples:")
display(X_train.head())

print(f"\n📈 DATA TYPES:")
for col in X_train.columns:
    dtype = X_train[col].dtype
    unique_count = X_train[col].nunique()
    print(f"   {col:<12}: {dtype} ({unique_count} unique values)")

print(f"\n💡 OBSERVATIONS:")
print(f"   • Numerical features: total_bill, tip, size")  
print(f"   • Categorical features: sex, smoker, day")
print(f"   • Mixed data types require preprocessing pipeline!")
print(f"   • Next: Check for missing values and apply preprocessing")

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
154,19.77,2.0,Male,No,Sun,4
167,31.71,4.5,Male,No,Sun,4
110,14.0,3.0,Male,No,Sat,2
225,16.27,2.5,Female,Yes,Fri,2


In [None]:
# 🔍 MISSING VALUE ANALYSIS
# =========================

print("🔍 MISSING VALUE ANALYSIS:")
print("=" * 29)

# Check for missing values in the dataset
missing_values = df.isna().sum()
total_samples = len(df)

print(f"📊 Missing Value Report:")
print(f"   Total samples: {total_samples}")
print()

if missing_values.sum() == 0:
    print("✅ EXCELLENT! No missing values found in any column!")
    print("   This simplifies our preprocessing pipeline")
else:
    print("⚠️  Missing values detected:")
    for column, missing_count in missing_values.items():
        if missing_count > 0:
            percentage = (missing_count / total_samples) * 100
            print(f"   {column}: {missing_count} missing ({percentage:.2f}%)")

print(f"\n📋 Column-wise missing value summary:")
missing_summary = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': (missing_values.values / total_samples * 100).round(2)
})
display(missing_summary)

print(f"\n💡 PREPROCESSING STRATEGY:")
if missing_values.sum() == 0:
    print("   • No missing value imputation needed")
    print("   • Focus on encoding categorical variables")
    print("   • Apply feature scaling to numerical variables")
else:
    print("   • Will handle missing values in preprocessing pipeline")
    print("   • Use median imputation for numerical features")
    print("   • Use most frequent imputation for categorical features")

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [None]:
# 🛠️ IMPORTING PREPROCESSING TOOLS
# =================================

print("🛠️ IMPORTING PREPROCESSING COMPONENTS:")
print("=" * 42)

# Missing value handling
from sklearn.impute import SimpleImputer
print("✅ SimpleImputer: Handle missing values")

# Feature encoding and scaling
from sklearn.preprocessing import OneHotEncoder
print("✅ OneHotEncoder: Convert categorical to binary features")

from sklearn.preprocessing import StandardScaler  
print("✅ StandardScaler: Normalize numerical features")

# Pipeline construction tools
from sklearn.pipeline import Pipeline
print("✅ Pipeline: Chain preprocessing steps")

from sklearn.compose import ColumnTransformer
print("✅ ColumnTransformer: Apply different preprocessing to different columns")

print(f"\n📋 PREPROCESSING TOOLKIT READY:")
print(f"   🔧 Missing Values → SimpleImputer")
print(f"   🔧 Categorical Data → OneHotEncoder") 
print(f"   🔧 Numerical Data → StandardScaler")
print(f"   🔧 Pipeline Management → Pipeline & ColumnTransformer")

print(f"\n💡 NEXT STEPS:")
print(f"   1. Define numerical and categorical columns")
print(f"   2. Create separate pipelines for each data type")
print(f"   3. Combine pipelines using ColumnTransformer")
print(f"   4. Apply preprocessing to training and test data")

In [None]:
# 📊 DEFINING COLUMN TYPES FOR PREPROCESSING
# ==========================================

print("📊 CATEGORIZING FEATURES BY DATA TYPE:")
print("=" * 41)

# Define categorical columns (non-numerical features)
cat_cols = ["sex", "smoker", "day"]
print(f"🏷️  CATEGORICAL COLUMNS: {cat_cols}")
print(f"   • These contain text/category values")
print(f"   • Need OneHotEncoder to convert to numbers")
print(f"   • Will expand into multiple binary columns")

# Define numerical columns (already numerical features)  
num_col = ["total_bill", "tip", "size"]
print(f"\n🔢 NUMERICAL COLUMNS: {num_col}")
print(f"   • These are already numeric values")
print(f"   • Need StandardScaler for normalization")
print(f"   • Will maintain same number of columns")

print(f"\n📋 PREPROCESSING REQUIREMENTS:")
print(f"   Categorical ({len(cat_cols)} cols) → OneHot Encoding")
print(f"   Numerical ({len(num_col)} cols)   → Standard Scaling")

print(f"\n💡 EXPECTED TRANSFORMATION:")
print(f"   Original features: {len(cat_cols) + len(num_col)} columns")
print(f"   After preprocessing: {len(num_col)} + (binary features from encoding)")
print(f"   Final feature count will be determined by unique categorical values")

In [None]:
# 🏗️ BUILDING PREPROCESSING PIPELINES
# ====================================

print("🏗️ CONSTRUCTING FEATURE PREPROCESSING PIPELINES:")
print("=" * 49)

# NUMERICAL PIPELINE: Handle numerical features
print("🔢 NUMERICAL PIPELINE:")
num_pipeline = Pipeline(steps=[
    ('imputation', SimpleImputer(strategy="median")),  # Fill missing with median
    ('scaling', StandardScaler())                      # Normalize to mean=0, std=1
])
print("   Step 1: Impute missing values with median")
print("   Step 2: Standardize features (mean=0, std=1)")

# CATEGORICAL PIPELINE: Handle categorical features  
print(f"\n🏷️  CATEGORICAL PIPELINE:")
cat_pipeline = Pipeline(steps=[
    ('imputation', SimpleImputer(strategy="most_frequent")),  # Fill missing with mode
    ('encoding', OneHotEncoder())                            # Convert to binary features
])
print("   Step 1: Impute missing values with most frequent value")
print("   Step 2: One-hot encode categories into binary columns")

print(f"\n✅ PIPELINE CONSTRUCTION COMPLETE!")
print(f"\n📋 PIPELINE SUMMARY:")
print(f"   • Numerical Pipeline: {len(num_pipeline.steps)} steps")
print(f"   • Categorical Pipeline: {len(cat_pipeline.steps)} steps")
print(f"   • Both pipelines handle missing values + transformations")

print(f"\n💡 WHAT EACH PIPELINE DOES:")
print(f"   📈 Numerical: missing → median, then scale to standard normal")
print(f"   🔤 Categorical: missing → most frequent, then binary encoding") 
print(f"   🎯 Result: All features become numerical for ML algorithms")

In [None]:
# 🔗 COMBINING PIPELINES WITH COLUMNTRANSFORMER
# =============================================

print("🔗 CREATING UNIFIED PREPROCESSING PIPELINE:")
print("=" * 44)

# Combine both pipelines using ColumnTransformer
# This applies different preprocessing to different column types
preprocessor = ColumnTransformer([
    ("num_pipeline", num_pipeline, num_col),    # Apply num_pipeline to numerical columns
    ("cat_pipeline", cat_pipeline, cat_cols)    # Apply cat_pipeline to categorical columns
])

print("✅ UNIFIED PREPROCESSOR CREATED!")
print(f"\n📊 PREPROCESSING CONFIGURATION:")
print(f"   🔢 Numerical columns ({len(num_col)}): {num_col}")
print(f"      → Pipeline: median imputation + standard scaling")
print(f"   🏷️  Categorical columns ({len(cat_cols)}): {cat_cols}")  
print(f"      → Pipeline: mode imputation + one-hot encoding")

print(f"\n🎯 TRANSFORMATION WORKFLOW:")
print(f"   1. Input: Raw data with mixed types")
print(f"   2. Split: Route columns to appropriate pipelines")
print(f"   3. Process: Apply pipeline transformations")
print(f"   4. Combine: Merge processed features")
print(f"   5. Output: Fully numerical feature matrix")

print(f"\n💡 EXPECTED OUTPUT STRUCTURE:")
print(f"   • Numerical features: {len(num_col)} columns (scaled)")
print(f"   • Categorical features: Multiple binary columns (one-hot)")
print(f"   • Total: More columns than original due to encoding")
print(f"   • All values: Numerical and ready for ML algorithms!")

print(f"\n🚀 PREPROCESSOR READY FOR TRAINING DATA!")

In [82]:
# 🔄 TRANSFORM TRAINING DATA
# =========================

# Apply preprocessing pipeline to training data
print("🔄 Applying preprocessing to training data...")
X_train_transformed = preprocessor.fit_transform(X_train)

print("✅ Training data transformation complete!")
print(f"Original training shape: {X_train.shape}")
print(f"Transformed training shape: {X_train_transformed.shape}")
print(f"Features expanded from {X_train.shape[1]} to {X_train_transformed.shape[1]} due to one-hot encoding")
print()
print("💡 What happened:")
print("   • Categorical variables (sex, smoker, day) converted to binary features")
print("   • Numerical variables scaled to standard normal distribution")
print("   • Missing values imputed (if any)")
print("   • Data is now ready for machine learning models!")

🔄 Applying preprocessing to training data...
✅ Training data transformation complete!
Original training shape: (195, 6)
Transformed training shape: (195, 11)
Features expanded from 6 to 11 due to one-hot encoding

💡 What happened:
   • Categorical variables (sex, smoker, day) converted to binary features
   • Numerical variables scaled to standard normal distribution
   • Missing values imputed (if any)
   • Data is now ready for machine learning models!


In [83]:
# 🔄 TRANSFORM TEST DATA
# ======================

# Apply preprocessing pipeline to test data (using already fitted preprocessor)
print("🔄 Applying preprocessing to test data...")
X_test_transformed = preprocessor.transform(X_test)

print("✅ Test data transformation complete!")
print(f"Original test shape: {X_test.shape}")
print(f"Transformed test shape: {X_test_transformed.shape}")
print()
print("💡 Important notes:")
print("   • Used transform() not fit_transform() to avoid data leakage")
print("   • Same preprocessing applied as learned from training data")
print("   • Both datasets now have consistent feature structure")
print("   • Ready for model training and evaluation!")

🔄 Applying preprocessing to test data...
✅ Test data transformation complete!
Original test shape: (49, 6)
Transformed test shape: (49, 11)

💡 Important notes:
   • Used transform() not fit_transform() to avoid data leakage
   • Same preprocessing applied as learned from training data
   • Both datasets now have consistent feature structure
   • Ready for model training and evaluation!


In [None]:
# 👀 EXAMINING RAW TRAINING DATA (BEFORE PREPROCESSING)
# ====================================================

print("👀 RAW TRAINING DATA INSPECTION:")
print("=" * 36)

print(f"📊 Training data shape: {X_train.shape}")
print(f"📋 Data types before preprocessing:")

for col in X_train.columns:
    dtype = X_train[col].dtype
    unique_vals = X_train[col].nunique()
    sample_vals = X_train[col].unique()[:3]  # Show first 3 unique values
    print(f"   {col:<12}: {dtype} | {unique_vals} unique | sample: {sample_vals}")

print(f"\n🔍 Raw training data (first 5 rows):")
display(X_train)

print(f"\n⚠️  NOTICE: This data contains mixed types!")
print(f"   • Categorical columns have string values ('Male', 'Female', etc.)")
print(f"   • Numerical columns have float/int values")
print(f"   • ML algorithms can't handle strings directly")
print(f"   • This is WHY we need preprocessing!")

print(f"\n🔄 NEXT: Apply preprocessor to convert everything to numbers")

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
154,19.77,2.00,Male,No,Sun,4
167,31.71,4.50,Male,No,Sun,4
110,14.00,3.00,Male,No,Sat,2
225,16.27,2.50,Female,Yes,Fri,2
...,...,...,...,...,...,...
137,14.15,2.00,Female,No,Thur,2
72,26.86,3.14,Female,Yes,Sat,2
140,17.47,3.50,Female,No,Thur,2
235,10.07,1.25,Male,No,Sat,2


In [None]:
# 🤖 IMPORTING MACHINE LEARNING MODELS
# ====================================

print("🤖 SETTING UP MACHINE LEARNING MODELS:")
print("=" * 40)

# Import classification algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

print("✅ Imported classification algorithms:")
print("   🌳 DecisionTreeClassifier: Tree-based learning")
print("   🎯 SVC (Support Vector Classifier): Margin-based learning")

# Create model dictionary for easy iteration
models = {
    "Support Vector Classifier": SVC(),
    "Decision Tree Classifier": DecisionTreeClassifier()
}

print(f"\n📊 MODEL COLLECTION CREATED:")
print(f"   Total models: {len(models)}")
for name, model in models.items():
    print(f"   • {name}: {type(model).__name__}")

print(f"\n🎯 MODEL CHARACTERISTICS:")
print(f"   📈 SVC: Finds optimal decision boundary with maximum margin")
print(f"   🌲 Decision Tree: Creates if-then rules for classification")
print(f"   🔄 Both: Will be trained and evaluated on preprocessed data")

print(f"\n💡 EVALUATION STRATEGY:")
print(f"   • Train both models on same preprocessed training data")
print(f"   • Test both models on same preprocessed test data") 
print(f"   • Compare accuracy scores to find best performer")
print(f"   • Use standardized evaluation function for consistency")

print(f"\n🚀 MODELS READY FOR TRAINING!")

In [None]:
# 📏 MODEL TRAINING AND EVALUATION FUNCTION
# =========================================

print("📏 CREATING MODEL EVALUATION FUNCTION:")
print("=" * 40)

from sklearn.metrics import accuracy_score

def model_train_eval(X_train, y_train, X_test, y_test, models):
    """
    🎯 COMPREHENSIVE MODEL TRAINING & EVALUATION
    
    This function:
    1. Trains each model on training data
    2. Makes predictions on test data  
    3. Calculates accuracy scores
    4. Returns performance comparison
    
    Parameters:
    -----------
    X_train : array-like, shape (n_samples, n_features)
        Training feature matrix (preprocessed)
    y_train : array-like, shape (n_samples,)
        Training target labels
    X_test : array-like, shape (n_samples, n_features)  
        Test feature matrix (preprocessed)
    y_test : array-like, shape (n_samples,)
        Test target labels
    models : dict
        Dictionary of model_name: model_object pairs
        
    Returns:
    --------
    evaluation : dict
        Dictionary of model_name: accuracy_score pairs
    """
    evaluation = {}
    
    print("🔄 Training and evaluating models...")
    
    for i in range(len(models)):
        # Get current model
        model = list(models.values())[i]
        model_name = list(models.keys())[i]
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        
        # Calculate accuracy
        model_score = accuracy_score(y_test, y_pred)
        
        # Store result
        evaluation[model_name] = model_score
        
        print(f"   ✅ {model_name}: {model_score:.4f}")
    
    return evaluation

print("✅ MODEL EVALUATION FUNCTION CREATED!")
print(f"\n🔧 FUNCTION CAPABILITIES:")
print(f"   • Trains multiple models automatically")
print(f"   • Standardized evaluation process")
print(f"   • Calculates accuracy for each model")
print(f"   • Returns organized results dictionary")
print(f"   • Handles any number of models")

print(f"\n💡 USAGE WORKFLOW:")
print(f"   1. Pass preprocessed training/test data")
print(f"   2. Function trains each model") 
print(f"   3. Function tests each model")
print(f"   4. Function returns accuracy comparison")
print(f"   5. Easy to identify best performing model!")

print(f"\n🚀 READY TO EVALUATE MODELS ON PREPROCESSED DATA!")

In [87]:
# 🚀 TRAIN AND EVALUATE MODELS WITH PREPROCESSED DATA
# ==================================================

# Use the TRANSFORMED data for model training and evaluation
print("🚀 Training models on preprocessed data...")
print("=" * 50)

evaluation_results = model_train_eval(
    X_train_transformed,  # Use transformed training data (NOT X_train)
    y_train, 
    X_test_transformed,   # Use transformed test data (NOT X_test)
    y_test, 
    models
)

print("📊 MODEL PERFORMANCE RESULTS:")
print("=" * 40)
for model_name, accuracy in evaluation_results.items():
    print(f"{model_name:<25}: {accuracy:.4f} ({accuracy*100:.2f}%)")

print(f"\n🎯 Best performing model: {max(evaluation_results, key=evaluation_results.get)}")
print(f"🏆 Best accuracy: {max(evaluation_results.values()):.4f} ({max(evaluation_results.values())*100:.2f}%)")

print(f"\n💡 Success! Models trained on properly preprocessed data:")
print(f"   ✅ No more 'string to float' conversion errors")
print(f"   ✅ Categorical variables properly encoded as numbers")
print(f"   ✅ Numerical features properly scaled")
print(f"   ✅ All features ready for machine learning algorithms")

🚀 Training models on preprocessed data...
📊 MODEL PERFORMANCE RESULTS:
support vector classifier: 0.9184 (91.84%)
DT classififer           : 0.9184 (91.84%)

🎯 Best performing model: support vector classifier
🏆 Best accuracy: 0.9184 (91.84%)

💡 Success! Models trained on properly preprocessed data:
   ✅ No more 'string to float' conversion errors
   ✅ Categorical variables properly encoded as numbers
   ✅ Numerical features properly scaled
   ✅ All features ready for machine learning algorithms


## 🎉 Notebook Completion Summary

### ✅ What We Accomplished
1. **📊 Data Exploration**: Loaded and analyzed the Tips dataset (244 samples, 7 features)
2. **🏷️ Target Encoding**: Converted categorical time labels to numerical values
3. **🔄 Data Splitting**: Created stratified train-test split (80/20) maintaining class balance
4. **🔍 Missing Value Analysis**: Confirmed no missing values in dataset
5. **🛠️ Pipeline Construction**: Built separate preprocessing pipelines for numerical and categorical features
6. **🔗 Pipeline Integration**: Combined pipelines using ColumnTransformer
7. **🔄 Data Transformation**: Applied preprocessing to both training and test sets
8. **🤖 Model Training**: Trained Support Vector Classifier and Decision Tree
9. **📈 Model Evaluation**: Compared model performance with standardized metrics

### 🏆 Key Results
- **Best Model**: {Best performing model from evaluation}
- **Best Accuracy**: {Highest accuracy achieved}
- **Feature Expansion**: Original 6 features → {Final feature count} after one-hot encoding
- **Data Quality**: No missing values, clean preprocessing pipeline

### 💡 Key Learning Points
1. **Preprocessing Importance**: Raw categorical data must be converted to numerical for ML
2. **Pipeline Benefits**: Automated, reproducible, and prevents data leakage
3. **ColumnTransformer Power**: Apply different preprocessing to different column types
4. **Evaluation Consistency**: Standardized evaluation ensures fair model comparison
5. **Data Leakage Prevention**: Use `transform()` on test data, not `fit_transform()`

### 🚀 Next Steps
- **Ensemble Methods**: Combine multiple models for better performance
- **Hyperparameter Tuning**: Optimize model parameters using GridSearchCV
- **Cross-Validation**: More robust performance estimation
- **Feature Engineering**: Create new features from existing ones
- **Model Interpretability**: Understand which features drive predictions

### 🔧 Pipeline Advantages Demonstrated
- ✅ **Reproducibility**: Same preprocessing applied consistently
- ✅ **Maintainability**: Easy to modify and extend
- ✅ **Scalability**: Works with any dataset size
- ✅ **Flexibility**: Different preprocessing for different feature types
- ✅ **Error Prevention**: Automated process reduces manual mistakes

---

## 📚 Further Reading
- **Scikit-learn User Guide**: [Preprocessing Data](https://scikit-learn.org/stable/modules/preprocessing.html)
- **Pipeline Documentation**: [ML Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
- **Column Transformer**: [Heterogeneous Data](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

**Happy Machine Learning! 🚀🤖📊**