# SentinalX - Hybrid Fraud Detection Model Training
## Interactive Notebook for Training & Debugging

This notebook breaks down the training pipeline into separate executable blocks for easy debugging and experimentation.

**Pipeline Overview:**
1. üì¶ Imports & Setup
2. üìä Load & Explore Data
3. üîç Test Hard Rule Filter
4. ‚öôÔ∏è Feature Preparation
5. ü§ñ Initialize Model
6. üöÄ Train Isolation Forest
7. üîÆ Test Predictions
8. üìà Evaluate Metrics
9. üíæ Save Model

## 1. üì¶ Import Required Libraries

Import all necessary libraries for data processing, machine learning, and visualization.

In [2]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
import joblib
import json
from datetime import datetime
from typing import Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")
print(f"   Pandas version: {pd.__version__}")
print(f"   Numpy version: {np.__version__}")

‚úÖ All libraries imported successfully!
   Pandas version: 3.0.0
   Numpy version: 2.4.2


## 2. üìä Load and Explore Training Data

Load the training dataset and display basic statistics to understand the data distribution.

In [3]:
# Load training data
print("üìÇ Loading training data...")
train_df = pd.read_csv('Data/training_dataset.csv')

print(f"‚úÖ Loaded {len(train_df):,} training samples\n")

# Display dataset info
print("üìä Dataset Overview:")
print(f"   Shape: {train_df.shape}")
print(f"   Features: {train_df.columns.tolist()}")

print("\nüìä Label Distribution:")
label_counts = train_df['label'].value_counts()
for label, count in label_counts.items():
    pct = count / len(train_df) * 100
    print(f"   {label}: {count:,} ({pct:.1f}%)")

print("\nüìä User Type Distribution:")
print(train_df['userType'].value_counts().sort_index())

# Display first few rows
print("\nüìã First 5 rows:")
train_df.head()

üìÇ Loading training data...
‚úÖ Loaded 13,000 training samples

üìä Dataset Overview:
   Shape: (13000, 13)
   Features: ['phoneNumber', 'avgDuration', 'callFrequency', 'uniqueContacts', 'avgCallDistance', 'circleDiversity', 'label', 'userType', 'call_intensity', 'distance_per_call', 'contact_circle_ratio', 'high_freq_long_distance', 'delivery_pattern']

üìä Label Distribution:
   LEGITIMATE: 10,238 (78.8%)
   FRAUD: 2,762 (21.2%)

üìä User Type Distribution:
userType
BUSINESS_USER             3000
DELIVERY_PARTNER          3000
DIGITAL_ARREST_BOT        1500
LOW_VOLUME_SCAMMER         262
REGULAR_USER              4000
TRADITIONAL_SCAMMER       1000
TRAVELING_PROFESSIONAL     238
Name: count, dtype: int64

üìã First 5 rows:


Unnamed: 0,phoneNumber,avgDuration,callFrequency,uniqueContacts,avgCallDistance,circleDiversity,label,userType,call_intensity,distance_per_call,contact_circle_ratio,high_freq_long_distance,delivery_pattern
0,916698198780,4.168521,84,78,5.924908,1,LEGITIMATE,DELIVERY_PARTNER,16.252232,0.069705,39.0,0,1
1,917859421937,215.629137,19,15,174.310282,1,LEGITIMATE,REGULAR_USER,0.087707,8.715514,7.5,0,0
2,918673614047,73.762206,26,39,559.37804,1,LEGITIMATE,REGULAR_USER,0.347769,20.717705,19.5,0,0
3,918971147766,54.947743,62,111,1010.016874,4,FRAUD,TRADITIONAL_SCAMMER,1.108177,16.032014,22.2,1,0
4,917277642649,25.195583,44,64,1613.389214,3,FRAUD,TRADITIONAL_SCAMMER,1.679672,35.853094,16.0,0,0


## 3. üîç Test Hard Rule Filter (Stage 1)

Test the hard rule logic that protects delivery partners:
- **Rule**: `callFrequency > 50 AND avgCallDistance < 10`
- **Goal**: Zero false positives for delivery partners

In [4]:
# Apply hard rule filter
hard_rule_mask = (train_df['callFrequency'] > 50) & (train_df['avgCallDistance'] < 10)

rule_safe = train_df[hard_rule_mask].copy()
remaining = train_df[~hard_rule_mask].copy()

print("üìã STAGE 1: Hard Rule Filter Results")
print("=" * 60)
print(f"‚úì Hard rule protected: {len(rule_safe):,} records")
print(f"‚úì Remaining for ML: {len(remaining):,} records")

# Check accuracy of hard rule
if len(rule_safe) > 0:
    rule_accuracy = (rule_safe['label'] == 'LEGITIMATE').sum() / len(rule_safe)
    print(f"‚úì Hard rule accuracy: {rule_accuracy*100:.2f}%")
    
    # Check if any fraud slipped through
    fraud_in_safe = (rule_safe['label'] == 'FRAUD').sum()
    print(f"‚úì Fraud cases in safe zone: {fraud_in_safe}")
    
    # Display user types protected by hard rule
    print(f"\nüìä User types protected by hard rule:")
    print(rule_safe['userType'].value_counts())

print(f"\nüìä Remaining user types for ML evaluation:")
print(remaining['userType'].value_counts())

üìã STAGE 1: Hard Rule Filter Results
‚úì Hard rule protected: 2,554 records
‚úì Remaining for ML: 10,446 records
‚úì Hard rule accuracy: 100.00%
‚úì Fraud cases in safe zone: 0

üìä User types protected by hard rule:
userType
DELIVERY_PARTNER    2554
Name: count, dtype: int64

üìä Remaining user types for ML evaluation:
userType
REGULAR_USER              4000
BUSINESS_USER             3000
DIGITAL_ARREST_BOT        1500
TRADITIONAL_SCAMMER       1000
DELIVERY_PARTNER           446
LOW_VOLUME_SCAMMER         262
TRAVELING_PROFESSIONAL     238
Name: count, dtype: int64


## 4. ‚öôÔ∏è Configure Features

Define the feature columns that will be used for the Isolation Forest model.

In [5]:
# Define feature columns for the model
feature_columns = [
    'avgDuration',           # Base feature
    'callFrequency',         # Base feature
    'uniqueContacts',        # Base feature
    'avgCallDistance',       # Base feature
    'circleDiversity',       # Base feature
    'call_intensity',        # Engineered feature
    'distance_per_call',     # Engineered feature
    'contact_circle_ratio',  # Engineered feature
    'high_freq_long_distance'# Engineered feature (binary)
]

print("üìã Feature Configuration")
print("=" * 60)
print(f"Total features: {len(feature_columns)}")
print("\nFeatures:")
for i, feat in enumerate(feature_columns, 1):
    print(f"  {i}. {feat}")

# Check if all features exist in the dataset
print("\n‚úì Checking feature availability in dataset...")
missing_features = [f for f in feature_columns if f not in remaining.columns]
if missing_features:
    print(f"‚ùå Missing features: {missing_features}")
else:
    print(f"‚úÖ All {len(feature_columns)} features available!")
    
# Display feature statistics
print("\nüìä Feature Statistics (for ML evaluation set):")
remaining[feature_columns].describe()

üìã Feature Configuration
Total features: 9

Features:
  1. avgDuration
  2. callFrequency
  3. uniqueContacts
  4. avgCallDistance
  5. circleDiversity
  6. call_intensity
  7. distance_per_call
  8. contact_circle_ratio
  9. high_freq_long_distance

‚úì Checking feature availability in dataset...
‚úÖ All 9 features available!

üìä Feature Statistics (for ML evaluation set):


Unnamed: 0,avgDuration,callFrequency,uniqueContacts,avgCallDistance,circleDiversity,call_intensity,distance_per_call,contact_circle_ratio,high_freq_long_distance
count,10446.0,10446.0,10446.0,10446.0,10446.0,10446.0,10446.0,10446.0,10446.0
mean,127.899011,37.891346,64.112579,704.572528,2.940551,2.406145,19.232193,15.733499,0.180069
std,114.933723,25.331774,50.973598,674.492148,2.05161,5.009234,14.209815,7.680529,0.384263
min,2.01007,5.0,8.0,0.511443,1.0,0.011884,0.010709,2.666667,0.0
25%,33.413501,20.0,31.0,228.378355,2.0,0.090071,9.246434,10.333333,0.0
50%,93.39661,33.0,50.0,470.317602,2.0,0.314689,16.930586,14.666667,0.0
75%,181.985217,48.0,77.0,985.404791,3.0,1.258456,26.035091,20.0,0.0
max,419.877452,119.0,249.0,2798.549468,9.0,37.065292,99.714048,47.0,1.0


## 5. üîÑ Prepare and Scale Features

Extract features from the remaining dataset and apply StandardScaler normalization.

In [6]:
# Prepare features for training
X_train = remaining[feature_columns].copy()
y_train = (remaining['label'] == 'FRAUD').astype(int)

# Handle any missing values
X_train = X_train.fillna(X_train.mean())

print("‚öôÔ∏è Feature Preparation")
print("=" * 60)
print(f"‚úì Training samples: {len(X_train):,}")
print(f"‚úì Features: {X_train.shape[1]}")
print(f"‚úì Target distribution: {y_train.value_counts().to_dict()}")

# Initialize and fit scaler
print("\n‚öôÔ∏è Fitting StandardScaler...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

print(f"‚úÖ Feature scaling complete!")
print(f"\nüìä Scaled feature statistics:")
print(f"   Mean (should be ~0): {X_train_scaled.mean():.6f}")
print(f"   Std (should be ~1): {X_train_scaled.std():.6f}")
print(f"   Shape: {X_train_scaled.shape}")

# Display original vs scaled comparison for first sample
print(f"\nüìä Example: First sample comparison")
print("\nOriginal values:")
print(X_train.iloc[0])
print("\nScaled values:")
print(pd.Series(X_train_scaled[0], index=feature_columns))

‚öôÔ∏è Feature Preparation
‚úì Training samples: 10,446
‚úì Features: 9
‚úì Target distribution: {0: 7684, 1: 2762}

‚öôÔ∏è Fitting StandardScaler...
‚úÖ Feature scaling complete!

üìä Scaled feature statistics:
   Mean (should be ~0): -0.000000
   Std (should be ~1): 1.000000
   Shape: (10446, 9)

üìä Example: First sample comparison

Original values:
avgDuration                215.629137
callFrequency               19.000000
uniqueContacts              15.000000
avgCallDistance            174.310282
circleDiversity              1.000000
call_intensity               0.087707
distance_per_call            8.715514
contact_circle_ratio         7.500000
high_freq_long_distance      0.000000
Name: 1, dtype: float64

Scaled values:
avgDuration                0.763347
callFrequency             -0.745793
uniqueContacts            -0.963537
avgCallDistance           -0.786203
circleDiversity           -0.945913
call_intensity            -0.462855
distance_per_call         -0.740135
contact_c

## 6. ü§ñ Initialize Isolation Forest Model (Stage 2)

Configure the Isolation Forest with optimal parameters for fraud detection.

In [14]:
# Initialize Isolation Forest with optimized parameters
n_estimators = 200      # Number of trees
contamination = 0.25     # Expected fraud rate (30%)
max_samples = 512       # Samples per tree (speed optimization)
random_state = 42       # For reproducibility

isolation_forest = IsolationForest(
    n_estimators=n_estimators,
    contamination=contamination,
    max_samples=max_samples,
    random_state=random_state,
    n_jobs=-1,  # Use all CPU cores
    verbose=0
)

print("ü§ñ Isolation Forest Configuration")
print("=" * 60)
print(f"  ‚Ä¢ n_estimators: {n_estimators} (number of trees)")
print(f"  ‚Ä¢ contamination: {contamination} (expected fraud rate)")
print(f"  ‚Ä¢ max_samples: {max_samples} (samples per tree)")
print(f"  ‚Ä¢ random_state: {random_state}")
print(f"  ‚Ä¢ n_jobs: -1 (use all CPU cores)")
print(f"\n‚úÖ Model initialized!")

ü§ñ Isolation Forest Configuration
  ‚Ä¢ n_estimators: 200 (number of trees)
  ‚Ä¢ contamination: 0.25 (expected fraud rate)
  ‚Ä¢ max_samples: 512 (samples per tree)
  ‚Ä¢ random_state: 42
  ‚Ä¢ n_jobs: -1 (use all CPU cores)

‚úÖ Model initialized!


## 7. üöÄ Train the Isolation Forest

Train the model on the scaled feature data. This may take 10-30 seconds.

In [15]:
import time

print("üå≤ Training Isolation Forest...")
print("=" * 60)

# Track training time
start_time = time.time()

# Train the model
isolation_forest.fit(X_train_scaled)

training_time = time.time() - start_time

print(f"‚úÖ Training complete!")
print(f"   Training time: {training_time:.2f} seconds")
print(f"   Samples used: {len(X_train_scaled):,}")
print(f"   Model ready for predictions!")

üå≤ Training Isolation Forest...
‚úÖ Training complete!
   Training time: 0.37 seconds
   Samples used: 10,446
   Model ready for predictions!


## 8. üîÆ Make Predictions on Training Set

Apply the hybrid system (hard rule + ML) to make predictions on the entire training dataset.

In [16]:
# Make predictions on the remaining dataset (ML predictions)
print("üîÆ Making predictions...")
print("=" * 60)

# Predict on the remaining dataset
ml_predictions = isolation_forest.predict(X_train_scaled)
anomaly_scores = isolation_forest.decision_function(X_train_scaled)

# Convert predictions (-1 = fraud, 1 = legitimate)
remaining['prediction'] = np.where(ml_predictions == -1, 'FRAUD', 'LEGITIMATE')
remaining['anomaly_score'] = anomaly_scores
remaining['detection_stage'] = 'ML_ISOLATION_FOREST'

# Add hard rule predictions back
rule_safe['prediction'] = 'LEGITIMATE'
rule_safe['anomaly_score'] = 0.0  # Not applicable for rule-based
rule_safe['detection_stage'] = 'RULE_BASED'

# Combine all predictions
all_predictions = pd.concat([rule_safe, remaining], ignore_index=False).sort_index()

print(f"‚úÖ Predictions complete!")
print(f"\nüìä Prediction Distribution:")
print(all_predictions['prediction'].value_counts())
print(f"\nüìä Detection Stage Distribution:")
print(all_predictions['detection_stage'].value_counts())

# Show some example predictions
print(f"\nüìã Example predictions:")
all_predictions[['phoneNumber', 'label', 'prediction', 'detection_stage', 'anomaly_score']].head(10)

üîÆ Making predictions...
‚úÖ Predictions complete!

üìä Prediction Distribution:
prediction
LEGITIMATE    10388
FRAUD          2612
Name: count, dtype: int64

üìä Detection Stage Distribution:
detection_stage
ML_ISOLATION_FOREST    10446
RULE_BASED              2554
Name: count, dtype: int64

üìã Example predictions:


Unnamed: 0,phoneNumber,label,prediction,detection_stage,anomaly_score
0,916698198780,LEGITIMATE,LEGITIMATE,RULE_BASED,0.0
1,917859421937,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.075877
2,918673614047,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.072277
3,918971147766,FRAUD,FRAUD,ML_ISOLATION_FOREST,-0.033204
4,917277642649,FRAUD,LEGITIMATE,ML_ISOLATION_FOREST,0.022571
5,919064934039,LEGITIMATE,LEGITIMATE,RULE_BASED,0.0
6,919189098308,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.098215
7,919588990580,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.072636
8,917490230771,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.075809
9,917321115495,LEGITIMATE,LEGITIMATE,ML_ISOLATION_FOREST,0.085086


## 9. üìà Calculate Performance Metrics

Evaluate the model's performance using standard classification metrics.

In [17]:
# Calculate metrics
y_true = (train_df['label'] == 'FRAUD').astype(int)
y_pred = (all_predictions['prediction'] == 'FRAUD').astype(int)

# Overall metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("üìà TRAINING PERFORMANCE METRICS")
print("=" * 60)
print(f"\nüéØ Overall Performance:")
print(f"  ‚Ä¢ Accuracy:  {accuracy*100:.2f}%")
print(f"  ‚Ä¢ Precision: {precision*100:.2f}% (Low false positives)")
print(f"  ‚Ä¢ Recall:    {recall*100:.2f}% (Catch fraudsters)")
print(f"  ‚Ä¢ F1-Score:  {f1*100:.2f}%")

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nüìä Confusion Matrix:")
print(f"  ‚Ä¢ True Negatives (Legit ‚Üí Legit):  {tn:,}")
print(f"  ‚Ä¢ False Positives (Legit ‚Üí Fraud): {fp:,}")
print(f"  ‚Ä¢ False Negatives (Fraud ‚Üí Legit): {fn:,}")
print(f"  ‚Ä¢ True Positives (Fraud ‚Üí Fraud):  {tp:,}")

# False Positive Rate
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
print(f"\n  ‚Ä¢ Overall False Positive Rate: {fpr*100:.2f}%")

üìà TRAINING PERFORMANCE METRICS

üéØ Overall Performance:
  ‚Ä¢ Accuracy:  91.34%
  ‚Ä¢ Precision: 81.32% (Low false positives)
  ‚Ä¢ Recall:    76.90% (Catch fraudsters)
  ‚Ä¢ F1-Score:  79.05%

üìä Confusion Matrix:
  ‚Ä¢ True Negatives (Legit ‚Üí Legit):  9,750
  ‚Ä¢ False Positives (Legit ‚Üí Fraud): 488
  ‚Ä¢ False Negatives (Fraud ‚Üí Legit): 638
  ‚Ä¢ True Positives (Fraud ‚Üí Fraud):  2,124

  ‚Ä¢ Overall False Positive Rate: 4.77%


## 10. üõ°Ô∏è Critical Metric: Delivery Partner FPR

Verify that the hard rule achieves **zero false positives** for delivery partners.

In [18]:
# Check Delivery Partner False Positive Rate (CRITICAL METRIC)
delivery_partners = train_df[
    (train_df['callFrequency'] > 50) & 
    (train_df['avgCallDistance'] < 10) &
    (train_df['label'] == 'LEGITIMATE')
]

print("üõ°Ô∏è CRITICAL METRIC: Delivery Partner FPR")
print("=" * 60)
print(f"Total Delivery Partners: {len(delivery_partners)}")

if len(delivery_partners) > 0:
    delivery_predictions = all_predictions.loc[delivery_partners.index]
    delivery_fp = (delivery_predictions['prediction'] == 'FRAUD').sum()
    delivery_fpr = delivery_fp / len(delivery_partners)
    
    print(f"Incorrectly Flagged as Fraud: {delivery_fp}")
    print(f"False Positive Rate: {delivery_fpr*100:.4f}%")
    
    if delivery_fpr == 0:
        print(f"\n‚úÖ PERFECT! Zero false positives on delivery partners!")
        print(f"   Hard rule is working correctly!")
    else:
        print(f"\n‚ö†Ô∏è WARNING: {delivery_fp} delivery partners misclassified!")
        print(f"   This should be 0! Check hard rule implementation.")
        
    # Show delivery partner predictions
    print(f"\nüìä Delivery Partner Predictions:")
    print(delivery_predictions[['userType', 'label', 'prediction', 'detection_stage']].value_counts())
else:
    print("‚ÑπÔ∏è No delivery partners found in training set")

üõ°Ô∏è CRITICAL METRIC: Delivery Partner FPR
Total Delivery Partners: 2554
Incorrectly Flagged as Fraud: 0
False Positive Rate: 0.0000%

‚úÖ PERFECT! Zero false positives on delivery partners!
   Hard rule is working correctly!

üìä Delivery Partner Predictions:
userType          label       prediction  detection_stage
DELIVERY_PARTNER  LEGITIMATE  LEGITIMATE  RULE_BASED         2554
Name: count, dtype: int64


## 11. üìä Performance by User Type

Analyze model performance for each user profile type.

In [19]:
# Performance by user type
print("üìä PERFORMANCE BY USER TYPE")
print("=" * 60)

user_types = train_df['userType'].unique()
results = []

for user_type in sorted(user_types):
    mask = train_df['userType'] == user_type
    subset_true = y_true[mask]
    subset_pred = y_pred[mask]
    
    if len(subset_true) > 0:
        acc = accuracy_score(subset_true, subset_pred)
        n_samples = len(subset_true)
        n_correct = (subset_true == subset_pred).sum()
        
        results.append({
            'User Type': user_type,
            'Samples': n_samples,
            'Correct': n_correct,
            'Accuracy': f"{acc*100:.2f}%"
        })
        
        icon = '‚úÖ' if acc >= 0.95 else '‚ö†Ô∏è'
        print(f"  {icon} {user_type:30s}: {acc*100:5.2f}% ({n_correct}/{n_samples})")

# Create summary DataFrame
results_df = pd.DataFrame(results)
print("\nüìã Summary Table:")
results_df

üìä PERFORMANCE BY USER TYPE
  ‚úÖ BUSINESS_USER                 : 100.00% (3000/3000)
  ‚ö†Ô∏è DELIVERY_PARTNER              : 91.97% (2759/3000)
  ‚úÖ DIGITAL_ARREST_BOT            : 98.27% (1474/1500)
  ‚ö†Ô∏è LOW_VOLUME_SCAMMER            : 29.39% (77/262)
  ‚úÖ REGULAR_USER                  : 95.45% (3818/4000)
  ‚ö†Ô∏è TRADITIONAL_SCAMMER           : 57.30% (573/1000)
  ‚ö†Ô∏è TRAVELING_PROFESSIONAL        : 72.69% (173/238)

üìã Summary Table:


Unnamed: 0,User Type,Samples,Correct,Accuracy
0,BUSINESS_USER,3000,3000,100.00%
1,DELIVERY_PARTNER,3000,2759,91.97%
2,DIGITAL_ARREST_BOT,1500,1474,98.27%
3,LOW_VOLUME_SCAMMER,262,77,29.39%
4,REGULAR_USER,4000,3818,95.45%
5,TRADITIONAL_SCAMMER,1000,573,57.30%
6,TRAVELING_PROFESSIONAL,238,173,72.69%


## 12. üíæ Save Trained Model

Save the model, scaler, and configuration for deployment.

In [20]:
import os

# Create models directory
model_dir = 'models'
os.makedirs(model_dir, exist_ok=True)

print("üíæ Saving model...")
print("=" * 60)

# Save Isolation Forest
joblib.dump(isolation_forest, f'{model_dir}/isolation_forest.pkl')
print(f"‚úì Saved: {model_dir}/isolation_forest.pkl")

# Save scaler
joblib.dump(scaler, f'{model_dir}/scaler.pkl')
print(f"‚úì Saved: {model_dir}/scaler.pkl")

# Save configuration and statistics
config = {
    'feature_columns': feature_columns,
    'n_estimators': n_estimators,
    'contamination': contamination,
    'max_samples': max_samples,
    'random_state': random_state,
    'training_stats': {
        'timestamp': datetime.now().isoformat(),
        'total_samples': len(train_df),
        'legitimate_samples': int((train_df['label'] == 'LEGITIMATE').sum()),
        'fraud_samples': int((train_df['label'] == 'FRAUD').sum()),
        'rule_based_protected': len(rule_safe),
        'ml_evaluated': len(remaining),
        'metrics': {
            'accuracy': float(accuracy),
            'precision': float(precision),
            'recall': float(recall),
            'f1_score': float(f1),
            'true_positives': int(tp),
            'true_negatives': int(tn),
            'false_positives': int(fp),
            'false_negatives': int(fn),
            'delivery_partner_fpr': float(delivery_fpr) if 'delivery_fpr' in locals() else 0.0
        }
    }
}

with open(f'{model_dir}/config.json', 'w') as f:
    json.dump(config, f, indent=2)
print(f"‚úì Saved: {model_dir}/config.json")

print(f"\n‚úÖ Model saved successfully!")
print(f"\nüí° Next steps:")
print(f"  1. Run 'python evaluate_model.py' to test on holdout data")
print(f"  2. Run 'python predict.py' for real-time predictions")

üíæ Saving model...
‚úì Saved: models/isolation_forest.pkl
‚úì Saved: models/scaler.pkl
‚úì Saved: models/config.json

‚úÖ Model saved successfully!

üí° Next steps:
  1. Run 'python evaluate_model.py' to test on holdout data
  2. Run 'python predict.py' for real-time predictions


## üéâ Training Complete!

### Summary

You've successfully trained a hybrid fraud detection model with:
- **Stage 1**: Hard rule filter for delivery partners
- **Stage 2**: Isolation Forest ML for anomaly detection

### Key Variables Available

You can now debug and explore these variables:
- `train_df` - Original training data
- `rule_safe` - Records protected by hard rule
- `remaining` - Records evaluated by ML
- `X_train` - Feature matrix (unscaled)
- `X_train_scaled` - Feature matrix (scaled)
- `scaler` - StandardScaler object
- `isolation_forest` - Trained model
- `all_predictions` - All predictions (rule + ML)
- `y_true`, `y_pred` - True and predicted labels

### Debugging Tips

1. **Check specific predictions**: `all_predictions[all_predictions['phoneNumber'] == '+91XXXXXXXXXX']`
2. **Find misclassifications**: `train_df[(y_true != y_pred)]`
3. **Inspect anomaly scores**: `remaining[['phoneNumber', 'anomaly_score', 'prediction']].sort_values('anomaly_score')`
4. **Test hard rule**: Modify the threshold and re-run cells 3-4
5. **Tune model**: Change `n_estimators`, `contamination` in cell 6 and re-run from there

### Next Steps

- Run all cells: `Kernel > Restart & Run All`
- Test on new data: Load test dataset and use `isolation_forest.predict()`
- Deploy: Use the saved models in `models/` directory