# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling

---

## 1. Import packages

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

In [3]:
import os
os.chdir(os.path.expanduser('~'))  # Changes to your home directory
print("Changed directory to:", os.getcwd())


Changed directory to: /Users/yifanwang


---
## 2. Load data

In [4]:
df = pd.read_csv('Documents/GitHub/BCG-Data-Science-Project/data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model. Remember, we only need to focus on training a `Random Forest` classifier.

In [5]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [6]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

(14606, 61)
(14606,)


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


### Model training

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm. 

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging. 

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [8]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
import numpy as np

# Prepare X and y
X = data.drop(columns=['Unnamed: 0', 'id', 'churn'])
y = data['churn']

# Define balancing methods
balancers = {
    "RandomOverSampler": RandomOverSampler(random_state=42),
    "SMOTE": SMOTE(random_state=42),
    "ADASYN": ADASYN(random_state=42),
    "RandomUnderSampler": RandomUnderSampler(random_state=42)
}

results = []

for name, balancer in balancers.items():
    # Balance the data
    X_bal, y_bal = balancer.fit_resample(X, y)
    
    # Train/test split evaluation
    X_train, X_test, y_train, y_test = train_test_split(
        X_bal, y_bal, test_size=0.2, random_state=42, stratify=y_bal
    )
    
    model = RandomForestClassifier(random_state=42, n_estimators=200)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    split_metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred),
        "recall": recall_score(y_test, y_pred),
        "f1": f1_score(y_test, y_pred)
    }
    
    # Cross-validation evaluation
    cv_scores = cross_val_score(model, X_bal, y_bal, cv=5, scoring='f1')
    cv_mean = np.mean(cv_scores)
    
    results.append({
        "method": name,
        **split_metrics,
        "cv_f1_mean": cv_mean
    })

import pandas as pd
results_df = pd.DataFrame(results).sort_values(by="cv_f1_mean", ascending=False)
import caas_jupyter_tools
caas_jupyter_tools.display_dataframe_to_user(name="Balancing Method Comparison", dataframe=results_df)


ImportError: cannot import name '_safe_tags' from 'sklearn.utils._tags' (/opt/miniconda3/envs/myenv/lib/python3.12/site-packages/sklearn/utils/_tags.py)

In [9]:
# Enhanced ML Pipeline with Comprehensive Error Handling
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, classification_report, confusion_matrix)
from sklearn.utils import resample
import logging

# Set up logging for error tracking
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class MLPipelineWithErrorHandling:
    """
    A robust ML pipeline with comprehensive error handling for churn prediction.
    """
    
    def __init__(self, random_state=42):
        self.random_state = random_state
        self.model = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.best_balancing_method = None
        self.results = {}
        
    def safe_data_loading(self, filepath):
        """Safely load data with error handling"""
        try:
            df = pd.read_csv(filepath)
            logger.info(f"Successfully loaded data with shape: {df.shape}")
            
            # Handle common data issues
            if "Unnamed: 0" in df.columns:
                df = df.drop(columns=["Unnamed: 0"])
                logger.info("Removed unnamed index column")
                
            return df
            
        except FileNotFoundError:
            logger.error(f"File not found: {filepath}")
            raise
        except pd.errors.EmptyDataError:
            logger.error("CSV file is empty")
            raise
        except Exception as e:
            logger.error(f"Unexpected error loading data: {str(e)}")
            raise
    
    def safe_feature_target_split(self, df, target_col='churn', id_col='id'):
        """Safely split features and target with validation"""
        try:
            # Validate required columns exist
            if target_col not in df.columns:
                raise ValueError(f"Target column '{target_col}' not found in dataframe")
            
            # Prepare features and target
            cols_to_drop = [col for col in [id_col, target_col] if col in df.columns]
            X = df.drop(columns=cols_to_drop)
            y = df[target_col]
            
            # Validate data
            if X.empty or y.empty:
                raise ValueError("Features or target is empty")
                
            if X.isnull().any().any():
                logger.warning("Missing values found in features - consider handling them")
                
            logger.info(f"Features shape: {X.shape}, Target shape: {y.shape}")
            logger.info(f"Target distribution:\n{y.value_counts()}")
            
            return X, y
            
        except Exception as e:
            logger.error(f"Error in feature-target split: {str(e)}")
            raise
    
    def safe_train_test_split(self, X, y, test_size=0.25):
        """Safely perform train-test split with validation"""
        try:
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=test_size, random_state=self.random_state, 
                stratify=y
            )
            
            self.X_train, self.X_test = X_train, X_test
            self.y_train, self.y_test = y_train, y_test
            
            logger.info(f"Train set: {X_train.shape}, Test set: {X_test.shape}")
            logger.info(f"Train target distribution:\n{y_train.value_counts()}")
            
            return X_train, X_test, y_train, y_test
            
        except Exception as e:
            logger.error(f"Error in train-test split: {str(e)}")
            raise
    
    def manual_balance_data(self, X, y, method='oversample'):
        """
        Manual data balancing without imblearn dependency
        """
        try:
            df_combined = pd.concat([X, y], axis=1)
            
            if method == 'oversample':
                # Oversample minority class
                class_counts = y.value_counts()
                majority_class = class_counts.idxmax()
                minority_class = class_counts.idxmin()
                max_count = class_counts.max()
                
                majority_df = df_combined[df_combined[y.name] == majority_class]
                minority_df = df_combined[df_combined[y.name] == minority_class]
                
                # Oversample minority class
                minority_oversampled = resample(
                    minority_df, replace=True, n_samples=max_count, 
                    random_state=self.random_state
                )
                
                balanced_df = pd.concat([majority_df, minority_oversampled])
                
            elif method == 'undersample':
                # Undersample majority class
                class_counts = y.value_counts()
                majority_class = class_counts.idxmax()
                minority_class = class_counts.idxmin()
                min_count = class_counts.min()
                
                majority_df = df_combined[df_combined[y.name] == majority_class]
                minority_df = df_combined[df_combined[y.name] == minority_class]
                
                # Undersample majority class
                majority_undersampled = resample(
                    majority_df, replace=False, n_samples=min_count, 
                    random_state=self.random_state
                )
                
                balanced_df = pd.concat([majority_undersampled, minority_df])
            
            else:
                balanced_df = df_combined
            
            # Shuffle the balanced dataset
            balanced_df = balanced_df.sample(frac=1, random_state=self.random_state).reset_index(drop=True)
            
            X_balanced = balanced_df.drop(columns=[y.name])
            y_balanced = balanced_df[y.name]
            
            logger.info(f"Balanced data using {method}: {X_balanced.shape}")
            logger.info(f"New target distribution:\n{y_balanced.value_counts()}")
            
            return X_balanced, y_balanced
            
        except Exception as e:
            logger.error(f"Error in data balancing: {str(e)}")
            raise
    
    def compare_balancing_methods(self, X, y):
        """Compare different balancing methods"""
        methods = ['none', 'oversample', 'undersample']
        results = []
        
        for method in methods:
            try:
                logger.info(f"Testing balancing method: {method}")
                
                if method == 'none':
                    X_bal, y_bal = X, y
                else:
                    X_bal, y_bal = self.manual_balance_data(X, y, method)
                
                # Split balanced data
                X_train_bal, X_test_bal, y_train_bal, y_test_bal = train_test_split(
                    X_bal, y_bal, test_size=0.2, random_state=self.random_state, 
                    stratify=y_bal
                )
                
                # Train model
                model = RandomForestClassifier(
                    random_state=self.random_state, 
                    n_estimators=100,  # Reduced for faster training
                    max_depth=10
                )
                model.fit(X_train_bal, y_train_bal)
                y_pred = model.predict(X_test_bal)
                
                # Calculate metrics
                metrics = {
                    'method': method,
                    'accuracy': accuracy_score(y_test_bal, y_pred),
                    'precision': precision_score(y_test_bal, y_pred, zero_division=0),
                    'recall': recall_score(y_test_bal, y_pred, zero_division=0),
                    'f1': f1_score(y_test_bal, y_pred, zero_division=0)
                }
                
                # Cross-validation
                try:
                    cv_scores = cross_val_score(model, X_bal, y_bal, cv=5, scoring='f1')
                    metrics['cv_f1_mean'] = np.mean(cv_scores)
                    metrics['cv_f1_std'] = np.std(cv_scores)
                except Exception as cv_e:
                    logger.warning(f"Cross-validation failed for {method}: {str(cv_e)}")
                    metrics['cv_f1_mean'] = metrics['f1']
                    metrics['cv_f1_std'] = 0
                
                results.append(metrics)
                logger.info(f"Method {method} - F1: {metrics['f1']:.4f}")
                
            except Exception as e:
                logger.error(f"Error testing method {method}: {str(e)}")
                continue
        
        if not results:
            raise ValueError("No balancing methods completed successfully")
            
        # Find best method
        results_df = pd.DataFrame(results)
        best_method = results_df.loc[results_df['cv_f1_mean'].idxmax(), 'method']
        self.best_balancing_method = best_method
        
        logger.info(f"Best balancing method: {best_method}")
        return results_df
    
    def train_final_model(self, X, y):
        """Train final model with best balancing method"""
        try:
            # Apply best balancing method
            if self.best_balancing_method == 'none':
                X_final, y_final = X, y
            else:
                X_final, y_final = self.manual_balance_data(X, y, self.best_balancing_method)
            
            # Final train-test split
            X_train, X_test, y_train, y_test = self.safe_train_test_split(X_final, y_final)
            
            # Train final model with optimized parameters
            self.model = RandomForestClassifier(
                random_state=self.random_state,
                n_estimators=200,
                max_depth=15,
                min_samples_split=5,
                min_samples_leaf=2,
                class_weight='balanced'  # Additional balancing
            )
            
            logger.info("Training final Random Forest model...")
            self.model.fit(X_train, y_train)
            
            # Make predictions
            y_pred = self.model.predict(X_test)
            y_pred_proba = self.model.predict_proba(X_test)[:, 1]
            
            # Store results
            self.results = {
                'predictions': y_pred,
                'probabilities': y_pred_proba,
                'y_test': y_test,
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred),
                'recall': recall_score(y_test, y_pred),
                'f1': f1_score(y_test, y_pred),
                'confusion_matrix': confusion_matrix(y_test, y_pred),
                'classification_report': classification_report(y_test, y_pred)
            }
            
            logger.info("Final model training completed successfully!")
            return self.results
            
        except Exception as e:
            logger.error(f"Error in final model training: {str(e)}")
            raise
    
    def evaluate_model(self):
        """Comprehensive model evaluation"""
        if not self.results:
            raise ValueError("Model must be trained first")
        
        print("=== MODEL EVALUATION RESULTS ===")
        print(f"Accuracy:  {self.results['accuracy']:.4f}")
        print(f"Precision: {self.results['precision']:.4f}")
        print(f"Recall:    {self.results['recall']:.4f}")
        print(f"F1-Score:  {self.results['f1']:.4f}")
        print(f"Best Balancing Method: {self.best_balancing_method}")
        
        print("\n=== CONFUSION MATRIX ===")
        print(self.results['confusion_matrix'])
        
        print("\n=== DETAILED CLASSIFICATION REPORT ===")
        print(self.results['classification_report'])
        
        # Feature importance
        if hasattr(self.model, 'feature_importances_'):
            feature_importance = pd.DataFrame({
                'feature': self.X_train.columns,
                'importance': self.model.feature_importances_
            }).sort_values('importance', ascending=False)
            
            print("\n=== TOP 10 FEATURE IMPORTANCES ===")
            print(feature_importance.head(10))
        
        return self.results

# Usage Example
def run_ml_pipeline(data_path):
    """
    Complete ML pipeline execution with error handling
    """
    try:
        # Initialize pipeline
        pipeline = MLPipelineWithErrorHandling(random_state=42)
        
        # Load and prepare data
        df = pipeline.safe_data_loading(data_path)
        X, y = pipeline.safe_feature_target_split(df)
        
        # Compare balancing methods
        print("Comparing balancing methods...")
        balancing_results = pipeline.compare_balancing_methods(X, y)
        print("\nBalancing Method Comparison:")
        print(balancing_results.round(4))
        
        # Train final model
        print(f"\nTraining final model with {pipeline.best_balancing_method} balancing...")
        final_results = pipeline.train_final_model(X, y)
        
        # Evaluate model
        pipeline.evaluate_model()
        
        return pipeline, balancing_results, final_results
        
    except Exception as e:
        logger.error(f"Pipeline execution failed: {str(e)}")
        raise

# For your specific case - replace with your actual data path
if __name__ == "__main__":
    
    data_path = 'Documents/GitHub/BCG-Data-Science-Project/data_for_predictions.csv'
    
    try:
        pipeline, balancing_results, final_results = run_ml_pipeline(data_path)
        print("\n🎉 Pipeline completed successfully!")
        
    except Exception as e:
        print(f"❌ Pipeline failed: {str(e)}")
        print("Please check the data path and ensure all dependencies are installed.")


2025-08-09 23:18:58,079 - INFO - Successfully loaded data with shape: (14606, 64)
2025-08-09 23:18:58,086 - INFO - Removed unnamed index column
2025-08-09 23:18:58,089 - INFO - Features shape: (14606, 61), Target shape: (14606,)
2025-08-09 23:18:58,091 - INFO - Target distribution:
churn
0    13187
1     1419
Name: count, dtype: int64
2025-08-09 23:18:58,092 - INFO - Testing balancing method: none


Comparing balancing methods...


2025-08-09 23:19:10,778 - INFO - Method none - F1: 0.0209
2025-08-09 23:19:10,779 - INFO - Testing balancing method: oversample
2025-08-09 23:19:10,833 - INFO - Balanced data using oversample: (26374, 61)
2025-08-09 23:19:10,835 - INFO - New target distribution:
churn
0    13187
1    13187
Name: count, dtype: int64
2025-08-09 23:19:30,214 - INFO - Method oversample - F1: 0.8376
2025-08-09 23:19:30,215 - INFO - Testing balancing method: undersample
2025-08-09 23:19:30,231 - INFO - Balanced data using undersample: (2838, 61)
2025-08-09 23:19:30,232 - INFO - New target distribution:
churn
0    1419
1    1419
Name: count, dtype: int64
2025-08-09 23:19:33,003 - INFO - Method undersample - F1: 0.6082
2025-08-09 23:19:33,009 - INFO - Best balancing method: oversample
2025-08-09 23:19:33,055 - INFO - Balanced data using oversample: (26374, 61)
2025-08-09 23:19:33,057 - INFO - New target distribution:
churn
0    13187
1    13187
Name: count, dtype: int64
2025-08-09 23:19:33,080 - INFO - Train s


Balancing Method Comparison:
        method  accuracy  precision  recall      f1  cv_f1_mean  cv_f1_std
0         none    0.9038     1.0000  0.0106  0.0209      0.0181     0.0055
1   oversample    0.8347     0.8231  0.8525  0.8376      0.8385     0.0044
2  undersample    0.5986     0.5940  0.6232  0.6082      0.6038     0.0227

Training final model with oversample balancing...


2025-08-09 23:19:41,238 - INFO - Final model training completed successfully!


=== MODEL EVALUATION RESULTS ===
Accuracy:  0.9598
Precision: 0.9420
Recall:    0.9800
F1-Score:  0.9606
Best Balancing Method: oversample

=== CONFUSION MATRIX ===
[[3098  199]
 [  66 3231]]

=== DETAILED CLASSIFICATION REPORT ===
              precision    recall  f1-score   support

           0       0.98      0.94      0.96      3297
           1       0.94      0.98      0.96      3297

    accuracy                           0.96      6594
   macro avg       0.96      0.96      0.96      6594
weighted avg       0.96      0.96      0.96      6594


=== TOP 10 FEATURE IMPORTANCES ===
                        feature  importance
12           margin_net_pow_ele    0.058319
11         margin_gross_pow_ele    0.055808
0                      cons_12m    0.054146
5       forecast_meter_rent_12m    0.043959
2               cons_last_month    0.039856
14                   net_margin    0.039397
3             forecast_cons_12m    0.037870
49                 months_activ    0.033377
16  var_y