In [None]:
# Setup: common Python libraries mirroring typical R data science stacks
import sys, os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Stats / ML (install if needed)
try:
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
except Exception as e:
    print("statsmodels not found; install with: pip install statsmodels")

try:
    from sklearn import model_selection, metrics, preprocessing, linear_model, ensemble, tree
except Exception as e:
    print("scikit-learn not found; install with: pip install scikit-learn")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)


# Advanced Predictive Models in R: Lasso, Trees, Random Forests, and Boosting

---
This notebook demonstrates how to use more advanced machine learning models in R for predicting customer behavior in banking. We'll cover:

- **Lasso regression**: A linear model that automatically selects important features
- **Decision Trees**: Easy-to-interpret models that make decisions like a flowchart
- **Random Forests**: Combines many decision trees for better predictions
- **Boosting**: Builds models sequentially, learning from previous mistakes

## Learning Objectives
By the end of this notebook, you will:
1. Understand when to use each type of model
2. Know how to prepare data for machine learning
3. Be able to train, evaluate, and compare different models
4. Interpret model results and feature importance

## Required Packages
Please ensure you have the required packages installed:

```r
install.packages(c("gamlr", "rpart", "randomForest", "xgboost", "caret", "pROC"))
```


In [None]:
# --- Original R code (commented) ---
# # Load libraries with error handling
# required_packages <- c("gamlr", "rpart", "randomForest", "xgboost", "caret", "pROC", "data.table", "ggplot2")
# 
# for (pkg in required_packages) {
#   if (!require(pkg, character.only = TRUE)) {
#     cat("Installing package:", pkg, "\n")
#     install.packages(pkg)
#     library(pkg, character.only = TRUE)
#   }
# }
# 
# cat("All packages loaded successfully!\n")
# 

# --- Naive Python translation (manual review recommended) ---
# Load libraries with error handling
required_packages = list("gamlr", "rpart", "randomForest", "xgboost", "caret", "pROC", "data.table", "ggplot2")

for (pkg in required_packages) {
  if (!require(pkg, character.only = True)) {
    cat("Installing package:", pkg, "\n")
    install.packages(pkg)
    # R: library(pkg, character.only = True) — install/import Python equivalents as needed
  }
}

cat("All packages loaded successfully!\n")

In [None]:
# --- Original R code (commented) ---
# # Load and explore the banking dataset
# data <- fread("banking.csv")
# 
# cat("Dataset dimensions:", nrow(data), "rows and", ncol(data), "columns\n\n")
# 
# # Display structure
# str(data)
# 
# # Summary statistics
# summary(data)
# 
# # Check for missing values
# cat("\nMissing values per column:\n")
# sapply(data, function(x) sum(is.na(x)))
# 
# # Look at the target variable distribution
# cat("\nTarget variable (y) distribution:\n")
# table(data$y)
# prop.table(table(data$y))
# 

# --- Naive Python translation (manual review recommended) ---
# Load and explore the banking dataset
data = fread("banking.csv")

cat("Dataset dimensions:", len(data), "rows and", data.shape[1], "columns\n\n")

# Display structure
data.info()

# Summary statistics
data.describe()

# Check for missing values
cat("\nMissing values per column:\n")
sapply(data, function(x) sum(x.isna()))

# Look at the target variable distribution
cat("\nTarget variable (y) distribution:\n")
table(data$y)
prop.table(table(data$y))

## Data Preparation

Before building models, we need to prepare our data properly.


In [None]:
# --- Original R code (commented) ---
# # Convert categorical variables to factors
# categorical_vars <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
# data[, (categorical_vars) := lapply(.SD, as.factor), .SDcols = categorical_vars]
# 
# # Create train/test split (80/20)
# set.seed(123)  # For reproducibility
# train_idx <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))
# train_data <- data[train_idx, ]
# test_data <- data[-train_idx, ]
# 
# cat("Training set:", nrow(train_data), "observations\n")
# cat("Test set:", nrow(test_data), "observations\n")
# 
# # Check that target variable is balanced in both sets
# cat("\nTarget distribution in training set:\n")
# prop.table(table(train_data$y))
# cat("\nTarget distribution in test set:\n")
# prop.table(table(test_data$y))
# 

# --- Naive Python translation (manual review recommended) ---
# Convert categorical variables to factors
categorical_vars = list("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
data[, (categorical_vars) := lapply(.SD, as.factor), .SDcols = categorical_vars]

# Create train/test split (80/20)
np.random.seed(123)  # For reproducibility
train_idx = # TODO: translate sample()
sample(seq_len(len(data)), size = 0.8 * len(data))
train_data = data[train_idx, ]
test_data = data[-train_idx, ]

cat("Training set:", len(train_data), "observations\n")
cat("Test set:", len(test_data), "observations\n")

# Check that target variable is balanced in both sets
cat("\nTarget distribution in training set:\n")
prop.table(table(train_data$y))
cat("\nTarget distribution in test set:\n")
prop.table(table(test_data$y))

## 1. Lasso Regression

**What is Lasso?** Lasso (Least Absolute Shrinkage and Selection Operator) is a linear model that automatically selects the most important features by setting less important coefficients to zero. This helps prevent overfitting and makes the model easier to interpret.

**When to use Lasso:**
- When you have many features and want automatic feature selection
- When you need an interpretable model
- When you suspect many features are irrelevant

### Logistic Lasso: Predicting customer subscription (y)


In [None]:
# --- Original R code (commented) ---
# # Prepare data for logistic lasso (training set)
# X_train <- model.matrix(y ~ ., train_data)[, -1]  # Remove intercept
# y_train <- train_data$y
# 
# # Prepare test set
# X_test <- model.matrix(y ~ ., test_data)[, -1]
# y_test <- test_data$y
# 
# # Fit logistic lasso with cross-validation to find optimal lambda
# cv_lasso <- cv.gamlr(X_train, y_train, family = "binomial")
# 
# # Plot the cross-validation curve
# plot(cv_lasso, main = "Lasso Cross-Validation")
# cat("Optimal lambda:", cv_lasso$lambda.min, "\n")
# 
# # Get coefficients for the optimal model
# lasso_coef <- coef(cv_lasso, select = "min")
# cat("\nNumber of selected features:", sum(lasso_coef != 0) - 1, "\n")  # -1 for intercept
# 
# # Show non-zero coefficients (selected features)
# selected_features <- lasso_coef[lasso_coef != 0, , drop = FALSE]
# print(selected_features)
# 
# # Make predictions on test set
# lasso_pred_prob <- predict(cv_lasso, X_test, type = "response")
# lasso_pred_class <- ifelse(lasso_pred_prob > 0.5, "yes", "no")
# 
# # Evaluate performance
# lasso_accuracy <- mean(lasso_pred_class == y_test)
# cat("\nLasso Test Accuracy:", round(lasso_accuracy, 3), "\n")
# 
# # Confusion matrix
# lasso_cm <- table(Predicted = lasso_pred_class, Actual = y_test)
# print(lasso_cm)
# 

# --- Naive Python translation (manual review recommended) ---
# Prepare data for logistic lasso (training set)
X_train = model.matrix(y ~ ., train_data)[, -1]  # Remove intercept
y_train = train_data$y

# Prepare test set
X_test = model.matrix(y ~ ., test_data)[, -1]
y_test = test_data$y

# Fit logistic lasso with cross-validation to find optimal lambda
cv_lasso = cv.gamlr(X_train, y_train, family = "binomial")

# Plot the cross-validation curve
# TODO: translate base R plot -> matplotlib
# plot(cv_lasso, main = "Lasso Cross-Validation")
cat("Optimal lambda:", cv_lasso$lambda.min, "\n")

# Get coefficients for the optimal model
lasso_coef = coef(cv_lasso, select = "min")
cat("\nNumber of selected features:", sum(lasso_coef != 0) - 1, "\n")  # -1 for intercept

# Show non-zero coefficients (selected features)
selected_features = lasso_coef[lasso_coef != 0, , drop = False]
print(selected_features)

# Make predictions on test set
lasso_pred_prob = predict(cv_lasso, X_test, type = "response")
lasso_pred_class = np.where(lasso_pred_prob > 0.5, "yes", "no")

# Evaluate performance
lasso_accuracy = mean(lasso_pred_class == y_test)
cat("\nLasso Test Accuracy:", round(lasso_accuracy, 3), "\n")

# Confusion matrix
lasso_cm = table(Predicted = lasso_pred_class, Actual = y_test)
print(lasso_cm)

## 2. Decision Trees

**What are Decision Trees?** Decision trees make predictions by asking a series of yes/no questions about the features. They're like a flowchart that leads to a prediction.

**When to use Decision Trees:**
- When you need a highly interpretable model
- When relationships between features are non-linear
- When you want to understand the decision-making process

**Pros:** Easy to interpret, handles non-linear relationships
**Cons:** Can overfit, unstable (small data changes can create very different trees)


In [None]:
# --- Original R code (commented) ---
# # Fit a decision tree for classification
# tree_mod <- rpart(y ~ ., data = train_data, method = "class", 
#                   control = rpart.control(cp = 0.01, minsplit = 20))
# 
# # Plot the tree
# plot(tree_mod, uniform=TRUE, margin=0.1)
# text(tree_mod, use.n=TRUE, all=TRUE, cex=.8)
# title("Decision Tree for Customer Subscription Prediction")
# 
# # Print complexity parameter table
# printcp(tree_mod)
# 
# # Make predictions
# tree_pred_class <- predict(tree_mod, test_data, type = "class")
# tree_pred_prob <- predict(tree_mod, test_data, type = "prob")[, "yes"]
# 
# # Evaluate performance
# tree_accuracy <- mean(tree_pred_class == test_data$y)
# cat("\nDecision Tree Test Accuracy:", round(tree_accuracy, 3), "\n")
# 
# # Confusion matrix
# tree_cm <- table(Predicted = tree_pred_class, Actual = test_data$y)
# print(tree_cm)
# 
# # Feature importance
# cat("\nFeature Importance (Decision Tree):\n")
# importance_tree <- tree_mod$variable.importance
# print(sort(importance_tree, decreasing = TRUE)[1:10])  # Top 10
# 

# --- Naive Python translation (manual review recommended) ---
# Fit a decision tree for classification
tree_mod = rpart(y ~ ., data = train_data, method = "class", 
                  control = rpart.control(cp = 0.01, minsplit = 20))

# Plot the tree
# TODO: translate base R plot -> matplotlib
# plot(tree_mod, uniform=True, margin=0.1)
text(tree_mod, use.n=True, all=True, cex=.8)
title("Decision Tree for Customer Subscription Prediction")

# Print complexity parameter table
printcp(tree_mod)

# Make predictions
tree_pred_class = predict(tree_mod, test_data, type = "class")
tree_pred_prob = predict(tree_mod, test_data, type = "prob")[, "yes"]

# Evaluate performance
tree_accuracy = mean(tree_pred_class == test_data$y)
cat("\nDecision Tree Test Accuracy:", round(tree_accuracy, 3), "\n")

# Confusion matrix
tree_cm = table(Predicted = tree_pred_class, Actual = test_data$y)
print(tree_cm)

# Feature importance
cat("\nFeature Importance (Decision Tree):\n")
importance_tree = tree_mod$variable.importance
print(sort(importance_tree, decreasing = True)[1:10])  # Top 10

## 3. Random Forests

**What are Random Forests?** Random Forests combine many decision trees, where each tree is trained on a random subset of the data and features. The final prediction is the average (or majority vote) of all trees.

**When to use Random Forests:**
- When you want better accuracy than a single decision tree
- When you have enough data (works well with large datasets)
- When you want feature importance rankings

**Pros:** Usually more accurate than single trees, provides feature importance, handles missing values
**Cons:** Less interpretable than single trees, can overfit with very noisy data


In [None]:
# --- Original R code (commented) ---
# # Fit a random forest for classification
# set.seed(123)
# rf_mod <- randomForest(y ~ ., data = train_data, ntree = 500, importance = TRUE, 
#                        mtry = sqrt(ncol(train_data) - 1))  # Standard mtry for classification
# 
# print(rf_mod)
# 
# # Plot variable importance
# varImpPlot(rf_mod, main = "Random Forest Feature Importance")
# 
# # Make predictions
# rf_pred_class <- predict(rf_mod, test_data)
# rf_pred_prob <- predict(rf_mod, test_data, type = "prob")[, "yes"]
# 
# # Evaluate performance
# rf_accuracy <- mean(rf_pred_class == test_data$y)
# cat("\nRandom Forest Test Accuracy:", round(rf_accuracy, 3), "\n")
# 
# # Confusion matrix
# rf_cm <- table(Predicted = rf_pred_class, Actual = test_data$y)
# print(rf_cm)
# 
# # Feature importance (top 10)
# importance_rf <- importance(rf_mod)[, "MeanDecreaseGini"]
# cat("\nTop 10 Most Important Features (Random Forest):\n")
# print(sort(importance_rf, decreasing = TRUE)[1:10])
# 

# --- Naive Python translation (manual review recommended) ---
# Fit a random forest for classification
np.random.seed(123)
rf_mod = randomForest(y ~ ., data = train_data, ntree = 500, importance = True, 
                       mtry = sqrt(train_data.shape[1] - 1))  # Standard mtry for classification

print(rf_mod)

# Plot variable importance
varImpPlot(rf_mod, main = "Random Forest Feature Importance")

# Make predictions
rf_pred_class = predict(rf_mod, test_data)
rf_pred_prob = predict(rf_mod, test_data, type = "prob")[, "yes"]

# Evaluate performance
rf_accuracy = mean(rf_pred_class == test_data$y)
cat("\nRandom Forest Test Accuracy:", round(rf_accuracy, 3), "\n")

# Confusion matrix
rf_cm = table(Predicted = rf_pred_class, Actual = test_data$y)
print(rf_cm)

# Feature importance (top 10)
importance_rf = importance(rf_mod)[, "MeanDecreaseGini"]
cat("\nTop 10 Most Important Features (Random Forest):\n")
print(sort(importance_rf, decreasing = True)[1:10])

## 4. Boosting (using XGBoost)

**What is Boosting?** Boosting builds models sequentially, where each new model tries to correct the mistakes of the previous models. XGBoost is a popular and powerful implementation of gradient boosting.

**When to use Boosting:**
- When you want state-of-the-art predictive performance
- When you have sufficient data and computational resources
- When accuracy is more important than interpretability

**Pros:** Often achieves the best predictive performance, handles missing values, provides feature importance
**Cons:** More complex to tune, less interpretable, can overfit easily


In [None]:
# --- Original R code (commented) ---
# # Prepare data for xgboost (needs numeric labels: 0/1)
# Xmat_train <- model.matrix(y ~ ., train_data)[, -1]
# Xmat_test <- model.matrix(y ~ ., test_data)[, -1]
# y_train_numeric <- as.numeric(train_data$y) - 1  # Convert to 0/1
# y_test_numeric <- as.numeric(test_data$y) - 1
# 
# # Create DMatrix objects
# dtrain <- xgb.DMatrix(data = Xmat_train, label = y_train_numeric)
# dtest <- xgb.DMatrix(data = Xmat_test, label = y_test_numeric)
# 
# # Set parameters
# params <- list(
#   objective = "binary:logistic",
#   eval_metric = "auc",
#   max_depth = 4,
#   eta = 0.1,
#   subsample = 0.8,
#   colsample_bytree = 0.8
# )
# 
# # Train with cross-validation to find optimal number of rounds
# cv_result <- xgb.cv(
#   params = params,
#   data = dtrain,
#   nrounds = 200,
#   nfold = 5,
#   early_stopping_rounds = 10,
#   verbose = 0
# )
# 
# best_nrounds <- cv_result$best_iteration
# cat("Optimal number of rounds:", best_nrounds, "\n")
# 
# # Fit final model
# xgb_mod <- xgboost(
#   params = params,
#   data = dtrain,
#   nrounds = best_nrounds,
#   verbose = 0
# )
# 
# # Feature importance
# importance <- xgb.importance(model = xgb_mod)
# xgb.plot.importance(importance[1:15, ], main = "XGBoost Feature Importance (Top 15)")
# 
# # Make predictions
# xgb_pred_prob <- predict(xgb_mod, dtest)
# xgb_pred_class <- ifelse(xgb_pred_prob > 0.5, "yes", "no")
# 
# # Evaluate performance
# xgb_accuracy <- mean(xgb_pred_class == test_data$y)
# cat("\nXGBoost Test Accuracy:", round(xgb_accuracy, 3), "\n")
# 
# # Confusion matrix
# xgb_cm <- table(Predicted = xgb_pred_class, Actual = test_data$y)
# print(xgb_cm)
# 
# # ROC curve and AUC
# roc_obj <- roc(y_test_numeric, xgb_pred_prob)
# plot(roc_obj, main = "ROC Curve (XGBoost)")
# cat("\nAUC:", round(auc(roc_obj), 3), "\n")
# 

# --- Naive Python translation (manual review recommended) ---
# Prepare data for xgboost (needs numeric labels: 0/1)
Xmat_train = model.matrix(y ~ ., train_data)[, -1]
Xmat_test = model.matrix(y ~ ., test_data)[, -1]
y_train_numeric = as.numerilist(train_data$y) - 1  # Convert to 0/1
y_test_numeric = as.numerilist(test_data$y) - 1

# Create DMatrix objects
dtrain = xgb.DMatrix(data = Xmat_train, label = y_train_numeric)
dtest = xgb.DMatrix(data = Xmat_test, label = y_test_numeric)

# Set parameters
params = list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 4,
  eta = 0.1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train with cross-validation to find optimal number of rounds
cv_result = xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 200,
  nfold = 5,
  early_stopping_rounds = 10,
  verbose = 0
)

best_nrounds = cv_result$best_iteration
cat("Optimal number of rounds:", best_nrounds, "\n")

# Fit final model
xgb_mod = xgboost(
  params = params,
  data = dtrain,
  nrounds = best_nrounds,
  verbose = 0
)

# Feature importance
importance = xgb.importance(model = xgb_mod)
xgb.plot.importance(importance[1:15, ], main = "XGBoost Feature Importance (Top 15)")

# Make predictions
xgb_pred_prob = predict(xgb_mod, dtest)
xgb_pred_class = np.where(xgb_pred_prob > 0.5, "yes", "no")

# Evaluate performance
xgb_accuracy = mean(xgb_pred_class == test_data$y)
cat("\nXGBoost Test Accuracy:", round(xgb_accuracy, 3), "\n")

# Confusion matrix
xgb_cm = table(Predicted = xgb_pred_class, Actual = test_data$y)
print(xgb_cm)

# ROC curve and AUC
roc_obj = rolist(y_test_numeric, xgb_pred_prob)
# TODO: translate base R plot -> matplotlib
# plot(roc_obj, main = "ROC Curve (XGBoost)")
cat("\nAUC:", round(aulist(roc_obj), 3), "\n")

## 5. Model Comparison and Summary

Let's compare all our models to see which performs best.


In [None]:
# --- Original R code (commented) ---
# # Create a comparison table
# model_comparison <- data.frame(
#   Model = c("Lasso", "Decision Tree", "Random Forest", "XGBoost"),
#   Accuracy = c(lasso_accuracy, tree_accuracy, rf_accuracy, xgb_accuracy),
#   stringsAsFactors = FALSE
# )
# 
# model_comparison$Accuracy <- round(model_comparison$Accuracy, 3)
# model_comparison <- model_comparison[order(model_comparison$Accuracy, decreasing = TRUE), ]
# 
# cat("Model Performance Comparison (Test Set Accuracy):\n")
# print(model_comparison)
# 
# # Plot comparison
# ggplot(model_comparison, aes(x = reorder(Model, Accuracy), y = Accuracy)) +
#   geom_col(fill = "steelblue", alpha = 0.7) +
#   geom_text(aes(label = Accuracy), hjust = -0.1) +
#   coord_flip() +
#   labs(title = "Model Performance Comparison", 
#        x = "Model", y = "Test Accuracy") +
#   theme_minimal() +
#   ylim(0, 1)
# 

# --- Naive Python translation (manual review recommended) ---
# Create a comparison table
model_comparison = pd.DataFrame(
  Model = list("Lasso", "Decision Tree", "Random Forest", "XGBoost"),
  Accuracy = list(lasso_accuracy, tree_accuracy, rf_accuracy, xgb_accuracy),
  stringsAsFactors = False
)

model_comparison$Accuracy = round(model_comparison$Accuracy, 3)
model_comparison = model_comparison[order(model_comparison$Accuracy, decreasing = True), ]

cat("Model Performance Comparison (Test Set Accuracy):\n")
print(model_comparison)

# Plot comparison
ggplot(model_comparison, aes(x = reorder(Model, Accuracy), y = Accuracy)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = Accuracy), hjust = -0.1) +
  coord_flip() +
  labs(title = "Model Performance Comparison", 
       x = "Model", y = "Test Accuracy") +
  theme_minimal() +
  ylim(0, 1)

## Key Takeaways

### Model Characteristics Summary:

1. **Lasso Regression**
   - ✅ Automatic feature selection
   - ✅ Highly interpretable
   - ✅ Fast to train and predict
   - ❌ Assumes linear relationships

2. **Decision Trees**
   - ✅ Most interpretable (visual flowchart)
   - ✅ Handles non-linear relationships
   - ✅ No assumptions about data distribution
   - ❌ Prone to overfitting
   - ❌ Unstable (small changes → different tree)

3. **Random Forest**
   - ✅ Usually more accurate than single trees
   - ✅ Robust to overfitting
   - ✅ Provides feature importance
   - ✅ Handles missing values well
   - ❌ Less interpretable than single trees

4. **XGBoost**
   - ✅ Often achieves best predictive performance
   - ✅ Handles missing values
   - ✅ Built-in regularization
   - ❌ More complex to tune
   - ❌ Least interpretable
   - ❌ Can overfit if not tuned properly

### Choosing the Right Model:

- **Need interpretability?** → Decision Tree or Lasso
- **Want good performance with minimal tuning?** → Random Forest
- **Need maximum accuracy?** → XGBoost (with proper tuning)
- **Have limited data?** → Lasso or Decision Tree
- **Have lots of irrelevant features?** → Lasso or XGBoost

### Next Steps:
1. Try different hyperparameters for each model
2. Use cross-validation for more robust evaluation
3. Consider ensemble methods (combining multiple models)
4. Analyze feature importance to gain business insights
5. Test models on new, unseen data before deployment

---

# 6. Causal Machine Learning for Financial Decision Making

## Why Causal Inference Matters in Finance

**The Problem with Predictive Models:** Traditional ML models excel at prediction but struggle with **causation**. In finance, we often need to answer questions like:

- "What would happen if we **change** our interest rates?"
- "What is the **causal effect** of a marketing campaign on customer acquisition?"
- "How much **additional revenue** would we generate from a new product feature?"

**Correlation ≠ Causation:** Predictive models find patterns but can't tell us what happens when we intervene. Causal ML bridges this gap.

## Key Concepts in Causal Machine Learning

### 1. **Treatment Effects**
- **Average Treatment Effect (ATE)**: Average impact of treatment across population
- **Conditional Average Treatment Effect (CATE)**: Treatment effect for specific subgroups
- **Individual Treatment Effect (ITE)**: Personalized treatment effects

### 2. **Confounding**
Variables that affect both treatment assignment and outcome, creating spurious correlations.

### 3. **Identification Strategies**
- **Randomized Experiments**: Gold standard but often impractical
- **Natural Experiments**: Exploit random variation in real-world settings
- **Instrumental Variables**: Use variables that affect treatment but not outcome directly
- **Regression Discontinuity**: Exploit arbitrary cutoffs in treatment assignment

## Modern Causal ML Methods

### 1. **Double/Debiased Machine Learning (DML)**
Combines ML flexibility with causal inference rigor by:
- Using ML to model nuisance parameters
- Applying cross-fitting to avoid overfitting bias
- Providing valid confidence intervals

### 2. **Causal Forests**
Extension of Random Forests for heterogeneous treatment effects:
- Estimates personalized treatment effects
- Provides uncertainty quantification
- Handles high-dimensional data

### 3. **Meta-Learners**
- **T-Learner**: Separate models for treated/control groups
- **S-Learner**: Single model with treatment as feature
- **X-Learner**: Combines T-learner predictions optimally
- **R-Learner**: Uses residual-based approach

## Financial Applications

### 1. **Credit Risk & Lending**
- **Question**: "What's the causal effect of credit limit increases on default risk?"
- **Challenge**: Customers who get increases may be systematically different
- **Solution**: Use causal ML to control for selection bias

### 2. **Marketing & Customer Acquisition**
- **Question**: "Which customers benefit most from promotional offers?"
- **Application**: Personalized treatment effects for targeted campaigns
- **Business Value**: Optimize marketing spend and ROI

### 3. **Product Pricing**
- **Question**: "How do price changes affect demand across customer segments?"
- **Application**: Heterogeneous price elasticity estimation
- **Business Value**: Dynamic pricing strategies

### 4. **Regulatory Compliance**
- **Question**: "Do our algorithms create disparate impact across protected groups?"
- **Application**: Causal fairness analysis
- **Business Value**: Ensure fair lending practices


## R Packages for Causal Machine Learning

Let's explore the latest R packages for causal inference:


In [None]:
# --- Original R code (commented) ---
# # Install and load causal ML packages
# causal_packages <- c(
#   "grf",           # Generalized Random Forests (Causal Forests)
#   "DoubleML",      # Double/Debiased Machine Learning
#   "causalTree",    # Causal Trees
#   "hdm",           # High-dimensional econometrics
#   "ATE",           # Average Treatment Effects
#   "causaldrf",     # Causal Dose Response Functions
#   "BART",          # Bayesian Additive Regression Trees
#   "tmle",          # Targeted Maximum Likelihood Estimation
#   "SuperLearner"   # Ensemble methods for causal inference
# )
# 
# for (pkg in causal_packages) {
#   if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
#     cat("Installing package:", pkg, "\n")
#     install.packages(pkg, quiet = TRUE)
#     suppressPackageStartupMessages(library(pkg, character.only = TRUE))
#   }
# }
# 
# cat("Causal ML packages loaded successfully!\n")
# 

# --- Naive Python translation (manual review recommended) ---
# Install and load causal ML packages
causal_packages = c(
  "grf",           # Generalized Random Forests (Causal Forests)
  "DoubleML",      # Double/Debiased Machine Learning
  "causalTree",    # Causal Trees
  "hdm",           # High-dimensional econometrics
  "ATE",           # Average Treatment Effects
  "causaldrf",     # Causal Dose Response Functions
  "BART",          # Bayesian Additive Regression Trees
  "tmle",          # Targeted Maximum Likelihood Estimation
  "SuperLearner"   # Ensemble methods for causal inference
)

for (pkg in causal_packages) {
  if (!require(pkg, character.only = True, quietly = True)) {
    cat("Installing package:", pkg, "\n")
    install.packages(pkg, quiet = True)
    suppressPackageStartupMessages(# R: library(pkg, character.only = True) — install/import Python equivalents as needed)
  }
}

cat("Causal ML packages loaded successfully!\n")

## Practical Example: Causal Analysis of Marketing Campaign

Let's demonstrate causal ML using our banking dataset. We'll analyze the causal effect of previous marketing campaigns (`poutcome`) on subscription decisions.


In [None]:
# --- Original R code (commented) ---
# # Prepare data for causal analysis
# # Create a binary treatment variable: previous campaign success
# causal_data <- data.table::copy(data)
# causal_data[, treatment := ifelse(poutcome == "success", 1, 0)]
# causal_data[, outcome := ifelse(y == "yes", 1, 0)]
# 
# # Remove the original variables to avoid data leakage
# causal_data[, c("poutcome", "y") := NULL]
# 
# # Convert factors to numeric for some methods
# factor_cols <- sapply(causal_data, is.factor)
# causal_data_numeric <- causal_data
# causal_data_numeric[, (names(factor_cols)[factor_cols]) := lapply(.SD, as.numeric), .SDcols = names(factor_cols)[factor_cols]]
# 
# cat("Treatment distribution:\n")
# table(causal_data$treatment)
# cat("\nOutcome by treatment:\n")
# table(causal_data$treatment, causal_data$outcome)
# 
# # Simple difference in means (biased estimate)
# naive_ate <- mean(causal_data[treatment == 1, outcome]) - mean(causal_data[treatment == 0, outcome])
# cat("\nNaive ATE (difference in means):", round(naive_ate, 3), "\n")
# cat("This is likely biased due to confounding!\n")
# 

# --- Naive Python translation (manual review recommended) ---
# Prepare data for causal analysis
# Create a binary treatment variable: previous campaign success
causal_data = data.table::copy(data)
causal_data[, treatment := np.where(poutcome == "success", 1, 0)]
causal_data[, outcome := np.where(y == "yes", 1, 0)]

# Remove the original variables to avoid data leakage
causal_data[, list("poutcome", "y") := None]

# Convert factors to numeric for some methods
factor_cols = sapply(causal_data, is.factor)
causal_data_numeric = causal_data
causal_data_numeric[, (names(factor_cols)[factor_cols]) := lapply(.SD, as.numeric), .SDcols = names(factor_cols)[factor_cols]]

cat("Treatment distribution:\n")
table(causal_data$treatment)
cat("\nOutcome by treatment:\n")
table(causal_data$treatment, causal_data$outcome)

# Simple difference in means (biased estimate)
naive_ate = mean(causal_data[treatment == 1, outcome]) - mean(causal_data[treatment == 0, outcome])
cat("\nNaive ATE (difference in means):", round(naive_ate, 3), "\n")
cat("This is likely biased due to confounding!\n")

### Method 1: Causal Forest (Generalized Random Forest)

Causal Forests estimate heterogeneous treatment effects while controlling for confounding.


In [None]:
# --- Original R code (commented) ---
# # Prepare data for causal forest
# X <- as.matrix(causal_data_numeric[, !c("treatment", "outcome")])
# Y <- causal_data_numeric$outcome
# W <- causal_data_numeric$treatment
# 
# # Fit causal forest
# set.seed(123)
# cf <- causal_forest(X, Y, W, num.trees = 2000)
# 
# # Get average treatment effect
# ate_cf <- average_treatment_effect(cf)
# cat("Causal Forest ATE:", round(ate_cf[1], 3), "\n")
# cat("95% CI: [", round(ate_cf[1] - 1.96 * ate_cf[2], 3), ", ", 
#     round(ate_cf[1] + 1.96 * ate_cf[2], 3), "]\n")
# 
# # Get individual treatment effects
# tau_hat <- predict(cf)$predictions
# 
# # Analyze heterogeneity
# cat("\nTreatment Effect Heterogeneity:\n")
# cat("Mean ITE:", round(mean(tau_hat), 3), "\n")
# cat("Std Dev ITE:", round(sd(tau_hat), 3), "\n")
# cat("Min ITE:", round(min(tau_hat), 3), "\n")
# cat("Max ITE:", round(max(tau_hat), 3), "\n")
# 
# # Plot distribution of treatment effects
# hist(tau_hat, breaks = 30, main = "Distribution of Individual Treatment Effects",
#      xlab = "Treatment Effect", col = "lightblue", border = "white")
# abline(v = mean(tau_hat), col = "red", lwd = 2, lty = 2)
# legend("topright", "Mean ITE", col = "red", lty = 2, lwd = 2)
# 

# --- Naive Python translation (manual review recommended) ---
# Prepare data for causal forest
X = as.matrix(causal_data_numeric[, !list("treatment", "outcome")])
Y = causal_data_numeric$outcome
W = causal_data_numeric$treatment

# Fit causal forest
np.random.seed(123)
cf = causal_forest(X, Y, W, num.trees = 2000)

# Get average treatment effect
ate_cf = average_treatment_effect(cf)
cat("Causal Forest ATE:", round(ate_cf[1], 3), "\n")
cat("95% CI: [", round(ate_cf[1] - 1.96 * ate_cf[2], 3), ", ", 
    round(ate_cf[1] + 1.96 * ate_cf[2], 3), "]\n")

# Get individual treatment effects
tau_hat = predict(cf)$predictions

# Analyze heterogeneity
cat("\nTreatment Effect Heterogeneity:\n")
cat("Mean ITE:", round(mean(tau_hat), 3), "\n")
cat("Std Dev ITE:", round(sd(tau_hat), 3), "\n")
cat("Min ITE:", round(min(tau_hat), 3), "\n")
cat("Max ITE:", round(max(tau_hat), 3), "\n")

# Plot distribution of treatment effects
hist(tau_hat, breaks = 30, main = "Distribution of Individual Treatment Effects",
     xlab = "Treatment Effect", col = "lightblue", border = "white")
abline(v = mean(tau_hat), col = "red", lwd = 2, lty = 2)
legend("topright", "Mean ITE", col = "red", lty = 2, lwd = 2)

In [None]:
# --- Original R code (commented) ---
# # Variable importance for treatment effect heterogeneity
# var_imp <- variable_importance(cf)
# var_names <- colnames(X)
# 
# # Create importance data frame
# importance_df <- data.frame(
#   Variable = var_names,
#   Importance = var_imp,
#   stringsAsFactors = FALSE
# )
# importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
# 
# cat("\nTop 10 Variables for Treatment Effect Heterogeneity:\n")
# print(head(importance_df, 10))
# 
# # Plot variable importance
# top_vars <- head(importance_df, 10)
# barplot(top_vars$Importance, names.arg = top_vars$Variable, 
#         main = "Variable Importance for Treatment Effect Heterogeneity",
#         las = 2, cex.names = 0.8, col = "steelblue")
# 

# --- Naive Python translation (manual review recommended) ---
# Variable importance for treatment effect heterogeneity
var_imp = variable_importance(cf)
var_names = colnames(X)

# Create importance data frame
importance_df = pd.DataFrame(
  Variable = var_names,
  Importance = var_imp,
  stringsAsFactors = False
)
importance_df = importance_df[order(importance_df$Importance, decreasing = True), ]

cat("\nTop 10 Variables for Treatment Effect Heterogeneity:\n")
print(importance_df.head(10))

# Plot variable importance
top_vars = importance_df.head(10)
barplot(top_vars$Importance, names.arg = top_vars$Variable, 
        main = "Variable Importance for Treatment Effect Heterogeneity",
        las = 2, cex.names = 0.8, col = "steelblue")

### Method 2: Double Machine Learning (DML)

DML provides robust causal estimates by using ML to control for confounding.


In [None]:
# --- Original R code (commented) ---
# # Note: DoubleML package requires Python backend
# # Here we'll demonstrate the concept using manual implementation
# 
# # Step 1: Predict treatment (propensity score)
# ps_model <- randomForest(as.factor(treatment) ~ ., 
#                         data = causal_data_numeric[, !"outcome"], 
#                         ntree = 500)
# ps_hat <- predict(ps_model, type = "prob")[, "1"]
# 
# # Step 2: Predict outcome
# outcome_model <- randomForest(outcome ~ ., 
#                              data = causal_data_numeric[, !"treatment"], 
#                              ntree = 500)
# y_hat <- predict(outcome_model)
# 
# # Step 3: Compute residuals
# y_residual <- Y - y_hat
# w_residual <- W - ps_hat
# 
# # Step 4: Estimate ATE using residuals
# ate_dml <- sum(w_residual * y_residual) / sum(w_residual^2)
# 
# cat("Double ML ATE estimate:", round(ate_dml, 3), "\n")
# 
# # Compare propensity scores by treatment group
# cat("\nPropensity Score Balance Check:\n")
# cat("Mean PS (Treated):", round(mean(ps_hat[W == 1]), 3), "\n")
# cat("Mean PS (Control):", round(mean(ps_hat[W == 0]), 3), "\n")
# 
# # Plot propensity score distribution
# hist(ps_hat[W == 0], breaks = 20, col = rgb(1,0,0,0.5), 
#      main = "Propensity Score Distribution", xlab = "Propensity Score",
#      xlim = c(0, 1))
# hist(ps_hat[W == 1], breaks = 20, col = rgb(0,0,1,0.5), add = TRUE)
# legend("topright", c("Control", "Treated"), fill = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))
# 

# --- Naive Python translation (manual review recommended) ---
# Note: DoubleML package requires Python backend
# Here we'll demonstrate the concept using manual implementation

# Step 1: Predict treatment (propensity score)
ps_model = randomForest((treatment).astype("category") ~ ., 
                        data = causal_data_numeric[, !"outcome"], 
                        ntree = 500)
ps_hat = predict(ps_model, type = "prob")[, "1"]

# Step 2: Predict outcome
outcome_model = randomForest(outcome ~ ., 
                             data = causal_data_numeric[, !"treatment"], 
                             ntree = 500)
y_hat = predict(outcome_model)

# Step 3: Compute residuals
y_residual = Y - y_hat
w_residual = W - ps_hat

# Step 4: Estimate ATE using residuals
ate_dml = sum(w_residual * y_residual) / sum(w_residual^2)

cat("Double ML ATE estimate:", round(ate_dml, 3), "\n")

# Compare propensity scores by treatment group
cat("\nPropensity Score Balance Check:\n")
cat("Mean PS (Treated):", round(mean(ps_hat[W == 1]), 3), "\n")
cat("Mean PS (Control):", round(mean(ps_hat[W == 0]), 3), "\n")

# Plot propensity score distribution
hist(ps_hat[W == 0], breaks = 20, col = rgb(1,0,0,0.5), 
     main = "Propensity Score Distribution", xlab = "Propensity Score",
     xlim = list(0, 1))
hist(ps_hat[W == 1], breaks = 20, col = rgb(0,0,1,0.5), add = True)
legend("topright", list("Control", "Treated"), fill = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))

### Method 3: Meta-Learners (T-Learner and X-Learner)

Meta-learners use different strategies to combine ML models for causal inference.


In [None]:
# --- Original R code (commented) ---
# # T-Learner: Separate models for treated and control
# treated_data <- causal_data_numeric[treatment == 1, !c("treatment")]
# control_data <- causal_data_numeric[treatment == 0, !c("treatment")]
# 
# # Fit separate models
# model_treated <- randomForest(outcome ~ ., data = treated_data, ntree = 500)
# model_control <- randomForest(outcome ~ ., data = control_data, ntree = 500)
# 
# # Predict for all observations
# X_all <- causal_data_numeric[, !c("treatment", "outcome")]
# mu1_hat <- predict(model_treated, X_all)  # Potential outcome if treated
# mu0_hat <- predict(model_control, X_all)  # Potential outcome if control
# 
# # Individual treatment effects
# tau_tlearner <- mu1_hat - mu0_hat
# ate_tlearner <- mean(tau_tlearner)
# 
# cat("T-Learner ATE:", round(ate_tlearner, 3), "\n")
# cat("T-Learner ITE std dev:", round(sd(tau_tlearner), 3), "\n")
# 
# # Compare methods
# comparison_df <- data.frame(
#   Method = c("Naive", "Causal Forest", "Double ML", "T-Learner"),
#   ATE = c(naive_ate, ate_cf[1], ate_dml, ate_tlearner),
#   stringsAsFactors = FALSE
# )
# comparison_df$ATE <- round(comparison_df$ATE, 3)
# 
# cat("\nMethod Comparison:\n")
# print(comparison_df)
# 
# # Plot comparison
# barplot(comparison_df$ATE, names.arg = comparison_df$Method,
#         main = "Average Treatment Effect Estimates",
#         ylab = "ATE", col = c("red", "green", "blue", "orange"),
#         las = 2)
# abline(h = 0, lty = 2)
# 

# --- Naive Python translation (manual review recommended) ---
# T-Learner: Separate models for treated and control
treated_data = causal_data_numeric[treatment == 1, !list("treatment")]
control_data = causal_data_numeric[treatment == 0, !list("treatment")]

# Fit separate models
model_treated = randomForest(outcome ~ ., data = treated_data, ntree = 500)
model_control = randomForest(outcome ~ ., data = control_data, ntree = 500)

# Predict for all observations
X_all = causal_data_numeric[, !list("treatment", "outcome")]
mu1_hat = predict(model_treated, X_all)  # Potential outcome if treated
mu0_hat = predict(model_control, X_all)  # Potential outcome if control

# Individual treatment effects
tau_tlearner = mu1_hat - mu0_hat
ate_tlearner = mean(tau_tlearner)

cat("T-Learner ATE:", round(ate_tlearner, 3), "\n")
cat("T-Learner ITE std dev:", round(sd(tau_tlearner), 3), "\n")

# Compare methods
comparison_df = pd.DataFrame(
  Method = list("Naive", "Causal Forest", "Double ML", "T-Learner"),
  ATE = list(naive_ate, ate_cf[1], ate_dml, ate_tlearner),
  stringsAsFactors = False
)
comparison_df$ATE = round(comparison_df$ATE, 3)

cat("\nMethod Comparison:\n")
print(comparison_df)

# Plot comparison
barplot(comparison_df$ATE, names.arg = comparison_df$Method,
        main = "Average Treatment Effect Estimates",
        ylab = "ATE", col = list("red", "green", "blue", "orange"),
        las = 2)
abline(h = 0, lty = 2)

## Business Insights and Recommendations

### Interpreting the Results

1. **Naive vs. Causal Estimates**: The difference between naive and causal estimates reveals the extent of confounding bias

2. **Treatment Effect Heterogeneity**: Individual treatment effects vary significantly, suggesting personalized strategies could be valuable

3. **Key Drivers**: Variables important for treatment effect heterogeneity identify customer segments that respond differently to campaigns

### Financial Decision Making Applications

#### 1. **Personalized Marketing**
```r
# Identify high-response customers
high_responders <- which(tau_hat > quantile(tau_hat, 0.8))
# Target future campaigns to these customers
```

#### 2. **Resource Allocation**
- **Budget Optimization**: Focus marketing spend on customers with highest predicted treatment effects
- **Channel Selection**: Use causal analysis to determine most effective communication channels

#### 3. **Risk Management**
- **Adverse Selection**: Identify if certain interventions attract riskier customers
- **Regulatory Compliance**: Ensure marketing strategies don't create unfair disparate impact

#### 4. **Product Development**
- **Feature Testing**: Use causal ML to evaluate impact of new product features
- **Pricing Strategy**: Estimate causal price elasticity across customer segments

### Implementation Checklist

✅ **Data Requirements**
- Sufficient variation in treatment assignment
- Rich set of pre-treatment covariates
- Clear temporal ordering (cause before effect)

✅ **Model Validation**
- Check overlap in propensity scores
- Test for balance in covariates
- Validate using randomized experiments when possible

✅ **Business Integration**
- Translate causal estimates into business metrics (ROI, CLV)
- Build decision support tools for stakeholders
- Monitor performance of causal-ML-driven decisions

### Latest Developments (2024/25)

1. **Causal Representation Learning**: Deep learning methods for causal inference
2. **Federated Causal Inference**: Privacy-preserving causal analysis across institutions
3. **Causal Fairness**: Ensuring algorithmic decisions are causally fair
4. **Dynamic Treatment Regimes**: Optimal sequential decision making

### Recommended Reading

- **Books**: "Causal Inference: The Mixtape" by Scott Cunningham
- **Papers**: Athey & Imbens (2019) on Machine Learning for Causal Inference
- **R Packages**: `grf`, `DoubleML`, `causalTree` documentation

