# Advanced Predictive Models in R: Lasso, Trees, Random Forests, and Boosting

---
This notebook demonstrates how to use more advanced machine learning models in R for predicting customer behavior in banking. We'll cover:

- **Lasso regression**: A linear model that automatically selects important features
- **Decision Trees**: Easy-to-interpret models that make decisions like a flowchart
- **Random Forests**: Combines many decision trees for better predictions
- **Boosting**: Builds models sequentially, learning from previous mistakes

## Learning Objectives
By the end of this notebook, you will:
1. Understand when to use each type of model
2. Know how to prepare data for machine learning
3. Be able to train, evaluate, and compare different models
4. Interpret model results and feature importance

## Required Packages
Please ensure you have the required packages installed:

```r
install.packages(c("gamlr", "rpart", "randomForest", "xgboost", "caret", "pROC"))
```


In [ ]:
# Load libraries with error handling
required_packages <- c("gamlr", "rpart", "randomForest", "xgboost", "caret", "pROC", "data.table", "ggplot2")

for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE)) {
    cat("Installing package:", pkg, "\n")
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

cat("All packages loaded successfully!\n")


In [ ]:
# Load and explore the banking dataset
data <- fread("banking.csv")

cat("Dataset dimensions:", nrow(data), "rows and", ncol(data), "columns\n\n")

# Display structure
str(data)

# Summary statistics
summary(data)

# Check for missing values
cat("\nMissing values per column:\n")
sapply(data, function(x) sum(is.na(x)))

# Look at the target variable distribution
cat("\nTarget variable (y) distribution:\n")
table(data$y)
prop.table(table(data$y))


## Data Preparation

Before building models, we need to prepare our data properly.


In [ ]:
# Convert categorical variables to factors
categorical_vars <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
data[, (categorical_vars) := lapply(.SD, as.factor), .SDcols = categorical_vars]

# Create train/test split (80/20)
set.seed(123)  # For reproducibility
train_idx <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

cat("Training set:", nrow(train_data), "observations\n")
cat("Test set:", nrow(test_data), "observations\n")

# Check that target variable is balanced in both sets
cat("\nTarget distribution in training set:\n")
prop.table(table(train_data$y))
cat("\nTarget distribution in test set:\n")
prop.table(table(test_data$y))


## 1. Lasso Regression

**What is Lasso?** Lasso (Least Absolute Shrinkage and Selection Operator) is a linear model that automatically selects the most important features by setting less important coefficients to zero. This helps prevent overfitting and makes the model easier to interpret.

**When to use Lasso:**
- When you have many features and want automatic feature selection
- When you need an interpretable model
- When you suspect many features are irrelevant

### Logistic Lasso: Predicting customer subscription (y)


In [ ]:
# Prepare data for logistic lasso (training set)
X_train <- model.matrix(y ~ ., train_data)[, -1]  # Remove intercept
y_train <- train_data$y

# Prepare test set
X_test <- model.matrix(y ~ ., test_data)[, -1]
y_test <- test_data$y

# Fit logistic lasso with cross-validation to find optimal lambda
cv_lasso <- cv.gamlr(X_train, y_train, family = "binomial")

# Plot the cross-validation curve
plot(cv_lasso, main = "Lasso Cross-Validation")
cat("Optimal lambda:", cv_lasso$lambda.min, "\n")

# Get coefficients for the optimal model
lasso_coef <- coef(cv_lasso, select = "min")
cat("\nNumber of selected features:", sum(lasso_coef != 0) - 1, "\n")  # -1 for intercept

# Show non-zero coefficients (selected features)
selected_features <- lasso_coef[lasso_coef != 0, , drop = FALSE]
print(selected_features)

# Make predictions on test set
lasso_pred_prob <- predict(cv_lasso, X_test, type = "response")
lasso_pred_class <- ifelse(lasso_pred_prob > 0.5, "yes", "no")

# Evaluate performance
lasso_accuracy <- mean(lasso_pred_class == y_test)
cat("\nLasso Test Accuracy:", round(lasso_accuracy, 3), "\n")

# Confusion matrix
lasso_cm <- table(Predicted = lasso_pred_class, Actual = y_test)
print(lasso_cm)


## 2. Decision Trees

**What are Decision Trees?** Decision trees make predictions by asking a series of yes/no questions about the features. They're like a flowchart that leads to a prediction.

**When to use Decision Trees:**
- When you need a highly interpretable model
- When relationships between features are non-linear
- When you want to understand the decision-making process

**Pros:** Easy to interpret, handles non-linear relationships
**Cons:** Can overfit, unstable (small data changes can create very different trees)


In [ ]:
# Fit a decision tree for classification
tree_mod <- rpart(y ~ ., data = train_data, method = "class", 
                  control = rpart.control(cp = 0.01, minsplit = 20))

# Plot the tree
plot(tree_mod, uniform=TRUE, margin=0.1)
text(tree_mod, use.n=TRUE, all=TRUE, cex=.8)
title("Decision Tree for Customer Subscription Prediction")

# Print complexity parameter table
printcp(tree_mod)

# Make predictions
tree_pred_class <- predict(tree_mod, test_data, type = "class")
tree_pred_prob <- predict(tree_mod, test_data, type = "prob")[, "yes"]

# Evaluate performance
tree_accuracy <- mean(tree_pred_class == test_data$y)
cat("\nDecision Tree Test Accuracy:", round(tree_accuracy, 3), "\n")

# Confusion matrix
tree_cm <- table(Predicted = tree_pred_class, Actual = test_data$y)
print(tree_cm)

# Feature importance
cat("\nFeature Importance (Decision Tree):\n")
importance_tree <- tree_mod$variable.importance
print(sort(importance_tree, decreasing = TRUE)[1:10])  # Top 10


## 3. Random Forests

**What are Random Forests?** Random Forests combine many decision trees, where each tree is trained on a random subset of the data and features. The final prediction is the average (or majority vote) of all trees.

**When to use Random Forests:**
- When you want better accuracy than a single decision tree
- When you have enough data (works well with large datasets)
- When you want feature importance rankings

**Pros:** Usually more accurate than single trees, provides feature importance, handles missing values
**Cons:** Less interpretable than single trees, can overfit with very noisy data


In [ ]:
# Fit a random forest for classification
set.seed(123)
rf_mod <- randomForest(y ~ ., data = train_data, ntree = 500, importance = TRUE, 
                       mtry = sqrt(ncol(train_data) - 1))  # Standard mtry for classification

print(rf_mod)

# Plot variable importance
varImpPlot(rf_mod, main = "Random Forest Feature Importance")

# Make predictions
rf_pred_class <- predict(rf_mod, test_data)
rf_pred_prob <- predict(rf_mod, test_data, type = "prob")[, "yes"]

# Evaluate performance
rf_accuracy <- mean(rf_pred_class == test_data$y)
cat("\nRandom Forest Test Accuracy:", round(rf_accuracy, 3), "\n")

# Confusion matrix
rf_cm <- table(Predicted = rf_pred_class, Actual = test_data$y)
print(rf_cm)

# Feature importance (top 10)
importance_rf <- importance(rf_mod)[, "MeanDecreaseGini"]
cat("\nTop 10 Most Important Features (Random Forest):\n")
print(sort(importance_rf, decreasing = TRUE)[1:10])


## 4. Boosting (using XGBoost)

**What is Boosting?** Boosting builds models sequentially, where each new model tries to correct the mistakes of the previous models. XGBoost is a popular and powerful implementation of gradient boosting.

**When to use Boosting:**
- When you want state-of-the-art predictive performance
- When you have sufficient data and computational resources
- When accuracy is more important than interpretability

**Pros:** Often achieves the best predictive performance, handles missing values, provides feature importance
**Cons:** More complex to tune, less interpretable, can overfit easily


In [ ]:
# Prepare data for xgboost (needs numeric labels: 0/1)
Xmat_train <- model.matrix(y ~ ., train_data)[, -1]
Xmat_test <- model.matrix(y ~ ., test_data)[, -1]
y_train_numeric <- as.numeric(train_data$y) - 1  # Convert to 0/1
y_test_numeric <- as.numeric(test_data$y) - 1

# Create DMatrix objects
dtrain <- xgb.DMatrix(data = Xmat_train, label = y_train_numeric)
dtest <- xgb.DMatrix(data = Xmat_test, label = y_test_numeric)

# Set parameters
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 4,
  eta = 0.1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train with cross-validation to find optimal number of rounds
cv_result <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 200,
  nfold = 5,
  early_stopping_rounds = 10,
  verbose = 0
)

best_nrounds <- cv_result$best_iteration
cat("Optimal number of rounds:", best_nrounds, "\n")

# Fit final model
xgb_mod <- xgboost(
  params = params,
  data = dtrain,
  nrounds = best_nrounds,
  verbose = 0
)

# Feature importance
importance <- xgb.importance(model = xgb_mod)
xgb.plot.importance(importance[1:15, ], main = "XGBoost Feature Importance (Top 15)")

# Make predictions
xgb_pred_prob <- predict(xgb_mod, dtest)
xgb_pred_class <- ifelse(xgb_pred_prob > 0.5, "yes", "no")

# Evaluate performance
xgb_accuracy <- mean(xgb_pred_class == test_data$y)
cat("\nXGBoost Test Accuracy:", round(xgb_accuracy, 3), "\n")

# Confusion matrix
xgb_cm <- table(Predicted = xgb_pred_class, Actual = test_data$y)
print(xgb_cm)

# ROC curve and AUC
roc_obj <- roc(y_test_numeric, xgb_pred_prob)
plot(roc_obj, main = "ROC Curve (XGBoost)")
cat("\nAUC:", round(auc(roc_obj), 3), "\n")


## 5. Model Comparison and Summary

Let's compare all our models to see which performs best.


In [ ]:
# Create a comparison table
model_comparison <- data.frame(
  Model = c("Lasso", "Decision Tree", "Random Forest", "XGBoost"),
  Accuracy = c(lasso_accuracy, tree_accuracy, rf_accuracy, xgb_accuracy),
  stringsAsFactors = FALSE
)

model_comparison$Accuracy <- round(model_comparison$Accuracy, 3)
model_comparison <- model_comparison[order(model_comparison$Accuracy, decreasing = TRUE), ]

cat("Model Performance Comparison (Test Set Accuracy):\n")
print(model_comparison)

# Plot comparison
ggplot(model_comparison, aes(x = reorder(Model, Accuracy), y = Accuracy)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = Accuracy), hjust = -0.1) +
  coord_flip() +
  labs(title = "Model Performance Comparison", 
       x = "Model", y = "Test Accuracy") +
  theme_minimal() +
  ylim(0, 1)


## Key Takeaways

### Model Characteristics Summary:

1. **Lasso Regression**
   - ✅ Automatic feature selection
   - ✅ Highly interpretable
   - ✅ Fast to train and predict
   - ❌ Assumes linear relationships

2. **Decision Trees**
   - ✅ Most interpretable (visual flowchart)
   - ✅ Handles non-linear relationships
   - ✅ No assumptions about data distribution
   - ❌ Prone to overfitting
   - ❌ Unstable (small changes → different tree)

3. **Random Forest**
   - ✅ Usually more accurate than single trees
   - ✅ Robust to overfitting
   - ✅ Provides feature importance
   - ✅ Handles missing values well
   - ❌ Less interpretable than single trees

4. **XGBoost**
   - ✅ Often achieves best predictive performance
   - ✅ Handles missing values
   - ✅ Built-in regularization
   - ❌ More complex to tune
   - ❌ Least interpretable
   - ❌ Can overfit if not tuned properly

### Choosing the Right Model:

- **Need interpretability?** → Decision Tree or Lasso
- **Want good performance with minimal tuning?** → Random Forest
- **Need maximum accuracy?** → XGBoost (with proper tuning)
- **Have limited data?** → Lasso or Decision Tree
- **Have lots of irrelevant features?** → Lasso or XGBoost

### Next Steps:
1. Try different hyperparameters for each model
2. Use cross-validation for more robust evaluation
3. Consider ensemble methods (combining multiple models)
4. Analyze feature importance to gain business insights
5. Test models on new, unseen data before deployment