# Anxiety Risk Prediction Using Classification Models

## Overview
This notebook builds multi-class classification models to predict anxiety risk levels from survey-based behavioural indicators.

The dataset contains approximately 70 variables capturing emotional state, lifestyle patterns, and socioeconomic background. The goal is to develop a robust classifier and surface the predictors most associated with anxiety risk.

## Objective
- Train and evaluate multi-class classification models for a 5-class anxiety target
- Optimise model performance using **Mean F1-Score** (macro-averaged)
- Compare Random Forest, GBM, and Multinomial Logistic Regression

## Dataset
- **Training set**: 594 samples × 43 variables (including target)
- **Test set**: 95 samples × 42 variables
- **Target**: `alwaysAnxious` — ordinal scale from −2 to +2

> **Note:** Dataset files are not included in this repository due to licensing restrictions. Update file paths to your local copy before running.

## Modelling Approach

### Key Finding: Simpler Features → Better Generalisation

| Approach | Features | CV F1 | Test F1 | Outcome |
|----------|----------|-------|---------|---------|
| Aggressive feature engineering | 25+ | 0.4271 | 0.3621 | Overfitting |
| Filtered features (remove NZV) | 41 | 0.3941 | 0.511 | Better |
| **All original features** | **42** | **0.3941** | **0.511** | **Best** |

### Why This Works
- **Class imbalance handling**: 5-fold CV with upsampling for minority classes
- **Minimal processing**: Random Forest naturally selects important features; manual filtering may discard useful information
- **Model comparison**: Three models trained and compared; GBM achieved the highest CV F1

### Final Model
- **Algorithm**: GBM (n.trees=400, depth=5, shrinkage=0.05)
- **Features**: All 42 original predictors, no filtering
- **Validation**: 5-fold stratified CV with upsampling
- **CV F1**: 0.4102

In [1]:
############################################################
# part 2-Classification (Mean F1 Kaggle)
# ULTRA-SIMPLE VERSION – ALL ORIGINAL FEATURES
# No feature engineering, no feature selection – just raw data
############################################################

set.seed(42)

############################################################
# STEP 0: Install / load allowed packages
############################################################
req_pkgs <- c("caret", "randomForest", "gbm", "nnet")

for (p in req_pkgs) {
  if (!require(p, character.only = TRUE)) {
    install.packages(p, repos = "http://cran.us.r-project.org")
    library(p, character.only = TRUE)
  }
}

cat(" All supported packages loaded\n\n")

############################################################
# STEP 1: Load data
############################################################
train <- read.csv("classification_train.csv", stringsAsFactors = FALSE)
test  <- read.csv("classification_test.csv",  stringsAsFactors = FALSE)

cat(" Data loaded\n")
cat("  Train:", nrow(train), "rows ×", ncol(train), "cols\n")
cat("  Test :", nrow(test),  "rows ×", ncol(test),  "cols\n\n")

############################################################
# STEP 2: Remap target levels to valid factor names
############################################################
orig_y <- train$alwaysAnxious

label_map_forward <- c(
  "-2" = "class_neg2",
  "-1" = "class_neg1",
  "0"  = "class_0",
  "1"  = "class_pos1",
  "2"  = "class_pos2"
)

label_map_back <- c(
  class_neg2 = -2,
  class_neg1 = -1,
  class_0    =  0,
  class_pos1 =  1,
  class_pos2 =  2
)

train$anx_cat <- factor(
  label_map_forward[as.character(orig_y)],
  levels = label_map_forward[as.character(sort(unique(orig_y)))]
)

cat(" Target remapped to valid factor levels\n")
print(table(train$anx_cat))
cat("\n")

############################################################
# STEP 3: Align character/factor columns
############################################################
char_cols <- names(train)[sapply(train, is.character)]

for (col in char_cols) {
  train[[col]] <- factor(train[[col]])
}

for (col in char_cols) {
  if (col %in% names(test)) {
    test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
  }
}

cat(" Factor columns aligned between train and test\n\n")

############################################################
# STEP 4: Fill NA values ONLY – NO OTHER PROCESSING
############################################################
for (col in names(train)) {
  if (is.numeric(train[[col]])) {
    train[[col]][is.na(train[[col]])] <- 0
  }
}
for (col in names(test)) {
  if (is.numeric(test[[col]])) {
    test[[col]][is.na(test[[col]])] <- 0
  }
}

cat(" NA values filled\n\n")

############################################################
# STEP 5: USE ALL ORIGINAL FEATURES (NO FILTERING)
############################################################
x_cols <- setdiff(names(train), c("alwaysAnxious", "anx_cat"))

train_select <- data.frame(train[, x_cols, drop = FALSE], anx_cat = train$anx_cat)
test_select  <- test[, x_cols, drop = FALSE]

cat(" Using ALL original features (no filtering)\n")
cat("  Total predictors:", length(x_cols), "\n\n")

############################################################
# STEP 6: Custom Macro-F1 (Mean F1-Score)
############################################################
macro_f1 <- function(truth, pred) {
  truth <- factor(truth)
  pred  <- factor(pred, levels = levels(truth))
  cm <- table(truth, pred)
  
  f1s <- c()
  for (k in rownames(cm)) {
    tp <- cm[k, k]
    fp <- sum(cm[, k]) - tp
    fn <- sum(cm[k, ]) - tp
    prec <- ifelse(tp + fp == 0, 0, tp/(tp+fp))
    rec  <- ifelse(tp + fn == 0, 0, tp/(tp+fn))
    f1   <- ifelse(prec + rec == 0, 0, 2*prec*rec/(prec+rec))
    f1s <- c(f1s, f1)
  }
  mean(f1s)
}

f1_summary <- function(data, lev = NULL, model = NULL) {
  f1 <- macro_f1(data$obs, data$pred)
  acc <- mean(data$obs == data$pred)
  c(F1 = f1, Accuracy = acc)
}

############################################################
# STEP 7: Cross-validation setup
############################################################
ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = f1_summary,
  classProbs = TRUE,
  savePredictions = "final",
  sampling = "up",
  verboseIter = FALSE
)

############################################################
# STEP 8: Train models with optimized parameters
# Based on previous findings:
# - RF: The best mtry value is 3
# - GBM: The optimal parameters are n.trees = 400, depth = 5, shrinkage = 0.05
# - Fine-tune with a smaller mtry range (1-5)
############################################################

## 8.1 Random Forest – Fine-tune mtry around optimal value
cat("=== Training: Random Forest (mtry: 1-5) ===\n")
cat("Fine-tuning around mtry=3...\n\n")

m_rf <- train(
  anx_cat ~ .,
  data = train_select,
  method = "rf",
  trControl = ctrl,
  tuneGrid = expand.grid(mtry = c(1, 2, 3, 4, 5)),
  ntree = 300,
  metric = "F1"
)

rf_best_mtry <- m_rf$bestTune$mtry
rf_f1 <- max(m_rf$results$F1)
rf_sd <- sd(m_rf$resample$F1)

cat(" RF training complete\n")
cat("  Best mtry:", rf_best_mtry, "\n")
cat("  Best F1:", round(rf_f1, 4), " (SD:", round(rf_sd, 4), ")\n\n")

print(m_rf$results[, c("mtry", "F1", "Accuracy")])
cat("\n")

## 8.2 GBM – Use optimized parameters from previous run
cat("=== Training: GBM ===\n")
m_gbm <- train(
  anx_cat ~ .,
  data = train_select,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE,
  tuneGrid = expand.grid(
    n.trees = 400,
    interaction.depth = 5,
    shrinkage = 0.05,
    n.minobsinnode = 10
  ),
  metric = "F1"
)

gbm_f1 <- max(m_gbm$results$F1)
gbm_sd <- sd(m_gbm$resample$F1)

cat(" GBM training complete\n")
cat("  Fixed params: n.trees=400, depth=5, shrinkage=0.05\n")
cat("  F1:", round(gbm_f1, 4), " (SD:", round(gbm_sd, 4), ")\n\n")

## 8.3 Multinomial Logistic
cat("=== Training: Multinomial Logistic ===\n")
m_mult <- train(
  anx_cat ~ .,
  data = train_select,
  method = "multinom",
  trControl = ctrl,
  trace = FALSE,
  tuneGrid = data.frame(decay = c(0, 0.1, 0.5)),
  metric = "F1"
)

mult_best_decay <- m_mult$bestTune$decay
mult_f1 <- max(m_mult$results$F1)
mult_sd <- sd(m_mult$resample$F1)

cat(" Multinom training complete\n")
cat("  Best decay:", mult_best_decay, "\n")
cat("  F1:", round(mult_f1, 4), " (SD:", round(mult_sd, 4), ")\n\n")

############################################################
# STEP 9: Model comparison
############################################################
cat("========== MODEL COMPARISON ==========\n\n")

f1_scores <- c(RF = rf_f1, GBM = gbm_f1, Multinom = mult_f1)
best_idx <- which.max(f1_scores)
best_name <- names(f1_scores)[best_idx]

comparison_df <- data.frame(
  Model = names(f1_scores),
  CV_F1 = round(f1_scores, 4),
  SD = c(round(rf_sd, 4), round(gbm_sd, 4), round(mult_sd, 4))
)

print(comparison_df)
cat("\n")

cat(" Best model:", best_name, "\n")
cat("   CV F1:", round(max(f1_scores), 4), "\n\n")

if (best_name == "RF") {
  fin.mod <- m_rf
  best_params <- paste("mtry=", rf_best_mtry, ", ntree=300", sep="")
} else if (best_name == "GBM") {
  fin.mod <- m_gbm
  best_params <- "n.trees=400, depth=5, shrinkage=0.05"
} else {
  fin.mod <- m_mult
  best_params <- paste("decay=", mult_best_decay, sep="")
}

cat("Best parameters:", best_params, "\n\n")

############################################################
# STEP 10: Predict on test
############################################################
test_pred_fac <- predict(fin.mod, newdata = test_select, type = "raw")
pred.label <- as.numeric(label_map_back[as.character(test_pred_fac)])

cat("Prediction label distribution (numeric):\n")
print(table(pred.label))
cat("\n")

############################################################
# STEP 11: Write submission file
############################################################
submission <- data.frame(
  RowIndex   = 1:length(pred.label),
  Prediction = pred.label
)

write.csv(submission, "ClassificationPredictLabel.csv", row.names = FALSE)

cat("Submission file created: ClassificationPredictLabel.csv\n")
cat("  Predictions:", length(pred.label), "\n")
cat("  Classes:", paste(sort(unique(pred.label)), collapse = ", "), "\n\n")

############################################################
# FINAL SUMMARY
############################################################
cat("========== FINAL SUMMARY ==========\n\n")

cat("Approach: Ultra-Simple (ALL original features, minimal processing)\n\n")

cat("Data:\n")
cat("  Training samples:", nrow(train_select), "\n")
cat("  Number of predictors:", length(x_cols), "(all original, no filtering)\n")
cat("  Feature engineering: NONE\n\n")

cat("Cross-Validation Results (5-fold + upsampling):\n")
cat("  Random Forest:    ", round(rf_f1, 4), " (mtry=", rf_best_mtry, ")\n")
cat("  GBM:              ", round(gbm_f1, 4), "\n")
cat("  Multinomial Log:  ", round(mult_f1, 4), "\n\n")

cat("Selected Model: ", best_name, "\n")
cat("Expected out-of-sample F1: ~", round(max(f1_scores) - 0.01, 4), 
    " (conservative estimate)\n\n")

cat(" Final model stored in fin.mod\n")
cat(" Predictions exported to ClassificationPredictLabel.csv\n")

 All supported packages loaded

 Data loaded
  Train: 594 rows × 43 cols
  Test : 95 rows × 42 cols

 Target remapped to valid factor levels

class_neg2 class_neg1    class_0 class_pos1 class_pos2 
       119         87        182        139         67 

 Factor columns aligned between train and test

 NA values filled

 Using ALL original features (no filtering)
  Total predictors: 42 

=== Training: Random Forest (mtry: 1-5) ===
Fine-tuning around mtry=3...

 RF training complete
  Best mtry: 5 
  Best F1: 0.3996  (SD: 0.0517 )

  mtry        F1  Accuracy
1    1 0.2922939 0.3014967
2    2 0.3684368 0.3752568
3    3 0.3846963 0.3986386
4    4 0.3760472 0.3955873
5    5 0.3995718 0.4123539

=== Training: GBM ===
 GBM training complete
  Fixed params: n.trees=400, depth=5, shrinkage=0.05
  F1: 0.4102  (SD: 0.0083 )

=== Training: Multinomial Logistic ===
 Multinom training complete
  Best decay: 0.5 
  F1: 0.3644  (SD: 0.0551 )


            Model  CV_F1     SD
RF             RF 0.3996 