# Individual Planning Report: Predicting Video Game Server Usage

**Date:** November 2025  
**Course:** Data Science Project

This report analyzes player and session data from a MineCraft research server to address predictive questions about player behavior and server usage patterns.

**GitHub Repository:** [https://github.com/thisis77/DS](https://github.com/thisis77/DS)

---

In [None]:
# Import necessary libraries
library(tidyverse)  # includes dplyr, ggplot2, tidyr, readr
library(readxl)     # for reading Excel files
library(lubridate)  # for date/time handling

# Set display options
options(width = 200)
options(digits = 2)

# Set visualization theme
theme_set(theme_gray())

## XLSX to CSV Transformation

First, we convert the Excel files to CSV format for easier processing.

In [None]:
# Read the sessions xlsx file
sessions_df <- read_excel('sessions (2).xlsx')

# Display the first few rows
cat("Sessions data shape:", nrow(sessions_df), "x", ncol(sessions_df), "\n")
head(sessions_df)

In [None]:
# Convert to CSV
write.csv(sessions_df, 'sessions.csv', row.names = FALSE)
cat("✓ sessions.csv created successfully\n")

---

## 1. Data Description

This section provides a comprehensive analysis of the MineCraft research server dataset, including player profiles and session logs.

### 1.1 Loading the Datasets

In [None]:
# Load sessions data
sessions_df <- read.csv('sessions.csv')
cat("Sessions dataset shape:", nrow(sessions_df), "x", ncol(sessions_df), "\n")
cat("Number of unique players in sessions:", n_distinct(sessions_df$hashedEmail), "\n\n")
cat("First 5 rows of sessions data:\n")
head(sessions_df, 5)

---

## 2. Data Wrangling and Cleaning

This section performs the minimum necessary data wrangling to convert the data into tidy format and address quality issues.

### 2.1 Data Type Optimization and Conversion

In [None]:
# Reload data to ensure clean state
players_df <- read.csv('players.csv')
sessions_df <- read.csv('sessions.csv')

cat("Original data types:\n")
cat("\nPlayers dataset:\n")
str(players_df)
cat("Shape:", nrow(players_df), "x", ncol(players_df), "\n")

cat("\nSessions dataset:\n")
str(sessions_df) 
cat("Shape:", nrow(sessions_df), "x", ncol(sessions_df), "\n")

### 2.2 Feature Engineering and Derived Variables

In [None]:
# Ensure start_time and end_time are datetimes (dmy_hm for format like '30/06/2024 18:12')
sessions_df$start_time <- dmy_hm(sessions_df$start_time)
sessions_df$end_time <- dmy_hm(sessions_df$end_time)

# Report parsing issues if any
n_start_na <- sum(is.na(sessions_df$start_time))
n_end_na <- sum(is.na(sessions_df$end_time))
if (n_start_na > 0 || n_end_na > 0) {
    cat(sprintf("Warning: %d start_time(s) and %d end_time(s) could not be parsed and are set to NA\n", 
                n_start_na, n_end_na))
}

# Calculate session duration in minutes (will be NA when either timestamp is missing)
sessions_df$session_duration_minutes <- as.numeric(difftime(sessions_df$end_time, sessions_df$start_time, units = "mins"))

# Create time-based features (will be NA for rows with NA start_time)
sessions_df$start_hour <- hour(sessions_df$start_time)
sessions_df$start_day_of_week <- wday(sessions_df$start_time, label = TRUE, abbr = FALSE)
sessions_df$start_month <- month(sessions_df$start_time)

# Categorize session times (handle missing hours)
categorize_time <- function(hour) {
  if (is.na(hour)) {
    return(NA)
  }
  hour <- as.integer(hour)
  if (hour >= 6 && hour < 12) {
    return('Morning')
  } else if (hour >= 12 && hour < 18) {
    return('Afternoon')
  } else if (hour >= 18 && hour < 24) {
    return('Evening')
  } else {
    return('Night')
  }
}

sessions_df$time_period <- sapply(sessions_df$start_hour, categorize_time)
sessions_df$time_period <- factor(sessions_df$time_period)

cat("Session duration statistics (minutes):\n")
print(summary(sessions_df$session_duration_minutes))
cat(sprintf("\nNegative durations (data quality issue): %d\n", 
            sum(sessions_df$session_duration_minutes < 0, na.rm = TRUE)))
cat(sprintf("Zero duration sessions: %d\n", 
            sum(sessions_df$session_duration_minutes == 0, na.rm = TRUE)))

cat("\nTime period distribution:\n")
print(table(sessions_df$time_period, useNA = "ifany"))

### 2.3 Player-Level Aggregations

In [None]:
# Create player-level summary statistics from sessions
player_session_stats <- sessions_df %>%
  group_by(hashedEmail) %>%
  summarise(
    total_sessions = n(),
    total_playtime_minutes = sum(session_duration_minutes, na.rm = TRUE),
    avg_session_duration = mean(session_duration_minutes, na.rm = TRUE),
    session_duration_std = sd(session_duration_minutes, na.rm = TRUE),
    first_session = min(start_time, na.rm = TRUE),
    last_session = max(start_time, na.rm = TRUE),
    .groups = 'drop'
  )

# Calculate days between first and last session
player_session_stats$engagement_days <- as.numeric(difftime(
  player_session_stats$last_session, 
  player_session_stats$first_session, 
  units = "days"
)) + 1

# Fill NA std with 0 for players with only one session
player_session_stats$session_duration_std[is.na(player_session_stats$session_duration_std)] <- 0

cat("Player session statistics shape:", nrow(player_session_stats), "x", ncol(player_session_stats), "\n\n")
cat("Player session statistics summary:\n")
print(summary(player_session_stats))

### 2.4 Data Integration and Tidy Format

In [None]:
# Merge players data with session statistics
# Use left join to keep all players (even those without sessions)
players_complete <- players_df %>%
  left_join(player_session_stats, by = "hashedEmail")

# Fill missing values for players without sessions
session_cols <- c('total_sessions', 'total_playtime_minutes', 'avg_session_duration', 
                  'session_duration_std', 'engagement_days')
for (col in session_cols) {
  if (col %in% names(players_complete)) {
    players_complete[[col]][is.na(players_complete[[col]])] <- 0
  }
}

# Create engagement categories based on total sessions
categorize_engagement <- function(total_sessions) {
  if (total_sessions == 0) {
    return('No Activity')
  } else if (total_sessions <= 5) {
    return('Low')
  } else if (total_sessions <= 20) {
    return('Medium')
  } else {
    return('High')
  }
}

players_complete$engagement_level <- sapply(players_complete$total_sessions, categorize_engagement)
players_complete$engagement_level <- factor(players_complete$engagement_level)

cat("Complete dataset shape:", nrow(players_complete), "x", ncol(players_complete), "\n")
cat("Players without sessions:", sum(players_complete$total_sessions == 0), "\n\n")
cat("Engagement level distribution:\n")
print(table(players_complete$engagement_level))

### 2.5 Data Quality Issues Documentation

In [None]:
# Document all data quality issues found during wrangling
cat("DATA QUALITY ASSESSMENT SUMMARY\n")
cat(paste(rep("=", 50), collapse=""), "\n")

cat("\n1. MISSING VALUES:\n")
cat(sprintf("   - Age missing in players dataset: %d records\n", sum(is.na(players_df$Age))))
cat("   - No missing values in sessions dataset\n")

cat("\n2. DATA CONSISTENCY:\n")
cat(sprintf("   - Players in players.csv: %d\n", nrow(players_df)))
cat(sprintf("   - Players with sessions: %d\n", nrow(player_session_stats)))
cat(sprintf("   - Players without sessions: %d\n", nrow(players_df) - nrow(player_session_stats)))

cat("\n3. DATA RANGE VALIDATION:\n")
cat(sprintf("   - Age range: %.0f to %.0f years\n", 
            min(players_df$Age, na.rm = TRUE), max(players_df$Age, na.rm = TRUE)))
cat(sprintf("   - Played hours range: %.1f to %.1f hours\n", 
            min(players_df$played_hours, na.rm = TRUE), max(players_df$played_hours, na.rm = TRUE)))
cat(sprintf("   - Session duration range: %.1f to %.1f minutes\n", 
            min(sessions_df$session_duration_minutes, na.rm = TRUE), 
            max(sessions_df$session_duration_minutes, na.rm = TRUE)))

cat("\n4. POTENTIAL OUTLIERS:\n")
outliers_age <- sum((players_df$Age < 10 | players_df$Age > 60), na.rm = TRUE)
outliers_hours <- sum(players_df$played_hours > 100, na.rm = TRUE)
outliers_session <- sum(sessions_df$session_duration_minutes > 300, na.rm = TRUE)

cat(sprintf("   - Age outliers (< 10 or > 60): %d\n", outliers_age))
cat(sprintf("   - High playtime outliers (> 100 hours): %d\n", outliers_hours))
cat(sprintf("   - Long session outliers (> 5 hours): %d\n", outliers_session))

cat("\n5. DATA INTEGRITY:\n")
cat(sprintf("   - Duplicate player records: %d\n", sum(duplicated(players_df$hashedEmail))))
cat(sprintf("   - Duplicate session records: %d\n", sum(duplicated(sessions_df))))
cat(sprintf("   - Sessions with negative duration: %d\n", 
            sum(sessions_df$session_duration_minutes < 0, na.rm = TRUE)))

### 2.6 Final Cleaned Dataset Summary

In [None]:
# Display final cleaned datasets
cat("FINAL CLEANED DATASETS\n")
cat(paste(rep("=", 50), collapse=""), "\n")

cat(sprintf("\nComplete Players Dataset: %d x %d\n", nrow(players_complete), ncol(players_complete)))
cat("Columns:", paste(names(players_complete), collapse=", "), "\n")
cat("\nFirst 3 rows of complete dataset:\n")
print(head(players_complete, 3))

cat(sprintf("\nSessions Dataset: %d x %d\n", nrow(sessions_df), ncol(sessions_df)))
cat("Columns:", paste(names(sessions_df), collapse=", "), "\n")
cat("\nSample of sessions data:\n")
print(head(sessions_df[, c('hashedEmail', 'session_duration_minutes', 'time_period', 'start_day_of_week')], 3))

cat("\nData is now in tidy format and ready for analysis!\n")
cat("Key transformations completed:\n")
cat("✓ Categorical variables converted to proper types\n")
cat("✓ Timestamp data converted to datetime\n")
cat("✓ Session durations calculated\n")
cat("✓ Player-level aggregations created\n")
cat("✓ Missing values handled appropriately\n")
cat("✓ Engagement categories defined\n")
cat("✓ Time-based features engineered\n")

---

## 3. Question Formulation and Research Objective

This section identifies the specific predictive question from the three broad research areas.

### 3.1 Research Question Selection

**Broad Question Selected:** Question 1 - What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?

**Specific Research Question:**  
*Can player experience level, age, gender, and engagement metrics (total playtime hours and session frequency) predict newsletter subscription status in the MineCraft research server dataset?*

**Justification:**
- Newsletter subscription is a clear binary outcome variable suitable for classification
- Player characteristics and behavioral data are well-represented in our dataset
- Understanding subscription drivers can help target recruitment efforts
- Addresses practical stakeholder needs for player engagement optimization

### 3.2 Variables for Analysis

**Response Variable:**
- `subscribe` (boolean): Newsletter subscription status

**Explanatory Variables:**
- `experience` (categorical): Player experience level (Amateur, Regular, Pro, Veteran)
- `Age` (numerical): Player age in years
- `gender` (categorical): Player gender (Male, Female)
- `played_hours` (numerical): Total hours played on server
- `total_sessions` (numerical, derived): Total number of play sessions
- `avg_session_duration` (numerical, derived): Average session length in minutes

---

## 4. Exploratory Data Analysis and Visualization

This section provides comprehensive visualizations to understand the relationships between predictor variables and newsletter subscription status.

### 4.1 Response Variable Distribution

In [None]:
# Load gridExtra for arranging plots
library(gridExtra)

# Subscription distribution visualization
# Create data frame for plotting
sub_df <- data.frame(
  status = c('Not Subscribed', 'Subscribed'),
  count = c(sum(!players_complete$subscribe), sum(players_complete$subscribe))
)
sub_df$percent <- sub_df$count / sum(sub_df$count) * 100

# Define colors
colors <- c('#ff7f0e', '#1f77b4')

# Bar plot
p1 <- ggplot(sub_df, aes(x = status, y = count, fill = status)) +
  geom_bar(stat = 'identity', alpha = 0.7) +
  scale_fill_manual(values = colors) +
  labs(title = 'Newsletter Subscription Distribution',
       x = '', y = 'Number of Players') +
  theme_minimal() +
  theme(legend.position = 'none') +
  geom_text(aes(label = count), vjust = -0.5, fontface = 'bold')

# Pie chart
p2 <- ggplot(sub_df, aes(x = "", y = percent, fill = status)) +
  geom_bar(stat = 'identity', width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = colors) +
  labs(title = 'Newsletter Subscription Percentage') +
  theme_void() +
  geom_text(aes(label = paste0(round(percent, 1), '%')),
            position = position_stack(vjust = 0.5), fontface = 'bold')

# Arrange plots side by side
grid.arrange(p1, p2, ncol = 2)

cat("Newsletter Subscription Summary:\n")
cat(sprintf("Total players: %d\n", nrow(players_complete)))
cat(sprintf("Subscribed: %d (%.1f%%)\n", 
            sum(players_complete$subscribe), 
            sum(players_complete$subscribe)/nrow(players_complete)*100))
cat(sprintf("Not subscribed: %d (%.1f%%)\n", 
            sum(!players_complete$subscribe), 
            sum(!players_complete$subscribe)/nrow(players_complete)*100))

### 4.2 Categorical Variables vs Subscription

In [None]:
# Categorical variables analysis
colors <- c('#ff7f0e', '#1f77b4')

# Prepare percentage data for experience
exp_pct <- players_complete %>%
  group_by(experience, subscribe) %>%
  summarise(n = n(), .groups = 'drop') %>%
  group_by(experience) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup()

# Plot 1: Experience level vs subscription
p1 <- ggplot(exp_pct, aes(x = experience, y = pct, fill = subscribe)) +
  geom_bar(stat = 'identity', position = 'dodge', alpha = 0.7) +
  scale_fill_manual(values = colors, labels = c('Not Subscribed', 'Subscribed')) +
  labs(title = 'Subscription by Experience Level (%)',
       x = 'Experience Level', y = 'Percentage') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_blank())

# Prepare percentage data for gender
gender_pct <- players_complete %>%
  group_by(gender, subscribe) %>%
  summarise(n = n(), .groups = 'drop') %>%
  group_by(gender) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup()

# Plot 2: Gender vs subscription
p2 <- ggplot(gender_pct, aes(x = gender, y = pct, fill = subscribe)) +
  geom_bar(stat = 'identity', position = 'dodge', alpha = 0.7) +
  scale_fill_manual(values = colors, labels = c('Not Subscribed', 'Subscribed')) +
  labs(title = 'Subscription by Gender (%)',
       x = 'Gender', y = 'Percentage') +
  theme_minimal() +
  theme(legend.title = element_blank())

# Prepare percentage data for engagement level
eng_pct <- players_complete %>%
  group_by(engagement_level, subscribe) %>%
  summarise(n = n(), .groups = 'drop') %>%
  group_by(engagement_level) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup()

# Plot 3: Engagement level vs subscription
p3 <- ggplot(eng_pct, aes(x = engagement_level, y = pct, fill = subscribe)) +
  geom_bar(stat = 'identity', position = 'dodge', alpha = 0.7) +
  scale_fill_manual(values = colors, labels = c('Not Subscribed', 'Subscribed')) +
  labs(title = 'Subscription by Engagement Level (%)',
       x = 'Engagement Level', y = 'Percentage') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_blank())

# Plot 4: Count plot
p4 <- ggplot(players_complete, aes(x = experience, fill = subscribe)) +
  geom_bar(position = 'dodge', alpha = 0.7) +
  scale_fill_manual(values = colors, labels = c('Not Subscribed', 'Subscribed')) +
  labs(title = 'Player Count by Experience and Subscription',
       x = 'Experience Level', y = 'Count') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_blank())

# Arrange plots in 2x2 grid
grid.arrange(p1, p2, p3, p4, ncol = 2)

# Print crosstabs
cat("Experience vs Subscription:\n")
print(addmargins(table(players_complete$experience, players_complete$subscribe)))

### 4.3 Numerical Variables vs Subscription

In [None]:
# Numerical variables analysis
colors <- c('#ff7f0e', '#1f77b4')

# Prepare data
age_data <- players_complete %>% filter(!is.na(Age))

# Create subscribe labels
players_complete$subscribe_label <- ifelse(players_complete$subscribe, 'Yes', 'No')
age_data$subscribe_label <- ifelse(age_data$subscribe, 'Yes', 'No')

# Plot 1: Age box plot
p1 <- ggplot(age_data, aes(x = subscribe_label, y = Age, fill = subscribe)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = colors) +
  labs(title = 'Age by Subscription Status',
       x = 'Subscribed', y = 'Age (years)') +
  theme_minimal() +
  theme(legend.position = 'none')

# Plot 2: Played hours box plot
p2 <- ggplot(players_complete, aes(x = subscribe_label, y = played_hours, fill = subscribe)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = colors) +
  labs(title = 'Played Hours by Subscription Status',
       x = 'Subscribed', y = 'Played Hours') +
  theme_minimal() +
  theme(legend.position = 'none')

# Plot 3: Total sessions box plot
p3 <- ggplot(players_complete, aes(x = subscribe_label, y = total_sessions, fill = subscribe)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = colors) +
  labs(title = 'Total Sessions by Subscription Status',
       x = 'Subscribed', y = 'Total Sessions') +
  theme_minimal() +
  theme(legend.position = 'none')

# Plot 4: Avg session duration box plot
active_players <- players_complete %>% filter(total_sessions > 0)
active_players$subscribe_label <- ifelse(active_players$subscribe, 'Yes', 'No')

p4 <- ggplot(active_players, aes(x = subscribe_label, y = avg_session_duration, fill = subscribe)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = colors) +
  labs(title = 'Avg Session Duration by Subscription',
       x = 'Subscribed', y = 'Avg Duration (min)') +
  theme_minimal() +
  theme(legend.position = 'none')

# Arrange plots in 2x2 grid
grid.arrange(p1, p2, p3, p4, ncol = 2)

# Statistical summary
cat("Numerical Statistics by Subscription Status:\n\n")
cat("Age:\n")
print(age_data %>% group_by(subscribe) %>% summarise(
  count = n(),
  mean = mean(Age, na.rm = TRUE),
  std = sd(Age, na.rm = TRUE),
  min = min(Age, na.rm = TRUE),
  q25 = quantile(Age, 0.25, na.rm = TRUE),
  median = median(Age, na.rm = TRUE),
  q75 = quantile(Age, 0.75, na.rm = TRUE),
  max = max(Age, na.rm = TRUE)
))

cat("\nPlayed Hours:\n")
print(players_complete %>% group_by(subscribe) %>% summarise(
  count = n(),
  mean = mean(played_hours, na.rm = TRUE),
  std = sd(played_hours, na.rm = TRUE),
  min = min(played_hours, na.rm = TRUE),
  q25 = quantile(played_hours, 0.25, na.rm = TRUE),
  median = median(played_hours, na.rm = TRUE),
  q75 = quantile(played_hours, 0.75, na.rm = TRUE),
  max = max(played_hours, na.rm = TRUE)
))

### 4.4 Key Insights from Exploratory Analysis

**Primary Findings:**

1. **Subscription Distribution**: The dataset shows more subscribers than non-subscribers, indicating strong player engagement with newsletter content.

2. **Experience Level Patterns**: Different experience levels show varying subscription rates, suggesting experience is a meaningful predictor of newsletter engagement.

3. **Engagement Metrics**: Players with higher total sessions and playtime hours show different subscription patterns, indicating behavioral metrics are important predictive features.

4. **Demographics**: Age and gender show some variation between subscriber groups, though relationships may require statistical modeling to fully understand.

**Observations for Modeling:**
- Class imbalance in subscription status may require sampling techniques
- Missing age values need careful handling in preprocessing
- Some numerical variables show skewed distributions
- Categorical variables (experience, gender) show clear patterns worth investigating

---

## 5. Methods and Implementation Plan

This section proposes the predictive modeling approach and justifies the methodology chosen to answer the research question.

### 5.1 Proposed Method

**Selected Method:** Logistic Regression for Binary Classification

**Why Logistic Regression:**
- Well-suited for binary classification (subscribe: Yes/No)
- Provides interpretable coefficients showing predictor impact
- Handles mixed categorical and numerical features
- Robust baseline for our dataset size (196 players)
- Outputs probabilities for interpretable predictions

**Comparison Model:** K-Nearest Neighbors (KNN) to evaluate non-linear patterns

### 5.2 Method Justification

**Why This Method is Appropriate:**

1. **Problem Alignment:** Binary outcome (subscription status) perfectly suits logistic regression
2. **Data Characteristics:** Mixed variable types with moderate sample size
3. **Interpretability:** Stakeholders need to understand which features drive subscriptions
4. **Proven Approach:** Standard method for customer behavior and subscription prediction

### 5.3 Required Assumptions

**Key Assumptions for Logistic Regression:**

1. **Binary Outcome:** Response is binary (subscribe: True/False) ✓ *Met*
2. **Independence of Observations:** Players make individual subscription decisions
3. **Linear Log-Odds:** Log-odds of outcome linearly related to predictors
4. **No Perfect Multicollinearity:** Predictors not perfectly correlated
5. **Sufficient Sample Size:** ~130 subscribers with 6 predictors (adequate ratio)
6. **Limited Outliers:** Will address outliers identified in EDA

### 5.4 Potential Limitations and Weaknesses

**Method Limitations:**
- Linear decision boundary may miss complex non-linear patterns
- Requires manual creation of interaction terms
- One-hot encoding increases feature dimensionality

**Data Limitations:**
- Missing age values (40+ missing observations)
- Class imbalance favoring subscribers
- Moderate sample size limits model complexity

**Mitigation Strategies:**
- Compare with KNN to capture non-linear patterns
- Test multiple imputation strategies for missing values
- Use stratified sampling and class weight adjustments

### 5.5 Model Comparison and Selection Strategy

**Evaluation Metrics:**
- Accuracy: Overall correct predictions
- Precision: Among predicted subscribers, % actually subscribed
- Recall: Among actual subscribers, % correctly identified
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Area under ROC curve

**Models to Compare:**
1. Baseline: Majority class prediction
2. Logistic Regression (all features)
3. Logistic Regression (with feature selection)
4. K-Nearest Neighbors (k=[3,5,7,9,11])

**Selection Criteria:**
- Primary: Highest cross-validated F1-score
- Secondary: Model interpretability (favor logistic regression if scores similar)
- Tertiary: Smallest train-test performance gap

### 5.6 Data Processing and Validation Plan

**Data Splitting Strategy:**
- Training Set: 70% (~137 players) - model training and cross-validation
- Validation Set: 15% (~30 players) - hyperparameter tuning
- Test Set: 15% (~29 players) - final evaluation only
- All splits stratified by subscription status to maintain class balance

**Preprocessing Steps:**
1. **Handle Missing Values:** Median imputation for Age variable
2. **Encode Categorical Variables:**
   - Experience: One-hot encoding (creates 4 binary features)
   - Gender: Binary encoding (Male=1, Female=0)
3. **Feature Scaling:** Standardize numerical features (mean=0, std=1)
   - Applied to: Age, played_hours, total_sessions, avg_session_duration

**Cross-Validation Approach:**
- Method: Stratified 5-Fold Cross-Validation
- Maintains subscription ratio in each fold
- Reports mean ± standard deviation for all metrics
- Used for robust performance estimation and hyperparameter tuning

**Evaluation Protocol:**
1. Train models with 5-fold CV on training set
2. Tune hyperparameters using validation set
3. Evaluate final model once on held-out test set
4. Report cross-validation scores and test performance
5. Analyze feature importance and coefficients
6. Generate confusion matrix and ROC curve