# Practical of Real Financial Data (R version)

---
This notebook covers an end-to-end unsupervised and supervised learning task on real financial data, focusing on SMEs applying for loans at a P2P lending platform. The workflow mirrors the Python version, but uses idiomatic R and tidyverse approaches for clarity and comparison.

## Topics covered

* Data description and pre-processing
* K-means clustering with different k values
* Cluster evaluation and selection of best k
* Logistic regression classifier for loan default prediction (full dataset)
* Separate models for each identified cluster


In [ ]:
# Load required libraries
# tidyverse: Collection of R packages for data manipulation and visualization
# cluster: Functions for clustering analysis
# factoextra: Extract and visualize results of multivariate data analyses
# caret: Classification and Regression Training
# ggplot2: Grammar of graphics for creating plots
# gridExtra: Arrange multiple plots in a grid
# broom: Convert statistical analysis objects into tidy tibbles
# pROC: Tools for ROC curve analysis
# ROCR: Visualizing classifier performance
# scales: Scale functions for visualization

library(tidyverse)
library(cluster)
library(factoextra)
library(caret)
library(ggplot2)
library(gridExtra)
library(broom)
library(pROC)
library(ROCR)
library(scales)


In [ ]:
# Set seed for reproducibility
# This ensures that random processes (like clustering) produce the same results each time
set.seed(42)

## Data Import

We use a CSV file `borrower_companies.csv` with financial ratios and a `status` column (1 = default, 0 = paid back).

In [ ]:
# Read the data (assume file is in working directory)
# read_csv() loads CSV files into a tibble (modern data frame)
# glimpse() shows the structure of the data - like str() but more readable
dataset <- read_csv('borrower_companies.csv')
glimpse(dataset)

## Data Exploration

Let's check the structure, missing values, and summary statistics.

In [ ]:
# Check dimensions and missing values
# dim() returns number of rows and columns
# is.na() checks for missing values, colSums() counts them by column
cat("Dataset dimensions (rows, columns):\n")
dim(dataset)
cat("\nMissing values per column:\n")
colSums(is.na(dataset))

In [ ]:
# Summary statistics for all variables
# summary() provides min, max, median, mean, and quartiles for numeric variables
summary(dataset)

## Visualize Feature Distributions

Boxplots (standardized) for all features except `status`.

In [ ]:
# Standardize features (excluding status column)
# Standardization converts all features to have mean=0 and standard deviation=1
# This is important for clustering as it prevents features with larger scales from dominating

features <- dataset %>% select(-status)  # Remove the target variable
features_scaled <- as_tibble(scale(features))  # Standardize all features

# Reshape data for plotting (convert from wide to long format)
# This allows us to plot all features in one chart
features_scaled_long <- features_scaled %>% 
  mutate(row = row_number()) %>%  # Add row numbers for identification
  pivot_longer(-row, names_to = 'variable', values_to = 'value')  # Reshape to long format

# Create boxplots to visualize the distribution of each standardized feature
# Boxplots show median, quartiles, and outliers for each variable
ggplot(features_scaled_long, aes(x = value, y = variable)) +
  geom_boxplot(fill = 'skyblue', outlier.alpha = 0.2) +
  labs(title = 'Standardized Feature Distributions', 
       subtitle = 'All features now have mean=0 and std=1',
       x = 'Standardized Value', y = 'Financial Ratios') +
  theme_minimal()

## Outlier Removal (Z-score method)

Remove rows where any feature has |z| > 4.

In [ ]:
# Remove extreme outliers using Z-score method
# Z-score measures how many standard deviations away from the mean a value is
# We remove rows where ANY feature has |z-score| > 4 (very extreme values)

z_scores <- as_tibble(scale(features))  # Calculate z-scores for all features
# Create a mask: TRUE if ALL z-scores in a row are < 4 in absolute value
outlier_mask <- apply(abs(z_scores), 1, function(x) all(x < 4))
dataset_o <- dataset[outlier_mask, ]  # Keep only non-outlier rows

cat("Original dataset size:", nrow(dataset), "rows\n")
cat("After outlier removal:", nrow(dataset_o), "rows\n")
cat("Removed", nrow(dataset) - nrow(dataset_o), "outlier rows\n")

In [ ]:
# Boxplot after outlier removal
features_o <- dataset_o %>% select(-status)
features_o_scaled <- as_tibble(scale(features_o))
features_o_scaled_long <- features_o_scaled %>% 
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = 'variable', values_to = 'value')

ggplot(features_o_scaled_long, aes(x = value, y = variable)) +
  geom_boxplot(fill = 'lightgreen', outlier.alpha = 0.2) +
  labs(title = 'Standardized Feature Distributions (Outliers Removed)', x = '', y = '') +
  theme_minimal()

## Prepare Data for Clustering

Standardize features for clustering.

In [ ]:
# Prepare data for clustering
# X contains the features (financial ratios) for clustering
# y contains the target variable (loan status: 0=paid back, 1=default)
# Note: Clustering is unsupervised, so we don't use y for clustering itself

X <- dataset_o %>% select(-status)  # Features only (remove target variable)
X_scaled <- scale(X)  # Standardize features for clustering
y <- dataset_o$status  # Target variable (for later supervised learning)

## Principal Component Analysis (PCA)

Visualize explained variance to understand dimensionality.

In [ ]:
# Principal Component Analysis (PCA)
# PCA reduces dimensionality by finding the directions of maximum variance
# This helps us understand how much information each dimension contains

pca <- prcomp(X_scaled, center = TRUE, scale. = TRUE)
# Calculate the proportion of variance explained by each principal component
explained_var <- pca$sdev^2 / sum(pca$sdev^2)
cum_var <- cumsum(explained_var)  # Cumulative variance explained

# Create a visualization showing individual and cumulative explained variance
tibble(PC = 1:length(explained_var),
       Explained = explained_var,
       Cumulative = cum_var) %>%
  ggplot(aes(x = PC)) +
  geom_bar(aes(y = Explained), stat = 'identity', fill = 'steelblue', alpha = 0.6) +
  geom_line(aes(y = Cumulative), color = 'red', size = 1) +
  geom_point(aes(y = Cumulative), color = 'red', size = 2) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(title = 'PCA: Explained Variance', 
       subtitle = 'Bars show individual variance, red line shows cumulative',
       y = 'Proportion of Variance Explained', 
       x = 'Principal Component') +
  theme_minimal()

## K-means Clustering: Try Different k

Evaluate clusters using silhouette and WCSS (within-cluster sum of squares).

In [ ]:
# K-means clustering: Try different numbers of clusters (k)
# We'll evaluate each k using two metrics:
# 1. Silhouette Score: measures how well-separated clusters are (higher = better)
# 2. WCSS (Within-Cluster Sum of Squares): measures compactness (lower = better)

max_k <- 7  # Test up to 7 clusters
silhouette_scores <- numeric(max_k - 1)  # Store silhouette scores
wcss <- numeric(max_k - 1)  # Store WCSS values
labels_list <- list()  # Store cluster assignments for each k

cat("Testing different numbers of clusters...\n")
for (k in 2:max_k) {
  cat("k =", k, "\n")
  # Run k-means with multiple random starts to find best solution
  km <- kmeans(X_scaled, centers = k, nstart = 25)
  
  # Calculate silhouette score (measures cluster quality)
  ss <- silhouette(km$cluster, dist(X_scaled))
  silhouette_scores[k-1] <- mean(ss[, 3])
  
  # Store WCSS (within-cluster sum of squares)
  wcss[k-1] <- km$tot.withinss
  
  # Save cluster labels for later visualization
  labels_list[[as.character(k)]] <- km$cluster
}

# Display results table
results_table <- tibble(Clusters = 2:max_k, 
                       Silhouette = round(silhouette_scores, 3), 
                       WCSS = round(wcss, 1))
print(results_table)

In [ ]:
# Visualize clusters in 2D using the first two principal components
# This helps us see how well the clusters separate in the most important dimensions

pca_scores <- as_tibble(pca$x[, 1:2])  # Get first two principal components
plots <- list()

# Create a separate plot for each value of k
for (k in 2:max_k) {
  clust <- as.factor(labels_list[[as.character(k)]])
  plots[[k-1]] <- ggplot(pca_scores, aes(x = PC1, y = PC2, color = clust)) +
    geom_point(alpha = 0.6, size = 1.5) +
    labs(title = paste('k =', k, 'clusters'), 
         color = 'Cluster',
         x = 'First Principal Component',
         y = 'Second Principal Component') +
    theme_minimal() +
    theme(legend.position = 'bottom')
}

# Arrange all plots in a grid for easy comparison
do.call(grid.arrange, c(plots, ncol = 2))

## Elbow and Silhouette Plots

Choose the optimal number of clusters.

In [ ]:
# Create elbow plot to help choose optimal number of clusters
# - Silhouette score: higher is better (blue line)
# - WCSS: look for "elbow" where improvement slows down (red line)

elbow_df <- tibble(Clusters = 2:max_k, Silhouette = silhouette_scores, WCSS = wcss)

# Create dual-axis plot (this is complex but shows both metrics together)
ggplot(elbow_df, aes(x = Clusters)) +
  # Silhouette score (blue line) - higher is better
  geom_line(aes(y = Silhouette), color = 'blue', size = 1) +
  geom_point(aes(y = Silhouette), color = 'blue', size = 3) +
  # WCSS (red line, rescaled) - look for elbow
  geom_line(aes(y = rescale(WCSS, to = range(Silhouette))), color = 'red', size = 1) +
  geom_point(aes(y = rescale(WCSS, to = range(Silhouette))), color = 'red', size = 3) +
  scale_y_continuous(
    name = 'Silhouette Score (Blue)',
    sec.axis = sec_axis(~ rescale(., from = range(elbow_df$Silhouette), to = range(elbow_df$WCSS)), 
                       name = 'WCSS (Red)', labels = comma)
  ) +
  labs(x = 'Number of Clusters',
       title = 'Cluster Evaluation: Silhouette Score vs WCSS',
       subtitle = 'Higher silhouette is better; look for WCSS elbow') +
  theme_minimal() +
  theme(axis.title.y.left = element_text(color = 'blue'),
        axis.title.y.right = element_text(color = 'red'))

## Inspect Cluster Feature Distributions

Pick k = 3 for illustration.

In [ ]:
# Inspect how features differ between clusters
# This helps us understand what makes each cluster unique

chosen_k <- 3  # Choose k=3 based on the evaluation above
cat("Analyzing clusters for k =", chosen_k, "\n")

cluster_labels <- labels_list[[as.character(chosen_k)]]
X_labeled <- X_scaled %>% as_tibble() %>% mutate(cluster = factor(cluster_labels))

# Show cluster sizes
cat("Cluster sizes:\n")
print(table(cluster_labels))

# Reshape data for plotting density curves
X_long <- X_labeled %>%
  pivot_longer(-cluster, names_to = 'variable', values_to = 'value')

# Create density plots showing how each feature differs between clusters
# Each panel shows one financial ratio, with different colors for each cluster
ggplot(X_long, aes(x = value, fill = cluster)) +
  geom_density(alpha = 0.4) +  # Semi-transparent density curves
  facet_wrap(~ variable, scales = 'free', ncol = 2) +  # Separate panel per feature
  labs(title = 'Feature Distributions by Cluster', 
       subtitle = 'Each panel shows how one financial ratio differs between clusters',
       x = 'Standardized Value', 
       y = 'Density',
       fill = 'Cluster') +
  theme_minimal()

# Supervised Learning: Logistic Regression

Compare a model trained on the full dataset vs. one per cluster.

In [ ]:
# Balance the dataset for supervised learning
# Many real-world datasets have imbalanced classes (more non-defaults than defaults)
# We'll undersample the majority class to create a balanced dataset

library(rsample)  # For data splitting
library(recipes)  # For data preprocessing

dataset_o$status <- as.factor(dataset_o$status)  # Convert to factor for classification

# Separate majority and minority classes
minority <- dataset_o %>% filter(status == 1)  # Defaults (usually fewer)
majority <- dataset_o %>% filter(status == 0)  # Non-defaults (usually more)

cat("Original class distribution:\n")
print(table(dataset_o$status))

# Undersample majority class to balance the dataset
set_size <- nrow(minority) * 2  # Take 2x minority class size from majority
majority_down <- majority %>% sample_n(min(set_size, nrow(majority)))
balanced <- bind_rows(minority, majority_down)
balanced <- balanced %>% sample_frac(1)  # Shuffle the combined dataset

cat("\nBalanced class distribution:\n")
print(table(balanced$status))

In [ ]:
# Split data into training and testing sets
# Training set: used to build the model
# Testing set: used to evaluate model performance on unseen data

set.seed(42)  # For reproducible splits
# Use stratified sampling to maintain class balance in both sets
split <- initial_split(balanced, prop = 0.8, strata = status)
train <- training(split)  # 80% for training
test <- testing(split)    # 20% for testing

cat("Training set size:", nrow(train), "\n")
cat("Testing set size:", nrow(test), "\n")

# Create preprocessing recipe
# This standardizes features (mean=0, sd=1) using training data statistics
rec <- recipe(status ~ ., data = train) %>%
  step_center(all_predictors()) %>%  # Subtract mean
  step_scale(all_predictors()) %>%   # Divide by standard deviation
  prep()  # Calculate the preprocessing parameters

# Apply preprocessing to both training and testing sets
X_train <- bake(rec, new_data = train) %>% select(-status)
y_train <- train$status
X_test <- bake(rec, new_data = test) %>% select(-status)
y_test <- test$status

In [ ]:
# Fit logistic regression model
# Logistic regression predicts the probability of default (status = 1)
# It uses a logistic function to map any real number to a probability (0-1)

model <- glm(status ~ ., data = cbind(X_train, status = y_train), family = binomial())

cat("Logistic Regression Model Summary:\n")
cat("=====================================\n")
summary(model)

In [ ]:
# Make predictions and evaluate model performance
# The model outputs probabilities, which we convert to class predictions

# Get predicted probabilities of default
pred_probs <- predict(model, newdata = X_test, type = 'response')

# Convert probabilities to class predictions using 0.5 threshold
pred_class <- ifelse(pred_probs > 0.5, 1, 0)

# Create confusion matrix to see prediction accuracy
conf_mat <- table(Predicted = pred_class, Actual = as.numeric(as.character(y_test)))
cat("Confusion Matrix:\n")
cat("(Rows = Predicted, Columns = Actual)\n")
print(conf_mat)

In [ ]:
# Detailed classification metrics
# This provides precision, recall, F1-score, and other important metrics
cat("\nDetailed Classification Report:\n")
cat("==============================\n")
caret::confusionMatrix(as.factor(pred_class), y_test, positive = '1')

In [ ]:
# ROC Curve and AUC Score
# ROC curve shows trade-off between true positive rate and false positive rate
# AUC (Area Under Curve) summarizes performance: 1.0 = perfect, 0.5 = random

roc_obj <- roc(as.numeric(as.character(y_test)), pred_probs)
plot(roc_obj, col = 'blue', main = 'ROC Curve (Full Dataset)', 
     xlab = 'False Positive Rate (1 - Specificity)',
     ylab = 'True Positive Rate (Sensitivity)')
# Add diagonal line for reference (random classifier)
abline(a = 0, b = 1, lty = 2, col = 'gray')

auc_score <- auc(roc_obj)
cat("\nAUC Score:", round(auc_score, 3), "\n")
cat("(1.0 = perfect classifier, 0.5 = random guessing)\n")

# Per-Cluster Logistic Regression

Repeat the above for each cluster (example for cluster 1).

In [ ]:
# Build separate logistic regression models for each cluster
# Hypothesis: Companies in different clusters might have different default patterns
# This allows us to create specialized models for each group

cat("Building cluster-specific models...\n")
cat("===================================\n")

for (cl in 1:chosen_k) {
  cat('\n--- CLUSTER', cl, '---\n')
  
  # Get data for this cluster only
  idx <- which(cluster_labels == cl)
  cluster_data <- dataset_o[idx, ]
  cluster_data$status <- as.factor(cluster_data$status)
  
  cat('Cluster size:', nrow(cluster_data), 'companies\n')
  
  # Check if we have enough data for both classes
  minority <- cluster_data %>% filter(status == 1)
  majority <- cluster_data %>% filter(status == 0)
  
  cat('Defaults:', nrow(minority), ', Non-defaults:', nrow(majority), '\n')
  
  if (nrow(minority) < 5 | nrow(majority) < 5) {
    cat('Too few samples in one class, skipping this cluster\n')
    next
  }
  
  # Balance the cluster data
  set_size <- nrow(minority) * 2
  majority_down <- majority %>% sample_n(min(set_size, nrow(majority)))
  balanced <- bind_rows(minority, majority_down) %>% sample_frac(1)
  
  # Split and preprocess
  split <- initial_split(balanced, prop = 0.8, strata = status)
  train <- training(split)
  test <- testing(split)
  
  # Standardize features
  rec <- recipe(status ~ ., data = train) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    prep()
  
  X_train <- bake(rec, new_data = train) %>% select(-status)
  y_train <- train$status
  X_test <- bake(rec, new_data = test) %>% select(-status)
  y_test <- test$status
  
  # Fit cluster-specific model
  model <- glm(status ~ ., data = cbind(X_train, status = y_train), family = binomial())
  
  # Make predictions
  pred_probs <- predict(model, newdata = X_test, type = 'response')
  pred_class <- ifelse(pred_probs > 0.5, 1, 0)
  
  # Evaluate performance
  conf_mat <- table(Predicted = pred_class, Actual = as.numeric(as.character(y_test)))
  cat('Confusion Matrix:\n')
  print(conf_mat)
  
  cat('\nDetailed Metrics:\n')
  print(caret::confusionMatrix(as.factor(pred_class), y_test, positive = '1'))
  
  # ROC curve
  roc_obj <- roc(as.numeric(as.character(y_test)), pred_probs)
  plot(roc_obj, col = 'red', main = paste('ROC Curve (Cluster', cl, ')'),
       xlab = 'False Positive Rate', ylab = 'True Positive Rate')
  abline(a = 0, b = 1, lty = 2, col = 'gray')
  
  auc_score <- auc(roc_obj)
  cat('AUC Score:', round(auc_score, 3), '\n')
}


----

## Summary

This notebook demonstrates a complete machine learning workflow combining:

1. **Unsupervised Learning (Clustering)**: We used k-means to identify groups of similar companies based on their financial ratios
2. **Supervised Learning (Classification)**: We built logistic regression models to predict loan defaults

**Key Insights:**
- Clustering helps identify distinct company profiles in the financial data
- Cluster-specific models may perform differently than a single global model
- The combination of clustering and classification can provide both interpretability and predictive power

**R Programming Notes:**
- All code uses tidyverse and modern R best practices
- The workflow closely parallels Python implementations for easy comparison
- Comments explain both the statistical concepts and R-specific syntax

This approach is valuable in finance for risk assessment, where different types of companies may have different risk profiles that warrant specialized models.