# Clustering & Company Valuation using Clustering (R Version)

---

This notebook demonstrates unsupervised learning and company valuation using clustering, implemented in R. The structure and flow mirror the Python version, but use idiomatic R and tidyverse approaches for clarity and best practice.

We cover:
- **Introduction to clustering**: Simulate data, apply k-means and DBSCAN, and evaluate clustering.
- **Clustering for company valuation**: Use clustering to value companies using the multiples method, with real data.


# Introduction to Clustering in R

## Step 0: Load Required Libraries

We start by loading the necessary libraries. If you are new to R, you may need to install some of these packages using `install.packages("package_name")` in your R console. These libraries provide tools for data manipulation, clustering, and visualization.

In [ ]:
library(tidyverse)
library(cluster)
library(dbscan)
library(ggplot2)
set.seed(50)

In [ ]:
## Step 1: Simulate a Dataset with 4 Clusters (aligned with Python make_blobs)

# Generate 200 samples, 2 features, 4 centers, cluster_std=1.6, random_state=50
set.seed(50)
n <- 200
centers <- matrix(c(2,2, 8,3, 3,6, 7,7), ncol=2, byrow=TRUE)
sd <- 1.6
gen_cluster <- function(mu, m) {
  tibble(x1 = rnorm(m, mean = mu[1], sd = sd),
         x2 = rnorm(m, mean = mu[2], sd = sd))
}
X <- bind_rows(
  gen_cluster(centers[1,], n/4) %>% mutate(label = factor(1)),
  gen_cluster(centers[2,], n/4) %>% mutate(label = factor(2)),
  gen_cluster(centers[3,], n/4) %>% mutate(label = factor(3)),
  gen_cluster(centers[4,], n/4) %>% mutate(label = factor(4))
)
head(X)

In [ ]:
## Step 2: Visualize the Simulated Data

Let's plot the data to see the clusters. Each color represents a different true cluster.ggplot(X, aes(x=x1, y=x2, color=label)) +
  geom_point(size=2) +
  theme_minimal() +
  labs(title="Simulated Data with 4 Clusters")

In [ ]:
## Step 3: Apply K-Means Clustering

Now, let's use the k-means algorithm to find clusters in the data. We specify 4 clusters (since we know the true number here).kmeans_result <- kmeans(X %>% select(x1, x2), centers=4, nstart=25)
X$cluster <- as.factor(kmeans_result$cluster)
ggplot(X, aes(x=x1, y=x2, color=cluster)) + geom_point(size=2) + theme_minimal() + labs(title="K-means Clusters (k=4)")

In [ ]:
## Step 4: Evaluate Clustering Quality

We use two common metrics:
- **Silhouette score**: Measures how well each point fits within its cluster (1=good, 0=overlap, -1=bad).
- **Within-Cluster Sum of Squares (WCSS)**: Lower values mean tighter clusters.sil <- silhouette(kmeans_result$cluster, dist(X %>% select(x1, x2)))
mean_sil <- mean(sil[,3])
wcss <- kmeans_result$tot.withinss
cat("Silhouette (mean):", round(mean_sil,3), "\nWCSS:", round(wcss,1), "\n")

In [ ]:
## Step 5: Explore Different Numbers of Clusters

It's important to test different cluster counts to find the best fit. Here, we loop through 2 to 7 clusters and record the metrics.max_n_clusters <- 7
results <- tibble(Clusters=integer(), Silhouette=numeric(), WCSS=numeric())
for (k in 2:max_n_clusters) {
  km <- kmeans(X %>% select(x1, x2), centers=k, nstart=25)
  sil <- silhouette(km$cluster, dist(X %>% select(x1, x2)))
  results <- results %>% add_row(Clusters=k, Silhouette=mean(sil[,3]), WCSS=km$tot.withinss)
}
results

In [ ]:
## Step 6: Visualize Cluster Evaluation Metrics

We plot both the Silhouette score and the (scaled) WCSS to help decide the optimal number of clusters. Look for the 'elbow' in the WCSS curve and the peak in Silhouette.ggplot(results, aes(x=Clusters)) +
  geom_line(aes(y=Silhouette, color="Silhouette"), size=1.2) +
  geom_point(aes(y=Silhouette, color="Silhouette"), size=2) +
  geom_line(aes(y=scale(WCSS), color="WCSS (scaled)"), size=1.2, linetype="dashed") +
  geom_point(aes(y=scale(WCSS), color="WCSS (scaled)"), size=2, shape=17) +
  scale_color_manual(values=c("blue", "red")) +
  labs(y="Metric Value", color="Metric", title="Cluster Evaluation Metrics") +
  theme_minimal()

In [ ]:
## Step 7: Try DBSCAN on a Non-Spherical Dataset

Some datasets have clusters that are not round. DBSCAN is a clustering algorithm that can find clusters of arbitrary shape. Let's generate a 'moons' dataset and visualize it.set.seed(0)
n <- 200
noise <- 0.05
theta <- runif(n/2, 0, pi)
nx1 <- rnorm(n/2, 0, noise); ny1 <- rnorm(n/2, 0, noise)
nx2 <- rnorm(n/2, 0, noise); ny2 <- rnorm(n/2, 0, noise)
x1 <- c(cos(theta) + nx1, 1 - cos(theta) + nx2)
x2 <- c(sin(theta) + ny1, -sin(theta) + 0.5 + ny2)
moon_df <- tibble(x1 = x1, x2 = x2)
ggplot(moon_df, aes(x=x1, y=x2)) + geom_point(size=2) + theme_minimal() + labs(title="Moons Data (aligned with Python)")

In [ ]:
## Step 8: K-Means on Moons Data

Let's see how k-means performs on this non-spherical data.km_moon <- kmeans(moon_df, centers=2, nstart=25)
moon_df$cluster <- as.factor(km_moon$cluster)
ggplot(moon_df, aes(x=x1, y=x2, color=cluster)) + geom_point(size=2) + theme_minimal() + labs(title="K-means on Moons Data")
sil <- silhouette(km_moon$cluster, dist(moon_df))
mean_sil <- mean(sil[,3])
wcss <- km_moon$tot.withinss
cat("Silhouette:", round(mean_sil,3), "\nWCSS:", round(wcss,1), "\n")

In [ ]:
## Step 9: DBSCAN on Moons Data

Now, let's use DBSCAN, which can find clusters of arbitrary shape and is robust to noise.db_moon <- dbscan(moon_df, eps=0.3, minPts=5)
moon_df$dbscan <- as.factor(db_moon$cluster)
ggplot(moon_df, aes(x=x1, y=x2, color=dbscan)) + geom_point(size=2) + theme_minimal() + labs(title="DBSCAN on Moons Data")

# Clustering for Company Valuation

In [ ]:
# Clustering for Company Valuation


In this section, we use clustering to help value a company using the multiples method. We will walk through each step, explaining the purpose and R code for each part.

In [ ]:
## Step 1: Data Collection

Let's load company financial data. Make sure the file `financialdata_original.csv` is in your working directory. If not, set the correct path or upload the file.library(readr)
dataset <- read_csv("financialdata_original.csv")
head(dataset, 12)

In [ ]:
## Step 2: Data Preprocessing

Let's inspect the data, check for missing values, and remove any incomplete rows. This ensures our clustering is not affected by missing data.glimpse(dataset)
summary(dataset)
sum(is.na(dataset))
dataset <- dataset %>% drop_na()
summary(dataset)

In [ ]:
## Step 3: Model Selection - PCA Explained Variance (aligned with Python)

num_data <- dataset %>% select(where(is.numeric))
num_data_scaled <- scale(num_data)
pca <- prcomp(num_data_scaled, center=FALSE, scale.=FALSE)
exp_var <- (pca$sdev^2) / sum(pca$sdev^2)
cum_exp <- cumsum(exp_var)
df_pca <- tibble(PC = seq_along(exp_var), ExplainedVar = exp_var, CumExplainedVar = cum_exp)
ggplot(df_pca, aes(x=PC)) +
  geom_col(aes(y=ExplainedVar), alpha=0.5) +
  geom_step(aes(y=CumExplainedVar), direction="mid", color="steelblue") +
  labs(y="Cumulative Explained Variance", x="Number of Principal Components") +
  theme_minimal()

In [ ]:
## Step 4: Clustering and Validation (aligned with Python)

# Evaluate clusters for k = 2..20 using silhouette and WCSS
library(cluster)
max_n_clusters <- 21
tab <- tibble(Clusters=integer(), Silhouette=numeric(), WCSS=numeric())
for (k in 2:(max_n_clusters)) {
  km <- kmeans(num_data_scaled, centers=k, nstart=25)
  sil <- silhouette(km$cluster, dist(num_data_scaled))
  tab <- tab %>% add_row(Clusters=k, Silhouette=mean(sil[,3]), WCSS=km$tot.withinss)
}
tab

# Elbow-style plot: Silhouette and scaled WCSS
ggplot(tab, aes(x=Clusters)) +
  geom_line(aes(y=Silhouette, color="Silhouette"), size=1.2) +
  geom_point(aes(y=Silhouette, color="Silhouette"), size=2) +
  geom_line(aes(y=scale(WCSS), color="WCSS (scaled)"), size=1.2, linetype="dashed") +
  geom_point(aes(y=scale(WCSS), color="WCSS (scaled)"), size=2, shape=17) +
  scale_color_manual(values=c("blue", "red")) +
  labs(y="Metric Value", color="Metric", title="Cluster Evaluation Metrics") +
  theme_minimal()

# Choose k (e.g., 8 as in Python example) and assign clusters
set.seed(42)
k <- 8
km <- kmeans(num_data_scaled, centers=k, nstart=25)
dataset$Cluster <- as.factor(km$cluster)

In [ ]:
## Step 5: Identify Closest Companies

Suppose we want to value `Company_11`. We find its cluster and select all companies in the same cluster as its peers.company11_cluster <- dataset %>% filter(shortName == "Company_11") %>% pull(Cluster)
similar_companies <- dataset %>% filter(Cluster == company11_cluster)
similar_companies

In [ ]:
## Step 6: Merge with Market Data

Remove `Company_11` from the peer group and merge with additional market data (e.g., market cap, enterprise value) from `financialdata_extra.csv`.similar_companies <- similar_companies %>% filter(shortName != "Company_11")
data_extra <- read_csv("financialdata_extra.csv")
merged_data <- left_join(similar_companies, data_extra, by="shortName")
head(merged_data)

In [ ]:
## Step 7: Valuation - Market Cap

We can use the average market capitalization of the peer group as a simple valuation for `Company_11`.avg_market_cap <- mean(merged_data$marketCap, na.rm=TRUE)
avg_market_cap

In [ ]:
## Step 8: Calculate EV/EBITDA Multiple

A more robust method is to use the average EV/EBITDA multiple of the peer group.merged_data <- merged_data %>% mutate(EV_to_ebitda = enterpriseValue / ebitda)
average_EV_ebitda <- mean(merged_data$EV_to_ebitda, na.rm=TRUE)
average_EV_ebitda

In [ ]:
## Step 9: Estimate Company_11 Enterprise Value

Finally, we estimate the value of `Company_11` by multiplying its EBITDA by the average EV/EBITDA multiple from its peer group.ebitda_value_company11 <- dataset %>% filter(shortName == "Company_11") %>% pull(ebitda)
Company11_EV <- average_EV_ebitda * ebitda_value_company11
Company11_EV