# Data Science Projet - Cluster Analysis

In [None]:
library(dplyr)
library(GGally)
library(dbscan)

In [None]:
insurance = na.omit(read.csv("data/insurance_final.csv"))

## Business Question

Your boss wants to adjust the current insurance premium system, which simply depends on the age of the policy holder.
He asks you to search for insurance groups within the data who differ regarding the amount of `charges`.
However, he feels that looking at the customers' `age` in combination with their amount of `charges` might be a good starting point for the cluster analysis.


## EDA

Hence as a first step, we look at the pairs plot of both variables.

In [None]:
ggpairs(insurance[, c("age", "charges")]) + theme_bw()

The plot indicates 3 different groups within the data, based on low, medium and high charges.
Moreover, both groups indicate an increasing number of charges with higher age.

## DBSCAN

After finding a suitable combination for the `eps` and `minPts` parameters, DBScan offers the best results.
We get the following output:

In [None]:
db1 = dbscan(as.data.frame(scale(insurance[, c("age", "charges")])), eps = .4, minPts = 20)
db1

Let's have a look at the resulting scatterplot:

In [None]:
insurance$db_cluster = as.factor(db1$cluster)
ggplot(insurance, aes(age, charges, col = db_cluster)) +
 	geom_point() +  theme_bw() + guides(col = guide_legend(title = "Cluster")) +
 	scale_color_hue(labels = c("Noise", "1", "2", "3"))

## Exploring the final Clusters

### Smokers in each cluster
As a last step, we should explore our final clusters with some demographic variables from the original data.
We start with plotting the amount of smokers within each cluster:

In [None]:
ggplot(insurance) + geom_bar(aes(db_cluster, fill = smoker)) +
	theme_bw() + guides(fill = guide_legend(title = "Smoker")) +
	xlab("") + ylab("Count") +
  scale_x_discrete(labels=c("0" = "Noise", "1" = "Cluster 1", "2" = "Cluster 2", "3" = "Cluster 3"))

While cluster 1, which is the low-charges-cluster, only contains non-smokers, cluster 3 contains only smokers.
The medium-charges-cluster 2 contains smokers and non-smokers.  

### BMI Distribution within the Clusters
When plotting boxplots of the BMI for each cluster, we can see, that the people in the second cluster have a lower BMI than the people in the other groups.

In [None]:
ggplot(insurance) + geom_boxplot(aes(x = db_cluster, y = bmi, fill = db_cluster)) +
	theme_bw() + guides(fill = guide_legend(title = "Cluster")) +
	theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
	xlab("") + ylab("BMI") + coord_flip() +
  scale_fill_hue(labels = c("Noise", "1", "2", "3"))