# K-MEANS CLUSTERING WITH SOCIAL NETWORKING DATASET

Adapted from Lantz (2015) Chapter 9

For this analysis, we will use a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006.

To protect the users' anonymity, the SNS will remain unnamed.

The data was sampled evenly across four high school graduation years (2006 through 2009) representing the senior, junior, sophomore, and freshman classes at the time of data collection. Using an automated web crawler, the full text of the SNS profiles were downloaded, and each teen's gender, age, and number of SNS friends was recorded.

A text mining tool was used to divide the remaining SNS page content into words. From the top 500 words appearing across all the pages, 36 words were chosen to represent five categories of interests: namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the person's SNS profile.

## Load libraries and data

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(fastDummies) # to create dummies from categoric variables
library(mice) # for missing values
library(VIM) # for visualizing missing values
library(tidyimpute) # imputing missing values
library(BBmisc) # for standardization
library(formattable) # for number formatting
library(pheatmap) # heatmap
library(d3heatmap) # heatmap
library(heatmaply) # heatmap
library(factoextra) # visualizing distances, cluster, heatmap
library(knitr) # pretty tables
library(kableExtra) # pretty tables
library(IRdisplay) # pretty tables
library(NbClust) # cluster metrics
library(vegan) # cluster metrics
library(listviewer) # view list object
library(plyr) # for lapply on matrix objects

Load the data:

In [None]:
teens_dt <- data.table::fread("../data/csv/11_01_snsdata.csv", stringsAsFactors = T)

## Explore data

View the structure:

In [None]:
str(teens_dt)

The data include 30,000 teenagers with four variables indicating personal characteristics and 36 words indicating interests.

See factor levels:

In [None]:
teens_dt %>% purrr::keep(is.factor) %>% purrr::map(levels)

And the distribution:

In [None]:
teens_factors <- teens_dt %>% purrr::keep(is.factor) %>% # select factor columns
    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(teens_factors)

See numeric variables'distritubions (excluding NA's):

In [None]:
teens_dt %>% purrr::keep(is.numeric) %>% sapply(quantile, na.rm = T) %>% t()

In [None]:
teens_dt[,quantile(age, na.rm = T)]

### Cleaning age variable

Age variables has NA values:

In [None]:
teens_dt %>%
    purrr::keep(is.numeric) %>%
    sapply(function(x) sum(is.na(x))) %>%
    "["(. > 0)

And age distribution is not logical for college students:

In [None]:
teens_dt[,quantile(age, na.rm = T)]

We should leave only values 13-20:

In [None]:
teens_dt[!dplyr::between(age, 13, 20 - 1e-10), age := NA]

Now more reasonable:

In [None]:
teens_dt[,quantile(age, na.rm = T)]

### Dummify gender

Gender variable has NA values:

In [None]:
teens_dt[,table(gender,useNA = "ifany")]

Now let's create dummy variables for female and missing gender values:

In [None]:
gender_dummy <- teens_dt[,fastDummies::dummy_cols(.(gender = gender),
                            remove_first_dummy = T)] %>% dplyr::select(-gender)
gender_dummy

And insert these columns after the original gender variable:

In [None]:
teens_dt <- teens_dt %>%
    append(unclass(gender_dummy), after = 2) %>% 
    as.data.table

In [None]:
teens_dt

## Data imputation

Let's look at missing values and distribution of complete and incomplete cases:

In [None]:
mdpat <- mice::md.pattern(teens_dt)
mdpat

In this output 1 means data exists and 0 means it is missing (NA). So:

- 23602 cases are complete
- In 3674 cases age is missing
- In 875 cases gender is missing
- In 1849 cases both age and gender are missing

Lets only filter for those columns which have missing values (using the last row in the output):

In [None]:
mdcols <- mdpat %>%
            as.data.table %>%
            .[.N, names(.)[.SD > 0]] %>%
            head(-1)
mdcols

See whether we got it:

In [None]:
teens_dt[,.SD, .SDcols = mdcols]

Now let's visualize the distribution of missing values in these columns:

First using tidyverse notation:

In [None]:
teens_dt %>%
    dplyr::select(mdcols) %>%
    VIM::aggr(numbers = T)

And then in pure data.table:

In [None]:
teens_dt[,VIM::aggr(.SD, numbers = T), .SDcols = mdcols]

### Imputation using mice package:

"pmm" method is predictive mean matching. It uses other columns' values distribution and respective mean values of the column with missing values in order to fill in those missing values:

mice() function gives the results separately

In this examples, missing age values are completed using gradyear:

In [None]:
datatemp <- mice::mice(teens_dt[,.(gradyear, age)], method = "pmm", m = 1)

In [None]:
summary(datatemp)

And we can see the imputed values for age:

In [None]:
datatemp$imp$age

Impute missing values into completed object:

In [None]:
completed <- mice::complete(datatemp,1)

Let's see, for the original missing values, with what values age was imputed with for each greadyear, first only averages:

In [None]:
cbind(teens_dt[,.(age_original = age)], completed)[is.na(age_original), mean(age), by = gradyear]

And then the distribution of the imputed values for each gradyear:

In [None]:
colnms <- quantile(1) %>% names
colnms

In [None]:
cbind(teens_dt[,.(age_original = age)], completed)[
    is.na(age_original),
    .(colnms = colnms, age = quantile(age)),
    by = gradyear] %>%
tidyr::spread(colnms, age) %>%
select(c("gradyear", colnms))

#data.table::setcolorder(c("gradyear", colnms)) %>%
#.[]

We see that mice does not use fixed values for each gradyear adds some statistical distribution

### Imputation using tidyimpute

tidyimpute package is much simpler: It uses fixed values for imputing while following common methods such as mean

In [None]:
cbind(teens_dt[,.(gradyear, age)], age2 = teens_dt[, tidyimpute::impute_mean(.(age))])[
    is.na(age),
    .(colnms = colnms, age = quantile(age2.V1)),
    by = gradyear] %>%
tidyr::spread(colnms, age) %>%
select(c("gradyear", colnms))


It uses the same values across gradyears and within each gradyears.

We have to instruct explicitly to differentiate across gradyear values:

In [None]:
teens_dt[,age2 := tidyimpute::impute_mean(.(age)), by = gradyear]

In [None]:
teens_dt[is.na(age),
            .(colnms = colnms, age = quantile(age2)),
            by = gradyear] %>%
tidyr::spread(colnms, age) %>%
select(c("gradyear", colnms))


Now values are differentiated across gradyears but homogenous within each year

Now save the values in age2 into age and delete age2 column:

In [None]:
teens_dt[,c("age", "age2") := .(age2, NULL)]

In [None]:
teens_dt

Now let's see whether values are imputed:

In [None]:
teens_dt %>% dplyr::select(mdcols) %>% VIM::aggr(numbers = T)

## Data normalization

First remember the column names:

In [None]:
names(teens_dt)

We will save the columns related with 36 keywords separately:

In [None]:
interests <- teens_dt[,basketball:drugs]
interests

In [None]:
interests_z <- interests[,BBmisc::normalize(.SD)]

In [None]:
interests_z %>% sapply(quantile, na.rm = T) %>% t()

In [None]:
interests_z[,lapply(.SD, function(x) c(mean(x), sd(x)))] %>%
                    t %>%
                    round(3) %>%
                    magrittr::set_colnames(c("mean", "sd"))
                    

We see that all means are 0 and sd's are 1. However since the values are highly skewed (many 0's and few other values) the z-scores are highly extreme

In [None]:
props <- lapply(interests, table) %>%
    lapply(prop.table) %>%
    lapply(formattable::percent) %>%
    lapply(round, 3)

props

In [None]:
sapply(props, "[", 1) %>%
    formattable::percent() %>%
    sort(decreasing = T)

We see that in many variables, the zero values (means no mention of the words in SNS profiles), make up more than 90% of all cases

## Visualizing distances

The euclidian distances among rows can be visualized as such (for only first 1000 rows):

In [None]:
distancex <- factoextra::get_dist(interests_z[1:100])

In [None]:
factoextra::fviz_dist(distancex)

Cells closer to red show more proximate observations while cells closer to blue show more distant observations

## Build and train model

We train the dataset for 5 clusters:

In [None]:
set.seed(2345)
teen_clusters <- kmeans(interests_z, centers = 5)

In [None]:
summary(teen_clusters)

Sizes of each cluster are:

In [None]:
teen_clusters$size

The center values of each variable for each cluster are:

In [None]:
centers <- teen_clusters$centers %>% t %>% round(2)
centers

Let's highlight the values in each row with significalty high or low values with kableExtra:

In [None]:
apply(centers,
      1,
      function(x)
        {
          zs <- (x - mean(x)) / sd(x);
          cell_spec(x,
                    color = ifelse(abs(zs) > 1, "white", "black"),
                            background = ifelse(zs > 1, "navy", ifelse(zs < -1, "red", "white"))
                   )
        }
    ) %>%
t %>%
magrittr::set_colnames(1:5) %>%      
knitr::kable(escape = F) %>%
kableExtra::kable_styling() %>%
as.character() %>%
IRdisplay::display_html()

We can also visualize distinctive cluster and variable matchings with a heatmap:

In [None]:
d3heatmap::d3heatmap(centers, Rowv = F, Colv = F)

Another method of visualizing the centers data with heatmap:

In [None]:
pheatmap::pheatmap(centers, cluster_rows = F, cluster_cols = F)

For each cluster, let's select those variables for which the cluster is above some level: 

In [None]:
namesx <- rownames(centers)

apply(centers, 1, BBmisc::normalize) %>%
    t %>%
    plyr::alply(2, function(x) namesx[x > 1])
    #%>%
    #as.data.frame %>%
    #lapply(function(x) namesx[x > 1])

- The first cluster is above the mean for cheerleading, hollister, shopping and abercrombie. This cluster can be named as "princesses" (as per Lantz)
- The second cluster is above the mean for band and marching. This cluster can be named as "musicians" (Lantz named it as "brains")
- The 3rd cluster is above the mean for many of the sport types. This cluster can be named as "athletes"
- The 4th cluster is above the mean for hair, dress, clothes, die, death, drunk, drugs. This cluster can be named as "punks" (Lantz named it as "criminals")
- The 5th cluster is not distinctive in any of the terms. These are called "basket cases" - they were users that did not post any interests and is the largest cluster of all

We can also visualize the clusters' borders across dimensions using factroextra's fviz_cluster

Note that,  when there are more than 2 dimensions, this function automatically conducts a PCA and selects the two components that explain the most of the variance:

In [None]:
factoextra::fviz_cluster(teen_clusters, data = interests_z, labelsize = 0)

The second component on the y axis is probably related to the intensity of sport related interests

And destructive keywords like "death" or "drugs" may be captured with the first component on the x axis

Especially "athletes" and "punks" are wide apart

## Enhance data exloration with cluster information

Now we can add the cluster info back into the original dataset

In [None]:
teens_dt[,cluster := teen_clusters$cluster]
teens_dt

Given this new data, we can start to examine how the cluster assignment relates to individual characteristics

In [None]:
teens_dt[,c("cluster", "gender", "age", "friends")]

We can also look at the demographic characteristics of the clusters. For example mean ages across clusters:

In [None]:
aggcols <- c("age", "gender_F", "friends")

teens_dt[,lapply(.SD, mean) ,.SDcols = aggcols, by = cluster][order(cluster)] %>%
magrittr::set_rownames(c("princesses", "musicians", "athletes", "punks", "basket cases")) %>%
round(2)

The mean age does not vary much by cluster, which is not too surprising as these teen identities are often determined before high school. However average age of "princesses" is slightly below and average age of "musicians" are slightly above average ages of other clusters.

The percent of females is highest in princesses and athletes clusters while lowest in musicians and basket cases

The connection between a teen's number of friends and their predicte cluster is remarkable, given that we did not use the friendship data as an input to the clustering algorithm. Also interesting is the fact that the number of friends seems to be related to the stereotype of each clusters' high school popularity; the stereotypically popular groups tend to have more friends. (Highr in "princesses" and "athletes" clusters, lowest in punks and basket cases


## Improve model performance

While conducting k-means analysis, what value should be provided as "k" - the number of clusters?

### Manual simulation

First let's dig into the model output:

In [None]:
teen_clusters %>% listviewer::jsonedit(mode = "form")

The critical values are:
- totss (total sum of squares)
- tot.withinss (total within groups sum of squares)
- betweenss (between groups sum of squares)

As the "k" goes up withinss should leak into betweenss

In [None]:
withinss <- sapply(1:15,
       function(x) kmeans(interests_z, centers = x) %>%
       "["(c("totss", "tot.withinss", "betweenss")) %>% unlist
       ) %>%
t %>%
as.data.table

rownames(withinss) <- 1:15

In [None]:
withinss %>% round

In [None]:
p1 <- withinss %>%
ggplot(aes(x = withinss[,.I], y = tot.withinss)) +
geom_line() +
xlab("Number of clusters") +
ylab("Within group sum of squares")

plotly::ggplotly(p1)

We cannot detect a clear elbow point to cut the number of clusters

### Optimal k with vegan package

Vegan package also does a simulation to determine the optimal k based on Calinski measure:

In [None]:
modelx <- vegan::cascadeKM(interests_z, 1, 10, iter = 3)

In [None]:
modelx$results

Calinski is a measure of between-cluster to within-cluster variance.

A plot method exists for this object, however with 30K cases, the plot takes too much time so it is avoided here. A simpler plot is as such:

In [None]:
p2 <- modelx$results %>%
t %>%
as.data.table %>%
ggplot(aes(x = 1:10, y = calinski)) +
geom_line()

plotly::ggplotly(p2)

The k with max calinski value should be selected:

In [None]:
which.max(modelx$results[2,])

Let's run the model with that:

In [None]:
teen_clusters2 <- kmeans(interests_z, centers = 2)

And see the center values:

In [None]:
centers2 <- teen_clusters2$centers %>% t %>% round(2)
centers2

And emphasize values over and above average:

In [None]:
apply(centers2,
      1,
      function(x)
        {
          zs <- (x - mean(x)) / sd(x);
          cell_spec(x,
                    color = ifelse(abs(zs) > 0.5, "white", "black"),
                            background = ifelse(zs > 0.5, "navy", ifelse(zs < -0.5, "red", "white"))
                   )
        }
    ) %>%
t %>%
magrittr::set_colnames(1:2) %>%      
knitr::kable(escape = F) %>%
kableExtra::kable_styling() %>%
as.character() %>%
IRdisplay::display_html()

Get cluster sizes:

In [None]:
teen_clusters2$size

Here we see that, on calinski criterion alone, the clustering is done such that first cluster is the one that holds users that did not post much interests in their profiles and the second cluster is all others

This kind of a clustering do not provide any insight at all. The reason for such an outcome is the highly skewed nature of the dataset: at least 90% of users did not have any interests in many keywords. So the optimal clustering (based on distribution of variance across between/within groups) is for values 0 versus other values.

Data exploration step is important in these situations: Knowledge of the specifics of the data will lead us not fall into the pitfall of deciding upon "numbers" alone

With a too few number of clusters, we cannot have a pattern to be interpreted
With a too large number of cluster, each cluster may not yield a distinctive insight to be acted on

### Optimal k with NbClust

NbClust package provides 30 indexes for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user.

However, running this function on a larger set with too large a dimension (too many variables for distance calculation) consumes too much memory

In [None]:
dim(interests_z)

This issue is also mensioned here:

https://stats.stackexchange.com/questions/270751/nbclust-with-large-data-sets-sampling

So we will select a sample of 700 observations:

In [None]:
samp <- teens_dt[,sample(.N, 700)]
teen_nb <- NbClust::NbClust(interests_z[samp,], min.nc = 2, max.nc = 8, index = "all", method = "kmeans")

2 cluster is selected with majority rule of 30 separate indices

The model output:

In [None]:
teen_nb

The voting of 30 criterion can also be done manually:

In [None]:
teen_nb$Best.nc[1,] %>% table

So NbClust also falls into the sampe pitfall as vegan did: 2 clusters are not meaningful for this dataset but is an outcome of the highly skewed nature (towards 0) of the variables

### Exclude cases with no interests

The high number of cases with only a few interest with > 0 values, makes the analysis harder. Let's try to exclude them

Let's first determine the number of non-zero values in interests for all cases:

In [None]:
nonzeros <- apply(interests, 1, function(x) sum(x > 0))

In [None]:
table(nonzeros) %>% cumsum

It seems that if we include those cases that have more than 7 non-zeros, then the remaining ~4000 cases would have more information on interests

In [None]:
include_ind <- which(nonzeros > 7)

Let's exclude those cases with too many zeros and repeat all cleaning steps (since the imputation should be done on data with less zero values):

In [None]:
teens_dt <- data.table::fread("../data/csv/11_01_snsdata.csv", stringsAsFactors = T)

In [None]:
teens_dt2 <- teens_dt[include_ind]
dim(teens_dt2)

Let's see the distribution across genders:

In [None]:
teens_dt2[,table(gender)]

The set is highly imbalanced towards female cases. Let's get a sample more balanced across genders (2 F to 1 M):

In [None]:
ind_m <- teens_dt2[, .(.I[gender == "M"])] %>% na.omit %>% .[,V1]
length(ind_m)

set.seed(1)
ind_f <- teens_dt2[, .(.I[gender == "F"])] %>% na.omit %>% .[,V1] %>% sample(length(ind_m) * 2)

In [None]:
inds <- c(ind_m, ind_f)

Now let's see whether they are balanced:

In [None]:
teens_dt2[inds,table(gender)]

And have this small sample:

In [None]:
teens_dt2 <- teens_dt2[inds]

Repeat the other steps:

In [None]:
teens_dt2[!dplyr::between(age, 13, 20 - 1e-10), age := NA]

In [None]:
gender_dummy2 <- teens_dt2[,fastDummies::dummy_cols(.(gender = gender),
                            remove_first_dummy = F)] %>% dplyr::select(-gender)
gender_dummy2 %>% dim

In [None]:
teens_dt2 <- teens_dt2 %>%
    append(unclass(gender_dummy2), after = 2) %>% 
    as.data.table

In [None]:
teens_dt2[,age := tidyimpute::impute_mean(.(age)), by = gradyear]

In [None]:
interests2 <- teens_dt2[,basketball:drugs]
interests2

In [None]:
interests_z2 <- interests2[,BBmisc::normalize(.SD)]

In [None]:
interests_z2 %>% sapply(quantile, na.rm = T) %>% t()

### NbClust with clean data set

Let's see the NbClust results:

In [None]:
set.seed(123)
samp <- teens_dt2[,sample(.N, min(.N, 700))]
teen_nb2 <- NbClust::NbClust(interests_z2[samp,], min.nc = 2, max.nc = 8, index = "all", method = "kmeans")

### PCA for fewer dimensions

We have too many variables. If we reduce the dimensionality, we can have better results:

First let's conduct pca analysis:

In [None]:
pca <- interests_z2 %>%
    prcomp(center = T,
           scale = T,
           rank = 8)

And the summary:

In [None]:
summary(pca)

Let's visualize the loadings as heatmaps:

In [None]:
pca$rotation %>% pheatmap::pheatmap(cluster_rows = F, cluster_cols = F)

Using a better method with interactivity:

In [None]:
pca$rotation %>% heatmaply::heatmaply(Rowv = F, Colv = F)

Let's see for each PC, ten variables with highest loadings:

In [None]:
pca$rotation %>%
plyr::alply(2, function(x) rank(-x) %>%
                            sort %>%
                            names %>%
                            "["(1:10)) %>%
rlist::list.cbind()

And for each variable, let's see the PC that the variable has the highest loading, and split them according to those PCs

In [None]:
dt1 <- pca$rotation %>%
apply(1, which.max) %>%
as.data.table(keep.rownames = T)

split(dt1[,rn], dt1[,"."])

Let's get the PC scores:

In [None]:
pcas <- pca$x
colnames(pcas) <- c("fast life", "shopper", "apparel", "hard sports",
                    "partying", "religious", "match", "soft sporting")
dim(pcas)

### Build and train model with smaller dataset and PCs

Let's build a model on the balanced and clean dataset and on PCs.

First let's confirm whether the set is balanced across genders:

In [None]:
teens_factors2 <- teens_dt2 %>% purrr::keep(is.factor) %>% # select factor columns
    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(teens_factors2)

Now let's see the optimal cluster size from NbClust:

In [None]:
set.seed(1)
samp <- teens_dt2[,sample(.N, min(.N, 700))]
teen_nb2 <- NbClust::NbClust(pcas[samp,], min.nc = 2, max.nc = 8, index = "all", method = "kmeans")

Let's plot the within groups sum of squares manually for cluster k's 1:15:

In [None]:
withinss2 <- sapply(1:15,
       function(x) kmeans(pcas, centers = x) %>%
       "["(c("totss", "tot.withinss", "betweenss")) %>% unlist
       ) %>%
t %>%
as.data.table

rownames(withinss2) <- 1:15

In [None]:
p1 <- withinss2 %>%
ggplot(aes(x = withinss2[,.I], y = tot.withinss)) +
geom_line() +
xlab("Number of clusters") +
ylab("Within group sum of squares")

plotly::ggplotly(p1)

In [None]:
modelx <- vegan::cascadeKM(pcas, 1, 10, iter = 3)

In [None]:
plot(modelx)

Although

- NbClust yields 5,
- Manual inspection yields (wss) 14
- vegan yields (calinski) 8

as the optimal k, intuitively that high a number of clusters may not be meaningful: We should have an insight on each cluster. So k=8 and 14 cluster cases are left as exercises

Now let's try the k=6 case:

In [None]:
set.seed(2345)
kval <- 6
teen_clusters2 <- kmeans(pcas, centers = kval)

In [None]:
teen_clusters2

In [None]:
summary(teen_clusters2)

See the cluster sizes:

In [None]:
teen_clusters2$size

Third cluster is rather a small one

Let's view the center values:

In [None]:
centers2 <- teen_clusters2$centers %>% t %>% round(2)
centers2

And highlight significant centers:

In [None]:
apply(centers2,
      1,
      function(x)
        {
          zs <- (x - mean(x)) / sd(x);
          cell_spec(x,
                    color = ifelse(abs(zs) > 1, "white", "black"),
                            background = ifelse(zs > 1, "navy", ifelse(zs < -1, "red", "white"))
                   )
        }
    ) %>%
t %>%
magrittr::set_colnames(1:kval) %>%      
knitr::kable(escape = F) %>%
kableExtra::kable_styling() %>%
as.character() %>%
IRdisplay::display_html()

Visualize centers as an interactive heatmap after normalizing each row (PC):

In [None]:
centers2 %>% 
apply(1, BBmisc::normalize) %>%
t %>%
heatmaply::heatmaply(Rowv = F, Colv = F)

Now let's normalize centers across rows (PCs) and then for each cluster, return the PCs with highest two positive center values:

In [None]:
namesx2 <- rownames(centers2)

apply(centers2, 1, BBmisc::normalize) %>%
    t %>%
    plyr::alply(2, function(x) namesx2[order(x, decreasing = T)][sort(x, decreasing = T) > 0][1:2])
#    plyr::alply(2, function(x) namesx2[order(x, decreasing = T)])
    #%>%
    #as.data.frame %>%
    #lapply(function(x) namesx[x > 1])

Let's visualize the clusters across axes:

In [None]:
factoextra::fviz_cluster(teen_clusters2, data = pcas, labelsize = 0)

Let's add cluster numbers back to the data:

In [None]:
teens_dt2[,cluster := teen_clusters2$cluster]
teens_dt2

Before naming the clusters intuitively let's add them back into the dataset:

This convert a matrix columns into list items and appends them to a data.table:

In [None]:
teens_dt2 <- teens_dt2 %>%
    append(pcas %>% plyr::alply(2) %>% purrr::set_names(colnames(pcas))) %>% 
    as.data.table

View the data with clusters and PCs:

In [None]:
selected_cols <- c("cluster", "gender", "age", "friends", colnames(pcas))

In [None]:
teens_dt2[,.SD, .SDcols = selected_cols]

Now let's combine those two PC names for each cluster:

In [None]:
namesx2 <- rownames(centers2)

clusters <- apply(centers2, 1, BBmisc::normalize) %>%
    t %>%
    plyr::alply(2, function(x) namesx2[order(x, decreasing = T)][sort(x, decreasing = T) > 0][1:2] %>%
    paste(collapse = "&")) %>%
    unlist
                
clusters

It is better that we have more intuitive names for those clusters.

In [None]:
clusters2 <- c("athletes", "fast lifers", "shoppers", "matchers", "religious", "partier")

Now let's aggregate selected column values across clusters:

In [None]:
centers2 %>%
t %>%
magrittr::set_rownames(clusters2)

In [None]:
aggcols <- c("age", "gender_F", "gender_M", "friends", colnames(pcas))

teens_dt2[,c(lapply(.SD, mean), .N) ,.SDcols = aggcols, by = cluster][order(cluster)] %>%
magrittr::set_rownames(clusters2) %>%
round(2)

- Athletes are mostly concerned with hard sports and are balanced across genders. They avoid unhealthy topics
- Fast lifers are mostly concerned with fast life topics and are balanced across genders. They are less popular in terms of friends (likely outcasts)
- Shoppers are mostly concerned with shopping and apparel and are mostly female. They are more popular in terms of friends
- Matchers are mostly concerned with match related topics and are mostly male (a small group)
- Religious are mostly concerned with religion related topics and are mostly male (a small group)
- Partiers are mostly concerned with partying issues (and less concerned with shopping) and are mostly female (largest group of all)