## JupyterHUB: R notebook

In this notebook we will demonstrate a use case for R in Jupyter. We will perform a simple k-means clustering using the USArrests dataset. 

### Reading and checking data

First of all, we load the necessary packages. Be aware that, different from the IPython kernel, the IRkernel does not provide a method to pass bash commands to the terminal. This also means that you cannot install any extra packages from here. If you are missing any packages, install them before launching JupyterHUB.

In [None]:
library("ggplot2")
library("stats")
library("factoextra")

As an example, we use a locally stored version of the database (csv file):

In [None]:
setwd(paste0(getwd(), "/R_notebook_ex"))
USArrests <- read.table("USArrests.csv")
head(USArrests, n = 10)
print(paste("The dataset contains", nrow(USArrests), "states"))

Visualizing data is exactly the same as normally. We could for example plot the amount of murder arrests per state. If you use the standard size of th plot, it becomes quite unreadable, due to the many states. 

We are using the 'repr' package to adjust the size of the graph. This package comes included with the IRkernel package and is used to create readable text and images in tools like Jupyter Notebook. More info can be found here: https://github.com/IRkernel/repr

In [None]:
ggplot(data=USArrests, aes(x=row.names(USArrests), y=Murder)) +
  geom_bar(stat="identity")+
  ggtitle("Murder arrests per state")+
  ylab("Murder arrests")+
  xlab("State")+
  coord_flip()

In [None]:
options(repr.plot.width=15, repr.plot.height=8)

ggplot(data=USArrests, aes(x=row.names(USArrests), y=Murder)) +
  geom_bar(stat="identity")+
  ggtitle("Murder arrests per state")+
  ylab("Murder arrests")+
  xlab("State")+
  coord_flip()


If we take a closer look at the values of the different variables, some scaling seems to be necessary.

In [None]:
scaled_df <- scale(USArrests)
head(scaled_df, n = 10)

### Computing the clusters

First, let's set a seed to be able to reproduce our results. Here, the amount of clusters was set to 4. We use the kmeans function of the stats package. We can take a look at the output of the kmeans function:

In [None]:
set.seed(41)
km_clusters <- kmeans(scaled_df, 4, nstart=25)
print(km_clusters)

We can visualize the clusters easily using the "factoextra" package. As you can see, the graph size that was previously set is kept here as well.

In [None]:
fviz_cluster(km_clusters, data=scaled_df)