Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/02_Clustering'
except ImportError as e:
    pass

## Exercise 8: Cluster Analysis

### 8.1. Analyzing the Customer Data Set

#### 8.1.1 Load the customers dataset from the Excel file provided in ILIAS.
Load the excel file into a dataframe and inspect the first few records.
Remember to import the pandas package first! Then, call the ```read_excel()``` function to load the file.

In [None]:
# import pandas


# load the file using the read_excel() function


# show the first few records


#### 8.1.2 Cluster the dataset using K-Means clustering. 

8.1.2.1. Experiment with different K values. Which values do make sense? 

8.1.2.2. What does the clustering tell you concerning your product portfolio? 

8.1.2.3. What does the clustering tell you concerning your marketing efforts in different regions?

#####  8.1.2.1 Cluster the dataset using K-Means clustering. Experiment with different K values. 

The dataset contains five attributes: a customer ID and zip code as well as the number of bought and returned items for different products.
For our analysis, we must first think about the meaning of these attributes and how we should use them.

The customer id identifies individual customers and is otherwise just a number with no meaning, so we will exclude it.
The question asks us to generate insights about our product portfolio and the performance in different regions, so we will use zip code and product for the interpretation of our results.
That leaves us with the two attributes items bought and items returned, which contain the factual data about our business.
It seems hence reasonable to use these two attributes for the clustering (also, using only two attributes allows us to plot everything in this exercise).

Before using the selected attributes, we normalise their values into the same range to make sure that each attribute has the same importance when calculating the distance between the records.

To solve the task, we do the following:
- Cluster on the attributes ```ItemsBought``` and ```ItemsReturned```
- Visualize (Scatter) the clustering using ```ItemsBought``` and ```ItemsReturned``` for the x and y-axes, and the cluster id for the color of the data points.
- repeat the clustering for different K values

In [None]:
# import KMeans


# import matplotlib


# import preprocessing


# create the normaliser


# copy the dataframe before preprocessing so we can access the original values later


# preprocess the features ItemsBought and ItemsReturned


# setup a figure
plt.figure(figsize=(10,10))

# iterate over all values that we want to test for K
for i in range(1,7):
    # create the clusterer

    
    # create the clustering


    # add a subplot
    plt.subplot(3,2,i)
    plt.tight_layout()
    
    # setup the labels of the subplot
    plt.title("#clusters (K) = {}".format(i))
    plt.xlabel('ItemsBought')
    plt.ylabel('ItemsReturned')
    
    # create the scatter plot


# show the figure
plt.show()

Which K value(s) make sense and how would you label the resulting clusters?

Answer:

#####  8.1.2.3 What does the clustering tell you concerning your product portfolio?

Run the clustering again with ```K=3```. Add the product ids to the plot using the ```annotate()``` function and interpret the results.

In [None]:
# create the clusterer for K = 3


# create the clustering


# create a scatter plot


# annotate each data point with its product id

    
# setup the labels of the plot
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')

# show the plot
plt.show()

Answer:

#####  8.1.2.4 What does the clustering tell you concerning your marketing efforts in different regions?

To answer this question, we simply plot again, but this time annotate the clusters with the zip code of the respective customers.
Note that we can use the original dataset instead of the preprocessed one for plotting and still use the clusters created on the preprocessed data, as both datasets have the same ordering of records.

In [None]:
# create the scatter plot


# annotate each data point with the zip code value

    
# setup the plot labels
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')

# show the plot
plt.show()

Answer: 

#### 8.1.3 Cluster the data set using Agglomerative Hierarchical Clustering. What does the dendrogram tell you concerning your customer groups?

To plot a dendrogram, we need to use the ```linkage()``` function from scipy instead of the clusterer from scikit-learn.
After creating the clustering with the ```linkage()``` function, we can plot using the ```dendrogram()``` function

In [None]:
# import linkage and dendrogram from scipy


# create the clustering


# plot the dendrogram


# setup the labels
plt.xlabel('Customer IDs')
plt.ylabel('distance')

# show the plot
plt.show()

Judging by the dendrogram, the customers in the "bad" group (IDs 8, 9 and 14) are far more different from the other customers than the customers in the "good" and "average" group (look at the Y axis).

#### 8.1.4 Flatten the hierarchical clustering so that you get 3 or 4 customer groups. Name these groups with appropriate labels.

To create a partitional clustering from a hierarchical clustering, we have to cut the hierarchy.
You can do this in the ```dendrogram()``` function using the ```truncate_mode``` parameter.
To create cluster ids as in KMeans, use the [```fcluster()``` function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html).

In [None]:
# import fcluster


# setup a figure
plt.figure(figsize=(10,10))

# iterate over the different numbers of clusters that we want to consider (here: 3 and 4)
counter = 1
for i in [3,4]:
    # add a sub plot
    plt.subplot(2,2,counter)
    counter += 1
    
    # setup the layout of the plot
    plt.tight_layout()
    plt.title('Dendrogram - {} clusters'.format(i))
    plt.xlabel('Count of Customers')
    plt.ylabel('distance')
    
    # plot the dendrogram
    
    
    # add a second sub plot
    plt.subplot(2,2,counter)
    counter += 1
    
    # create the clusters by cutting the hierarchy
    
    
    # create a scatter plot coloured according to the clusteres
    
    
    # setup the plot labels
    plt.xlabel('ItemsBought')
    plt.ylabel('ItemsReturned')
    
# show the figure
plt.show()

Answer:

### 8.2. Analyzing the Students Data Set

#### 8.2.1. Aggregate the students data set by student and calculate the average mark and the average number of attended classes



In [None]:
# load the excel file into a dataframe


# show the first few records


# group the dataframe by student name and calculate the mean values


# show the first few records


#### 8.2.2 Cluster the data set using the K-Means algorithm. Does one attribute dominate the clustering? What can you do about this? Assign suitable labels to your clusters.

Run a KMeans clusterer on the data and plot it in a scatter plot. Its a good idea to annotate the data points with the names of the students.

In [None]:
# create the clusterer


# create the clustering


# create the scatter plot


# setup the labels
plt.xlabel("Attended classes")
plt.ylabel("Mark")

# annotate each data point with the name of the student


# show the figure
plt.show()

Answer:

#### 8.2.3. Cluster the data set using Agglomerative Hierarchical Clustering. Experiment with different settings for calculating the cluster similarity. What is a good setting?

We first define which settings we want to test (the different linkage modes) and then iterate over these values in a for loop.
Inside the loop, we create the clustering with the respective settings and plot the dendrogram.

In [None]:
# define the different linkage modes that we want to test
modes = ['single', 'average', 'complete']

# create a figure
plt.figure(figsize=(20,5))
y_axis = None

# iterate over all linkage modes
for i, mode in enumerate(modes):
    
    # add a subplot 
    y_axis = plt.subplot(1,4,i + 1, sharey = y_axis)
    
    # setup the labels
    plt.title('Dendrogram - linkage mode: {}'.format(mode))
    plt.xlabel('ID of student')
    plt.ylabel('distance')
    
    # create the clustering
    
    
    # plot the dendrogram

    
# show the plots
plt.show()

#### 8.2.4. What does the dendrogram tell you about the distances between the different groups of students?

Answer:

### 8.3. Clustering the Iris Data Set
#### 8.3. Cluster the Iris data set using different algorithms and parameter settings. Does it make sense to normalise the data before applying the algorithms? Try to choose an algorithm and parameter setting that reproduces the original division into the three different species.
Load the dataset as seen in the last exercise.

In [None]:
# load the file into a dataframe


# show the first few records


#### Does it make sense to normalise the data before applying the algorithms?
Have a look at basic statistics of the dataset to check if you should apply normalisation.

In [None]:
# calculate statistics for the iris dataset


Answer:

Then we create clusterings with KMeans, Agglomerative Clustering and DBSCAN using different parameter settings. We compare the results using plots.

As we know the correct assignment from the dataset, we can calculate the overlap between clusters and the types of flowers.
For this calculattion, we add the cluster ids to the dataframe (using the ```join()``` function) and then group by the name of the flower and the cluster id.
Using the ```size()``` function, we get the number of records in each of these groups, which corresponds to the overlap of the cluster with the respective flower type.

In [None]:
# import Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering

# import DBSCAN
from sklearn.cluster import DBSCAN

# show frequency of each type of flower
display(iris.groupby('Name').size())

# plot the correct assignment
plt.figure(figsize=(5,5))

# create one series per type of flower
for name, group in iris.groupby('Name'):
    plt.scatter(group['PetalLength'], group['PetalWidth'], label=name)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.title("Ground Truth")
plt.legend()
plt.show()


# ***************************
# KMeans
# ***************************
estimator = KMeans(n_clusters = 3)
estimator.fit(iris[['PetalLength', 'PetalWidth']])

# show the frequency of each type of flower in every cluster
display(iris.join(pd.Series(estimator.labels_, name="KMeans")).groupby(['Name', 'KMeans']).size())

# plot the clusters
plt.figure(figsize=(5,5))
plt.title("KMeans")
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.scatter(iris['PetalLength'], iris['PetalWidth'], c=estimator.labels_)
plt.show()

# ***************************
# DBSCAN
# ***************************


# ***************************
# Agglomerative
# ***************************


Answer: 

### 8.4. Clustering the Geo Data Set

#### 8.4.1. The geo data set (provided in ILIAS) contains the coordinates (x & y) of housings in a certain area. Have a look at the data and visualize it with a scatter plot, using the ```area``` feature as colour.

In [None]:
# read the dataset into a dataframe


# show the first few lines


In [None]:
# create a scatter plot


#### 8.4.2. Cluster the data using k-Means (k=3). Do the clusters represent the original areas?
We cluster the dataset and plot again, this time using the cluster ids as colour.

In [None]:
# create the clustering


# plot again


Answer:

#### 8.4.3. Apply DBSCAN and play around with the epsilon. Can you reproduce the original areas using this cluster algorithm?

We run DBSCAN in its default configuration first.
Then, we systematically test different parameter settings for ```min_samples``` and ```eps```.

In [None]:
# show the result of running DBSCAN with default configuration


# test different parameter settings and plot the results


Answer:

### 8.5. Clustering the Zoo Data Set
#### 8.5.1. The Zoo data set describes 101 animals using 18 different attributes. The data set is provided in ILIAS as an ARFF file. Load this dataset.

In [None]:
# import arff
from scipy.io import arff

# load the file and create a dataframe
zoo_arff_data, zoo_arff_meta = arff.loadarff('zoo.arff')
zoo_data = pd.DataFrame(zoo_arff_data)

# solve the encoding issue in the data
columns_with_binary_strings = zoo_data.select_dtypes('object').columns.values
zoo_data[columns_with_binary_strings] = zoo_data[columns_with_binary_strings].apply(lambda x: x.str.decode("utf-8"))

zoo_data.head()

#### 8.5.2. Cluster the data set using Agglomerative Hierarchical Clustering. Experiment with different parameter settings in order to generate a nice species tree.

We first have to encode the non-numerical features. Note that the ```type``` feature already contains a classification of the species, so we exclude it and use it to see if our results make sense.

In [None]:
# import preprocessing from sklean


# specify which attributes you want to use


# create the encoder


# encode the selected attributes


# show the result


Then we can create a clustering and look at the dendrogram.

In [None]:
# create the clustering

# plot the dendrogram
