In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Visualisation with PCA

![](../images/countries.png)

This assignment is a continuation of the k-means assignment notebook. The goal is to cluster the coutries using some socio-economic and health factors that determine the overall development of the country.

In the previous assignment, you immediately applied the k-means algorithm to this problem. In this assignment, we will see how PCA can help us visualise our results.

In [None]:
data = pd.DataFrame(pd.read_csv('../data/country-data.csv'))
data.head()

In [None]:
X = data.drop(['country'], axis=1)
y = data['country']

The data was preprocessed.

Before we move on to visualise our clustering results, we will first visualise the data. Without any dimensionality reduction techniques, we would have to choose which features to display on the x- and y-axis. However, with PCA, we can choose to simply reduce our data to two components. 


### Exercise  1
Implement PCA on the data for visualisation purposes. Don't forget to scale your data! 

In [None]:
# Your code here. 

For visualisation purposes, we're going to use a package called _Altair_. This provides is with some nice functionality with the tooltip to allow us to explore the data a little better. In this piece of code, we assume the output of your PCA transformation on the data was called `pca_data`. 

In [None]:
import altair as alt

# Combine PCA results with original dataframe
pca_df = pd.DataFrame({'pc1': pca_data[:, 0], 'pc2': pca_data[:, 1]})
df = data.merge(pca_df, left_index=True, right_index=True)
df.head()

In [None]:
# Create chart.
alt.Chart(df).mark_circle().encode(
    x = 'pc1', 
    y = 'pc2', 
    tooltip=data.columns.tolist(), 
    color='country'
)

This visualises your data on the principal components.

What we are now going to do is recreate the clustering from the previous k-means assignment. This will give us labels for each of the countries. 

### Exercise 2
Recreate your k-means model from the previous assignment here. 

In [None]:
# Your code here.

Now let's recreate the chart again!

In [None]:
# Combine PCA results with original dataframe
pca_df = pd.DataFrame({'pc1': pca_data[:, 0], 'pc2': pca_data[:, 1], 'labels': best_model.labels_})
df = data.merge(pca_df, left_index=True, right_index=True)

First we recreate the dataframe, which now not only contains the original data and principal components, but also the labels. 

Next up, we recreate the chart. The color is now no longer determined by the column name `country`, but by the column name `label`. The `:N` is added to encode that the data is nominal - a discrete, unordered category. You can remove it to see how that changes your visualisation. 

In [None]:
# Create chart.
alt.Chart(df).mark_circle().encode(
    x = 'pc1', 
    y = 'pc2', 
    tooltip=data.columns.tolist(), 
    color='labels:N'
)

### Exercise 3
Inspect the chart! Does the clustering look good to you? 

In the previous exercise, there were originally 5 clusters with very few data points. Can you find the data points that originally belonged to the small clusters in the chart? 

# Clustering with PCA 

PCA is not only a helpful tool for visualising your clusters. 

A downside of the K-means algorithm is the "curse of dimensionality". In high-dimensionality spaces, Euclidean distances tend to become inflated. In such cases, running a dimensionality reduction algorithm such as principal component analysis prior to k-means clustering can alleviate this problem and speed up computations. 

![](../images/wine.png)



In [None]:
from sklearn.datasets import load_wine

wines = load_wine()
wine_df = pd.DataFrame(wines.data, columns=wines.feature_names)
wine_df.head()

Our wine dataset has 13 distinct features - hardly what we would call 'high-dimensional', but good enough for our purposes. For clustering with k-means, this dataset might benefit from a reduction in dimensionality. 

### Exercise 4
Scale your data and perform PCA to visualise the current dataset. 

In [None]:
# Your code here. 

### Exercise 5
Perform PCA with the number of components equal to the number of features to choose the right number of components. 

In [None]:
# Your code here. 

### Exercise 6
Use the Elbow method to determine the right number of clusters for this dataset. 

In [None]:
# Your code here. 

### Exercie 7
Find the best model with K-means.

In [None]:
# Your code here.

### Exercise 8
Plot the labels on the principal components. 

In [None]:
# Your code here. 

### Exercise 9
Determine the following for the model you just created.
- Inertia
- Silhouette score
- Calinkski-harabasz score [[documentation]](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html)

In [None]:
# Inertia

In [None]:
# Silhouette score

In [None]:
# Calinkski-harabasz score 

### Bonus Exercise 

Create a k-means model without PCA. Compare the inertia, silhouette score and calinski-harabasz score. Are the results what you expect? 

In [None]:
# Your code here.