# HW 4: Unsupervised Learning, K-Means Clustering
### CS 4824 / ECE 4484, Spring '21

Code inspired by submissions to the [Credit Card Dataset for Clustering](https://www.kaggle.com/arjunbhasin2013/ccdata) Kaggle competition.

---

In this assignment, you're tasked with...
1. Implementing the K-Means clustering algorithm in `custom_kmeans.py`.
2. Choosing the best value of $K$ for this dataset
3. Interpreting the demographhics within each of your $K$ clusters

In [None]:
###### standard imports ######
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

###### special from sklearn ######
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [None]:
###### Import our data and check out its dimensions ######
data = pd.read_csv("creditcards.csv")
print(f"{data.shape[0]} rows, {data.shape[1]} columns")

In [None]:
###### Check out the dataset ######
data.head(10)

## 1. Test and time your solution!

Here is where you run your solution to see how well it performs. Toggle the comments below to see the results from scikit-learn's KMeans implementation, they should help give you a reference for how your algorithm should perform.

In [None]:
import time
###### Import and run your solution! ######
from custom_kmeans import CustomKMeans

K = 5
tic = time.perf_counter()

# ===== Toggle the comment below to see sklearn's implementation =====
custom_labels = CustomKMeans(K).fit(data, True) # True is added to turn on by-timestep graphing
# custom_labels = KMeans(K).fit(data)
# ====================================================================

toc = time.perf_counter()

print(f"Clustered {data.shape[0]} datapoints into {K} clusters in {toc - tic:0.4f} seconds")

## 2. Choose the best K!

Use the elbow method to choose the $K$ which best balances the fewest number of clusters and the minimum sum of distances. Again, toggle sk-learn's implementation for your reference.

In [None]:
###### For choosing best K ######
sum_of_distances = []
max_k = 20
for k in range(2, max_k):
# ===== Toggle the comment below to see sklearn's implementation =====
    kmean = CustomKMeans(k).fit(data)
    #kmean = KMeans(k).fit(data) 
# ====================================================================
    sum_of_distances.append(kmean.inertia_)

###### Plot the cost vs number of clusters ######
fig = plt.figure(figsize=(9,6))
plt.plot(range(2, max_k), sum_of_distances, '--x')
plt.title("Cost vs # Clusters")
plt.xlabel("# Clusters")
plt.ylabel('Cost')
plt.show()

## 3. Interpret your groups!

Now that you've chosen the best $K$, cluster along that value. Use the seaborn FaceGrids to help interpreting the meanings of each cluster.

In [None]:
best_k = 1
labels = CustomKMeans(best_k).fit(data).labels_

# ================ Uncomment for accuracy reference ================ 
# labels = KMeans(best_k).fit(data).labels_
# ==================================================================

pd.DataFrame(labels).to_csv('labels.csv', index=False) # Saves to local file for plot_3d.py

In [None]:
###### Generate by-cluster feature breakdowns to aid in interpretation ######
labeled_data = data.copy()
labeled_data['cluster'] = labels

for c in labeled_data:
    if c == 'cluster': continue
    grid=sns.FacetGrid(labeled_data, col='cluster')
    grid.map(plt.hist, c)

## Replace cluster labels

Now that you've seen the feature breakdowns, describe and explain each cluster below. 2-3 sentences should be sufficient.

1. **Foo**: ...
2. **Bar**: ...
3. ...

Now replace the dummy strings in the below dict "`interpretations`" with each of your cluster names.

In [None]:
######  extract top two principal components ######
data_pca = pd.DataFrame(PCA(2).fit_transform(data))
data_pca.columns = ['PC1', 'PC2']
data_pca['cluster'] = labels

###### Interpret the meanings of your K clusters ######
interpretations = {
    0: "foo",
    1: "bar",
    2: "fizz",
    3: 'buzz',
#   ...
}
for key, value in interpretations.items():
    data_pca['cluster'].replace(key, value, inplace=True)

### Inspect your clusters!

See the divisions between your clusters, as projected along the first two principal components, below. Some questions you should be asking yourself: 
- do the intersections and overlaps between the groups make sense? 
- are there distinct boundaries between clusters?
- do the outliers' labels make sense?

In [None]:
###### graph the data with seaborn ######
sns.set_style("whitegrid")
fig = plt.figure(figsize=(12,12))

sns.scatterplot(data=data_pca, x='PC1', y='PC2', hue='cluster', palette='deep')

###### label and display! ######
plt.title("Clusters on first two principal components")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()