# K-Means Clustering with Sklearn

This notebook shows how to train and measure a K-Means clustering model.

* Method: [K-Means Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
* Dataset: Stock market data

## Imports

In [None]:
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import metrics

import seaborn as sb
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/sample_stocks.csv"

In [None]:
# Import the data
data = pd.read_csv(DATA_FILE)
data.head(5)

In [None]:
# Check the data types
data.dtypes

In [None]:
# Create the X and y
X = data[['dividendyield']]
y = data[['returns']]

## Identify the Number of Clusters to Use

In [None]:
# Define the cluster range
cluster_range = range(1, 20)

# Create a list of KMeans models with differing numbers of clusters
kmeans_models = [KMeans(n_clusters=i) for i in cluster_range]

# Let's take a look
kmeans_models[12]

**Cluster score**

* An internal evaluation criteria
* Defined as: the opposite of the value of X on the K-means objective (distance between the data samples and their associated cluster centers).
* A high score is assigned to an algorithm that produces clusters with high similarity within a cluster and low similarity between clusters

In [None]:
# Create scores for each model
cluster_scores = [kmeans_models[i].fit(y).score(y) for i in range(len(kmeans_models))]
cluster_scores[12]

In [None]:
# Plot an elbow curve of the scores to find the optimal number of clusters
plt.plot(cluster_range, cluster_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

**Interpretation**: it appears that after 3 clusters performance doesn't improve much

In [None]:
print("2 clusters: %.2f" % cluster_scores[1])
print("3 clusters: %.2f" % cluster_scores[2])
print("4 clusters: %.2f" % cluster_scores[3])
print("5 clusters: %.2f" % cluster_scores[4])
print("6 clusters: %.2f" % cluster_scores[5])
print("7 clusters: %.2f" % cluster_scores[6])

## Fit a K-Means Clustering Model

### Train the Model

In [None]:
# Create an instance of the model using the number of clusters we previously found
model = KMeans(n_clusters=3)
model.fit(y)
model

### Plot the Clusters

In [None]:
# Plot the clusters using the data transformed using pca
plt.figure('3 Cluster K-Means')
plt.scatter(pca_c[:, 0], pca_d[:, 0], c=model.labels_)
plt.xlabel('Dividend Yield')
plt.ylabel('Returns')
plt.title('3 Cluster K-Means')
plt.show()

## Model Evaluation

### Silhouette Score

Shows how well defined the clusters are.

Scores
* 1: Best (Better fit)
* 0: indicates overlapping clusters
* -1: Worst

Details
* Silhouette Coefficient
  * Mean distance of an observation and all other points in it's cluster.
  * Mean distance between an observation and all other points in the next nearest cluster
* Silhouette Score in Sklearn
  * Mean of silhouette coefficient for all observations

In [None]:
metrics.silhouette_score(data, model.labels_, metric='euclidean')