# Old Faithful Geyser

Here we will try using a *k*-means clustering on the Old Faithful geyser data. The data set is provided [here](http://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat).

![Old Faithful Geyser](images/wyoming-old-faithful.jpg "Old Faithful Geyser")
<div style="text-align: center;">
Credit: http://www.destination360.com/north-america/us/wyoming/yellowstone-national-park/old-faithful
</div>

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
geyser = pd.read_csv('data/old-faithful-geyser.csv')

In [None]:
geyser.head()

Try plotting the data.

In [None]:
geyser.plot.scatter(x='eruptions', y='waiting')

In [None]:
import seaborn as sns

In [None]:
sns.regplot(data=geyser, x='eruptions', y='waiting', fit_reg=False)

From the plot, we can see that the data can be divided into 2 main groups. Therefore, we will try using `k = 2` for our *k*-means model.

In [None]:
# import
from sklearn import cluster

# instantiate
k = 2
kmeans = cluster.KMeans(n_clusters=k)

# fit
kmeans.fit(geyser)

After we model the data, we can get the centroid of each cluster as follows:

In [None]:
centroids = kmeans.cluster_centers_
print(centroids)

From our *k*-means model we just built, we can see the labels to which each data point is assigned.

In [None]:
# predict
labels = kmeans.predict(geyser)
print(labels)

Later on, we can visualize the data based on the label information we have.

In [None]:
geyser['labels'] = labels

In [None]:
geyser.head()

In [None]:
geyser[geyser['labels'] == 0]

Plot with Pandas

In [None]:
f, ax = plt.subplots(1, 1, figsize=(8, 6))
color = ['blue', 'green']
for each in range(k):
    selected_data = geyser[geyser['labels'] == each]
    selected_data.plot.scatter(x='eruptions', y='waiting', ax=ax, color=color[each])
    lines = plt.plot(centroids[each, 0], centroids[each, 1], 'kx')
    plt.setp(lines, markersize=15.0, markeredgewidth=2.0)

Plot with Seaborn

In [None]:
f, ax = plt.subplots(1, 1, figsize=(8, 6))
for each in range(k):
    selected_data = geyser[geyser['labels'] == each]
    sns.regplot(data=selected_data, x='eruptions', y='waiting', fit_reg=False)
    lines = plt.plot(centroids[each, 0], centroids[each, 1], 'kx')
    plt.setp(lines, markersize=15.0, markeredgewidth=2.0)