Clustering is an unsupervised learning technique that involves grouping similar samples together into groups or clusters. The goal is to have the samples within a cluster be as similar as possible while maximising the differences with samples belonging to other clusters.

The k-means algorithm is a basic clustering method that, based on the number of clusters (*k*) specified, iteratively selects cluster centres (known as centroids), assigns samples to the closest centroid, then readjusts the centroid based on the values of the samples. This process continues repeatedly until centroids no longer change location or some other stopping criteria is reached.

> Can the California Housing data be clustered into economic regions based on `median_income`?

## Load data

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from itertools import cycle, islice
from pandas.plotting import parallel_coordinates

In [None]:
# Load data
data = pd.read_csv('../input/california-housing-prices/housing.csv')

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.describe().transpose()

## Scale features

In [None]:
features = ['longitude', 'latitude', 'median_income']
select_df = data[features]
select_df.columns

In [None]:
# Scale the features
X = StandardScaler().fit_transform(select_df)
X[:5]

## Train the model

In [None]:
kmeans = KMeans(n_clusters=6) # number of clusters must be specified
model = kmeans.fit(X)
model

In [None]:
centers = model.cluster_centers_
centers[:5]

In [None]:
centers.shape

In [None]:
# Function that creates a DataFrame with a column for Cluster Number
def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P

In [None]:
# Function that creates Parallel Plots
def parallel_plot(data):
	my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
	plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
	parallel_coordinates(data, 'prediction', color = my_colors, marker='o')

In [None]:
P = pd_centers(features, centers)

In [None]:
# Returns the six clusters and their corresponding values for the centroids
P

In [None]:
# Shows how different each cluster is across all features
parallel_plot(P)

In [None]:
# Create cluster label
data['econ_region'] = kmeans.fit_predict(X)
data['econ_region'] = data['econ_region'].astype("category")
data.head()

In [None]:
sns.set_style('whitegrid')
sns.relplot(x='longitude', y='latitude', hue='econ_region', data=data, kind='scatter');

In [None]:
median_attributes = ['econ_region', 'median_house_value', 'median_income', 'housing_median_age']
income_house = data[median_attributes]
income_house.groupby(['econ_region']).describe()

## Summary

* There appears to be some difference, in terms of median house value and median income, of the economic regions identified by the k-means algorithm.