## California Housing

As spatial features, California Housing's 'Latitude' and 'Longitude' make natural candidates for k-means clustering. 
In this example we'll cluster these with 'MedInc' (median income) to create economic segments in different regions of California.

Since k-means clustering is sensitive to scale,
it can be a good idea rescale or normalize data with extreme values.
Our features are already roughly on the same scale, so we'll leave them as-is

In [None]:
! conda install -c conda-forge xgboost

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

ModuleNotFoundError: No module named 'xgboost'

In [None]:
# Create cluster feature
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")

X.head()

Now let's look at a couple plots to see how effective this was.
First, a scatter plot that shows the geographic distribution of the clusters.
It seems like the algorithm has created separate segments for higher-income areas on the coasts.

In [None]:
sns.relplot(
    x="Longitude", y="Latitude", hue="Cluster", data=X, height=6,
);

The target in this dataset is MedHouseVal (median house value).
These box-plots show the distribution of the target within each cluster. 
If the clustering is informative, 
these distributions should, for the most part, separate across MedHouseVal, which is indeed what we see.

In [None]:
X["MedHouseVal"] = df["MedHouseVal"]
sns.catplot(x="MedHouseVal", y="Cluster", data=X, kind="boxen", height=6);