# Geodemographic Analysis with PySAL and scikit-learn

Here, we'll examine geodemographic clustering in Los Angeles County

In [None]:
%load_ext watermark

In [None]:
%watermark -v -a "author: eli knaap" -d -u -p segregation,libpysal,geopandas

In [None]:
import geopandas as gpd
from libpysal import weights
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import geoviews as gv
import hvplot.pandas
import seaborn as sns
import matplotlib.pyplot as plt
gv.extension('matplotlib', 'bokeh')
gv.output(backend='bokeh')

## Data Prep

In [None]:
scag = gpd.read_file("data/scag_region.gpkg", layer="tracts")

In [None]:
scag = scag.fillna(0)

In [None]:
scag.plot()

In [None]:
la = scag[scag.geoid.str[:5]=='06037']

In [None]:
wq = weights.Queen.from_dataframe(la)

In [None]:
la = la.iloc[wq.component_labels==0]

In [None]:
la.plot()

## Geodemographic Clusters

[Geodemographic analysis](https://en.wikipedia.org/wiki/Geodemographic_segmentation), which includes applying unsupervised learning to demographic and socioeconomic data, followed by a spatial analysis of the results

In [None]:
columns = ['median_household_income', 'median_home_value', 'p_asian_persons', 'p_hispanic_persons', 'p_nonhisp_black_persons', 'p_nonhisp_white_persons']

In [None]:
scaler = StandardScaler()

In [None]:
la_kmeans = KMeans(n_clusters=6).fit(scaler.fit_transform(la[columns]))

In [None]:
la_kmeans.labels_

In [None]:
la['kmeans'] = la_kmeans.labels_

In [None]:
la.hvplot(c='kmeans', cmap='tab10', line_width=0.1, alpha=0.7,  geo=True, tiles='CartoLight',  xaxis=False, yaxis=False, height=500, colorbar=False)

There are some obvious spatial patterns (which we might expect, given the results of our prior esda and segregation analyses). But what do these clusters mean? What kinds of demographic features do they represent?

In [None]:
la.groupby('kmeans')[columns].mean()

This table is a lot to interpret at once, so a visualization would be handy. Violin plots are a nice way of examining how each of the input variables is distributed in each of the resulting clusters

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(3,2, figsize=(16,8))
ax=ax.flatten()
for i, col in enumerate(columns):
    sns.violinplot(data=la, y=col, x=la.kmeans, ax=ax[i])
    ax[i].set_title(col.replace("_", " ").title())
plt.tight_layout()


We can also use a statistic to tell us how well this model fits the data. To do so, we can use scikit-learn's [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

> The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

> This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.

> The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
silhouette_score(scaler.fit_transform(la[columns]), la_kmeans.labels_)

What about other clustering algorithms or other numbers for *k*? Might we get a better model fit?

In [None]:
la_affprop = AffinityPropagation(damping=0.8, preference=-1000,).fit(scaler.fit_transform(la[columns]))

In [None]:
la_affprop.labels_

In [None]:
import pandas
pandas.Series(la_affprop.labels_).unique()

In [None]:
silhouette_score(scaler.fit_transform(la[columns]), la_affprop.labels_)

In [None]:
la['affprop'] = la_affprop.labels_

This will create a linked holoviews plot so we can zoom in on both maps together (**click the "wheel zoom" button on the bokeh plot so you can zoom in**)

In [None]:
la.hvplot(c='affprop', cmap='tab10', line_width=0.1, alpha=0.7,  geo=True, tiles='CartoLight',  xaxis=False, yaxis=False,  colorbar=False, title='Affinity Prop') + \
la.hvplot(c='kmeans', cmap='tab10', line_width=0.1, alpha=0.7,  geo=True, tiles='CartoLight',  xaxis=False, yaxis=False, colorbar=False, title='K-Means')

The silhouette score tells us that the affinity propagation clusterer provided a better solution. Nonetheless, we end up with similar spatial patterns

## Spatially-Constrained Geodemographics (Regionalization)

Above, we notice there are some obvious spatial patterns in the neighborhood clusters. That happens due to the underlying spatial autocorrelation in the race and class indicators we used to develop the clusters. Instead of allowing this autocorrelation to "fall out" of the results, we can leverage it to create spatially-contiguous clusters

`scikit-learn`'s agglomerative clustering algorithm allows us to pass a constraint and it accepts a pysal `W` object. Lets compare solutions with and without the constraint

In [None]:
w = weights.Queen.from_dataframe(la)

In [None]:
la_ward = AgglomerativeClustering(n_clusters=8, linkage='ward').fit(scaler.fit_transform(la[columns]))

In [None]:
la['ward'] = la_ward.labels_

In [None]:
la.groupby('ward')[columns].median()

In [None]:
sns.set_style('white')

In [None]:
la.plot('ward', categorical=True)

In [None]:
la_ward_spatial = AgglomerativeClustering(n_clusters=8, linkage='ward', connectivity=w.sparse).fit(scaler.fit_transform(la[columns]))

In [None]:
la['ward_spatial'] = la_ward_spatial.labels_

In [None]:
la.hvplot(c='ward', cmap='tab10', line_width=0.1, alpha=0.7,  geo=True, tiles='CartoLight',  xaxis=False, yaxis=False, frame_height=450, colorbar=False) + \
la.hvplot(c='ward_spatial', cmap='tab10', line_width=0.1, alpha=0.7,  geo=True, tiles='CartoLight',  xaxis=False, yaxis=False, frame_height=450, colorbar=False)

In [None]:
silhouette_score(scaler.fit_transform(la[columns]), la_ward.labels_)

In [None]:
silhouette_score(scaler.fit_transform(la[columns]), la_ward_spatial.labels_)

Why is the silhouette score higher for the first soluttion?

## Exercise

1. Two geodemographic typologies for Orange County using the same race and class variables as above
    - for the first, use 5 clusters
    - for the second, use 8 clusters
    - which solution is better?

2. Create a geodemographic typology for Riverside County using Affinity Propagation with `damping=0.8` and `preference=-100`
    - How many unique clusters do you find?
    - What is the average home price for tracts in Cluster 3?

3. What would happen if you created a spatially-constrained geodemographic typology using **DistanceBand**  spatial weights?

In [None]:
# %load solutions/05.py
##### 1)

# create orange county data
oc = scag[scag.geoid.str[:5] == '06059']

# create cluster models where k==5,8
oc5 = KMeans(n_clusters=5).fit(scaler.fit_transform(oc[columns]))
oc8 = KMeans(n_clusters=8).fit(scaler.fit_transform(oc[columns]))

# calculate silhouette coefs and print them
sil5 = silhouette_score(scaler.fit_transform(oc[columns]), oc5.labels_)
sil8 = silhouette_score(scaler.fit_transform(oc[columns]), oc8.labels_)

print(f'5-cluster solution: {sil5}')
print(f'8-cluster solution: {sil8}')


##### 2)

rside = scag[scag.geoid.str[:5] == '06065']

rside['affprop'] = AffinityPropagation(damping=0.8, preference=-100,).fit(scaler.fit_transform(rside[columns])).labels_
print(f'There are {len(rside.affprop.unique())} unique clusters in Riverside')

print(f"The average home price in cluster 3 is ${rside.groupby('affprop').mean()['median_home_value'][3].astype(int)}")

##### 3)

print("With distance band weights, the solution will be spatially-influenced but the clusters are not guaranteed to be contiguous")

