# Clustering model selection matrix

This notebook compares performance of various clustering models aimed at the selection of the optimal model for delineation of spatial signatures. 

## Dimensions

Dimensions of models to be tested.

### Algorithms

- K-Means
- K-Medoid
- SOM
    - different architectures
        - grid dimensions
        - parameter selection
- GMM

### Data normalisation

- MinMax stretch
- Standardise
- RobustScaler?

### Dimensionality reduction

- PCA
- tSNA?
- K-means with k>200?

### Number of clusters

- n -> m

### Input data

- Form
- Function
- Form & Function

## Comparison

### Quantitative data

- Mean sampled silhouette score
- Calinski-Harabasz
- Davies-Bouldin
- BIC

### Qualitative data

- label frequencies
- cross tabulation
    - postcode classification
    - modum
    - worldpop
- N-S, E-W distribution of cluster centers
    - weighted by area?
- Signatures
    - polygon areas
    - distances between signatures of the same kind
    - number of polygons/components (how many times we see the signature type)
- maps for a few cities
    - Liverpool
    - Glasgow
    - London

- clustergram for every algorithm

## First phase

- Each algorithm on a few similar 

## Data normalisation

Since we'll be reusing normalised/standardised data repeatedly, we do the transformation once and store results in chunked parquet files.

In [1]:
import dask.dataframe

In [2]:
form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/morphometrics/convolutions/conv_*.pq")

In [4]:
standardized = (form - form.mean()) / form.std()

In [5]:
%%time
standardized.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/")

CPU times: user 3min 43s, sys: 1min 14s, total: 4min 57s
Wall time: 3min 47s


In [6]:
min_max = (form - form.min()) / (form.max() - form.min())

In [7]:
%%time
min_max.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/normalized/")

CPU times: user 3min 1s, sys: 1min 7s, total: 4min 8s
Wall time: 3min 48s


### Harmonize chunks

Some chunks are missing columns as certain land use types are not present. We need to harmonize our chunks to have the same columns in each of them.

In [24]:
import pyarrow.parquet as pq

columns = set()
for i in range(103):
    schema = pq.read_schema(f"../../urbangrammar_samba/spatial_signatures/functional/functional/func_{i}.pq")
    for c in schema.names:
        columns.add(c)

for i in range(103):
    df = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/functional/functional/func_{i}.pq")
    missing = [c for c in columns if c not in df.columns]
    df[missing] = 0
    df.to_parquet(f"../../urbangrammar_samba/spatial_signatures/functional/functional/func_{i}.pq")

In [25]:
%%time
function = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/functional/functional/func_*.pq")
standardized = (function - function.mean()) / function.std()
min_max = (function - function.min()) / (function.max() - function.min())
standardized.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/standardized/")
min_max.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/normalized/")

CPU times: user 5min 35s, sys: 1min 39s, total: 7min 15s
Wall time: 3min 3s


Ensure that each observation has `hindex`.

In [27]:
stand_fn = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/standardized/")

In [29]:
standardized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/")
standardized_form['hindex'] = stand_fn.index.values
standardized_form.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/")

normalized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/normalized/")
normalized_form['hindex'] = stand_fn.index.values
normalized_form.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/normalized/")

## Test cases

Test cases: 
- Chunk 68 - Glasgow 155609
- Chunk 51 - Merseyside 121188
- Random sample - 250000

In [32]:
standardized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/").compute().set_index('hindex')

In [33]:
sample = standardized_form.sample(n=250_000, random_state=42)
sample.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/form_standardized.pq")

In [34]:
normalized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/normalized/").compute().set_index('hindex')
sample_norm = normalized_form.loc[sample.index]
sample_norm.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/form_normalized.pq")

In [37]:
stand_fn = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/standardized/").compute()
sample_stand_fn = stand_fn.loc[sample.index]
sample_stand_fn.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/function_standardized.pq")

In [38]:
norm_fn = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/normalized/").compute()
sample_norm_fn = norm_fn.loc[sample.index]
sample_norm_fn.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/function_normalized.pq")

In [7]:
geoms = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_*.pq").compute().set_index("hindex")
sample_geoms = geoms.loc[sample.index]
sample_geoms.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/geometry.pq")

## Evaluation

Link auxillary data.

In [26]:
import geopandas as gpd
import tobler
import rioxarray
import rasterstats
import numpy as np

parts = {}
parts["chunk51"] = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_51.pq")
parts["chunk68"] = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_68.pq")
parts["sample"] = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/geometry.pq", columns=["tessellation"])
parts["sample"] = gpd.GeoDataFrame(parts["sample"])
parts["sample"]["tessellation"] = gpd.GeoSeries.from_wkb(parts["sample"].tessellation, crs=27700)
parts["sample"] = parts["sample"].set_geometry("tessellation")


for key, gdf in parts.items():
    murray = gpd.read_file("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/murray.gpkg", bbox=tuple(gdf.total_bounds))
    murray.geometry = murray.buffer(80, cap_style=3)
    joined = tobler.area_weighted.area_join(murray, gdf, variables=["ward"])
    joined.reset_index()[["hindex", 'ward']].to_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/murray_{key}.pq")
    
    modum = gpd.read_file("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/modumew2016.zip", bbox=tuple(gdf.total_bounds))
    joined = tobler.area_weighted.area_join(modum, gdf, variables=["CLUSTER_LA"])
    joined.reset_index()[["hindex", 'CLUSTER_LA']].to_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/modum_{key}.pq")
    
    foot = rioxarray.open_rasterio("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/jochem.tif")
    foot_osgb = foot.rio.reproject("EPSG:27700")
    clipped = foot_osgb.rio.clip_box(*gdf.total_bounds)
    arr = clipped.values
    affine = clipped.rio.transform()
    stats = rasterstats.zonal_stats(
        gdf.representative_point(), 
        raster=arr[0],
        affine=affine,
        stats=['mean'],
        nodata = np.nan,
    )
    gdf['jochem'] = [x["mean"] for x in stats]
    gdf.reset_index()[["hindex", 'jochem']].to_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/jochem_{key}.pq")
    print(f"Part {key} done.")

  for feature in features_lst:


Part sample done.


In [60]:
def evaluation(data, labels, case, identifier, murray, modum, jochem, geom, sample_size=None):
    """Get evaluation metrics for a given clustering
    
    Parameters:
        data : array
        labels : array
        case : string {"chunks", "sample"}
        identifier : ID of clustering model
        sample_size : int (silhouette_score sample size)
    
    """
    from sklearn import metrics
    import pandas as pd
    import scipy as sp
    import matplotlib.pyplot as plt
    import contextily as ctx
    import urbangrammar_graphics as ugg
    import dask_geopandas
    from utils.dask_geopandas import dask_dissolve
    
    
    results = {}
    
    try:
        results['silhouette'] = metrics.silhouette_score(data, labels, sample_size=sample_size, random_state=42)
    except ValueError:
        results['silhouette'] = np.nan
    results['calinski'] = metrics.calinski_harabasz_score(data, labels)
    results['davies'] = metrics.davies_bouldin_score(data, labels)

    results['frequencies'] = pd.Series(labels).value_counts()
    
    # cross tabulation

    modum['labels'] = labels
    mod_crosstab = pd.crosstab(modum.dropna()['labels'], modum.dropna()["CLUSTER_LA"])
    results['mod_chi'], results['mod_p'], results['mod_dof'], results['mod_exp'] = sp.stats.chi2_contingency(mod_crosstab)
    results['mod_cramers_'] = cramers_v(mod_crosstab)
    results['mod_crosstab'] = mod_crosstab

    murray['labels'] = labels
    mur_crosstab = pd.crosstab(murray.dropna()['labels'], murray.dropna()["ward"])
    results['mur_chi'], results['mur_p'], results['mur_dof'], results['mur_exp'] = sp.stats.chi2_contingency(mur_crosstab)
    results['mur_cramers_v'] = cramers_v(mur_crosstab)
    results['mur_crosstab'] = mur_crosstab


    jochem['labels'] = labels
    joc_crosstab = pd.crosstab(jochem.dropna()['labels'], jochem.dropna()["jochem"])
    results['joc_chi'], results['joc_p'], results['joc_dof'], results['joc_exp'] = sp.stats.chi2_contingency(joc_crosstab)
    results['joc_cramers_v'] = cramers_v(joc_crosstab)
    results['joc_crosstab'] = joc_crosstab
    
    
    if case == "chunks":
        # signatures
        
        geom['labels'] = labels        
        ddf = dask_geopandas.from_geopandas(geom.sort_values('labels'), npartitions=64)
        spsig = dask_dissolve(ddf, by='labels').compute().reset_index(drop=True).explode()
        
        results['signature_abundance'] = spsig.labels.value_counts()
        results['signature_areas'] = spsig.area
        
        
        cmap = ugg.get_colormap(spsig.labels.nunique(), randomize=True)
        token = "pk.eyJ1IjoibWFydGluZmxlaXMiLCJhIjoiY2tsNmhlemtxMmlicTJubXN6and5aTc2NCJ9.l7nSUXM7ZRjAWTB7oXiswQ"
        
        ax = spsig.cx[332971:361675, 379462:404701].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
        ax.set_axis_off()

        plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/{identifier}_lpool.png")
        plt.close()   

        ax = spsig.cx[218800:270628, 645123:695069].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
        ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
        ax.set_axis_off()
        plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/{identifier}_gla.png")
        plt.close()    
    
#     else:
        
#         geom['labels'] = labels        
#         ddf = dask_geopandas.from_geopandas(geom.sort_values('labels'), npartitions=64)
#         spsig = dask_dissolve(ddf, by='labels').compute().reset_index().explode()
#         centroid = spsig.centroid
#         results['x_coords'] = centroid.x
#         results['y_coords ']= centroid.y
        
    return results


def cramers_v(confusion_matrix):
    import scipy as sp
    import numpy as np
    
    chi2 = sp.stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [68]:
for transformation in ["normalized", "standardized"]:
    c51 = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/form/{transformation}/part.51.parquet")
    c68 = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/form/{transformation}/part.68.parquet")
    form = pd.concat([c51, c68]).reset_index(drop=True).drop(columns="hindex")

    c51f = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/function/{transformation}/part.51.parquet")
    c68f = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/function/{transformation}/part.68.parquet")
    fn = pd.concat([c51f, c68f]).reset_index(drop=True)
    
    data = pd.concat([form, fn], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0)
    data.to_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/chunks_{transformation}_data.pq")
    
    form = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/form_{transformation}.pq")
    fn = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/function_{transformation}.pq")
    data = pd.concat([form, fn], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0)
    data.to_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/sample_{transformation}_data.pq")

## Test matrix

In [None]:
# !pip install scikit-learn-extra
# !pip install minisom

In [None]:
from itertools import product
from time import time

import pandas as pd
import numpy as np
import geopandas as gpd

from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from minisom import MiniSom


labels = {}
times = {}
evaluations = {}
quant_errors = {}

In [41]:
import numpy as np

In [62]:
# for case in ["chunks", "sample"]:
for case in ["sample"]:
    
    if case == "chunks":
        mod51 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/modum_chunk51.pq")
        mod68 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/modum_chunk68.pq")
        modum = pd.concat([mod51, mod68]).reset_index(drop=True)

        mur51 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/murray_chunk51.pq")
        mur68 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/murray_chunk68.pq")
        murray = pd.concat([mur51, mur68]).reset_index(drop=True)

        joc51 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/jochem_chunk51.pq")
        joc68 = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/jochem_chunk68.pq")
        jochem = pd.concat([joc51, joc68]).reset_index(drop=True)

        geom51 = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_51.pq", columns=["tessellation", "hindex"])
        geom68 = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_68.pq", columns=["tessellation", "hindex"])
        geom = pd.concat([geom51, geom68]).reset_index(drop=True).rename_geometry("geometry")

    else:
        modum = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/modum_sample.pq")
        murray = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/murray_sample.pq")
        jochem = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/validation/jochem_sample.pq")
        geom = pd.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/sample/geometry.pq", columns=["tessellation", "hindex"])
        geom = gpd.GeoDataFrame(geom, geometry=gpd.GeoSeries.from_wkb(geom.tessellation))

#     for transformation in ["normalized", "standardized"]:
    for transformation in ["standardized"]:
        
        # load data and prepare numpy.array
        if case == "chunks":
            data = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/chunks_{transformation}_data.pq").values
        else:
            data = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/sample_{transformation}_data.pq").values 

        for k in [10, 15, 20, 30]:
            # KMeans
            identifier = f"{case}_{transformation}_KMeans_{k}"
            s = time()
            km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(data)
            times[identifier] = time() - s
            labels_ = km.labels_
            labels[identifier] = labels_
            
            evaluations[identifier] = evaluation(data, labels_, case, identifier, murray, modum, jochem, geom, sample_size=10_000)
            
            print(f"{identifier} done. Time to fit the model: {times[identifier]} seconds.")
            
        for k in [10, 15, 20, 30]:
            # MiniBatchKMeans
            identifier = f"{case}_{transformation}_MiniBatchKMeans_{k}"
            s = time()
            km = MiniBatchKMeans(n_clusters=k, batch_size=25_000, n_init=10, random_state=42).fit(data)
            times[identifier] = time() - s
            labels_ = km.labels_
            labels[identifier] = labels_
            
            evaluations[identifier] = evaluation(data, labels_, case, identifier, murray, modum, jochem, geom, sample_size=10_000)
            
            print(f"{identifier} done. Time to fit the model: {times[identifier]} seconds.")
            
#         for k in [10, 15, 20, 30]:
#             # K-Medoid
#             identifier = f"{case}_{transformation}_KMedoid_{k}"
#             s = time()
#             km = KMedoids(n_clusters=k, random_state=42).fit(data)
#             times[identifier] = time() - s
#             labels_ = km.labels_
#             labels[identifier] = labels_
            
#             evaluations[identifier] = evaluation(data, labels_, case, identifier, murray, modum, jochem, geom, sample_size=10_000)
            
#             print(f"{identifier} done. Time to fit the model: {times[identifier]} seconds.")
        
        for k in [10, 15, 20, 30]:
            # GMM
            identifier = f"{case}_{transformation}_GMM_{k}"
            s = time()
            gmm = GaussianMixture(n_components=k, n_init=10, random_state=42, covariance_type="full", max_iter=500).fit(data)
            times[identifier] = time() - s
            labels_ = gmm.predict(data)
            labels[identifier] = labels_
            
            evaluations[identifier] = evaluation(data, labels_, case, identifier, murray, modum, jochem, geom, sample_size=10_000)
            
            print(f"{identifier} done. Time to fit the model: {times[identifier]} seconds.")

        # SOM
        for som_shape in [(3, 3), (2, 5), (3, 4), (2, 6), (3, 5), (4, 5), (5, 5), (6, 5)]:
            for (sigma, rate) in product([.01, .1, .25, .5, 1], [.01, .05, .1, .2, .5, 1]):
                identifier = f"{case}_{transformation}_SOM_{som_shape}_sigma-{sigma}_rate-{rate}"
                s = time()
                som = MiniSom(som_shape[0], som_shape[1], data.shape[1], sigma=sigma, learning_rate=rate,
                              topology="hexagonal", random_seed=42)
                som.train_batch(data, 50000, verbose=False)
                winner_coordinates = np.array([som.winner(x) for x in data])
                labels_ = np.apply_along_axis(lambda x: str(tuple(x)), 1, winner_coordinates)
                if len(np.unique(labels_)) > 1:
                    times[identifier] = time() - s
                    labels[identifier] = labels_

                    evaluations[identifier] = evaluation(data, labels_, case, identifier, murray, modum, jochem, geom, sample_size=10_000)
                    quant_errors[identifier] = som.quantization_error(data)

                print(f"{identifier} done. Time to fit the model: {times[identifier]} seconds.")

sample_standardized_KMeans_10 done. Time to fit the model: 33.980648040771484 seconds.
sample_standardized_KMeans_15 done. Time to fit the model: 36.58458423614502 seconds.
sample_standardized_KMeans_20 done. Time to fit the model: 50.62095856666565 seconds.
sample_standardized_KMeans_30 done. Time to fit the model: 88.40085291862488 seconds.
sample_standardized_MiniBatchKMeans_10 done. Time to fit the model: 17.53959631919861 seconds.
sample_standardized_MiniBatchKMeans_15 done. Time to fit the model: 19.377883672714233 seconds.
sample_standardized_MiniBatchKMeans_20 done. Time to fit the model: 19.961794137954712 seconds.
sample_standardized_MiniBatchKMeans_30 done. Time to fit the model: 27.228529930114746 seconds.
sample_standardized_GMM_10 done. Time to fit the model: 8817.877690076828 seconds.
sample_standardized_GMM_15 done. Time to fit the model: 8004.2522094249725 seconds.
sample_standardized_GMM_20 done. Time to fit the model: 11593.263016223907 seconds.
sample_standardized_G

In [46]:
import pickle

In [63]:
with open("all_data.pickle", "wb") as f:
    pickle.dump((labels, times, evaluations, quant_errors), f)

In [None]:
labels = {}
times = {}
evaluations = {}
quant_errors = {}

In [66]:
options = evaluations.keys()

In [69]:
len(options)

1008

In [71]:
one_eval = evaluations[list(options)[0]]

In [93]:
useless = []

for op in list(options):
    if 'chunks' in op:
        if ((evaluations[op]["frequencies"] / 276797) > .9).any():
            useless.append(op)
    else:
        if ((evaluations[op]["frequencies"] / 250000) > .9).any():
            useless.append(op)

In [95]:
len(useless)

388

In [96]:
options = [o for o in list(evaluations.keys()) if o not in useless]

In [98]:
silhouettes_chunks = pd.Series()
silhouettes_sample = pd.Series()


for op in options:
    if 'chunks' in op:
        silhouettes_chunks[op[7:]] = evaluations[op]['silhouette']
    else:
        silhouettes_sample[op[7:]] = evaluations[op]['silhouette']

  silhouettes_chunks = pd.Series()
  silhouettes_sample = pd.Series()


In [191]:
silhouettes_chunks.sort_values()[:60]

standardized_SOM_(5, 5)_sigma-1_rate-1         -0.024278
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01    -0.017866
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01   -0.017866
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01   -0.017779
standardized_SOM_(3, 4)_sigma-1_rate-0.5       -0.017356
standardized_SOM_(2, 5)_sigma-1_rate-0.5       -0.015143
standardized_SOM_(3, 4)_sigma-1_rate-0.2       -0.015097
standardized_SOM_(3, 4)_sigma-1_rate-1         -0.014982
standardized_SOM_(6, 5)_sigma-1_rate-1         -0.012165
standardized_SOM_(2, 6)_sigma-1_rate-0.5       -0.007150
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01    -0.005978
standardized_SOM_(3, 5)_sigma-1_rate-1         -0.005481
standardized_SOM_(2, 6)_sigma-1_rate-1         -0.004917
standardized_SOM_(6, 5)_sigma-1_rate-0.2       -0.004576
standardized_SOM_(5, 5)_sigma-0.25_rate-0.01   -0.003037
standardized_SOM_(5, 5)_sigma-0.1_rate-0.01    -0.003037
standardized_SOM_(5, 5)_sigma-0.01_rate-0.01   -0.003037
standardized_SOM_(4, 5)_sigma-1

In [100]:
silhouettes_sample.sort_values()[:20]

standardized_MiniBatchKMeans_15                 0.003868
standardized_MiniBatchKMeans_30                 0.006452
standardized_GMM_30                             0.006930
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    0.007282
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     0.007616
standardized_SOM_(6, 5)_sigma-1_rate-0.01       0.011388
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    0.012175
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     0.012175
standardized_SOM_(4, 5)_sigma-1_rate-0.01       0.012226
standardized_SOM_(5, 5)_sigma-1_rate-0.01       0.012382
normalized_GMM_30                               0.012945
standardized_SOM_(5, 5)_sigma-0.1_rate-0.01     0.013606
standardized_SOM_(5, 5)_sigma-0.01_rate-0.01    0.013606
standardized_SOM_(5, 5)_sigma-0.25_rate-0.01    0.015820
standardized_SOM_(5, 5)_sigma-0.5_rate-0.01     0.018615
standardized_SOM_(4, 5)_sigma-0.5_rate-0.01     0.020928
standardized_SOM_(5, 5)_sigma-1_rate-0.05       0.024459
normalized_GMM_20              

In [101]:
calinski_chunks = pd.Series()
calinski_sample = pd.Series()


for op in options:
    if 'chunks' in op:
        calinski_chunks[op[7:]] = evaluations[op]['calinski']
    else:
        calinski_sample[op[7:]] = evaluations[op]['calinski']

  calinski_chunks = pd.Series()
  calinski_sample = pd.Series()


In [182]:
calinski_chunks.sort_values()[:40]

standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     6959.598231
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    6959.598231
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    6960.295718
standardized_SOM_(6, 5)_sigma-1_rate-1          6990.208635
standardized_SOM_(5, 5)_sigma-0.01_rate-1       7421.136225
standardized_SOM_(5, 5)_sigma-0.1_rate-1        7421.136225
standardized_SOM_(5, 5)_sigma-0.25_rate-1       7424.063839
standardized_SOM_(6, 5)_sigma-0.1_rate-1        7430.500504
standardized_SOM_(6, 5)_sigma-0.01_rate-1       7430.500504
standardized_SOM_(6, 5)_sigma-0.25_rate-1       7433.264305
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     7748.246318
standardized_SOM_(5, 5)_sigma-1_rate-1          7825.409446
standardized_SOM_(4, 5)_sigma-1_rate-1          8107.615230
standardized_SOM_(5, 5)_sigma-0.01_rate-0.01    8112.822740
standardized_SOM_(5, 5)_sigma-0.1_rate-0.01     8112.822740
standardized_SOM_(5, 5)_sigma-0.25_rate-0.01    8112.861026
standardized_SOM_(6, 5)_sigma-1_rate-0.1

In [103]:
calinski_sample.sort_values()[:20]

standardized_SOM_(6, 5)_sigma-1_rate-0.01       4824.010917
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    4889.160249
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     4889.160249
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    4915.344589
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     4980.880058
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5     5014.890213
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5      5014.890213
standardized_SOM_(6, 5)_sigma-0.25_rate-0.5     5014.890213
standardized_MiniBatchKMeans_30                 5134.759793
standardized_SOM_(6, 5)_sigma-1_rate-0.05       5153.592227
standardized_SOM_(6, 5)_sigma-0.5_rate-1        5286.382837
standardized_SOM_(6, 5)_sigma-1_rate-0.1        5364.270909
standardized_SOM_(4, 5)_sigma-0.5_rate-0.5      5555.249455
standardized_SOM_(5, 5)_sigma-1_rate-0.01       5559.596842
standardized_SOM_(5, 5)_sigma-0.01_rate-0.01    5678.900502
standardized_SOM_(5, 5)_sigma-0.1_rate-0.01     5678.900502
standardized_SOM_(5, 5)_sigma-0.25_rate-

In [104]:
davies_chunks = pd.Series()
davies_sample = pd.Series()


for op in options:
    if 'chunks' in op:
        davies_chunks[op[7:]] = evaluations[op]['davies']
    else:
        davies_sample[op[7:]] = evaluations[op]['davies']

  davies_chunks = pd.Series()
  davies_sample = pd.Series()


In [105]:
davies_chunks.sort_values()[:20]

standardized_SOM_(3, 4)_sigma-0.5_rate-0.2      1.327454
standardized_SOM_(3, 4)_sigma-0.5_rate-0.5      1.328461
standardized_SOM_(3, 5)_sigma-0.5_rate-1        1.336137
standardized_SOM_(3, 5)_sigma-0.1_rate-0.2      1.351700
standardized_SOM_(3, 5)_sigma-0.01_rate-0.2     1.351700
standardized_SOM_(3, 5)_sigma-0.25_rate-0.2     1.354574
standardized_SOM_(2, 6)_sigma-0.5_rate-1        1.363487
standardized_SOM_(2, 5)_sigma-0.5_rate-0.5      1.381653
standardized_SOM_(3, 4)_sigma-0.5_rate-1        1.393961
standardized_SOM_(2, 5)_sigma-0.5_rate-1        1.394819
standardized_SOM_(4, 5)_sigma-0.5_rate-1        1.411874
normalized_SOM_(2, 6)_sigma-0.5_rate-0.01       1.426189
standardized_SOM_(2, 6)_sigma-0.5_rate-0.2      1.426299
standardized_SOM_(5, 5)_sigma-0.25_rate-1       1.437903
standardized_SOM_(5, 5)_sigma-0.01_rate-1       1.440224
standardized_SOM_(5, 5)_sigma-0.1_rate-1        1.440224
standardized_SOM_(3, 5)_sigma-0.5_rate-0.2      1.448240
standardized_SOM_(3, 3)_sigma-0

In [106]:
davies_sample.sort_values()[:20]

normalized_SOM_(5, 5)_sigma-0.25_rate-1        0.803414
normalized_SOM_(6, 5)_sigma-0.25_rate-1        0.803414
normalized_SOM_(3, 4)_sigma-0.25_rate-1        0.962934
normalized_SOM_(2, 5)_sigma-0.25_rate-1        0.962934
normalized_SOM_(4, 5)_sigma-0.25_rate-1        0.962934
normalized_SOM_(3, 5)_sigma-0.25_rate-1        0.962934
normalized_SOM_(2, 6)_sigma-0.25_rate-1        0.962934
standardized_SOM_(6, 5)_sigma-0.5_rate-1       1.144874
standardized_SOM_(5, 5)_sigma-0.25_rate-0.5    1.193206
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5    1.198918
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5     1.198918
standardized_SOM_(6, 5)_sigma-0.25_rate-0.5    1.198918
standardized_SOM_(5, 5)_sigma-0.01_rate-0.5    1.212181
standardized_SOM_(5, 5)_sigma-0.1_rate-0.5     1.212181
standardized_SOM_(4, 5)_sigma-0.25_rate-0.5    1.212815
standardized_SOM_(4, 5)_sigma-0.01_rate-0.5    1.212815
standardized_SOM_(4, 5)_sigma-0.1_rate-0.5     1.212815
standardized_SOM_(4, 5)_sigma-0.5_rate-0.5     1

In [108]:
one_eval.keys()

dict_keys(['silhouette', 'calinski', 'davies', 'frequencies', 'mod_chi', 'mod_p', 'mod_dof', 'mod_exp', 'mod_cramers_', 'mod_crosstab', 'mur_chi', 'mur_p', 'mur_dof', 'mur_exp', 'mur_cramers_v', 'mur_crosstab', 'joc_chi', 'joc_p', 'joc_dof', 'joc_exp', 'joc_cramers_v', 'joc_crosstab', 'signature_abundance', 'signature_areas'])

1    67888
2    48507
6    42341
8    39074
7    31447
0    12142
3    11888
5     8989
4     7895
9     6626
dtype: int64

In [113]:
fragmentation = pd.Series()

for op in options:
    if 'chunks' in op:
        fragmentation[op[7:]] = evaluations[op]['signature_abundance'].sum()
   

  fragmentation = pd.Series()


In [136]:
fragmentation_area = pd.Series()

for op in options:
    if 'chunks' in op:
        fragmentation_area[op[7:]] = evaluations[op]['signature_areas'].median()
   

  fragmentation_area = pd.Series()


In [186]:
fragmentation.loc[fragmentation.index.str.contains('stand')].sort_values()[:50]

standardized_SOM_(3, 3)_sigma-0.1_rate-0.05      838
standardized_SOM_(3, 3)_sigma-0.01_rate-0.05     838
standardized_SOM_(3, 3)_sigma-0.25_rate-0.05     840
standardized_SOM_(2, 5)_sigma-0.5_rate-1         908
standardized_SOM_(3, 3)_sigma-0.5_rate-0.01      927
standardized_SOM_(3, 3)_sigma-0.25_rate-0.01     931
standardized_SOM_(2, 5)_sigma-0.1_rate-0.05      937
standardized_SOM_(2, 5)_sigma-0.01_rate-0.05     937
standardized_SOM_(2, 6)_sigma-0.5_rate-1         940
standardized_SOM_(2, 5)_sigma-0.25_rate-0.05     940
standardized_SOM_(3, 3)_sigma-0.1_rate-0.01      941
standardized_SOM_(3, 3)_sigma-0.01_rate-0.01     941
standardized_SOM_(3, 4)_sigma-0.5_rate-1         953
standardized_SOM_(3, 4)_sigma-0.5_rate-0.2       964
standardized_SOM_(2, 6)_sigma-0.5_rate-0.2       967
standardized_SOM_(2, 5)_sigma-0.01_rate-0.1     1023
standardized_SOM_(2, 5)_sigma-0.1_rate-0.1      1023
standardized_SOM_(2, 5)_sigma-0.25_rate-0.1     1023
standardized_SOM_(3, 5)_sigma-0.5_rate-1      

In [121]:
fragmentation.loc[fragmentation.index.str.contains('KMeans')]

normalized_KMeans_10               2547
normalized_KMeans_15               3197
normalized_KMeans_20               3771
normalized_KMeans_30               4779
normalized_MiniBatchKMeans_10      2364
normalized_MiniBatchKMeans_15      3414
normalized_MiniBatchKMeans_20      4120
normalized_MiniBatchKMeans_30      5145
standardized_KMeans_10             1661
standardized_KMeans_15             1994
standardized_KMeans_20             2365
standardized_KMeans_30             3174
standardized_MiniBatchKMeans_10    1654
standardized_MiniBatchKMeans_15    2163
standardized_MiniBatchKMeans_20    2735
standardized_MiniBatchKMeans_30    3389
dtype: int64

In [122]:
fragmentation.loc[fragmentation.index.str.contains('GMM')]

normalized_GMM_10      1409
normalized_GMM_15      1453
normalized_GMM_20      1229
normalized_GMM_30      1379
standardized_GMM_10    1520
standardized_GMM_15    1373
standardized_GMM_20    1531
standardized_GMM_30    1369
dtype: int64

In [173]:
fragmentation_area.loc[fragmentation_area.index.str.contains('stand')].sort_values(ascending=False)[:40]

standardized_GMM_30                             2700.965074
standardized_SOM_(2, 6)_sigma-1_rate-0.1        2391.975084
standardized_SOM_(2, 5)_sigma-1_rate-0.05       2374.398935
standardized_SOM_(3, 3)_sigma-1_rate-0.1        2368.246530
standardized_SOM_(4, 5)_sigma-0.5_rate-0.2      2356.992203
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    2344.947563
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     2344.947563
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    2307.287479
standardized_SOM_(3, 5)_sigma-1_rate-0.01       2284.809471
standardized_SOM_(4, 5)_sigma-0.1_rate-0.1      2281.160246
standardized_SOM_(4, 5)_sigma-0.25_rate-0.1     2281.160246
standardized_SOM_(4, 5)_sigma-0.01_rate-0.1     2281.160246
standardized_SOM_(2, 5)_sigma-1_rate-0.2        2270.651929
standardized_SOM_(3, 3)_sigma-1_rate-0.2        2266.443011
standardized_MiniBatchKMeans_30                 2244.820890
standardized_GMM_15                             2239.217112
standardized_SOM_(2, 6)_sigma-1_rate-0.0

In [188]:
evaluations["chunks_standardized_GMM_30"]['frequencies']

17    24262
18    21483
12    20069
28    19235
13    17831
5     17506
11    16281
6     16210
3     15925
27    13116
19    12653
1      8949
15     8100
16     8015
9      7612
2      7590
14     6497
0      6306
24     5686
23     5089
29     4582
20     3989
21     3505
26     3180
4      1669
25      610
8       361
10      221
22      179
7        86
dtype: int64

In [124]:
postcode = pd.Series()

for op in options:
    if 'chunks' in op:
        postcode[op[7:]] = evaluations[op]['mur_cramers_v']

  postcode = pd.Series()


In [184]:
postcode.sort_values()[-20:]

standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    0.193043
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    0.193126
standardized_SOM_(6, 5)_sigma-1_rate-0.01       0.193576
standardized_GMM_15                             0.195398
standardized_SOM_(6, 5)_sigma-0.5_rate-0.2      0.195544
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5      0.195676
normalized_GMM_20                               0.196112
standardized_SOM_(6, 5)_sigma-1_rate-0.05       0.197000
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     0.197602
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    0.197952
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    0.198055
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     0.198055
standardized_GMM_20                             0.200421
normalized_KMeans_30                            0.201362
standardized_KMeans_20                          0.202131
normalized_MiniBatchKMeans_30                   0.204752
normalized_GMM_30                               0.204883
standardized_MiniBatchKMeans_30

In [126]:
jochem = pd.Series()

for op in options:
    if 'chunks' in op:
        jochem[op[7:]] = evaluations[op]['joc_cramers_v']

  jochem = pd.Series()


In [127]:
jochem.sort_values()[-20:]

standardized_SOM_(6, 5)_sigma-1_rate-0.05      0.293012
standardized_SOM_(5, 5)_sigma-1_rate-0.5       0.293537
standardized_SOM_(6, 5)_sigma-1_rate-0.2       0.293610
normalized_MiniBatchKMeans_30                  0.293773
standardized_GMM_15                            0.295652
standardized_SOM_(6, 5)_sigma-1_rate-0.01      0.295684
standardized_SOM_(5, 5)_sigma-1_rate-0.01      0.296311
standardized_SOM_(6, 5)_sigma-1_rate-0.5       0.296498
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05    0.299157
normalized_GMM_30                              0.299246
standardized_MiniBatchKMeans_15                0.299745
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5     0.302494
standardized_SOM_(6, 5)_sigma-0.5_rate-0.2     0.304802
standardized_GMM_20                            0.308456
standardized_MiniBatchKMeans_20                0.311887
standardized_KMeans_15                         0.313963
standardized_KMeans_20                         0.315366
standardized_GMM_30                            0

In [129]:
modum = pd.Series()

for op in options:
    if 'chunks' in op:
        modum[op[7:]] = evaluations[op]['mod_cramers_']

  modum = pd.Series()


In [130]:
modum.sort_values()[-20:]

standardized_SOM_(5, 5)_sigma-1_rate-0.1        0.298132
standardized_SOM_(4, 5)_sigma-1_rate-0.5        0.298227
standardized_SOM_(6, 5)_sigma-0.01_rate-0.1     0.301453
standardized_SOM_(6, 5)_sigma-0.1_rate-0.1      0.301453
standardized_SOM_(6, 5)_sigma-0.25_rate-0.1     0.301463
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     0.301649
standardized_SOM_(5, 5)_sigma-1_rate-0.5        0.302658
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    0.303264
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     0.303293
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    0.303293
standardized_KMeans_30                          0.304986
standardized_GMM_30                             0.305577
standardized_SOM_(4, 5)_sigma-1_rate-0.05       0.305997
standardized_SOM_(6, 5)_sigma-1_rate-0.05       0.307816
standardized_SOM_(6, 5)_sigma-1_rate-0.1        0.308840
standardized_SOM_(5, 5)_sigma-1_rate-0.01       0.308846
standardized_SOM_(6, 5)_sigma-1_rate-0.01       0.311538
standardized_MiniBatchKMeans_30

In [131]:
score = pd.DataFrame(index=modum.index)

In [134]:
score["modum"] = pd.Series(range(1, 312), index=modum.sort_values(ascending=False).index)

In [135]:
score["postcode_class"] = pd.Series(range(1, 312), index=postcode.sort_values(ascending=False).index)
score["jochem"] = pd.Series(range(1, 312), index=jochem.sort_values(ascending=False).index)

In [138]:
score["fragmentation_count"] = pd.Series(range(1, 312), index=fragmentation.sort_values(ascending=True).index)
score["fragmentation_area"] = pd.Series(range(1, 312), index=fragmentation_area.sort_values(ascending=False).index)
score["davies"] = pd.Series(range(1, 312), index=davies_chunks.sort_values(ascending=True).index)
score["silhouette"] = pd.Series(range(1, 312), index=silhouettes_chunks.sort_values(ascending=True).index)
score["calinski"] = pd.Series(range(1, 312), index=calinski_chunks.sort_values(ascending=True).index)

In [141]:
score["total"] = score.sum(axis=1)
score["comparative"] = score.modum + score.postcode_class + score.jochem
score["internal"] = score.davies + score.silhouette + score.calinski
score["fragmentation"] = score.fragmentation_count + score.fragmentation_area

In [144]:
score.total.sort_values()[:20]

standardized_GMM_30                             459
normalized_GMM_30                               553
standardized_SOM_(5, 5)_sigma-0.01_rate-0.2     637
standardized_SOM_(6, 5)_sigma-0.5_rate-0.2      640
standardized_SOM_(5, 5)_sigma-0.1_rate-0.2      641
standardized_GMM_20                             683
standardized_SOM_(6, 5)_sigma-0.5_rate-0.5      683
standardized_SOM_(5, 5)_sigma-0.25_rate-0.05    694
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5      694
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     698
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    700
standardized_GMM_15                             709
standardized_SOM_(5, 5)_sigma-0.01_rate-0.05    709
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5      709
standardized_SOM_(5, 5)_sigma-0.1_rate-0.05     711
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5     715
standardized_SOM_(4, 5)_sigma-0.1_rate-0.1      715
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    716
standardized_SOM_(4, 5)_sigma-0.25_rate-0.1     717
standardized

In [145]:
score.comparative.sort_values()[:20]

standardized_MiniBatchKMeans_30                  7
standardized_GMM_30                             13
standardized_KMeans_30                          14
standardized_KMeans_20                          32
standardized_SOM_(6, 5)_sigma-1_rate-0.01       37
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     39
standardized_SOM_(6, 5)_sigma-1_rate-0.05       40
standardized_SOM_(6, 5)_sigma-1_rate-0.2        42
standardized_SOM_(6, 5)_sigma-0.5_rate-0.2      45
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     45
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    48
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    48
standardized_SOM_(6, 5)_sigma-1_rate-0.5        49
standardized_SOM_(5, 5)_sigma-1_rate-0.01       57
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5      57
standardized_SOM_(5, 5)_sigma-1_rate-0.5        70
standardized_MiniBatchKMeans_20                 79
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     80
standardized_SOM_(6, 5)_sigma-1_rate-0.1        83
standardized_KMeans_15         

In [180]:
score.internal.sort_values()[:60]

standardized_SOM_(4, 5)_sigma-0.25_rate-0.5     163
standardized_SOM_(4, 5)_sigma-0.01_rate-0.5     177
standardized_SOM_(4, 5)_sigma-0.1_rate-0.5      178
standardized_SOM_(4, 5)_sigma-0.5_rate-1        185
standardized_SOM_(5, 5)_sigma-0.01_rate-0.5     188
standardized_SOM_(5, 5)_sigma-0.1_rate-0.5      189
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5      199
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5     200
standardized_SOM_(5, 5)_sigma-0.5_rate-1        202
standardized_SOM_(5, 5)_sigma-0.25_rate-0.5     211
standardized_SOM_(6, 5)_sigma-0.01_rate-1       218
standardized_SOM_(6, 5)_sigma-0.1_rate-1        219
standardized_SOM_(6, 5)_sigma-0.25_rate-1       220
standardized_SOM_(6, 5)_sigma-0.25_rate-0.5     223
standardized_SOM_(6, 5)_sigma-0.5_rate-1        237
standardized_SOM_(6, 5)_sigma-0.5_rate-0.5      247
standardized_SOM_(4, 5)_sigma-0.5_rate-0.5      259
standardized_SOM_(3, 4)_sigma-1_rate-0.5        267
standardized_SOM_(5, 5)_sigma-0.25_rate-0.2     269
standardized

In [189]:
score.internal.loc[score.index.str.contains("GMM")].sort_values()[:60]

standardized_GMM_30    389
normalized_GMM_30      397
standardized_GMM_20    439
standardized_GMM_15    472
normalized_GMM_20      500
standardized_GMM_10    538
normalized_GMM_15      556
normalized_GMM_10      586
Name: internal, dtype: int64

In [190]:
score.internal.sort_values()[:60]

standardized_SOM_(4, 5)_sigma-0.25_rate-0.5     163
standardized_SOM_(4, 5)_sigma-0.01_rate-0.5     177
standardized_SOM_(4, 5)_sigma-0.1_rate-0.5      178
standardized_SOM_(4, 5)_sigma-0.5_rate-1        185
standardized_SOM_(5, 5)_sigma-0.01_rate-0.5     188
standardized_SOM_(5, 5)_sigma-0.1_rate-0.5      189
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5      199
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5     200
standardized_SOM_(5, 5)_sigma-0.5_rate-1        202
standardized_SOM_(5, 5)_sigma-0.25_rate-0.5     211
standardized_SOM_(6, 5)_sigma-0.01_rate-1       218
standardized_SOM_(6, 5)_sigma-0.1_rate-1        219
standardized_SOM_(6, 5)_sigma-0.25_rate-1       220
standardized_SOM_(6, 5)_sigma-0.25_rate-0.5     223
standardized_SOM_(6, 5)_sigma-0.5_rate-1        237
standardized_SOM_(6, 5)_sigma-0.5_rate-0.5      247
standardized_SOM_(4, 5)_sigma-0.5_rate-0.5      259
standardized_SOM_(3, 4)_sigma-1_rate-0.5        267
standardized_SOM_(5, 5)_sigma-0.25_rate-0.2     269
standardized

In [147]:
score.fragmentation.sort_values()[:20]

standardized_GMM_30                              57
normalized_GMM_30                                58
standardized_GMM_15                              84
normalized_GMM_15                                86
standardized_SOM_(2, 5)_sigma-0.01_rate-0.01     89
standardized_SOM_(2, 5)_sigma-0.25_rate-0.01     91
standardized_SOM_(3, 3)_sigma-1_rate-0.2         91
standardized_SOM_(2, 5)_sigma-0.5_rate-0.01      92
standardized_SOM_(2, 5)_sigma-0.1_rate-0.01      93
normalized_SOM_(3, 4)_sigma-0.5_rate-0.01       104
normalized_SOM_(2, 6)_sigma-0.5_rate-0.01       105
standardized_SOM_(3, 3)_sigma-1_rate-0.1        111
normalized_GMM_20                               113
standardized_SOM_(2, 6)_sigma-1_rate-0.1        114
normalized_SOM_(2, 5)_sigma-0.5_rate-0.01       116
standardized_SOM_(2, 5)_sigma-1_rate-0.2        120
standardized_SOM_(4, 5)_sigma-0.5_rate-0.2      125
standardized_SOM_(2, 5)_sigma-1_rate-0.05       125
standardized_GMM_20                             134
standardized

In [149]:
postcode_sample = pd.Series()
jochem_sample = pd.Series()
modum_sample = pd.Series()


for op in options:
    if 'sample' in op:
        postcode_sample[op[7:]] = evaluations[op]['mur_cramers_v']
        jochem_sample[op[7:]] = evaluations[op]['joc_cramers_v']
        modum_sample[op[7:]] = evaluations[op]['mod_cramers_']

  postcode_sample = pd.Series()
  jochem_sample = pd.Series()
  modum_sample = pd.Series()


In [151]:
score_sample = pd.DataFrame(index=modum_sample.index)
score_sample["modum"] = pd.Series(range(1, 310), index=modum_sample.sort_values(ascending=False).index)
score_sample["postcode_class"] = pd.Series(range(1, 310), index=postcode_sample.sort_values(ascending=False).index)
score_sample["jochem"] = pd.Series(range(1, 310), index=jochem_sample.sort_values(ascending=False).index)
score_sample["davies"] = pd.Series(range(1, 310), index=davies_sample.sort_values(ascending=True).index)
score_sample["silhouette"] = pd.Series(range(1, 310), index=silhouettes_sample.sort_values(ascending=True).index)
score_sample["calinski"] = pd.Series(range(1, 310), index=calinski_sample.sort_values(ascending=True).index)
score_sample["total"] = score_sample.sum(axis=1)
score_sample["comparative"] = score_sample.modum + score_sample.postcode_class + score_sample.jochem
score_sample["internal"] = score_sample.davies + score_sample.silhouette + score_sample.calinski

In [152]:
score_sample.total.sort_values()[:20]

standardized_KMeans_30                          340
standardized_MiniBatchKMeans_30                 341
standardized_SOM_(6, 5)_sigma-1_rate-0.01       346
standardized_SOM_(6, 5)_sigma-1_rate-0.05       354
standardized_SOM_(6, 5)_sigma-1_rate-0.1        354
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    361
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     361
standardized_SOM_(6, 5)_sigma-1_rate-0.2        371
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     376
standardized_SOM_(5, 5)_sigma-1_rate-0.05       387
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    391
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     392
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    398
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     400
standardized_SOM_(6, 5)_sigma-0.5_rate-0.1      401
standardized_SOM_(5, 5)_sigma-1_rate-0.1        402
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    403
standardized_SOM_(5, 5)_sigma-1_rate-0.01       403
standardized_SOM_(5, 5)_sigma-0.25_rate-0.05    424
standardized

In [178]:
score_sample.comparative.loc[score_sample.index.str.contains('stand')].sort_values()[:20]

standardized_SOM_(6, 5)_sigma-1_rate-0.05       36
standardized_SOM_(6, 5)_sigma-1_rate-0.01       37
standardized_MiniBatchKMeans_30                 39
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    41
standardized_SOM_(6, 5)_sigma-1_rate-0.1        42
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     42
standardized_SOM_(5, 5)_sigma-1_rate-0.05       56
standardized_SOM_(6, 5)_sigma-1_rate-0.2        63
standardized_SOM_(4, 5)_sigma-0.1_rate-0.05     63
standardized_SOM_(4, 5)_sigma-1_rate-0.05       65
standardized_SOM_(4, 5)_sigma-0.01_rate-0.05    66
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    69
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     69
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     72
standardized_SOM_(5, 5)_sigma-1_rate-0.01       79
standardized_SOM_(4, 5)_sigma-0.5_rate-0.05     87
standardized_SOM_(5, 5)_sigma-0.01_rate-0.05    88
standardized_SOM_(5, 5)_sigma-0.25_rate-0.05    89
standardized_SOM_(5, 5)_sigma-0.5_rate-0.05     90
standardized_SOM_(5, 5)_sigma-1

In [154]:
score_sample.internal.sort_values()[:20]

standardized_SOM_(6, 5)_sigma-0.1_rate-0.2      208
standardized_SOM_(6, 5)_sigma-0.01_rate-0.2     209
standardized_SOM_(6, 5)_sigma-0.25_rate-0.2     210
standardized_KMeans_30                          218
standardized_SOM_(6, 5)_sigma-0.5_rate-0.5      235
standardized_SOM_(5, 5)_sigma-0.5_rate-0.2      242
standardized_SOM_(5, 5)_sigma-0.25_rate-0.2     258
standardized_SOM_(6, 5)_sigma-0.5_rate-0.2      259
standardized_SOM_(5, 5)_sigma-0.1_rate-0.2      259
standardized_SOM_(5, 5)_sigma-0.01_rate-0.2     263
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5      264
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    284
standardized_SOM_(6, 5)_sigma-0.01_rate-0.01    289
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     290
standardized_SOM_(6, 5)_sigma-0.5_rate-1        295
standardized_SOM_(5, 5)_sigma-0.01_rate-0.1     297
standardized_SOM_(5, 5)_sigma-0.1_rate-0.1      298
standardized_SOM_(5, 5)_sigma-1_rate-0.5        298
standardized_SOM_(6, 5)_sigma-1_rate-1          298
standardized

In [160]:
(score.total + score_sample.total).loc[score.index.intersection(score_sample.index)].sort_values()[:40]

standardized_SOM_(6, 5)_sigma-0.1_rate-0.05     1059.0
standardized_MiniBatchKMeans_30                 1059.0
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05    1061.0
standardized_SOM_(6, 5)_sigma-1_rate-0.01       1095.0
standardized_KMeans_30                          1103.0
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    1107.0
standardized_SOM_(5, 5)_sigma-0.25_rate-0.05    1118.0
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     1133.0
standardized_SOM_(5, 5)_sigma-0.01_rate-0.05    1144.0
standardized_SOM_(6, 5)_sigma-1_rate-0.05       1147.0
standardized_SOM_(5, 5)_sigma-0.1_rate-0.05     1148.0
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     1154.0
standardized_SOM_(6, 5)_sigma-1_rate-0.2        1162.0
standardized_SOM_(6, 5)_sigma-0.25_rate-0.01    1184.0
standardized_SOM_(6, 5)_sigma-0.01_rate-0.1     1189.0
standardized_SOM_(5, 5)_sigma-1_rate-0.01       1194.0
standardized_SOM_(6, 5)_sigma-0.1_rate-0.1      1194.0
standardized_SOM_(6, 5)_sigma-0.1_rate-0.01     1196.0
standardiz

In [161]:
(score.comparative + score_sample.comparative).loc[score.index.intersection(score_sample.index)].sort_values()[:40]

standardized_MiniBatchKMeans_30                  46.0
standardized_SOM_(6, 5)_sigma-1_rate-0.01        74.0
standardized_SOM_(6, 5)_sigma-1_rate-0.05        76.0
standardized_SOM_(6, 5)_sigma-0.1_rate-0.05      87.0
standardized_SOM_(6, 5)_sigma-0.01_rate-0.05     89.0
standardized_SOM_(6, 5)_sigma-1_rate-0.2        105.0
standardized_SOM_(6, 5)_sigma-0.5_rate-0.05     111.0
standardized_SOM_(6, 5)_sigma-0.25_rate-0.05    117.0
standardized_SOM_(6, 5)_sigma-1_rate-0.1        125.0
standardized_SOM_(5, 5)_sigma-1_rate-0.01       136.0
standardized_KMeans_30                          136.0
standardized_SOM_(6, 5)_sigma-0.5_rate-0.01     149.0
standardized_SOM_(6, 5)_sigma-0.5_rate-0.1      187.0
standardized_SOM_(5, 5)_sigma-0.01_rate-0.05    188.0
standardized_SOM_(5, 5)_sigma-0.25_rate-0.05    188.0
standardized_SOM_(5, 5)_sigma-0.1_rate-0.05     192.0
standardized_SOM_(4, 5)_sigma-1_rate-0.05       197.0
standardized_SOM_(5, 5)_sigma-1_rate-0.1        209.0
standardized_SOM_(5, 5)_sigm

In [162]:
(score.internal + score_sample.internal).loc[score.index.intersection(score_sample.index)].sort_values()[:40]

standardized_SOM_(6, 5)_sigma-0.5_rate-0.5      482.0
standardized_SOM_(4, 5)_sigma-0.25_rate-0.5     505.0
standardized_SOM_(6, 5)_sigma-0.01_rate-0.5     512.0
standardized_SOM_(6, 5)_sigma-0.1_rate-0.5      514.0
standardized_SOM_(5, 5)_sigma-0.01_rate-0.5     514.0
standardized_SOM_(5, 5)_sigma-0.1_rate-0.5      516.0
standardized_SOM_(4, 5)_sigma-0.01_rate-0.5     520.0
standardized_SOM_(4, 5)_sigma-0.1_rate-0.5      522.0
standardized_SOM_(5, 5)_sigma-0.25_rate-0.5     526.0
standardized_SOM_(5, 5)_sigma-0.25_rate-0.2     527.0
standardized_SOM_(6, 5)_sigma-0.5_rate-1        532.0
standardized_SOM_(6, 5)_sigma-0.1_rate-0.2      532.0
standardized_SOM_(6, 5)_sigma-0.01_rate-0.2     534.0
standardized_SOM_(6, 5)_sigma-0.25_rate-0.2     536.0
standardized_SOM_(6, 5)_sigma-0.25_rate-0.5     541.0
standardized_SOM_(5, 5)_sigma-0.1_rate-0.2      550.0
standardized_SOM_(5, 5)_sigma-0.01_rate-0.2     551.0
standardized_SOM_(5, 5)_sigma-0.5_rate-0.5      553.0
standardized_SOM_(5, 5)_sigm

In [164]:
evaluations["chunks_standardized_GMM_30"]['frequencies']

17    24262
18    21483
12    20069
28    19235
13    17831
5     17506
11    16281
6     16210
3     15925
27    13116
19    12653
1      8949
15     8100
16     8015
9      7612
2      7590
14     6497
0      6306
24     5686
23     5089
29     4582
20     3989
21     3505
26     3180
4      1669
25      610
8       361
10      221
22      179
7        86
dtype: int64

In [165]:
evaluations["chunks_standardized_KMeans_30"]['frequencies']

23    32317
0     28309
15    25717
4     23572
18    18582
7     17673
29    16669
11    15973
20    15864
1     14967
5     14199
13    10040
22     7694
25     7138
9      5783
3      4641
21     3207
19     3031
27     2957
14     2012
16     1952
12     1629
28      776
2       559
17      409
24      339
26      302
6       221
10      179
8        86
dtype: int64

In [198]:
labels["chunks_standardized_KMeans_30"]

array([13, 13, 13, ..., 21, 21, 21], dtype=int32)

In [202]:
data = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/sample_standardized_data.pq").values 
chunk_data = pd.read_parquet(f"../../urbangrammar_samba/spatial_signatures/clustering_data/sample/chunks_standardized_data.pq").values

%time km = KMeans(n_clusters=30, n_init=10, random_state=42).fit(data)
%time labels_ = km.predict(chunk_data)

# geom51 = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_51.pq", columns=["tessellation", "hindex"])
# geom68 = gpd.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_68.pq", columns=["tessellation", "hindex"])
# geom = pd.concat([geom51, geom68]).reset_index(drop=True).rename_geometry("geometry")

geom['labels'] = labels_        

CPU times: user 17min 16s, sys: 4min 24s, total: 21min 41s
Wall time: 1min 28s
CPU times: user 5.76 s, sys: 0 ns, total: 5.76 s
Wall time: 605 ms


In [203]:
data.shape

(250000, 331)

In [204]:
chunk_data.shape

(276797, 331)

In [193]:
    from sklearn import metrics
    import pandas as pd
    import scipy as sp
    import matplotlib.pyplot as plt
    import contextily as ctx
    import urbangrammar_graphics as ugg
    import dask_geopandas
    from utils.dask_geopandas import dask_dissolve

In [205]:
ddf = dask_geopandas.from_geopandas(geom.sort_values('labels'), npartitions=64)
spsig = dask_dissolve(ddf, by='labels').compute().reset_index(drop=True).explode()

cmap = ugg.get_colormap(spsig.labels.nunique(), randomize=True)
token = "pk.eyJ1IjoibWFydGluZmxlaXMiLCJhIjoiY2tsNmhlemtxMmlicTJubXN6and5aTc2NCJ9.l7nSUXM7ZRjAWTB7oXiswQ"

ax = spsig.cx[332971:361675, 379462:404701].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
ax.set_axis_off()

plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/KMeans_predicted_lpool.png")
plt.close()   

ax = spsig.cx[218800:270628, 645123:695069].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
ax.set_axis_off()
plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/KMeans_predicted_gla.png")
plt.close() 

In [195]:
%time km = KMeans(n_clusters=30, n_init=100, random_state=42).fit(chunk_data)
%time labels_ = km.labels_

CPU times: user 2h 42min 15s, sys: 44min 14s, total: 3h 26min 30s
Wall time: 14min 20s
CPU times: user 104 µs, sys: 0 ns, total: 104 µs
Wall time: 9.3 µs


In [196]:
geom['labels'] = labels_        

ddf = dask_geopandas.from_geopandas(geom.sort_values('labels'), npartitions=64)
spsig = dask_dissolve(ddf, by='labels').compute().reset_index(drop=True).explode()

cmap = ugg.get_colormap(spsig.labels.nunique(), randomize=True)
token = "pk.eyJ1IjoibWFydGluZmxlaXMiLCJhIjoiY2tsNmhlemtxMmlicTJubXN6and5aTc2NCJ9.l7nSUXM7ZRjAWTB7oXiswQ"

ax = spsig.cx[332971:361675, 379462:404701].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
ax.set_axis_off()

plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/KMeans30_100_lpool.png")
plt.close()   

ax = spsig.cx[218800:270628, 645123:695069].plot("labels", figsize=(20, 20), zorder=1, linewidth=.3, edgecolor='w', alpha=1, legend=True, cmap=cmap, categorical=True)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('roads', token), zorder=2, alpha=.3)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('labels', token), zorder=3, alpha=1)
ctx.add_basemap(ax, crs=27700, source=ugg.get_tiles('background', token), zorder=-1, alpha=1)
ax.set_axis_off()
plt.savefig(f"../../urbangrammar_samba/spatial_signatures/clustering_data/validation/maps/KMeans30_100_gla.png")
plt.close() 

## Full scale

In [5]:
import numpy as np

In [2]:
standardized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/").set_index('hindex')
stand_fn = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/standardized/")

In [6]:
data = dask.dataframe.multi.concat([standardized_form, stand_fn], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0)

In [8]:
%time data = data.compute()

CPU times: user 2min 37s, sys: 1min 25s, total: 4min 2s
Wall time: 2min 44s


In [9]:
from sklearn.cluster import KMeans, MiniBatchKMeans

In [10]:
data

Unnamed: 0_level_0,sdbAre_q1,sdbAre_q2,sdbAre_q3,sdbPer_q1,sdbPer_q2,sdbPer_q3,sdbCoA_q1,sdbCoA_q2,sdbCoA_q3,ssbCCo_q1,...,Code_18_521_q2,Code_18_334_q3,Code_18_244_q1,Code_18_244_q2,Code_18_331_q3,Code_18_132_q2,Code_18_132_q3,Code_18_521_q1,Code_18_222_q2,Code_18_521_q3
hindex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
c000e094707t0000,-0.947406,-0.371977,0.020285,-0.901199,-0.237045,-0.023143,-0.000419,-0.001515,-0.010221,-0.046170,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c000e094763t0000,-0.913567,-0.420861,-0.271703,-0.903627,-0.428003,-0.336729,-0.000419,-0.001515,-0.010221,-0.035325,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c000e094763t0001,-0.878137,-0.411587,-0.284021,-0.900393,-0.416250,-0.350010,-0.000419,-0.001515,-0.010221,-0.034917,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c000e094763t0002,-0.952475,-0.421566,-0.283919,-0.968400,-0.429947,-0.343165,-0.000419,-0.001515,-0.010221,-0.065649,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c000e094764t0000,-0.964878,-0.420861,-0.271703,-0.972440,-0.420006,-0.315861,-0.000419,-0.001515,-0.010221,-0.066832,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
c102e644989t0111,-0.311466,-0.431706,-0.373463,-0.082269,-0.459270,-0.389532,-0.000419,-0.001515,-0.010221,0.132837,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c102e644989t0112,-0.326671,-0.461825,-0.371855,-0.149873,-0.528701,-0.386678,-0.000419,-0.001515,-0.010221,0.136559,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c102e644989t0113,-0.094236,-0.364761,-0.304254,0.024972,-0.347371,-0.283669,-0.000419,-0.001515,-0.010221,0.021411,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0
c102e644989t0114,-0.477667,-0.568464,-0.390033,-0.600170,-0.646516,-0.472676,-0.000419,-0.001515,-0.010221,0.424887,...,0.0,0.0,0.0,0.0,-0.008758,0.0,-0.000679,0.0,-0.009142,0.0


In [11]:
%time km = KMeans(n_clusters=20, n_init=1, random_state=42).fit(data)

CPU times: user 27min 19s, sys: 1min 9s, total: 28min 29s
Wall time: 5min 11s


In [12]:
%time kmb = MiniBatchKMeans(n_clusters=20, n_init=1, random_state=42, batch_size=1_000_000).fit(data)

CPU times: user 5min 13s, sys: 3min 53s, total: 9min 7s
Wall time: 1min 40s
