# Constrained K-Means demo

## H2O K-Means algorithm

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that observation in a given group is more similar to another observation in the same group than to another observation in a different group.

![kmeans](https://media0.giphy.com/media/12vVAGkaqHUqCQ/giphy.gif?cid=790b7611178aaedddb5b58de2ef94d55dc6c3feecd2d02f2&rid=giphy.gif)

More about H2O K-means Clustering: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html

## Constrained K-Means algorithm in H2O

Using the `cluster_size_constraints` parameter, a user can set the minimum size of each cluster during the training by an array of numbers. The size of the array must be equal as the `k` parameter.

To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.

![mcf](https://adared.ch/wp-content/uploads/2015/11/mcf.png)

More information about how to convert the standard K-means algorithm to the Minimal Cost Flow problem is described in this paper: https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf.

**Minimum-cost flow problem can be efficiently solved in polynomial time. Currently, the performance of this implementation of Constrained K-means algorithm is slow due to many repeatable calculations which cannot be parallelized and more optimized at H2O backend.**

Expected time with various sized data:
* 5 000 rows, 5 features   ~ 0h  4m  3s
* 10 000 rows, 5 features  ~ 0h  9m 21s
* 15 000 rows, 5 features  ~ 0h 22m 25s
* 20 000 rows, 5 features  ~ 0h 39m 27s
* 25 000 rows, 5 features  ~ 1h 06m  8s
* 30 000 rows, 5 features  ~ 1h 26m 43s
* 35 000 rows, 5 features  ~ 1h 44m  7s
* 40 000 rows, 5 features  ~ 2h 13m 31s
* 45 000 rows, 5 features  ~ 2h  4m 29s
* 50 000 rows, 5 features  ~ 4h  4m 18s

(OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB)

## Shorter time using Aggregator Model

To solve Constrained K-means in a shorter time, you can used the H2O Aggregator model to aggregate data to smaller size first and then pass these data to the Constrained K-means model to calculate the final centroids to be used with scoring. The results won't be as accurate as a result from a model with the whole dataset. However, it should help solve the problem of a huge datasets.

However, there are some assumptions:
* the large dataset has to consist of many similar data points - if not, the insensitive aggregation can break the structure of the dataset
* the resulting clustering may not meet the initial constraints exactly when scoring (this also applies to Constrained K-means model, scoring use only result centroids to score and no constraints defined before)

The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. Aggregator maintains outliers as outliers but lumps together dense clusters into exemplars with an attached count column showing the member points.

More about H2O Aggregator: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html

In [2]:
# run h2o Kmeans

# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator

# init h2o cluster
h2o.init(strict_version_check=False)

versionFromGradle='3.29.0',projectVersion='3.29.0.99999',branch='maurever_PUBDEV-6447_constrained_kmeans_improvement',lastCommitHash='8c5a57d89b9a99dbd0decc4703f1d48854f8af79',gitDescribe='jenkins-master-4911-2-g8c5a57d89b-dirty',compiledOn='2020-02-04 17:04:53',compiledBy='mori'
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.3" 2019-04-16; OpenJDK Runtime Environment (build 11.0.3+7-post-Debian-5); OpenJDK 64-Bit Server VM (build 11.0.3+7-post-Debian-5, mixed mode, sharing)
  Starting server from /home/mori/Documents/h2o/code/h2o-3/build/h2o.jar
  Ice root: /tmp/tmpkw494e74
  JVM stdout: /tmp/tmpkw494e74/h2o_mori_started_from_python.out
  JVM stderr: /tmp/tmpkw494e74/h2o_mori_started_from_python.err
  Server is running at http://127.0.0.1:54325
Connecting to H2O server at http://127.0.0.1:54325 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Europe/Berlin
H2O data parsing timezone:,UTC
H2O cluster version:,3.29.0.99999
H2O cluster version age:,18 hours and 32 minutes
H2O cluster name:,H2O_from_python_mori_zoid7w
H2O cluster total nodes:,1
H2O cluster free memory:,5.768 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


## Data - Chicago Weather dataset

- 5162 rows
- 5 features (monht, day, year, maximal temperature, mean teperature)

In [3]:
# load data
import pandas as pd

data = pd.read_csv("../../smalldata/chicago/chicagoAllWeather.csv")
data = data.iloc[:,[1, 2, 3, 4, 5]]
print(data.shape)
data.head()

(5162, 5)


Unnamed: 0,month,day,year,maxTemp,meanTemp
0,1,1,2001,23.0,14.0
1,1,2,2001,18.0,12.0
2,1,3,2001,28.0,18.0
3,1,4,2001,30.0,24.0
4,1,5,2001,36.0,30.0


In [4]:
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time

start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))

Time: 0:00:00.000011


## Traditional K-means

In [5]:
data_h2o = h2o.H2OFrame(data)

# run h2o Kmeans
h2o_km = H2OKMeansEstimator(k=3, init="furthest")

start = timer()
h2o_km.train(training_frame=data_h2o)
end = timer()

# show details
h2o_km.show()
time_km = timedelta(seconds=end-start)
print("Time:", time_km)

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,5162.0,3.0,0.0,10.0,13948.963612,25779.0,11830.036388




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 13948.765713780385
Total Sum of Square Error to Grand Mean: 25778.999972296842
Between Cluster Sum of Square Error: 11830.234258516457

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,1135.0,3050.489645
1,,2.0,2500.0,6556.670899
2,,3.0,1527.0,4341.60517



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:26:08,0.074 sec,0.0,,
1,,2020-02-05 11:26:09,0.351 sec,1.0,5162.0,27011.092942
2,,2020-02-05 11:26:09,0.372 sec,2.0,817.0,16391.948801
3,,2020-02-05 11:26:09,0.382 sec,3.0,413.0,15616.684815
4,,2020-02-05 11:26:09,0.390 sec,4.0,328.0,15290.133936
5,,2020-02-05 11:26:09,0.396 sec,5.0,363.0,14887.276306
6,,2020-02-05 11:26:09,0.402 sec,6.0,297.0,14369.427876
7,,2020-02-05 11:26:09,0.409 sec,7.0,162.0,14036.249143
8,,2020-02-05 11:26:09,0.414 sec,8.0,56.0,13958.558922
9,,2020-02-05 11:26:09,0.420 sec,9.0,26.0,13950.311099


Time: 0:00:00.767539


## Constrained K-means

In [6]:
data_h2o = h2o.H2OFrame(data)

# run h2o Kmeans
h2o_km_co = H2OKMeansEstimator(k=3, init="furthest", cluster_size_constraints=[1000, 2000, 1000], standardize=True)
start = timer()
h2o_km_co.train(training_frame=data_h2o)
end = timer()

# show details
h2o_km_co.show()
time_km_co = timedelta(seconds=end-start)
print("Time:", time_km_co)

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_2


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,5162.0,3.0,0.0,8.0,14652.189098,25779.0,11126.810902




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 14652.189098266594
Total Sum of Square Error to Grand Mean: 25778.999999999396
Between Cluster Sum of Square Error: 11126.810901732802

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,2015.0,4892.601649
1,,2.0,2000.0,6659.594311
2,,3.0,1147.0,3099.993138



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:26:09,0.010 sec,0.0,,
1,,2020-02-05 11:26:49,39.186 sec,1.0,5162.0,38012.380888
2,,2020-02-05 11:27:12,1 min 2.138 sec,2.0,893.0,17374.936164
3,,2020-02-05 11:27:34,1 min 24.316 sec,3.0,512.0,15135.316601
4,,2020-02-05 11:27:56,1 min 46.839 sec,4.0,173.0,14697.735262
5,,2020-02-05 11:28:21,2 min 12.075 sec,5.0,53.0,14656.618077
6,,2020-02-05 11:28:42,2 min 32.806 sec,6.0,16.0,14652.600795
7,,2020-02-05 11:29:04,2 min 54.947 sec,7.0,3.0,14652.197408
8,,2020-02-05 11:29:27,3 min 17.237 sec,8.0,0.0,14652.189098


Time: 0:03:17.919621


## Constrained K-means reduced data using Aggregator - changed size 1/2 of original data

In [7]:
from h2o.estimators.aggregator import H2OAggregatorEstimator

# original data size 5162, constraints 1000, 2000, 1000
# aggregated data size ~ 2581, constaints 500, 1000, 500

params = {
    "target_num_exemplars": 2581,
    "rel_tol_num_exemplars": 0.01,
    "categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)

start = timer()
agg.train(training_frame=data_h2o)
data_agg = agg.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg = H2OKMeansEstimator(k=3, init="furthest", cluster_size_constraints=[500, 1000, 500], standardize=True)

h2o_km_co_agg.train(x=["month", "day", "year", "maxTemp", "meanTemp"],training_frame=data_agg)
end = timer()

# show details
h2o_km_co_agg.show()
time_km_co_12 = timedelta(seconds=end-start)
print("Time:", time_km_co_12)

aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_4


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,2564.0,3.0,0.0,10.0,7545.012316,12799.0,5253.987684




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 7545.012315760988
Total Sum of Square Error to Grand Mean: 12798.99999999936
Between Cluster Sum of Square Error: 5253.987684238372

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,911.0,2239.748747
1,,2.0,1000.0,3542.688204
2,,3.0,653.0,1762.575365



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:29:29,0.001 sec,0.0,,
1,,2020-02-05 11:29:40,11.399 sec,1.0,2564.0,18526.052055
2,,2020-02-05 11:29:48,19.015 sec,2.0,258.0,7998.650613
3,,2020-02-05 11:29:55,26.824 sec,3.0,150.0,7776.433697
4,,2020-02-05 11:30:04,35.062 sec,4.0,115.0,7639.470489
5,,2020-02-05 11:30:13,44.247 sec,5.0,64.0,7579.246065
6,,2020-02-05 11:30:21,52.570 sec,6.0,44.0,7563.020728
7,,2020-02-05 11:30:29,1 min 0.629 sec,7.0,30.0,7555.507641
8,,2020-02-05 11:30:37,1 min 8.067 sec,8.0,28.0,7550.636731
9,,2020-02-05 11:30:44,1 min 15.520 sec,9.0,18.0,7546.768955


Time: 0:01:24.811844


## Constrained K-means reduced data using Aggregator - changed size 1/4 of original data

In [8]:
from h2o.estimators.aggregator import H2OAggregatorEstimator

# original data size 5162, constraints 1000, 2000, 1000
# aggregated data size ~ 1290, constaints 250, 500, 250

params = {
    "target_num_exemplars": 1290,
    "rel_tol_num_exemplars": 0.01,
    "categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)

start = timer()
agg_14.train(training_frame=data_h2o)
data_agg_14 = agg_14.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_14 = H2OKMeansEstimator(k=3, init="furthest", cluster_size_constraints=[240, 480, 240], standardize=True)

h2o_km_co_agg_14.train(x=list(range(5)),training_frame=data_agg_14)
end = timer()

# show details
h2o_km_co_agg_14.show()
time_km_co_14 = timedelta(seconds=end-start)
print("Time:", time_km_co_14)

aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_6


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,1298.0,3.0,0.0,10.0,3979.432652,6477.0,2497.567348




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 3979.432652060333
Total Sum of Square Error to Grand Mean: 6476.999999999918
Between Cluster Sum of Square Error: 2497.567347939585

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,398.0,1020.89153
1,,2.0,480.0,1842.661718
2,,3.0,420.0,1115.879403



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:30:53,0.001 sec,0.0,,
1,,2020-02-05 11:30:55,2.502 sec,1.0,1298.0,6623.872891
2,,2020-02-05 11:30:57,4.728 sec,2.0,110.0,4091.15154
3,,2020-02-05 11:30:59,6.905 sec,3.0,40.0,4047.569761
4,,2020-02-05 11:31:02,9.053 sec,4.0,32.0,4035.384656
5,,2020-02-05 11:31:04,11.295 sec,5.0,32.0,4023.391339
6,,2020-02-05 11:31:06,13.506 sec,6.0,34.0,4008.612977
7,,2020-02-05 11:31:08,15.724 sec,7.0,37.0,3994.415378
8,,2020-02-05 11:31:11,18.140 sec,8.0,21.0,3985.89078
9,,2020-02-05 11:31:13,20.475 sec,9.0,20.0,3982.754609


Time: 0:00:23.186029


## Results

## Time 

| Data | Number of rows | Time  |
|---|---|---|
| Original data | {{data.shape[0]}} | {{print(time_km_co)}} |
| Aggregated data 1/2 size of original data | {{data_agg.shape[0]}} | {{print(time_km_co_12)}} |
| Aggregated data 1/4 size of original data | {{data_agg_14.shape[0]}}| {{print(time_km_co_14)}}|

## Accuracy

In [9]:
centers_km_co = h2o_km_co.centers()
centers_km_co_agg_12 = h2o_km_co_agg.centers()
centers_km_co_agg_14 = h2o_km_co_agg_14.centers()
centers_all = pd.concat([pd.DataFrame(centers_km_co).sort_values(by=[0]), pd.DataFrame(centers_km_co_agg_12).sort_values(by=[0]), pd.DataFrame(centers_km_co_agg_14).sort_values(by=[0])])

### Difference between coordinates of original data and aggregated data 

In [10]:
diff_first_cluster = pd.concat([centers_all.iloc[0,:] - centers_all.iloc[3,:], centers_all.iloc[0,:] - centers_all.iloc[6,:]], axis=1, ignore_index=True).transpose()
diff_first_cluster.index = ["1/2", "1/4"]
diff_first_cluster.style.bar(subset=[0,1,2,3,4], align='mid', color=['#d65f5f', '#5fba7d'])

Unnamed: 0,0,1,2,3,4
1/2,0.88972,0.717531,-0.265645,9.65953,8.87158
1/4,-1.32517,0.259583,-0.0909167,10.2001,9.06186


In [11]:
diff_second_cluster = pd.concat([centers_all.iloc[1,:] - centers_all.iloc[4,:], centers_all.iloc[1,:] - centers_all.iloc[7,:]], axis=1, ignore_index=True).transpose()
diff_second_cluster.index = ["1/2", "1/4"]
diff_second_cluster.style.bar(subset=[0,1,2,3,4], align='mid', color=['#d65f5f', '#5fba7d'])

Unnamed: 0,0,1,2,3,4
1/2,0.706606,0.571464,0.144082,0.299513,0.744391
1/4,0.2494,-0.471812,-3.35344,9.70779,9.17911


In [12]:
diff_third_cluster = pd.concat([centers_all.iloc[2,:] - centers_all.iloc[5,:], centers_all.iloc[2,:] - centers_all.iloc[8,:]], axis=1, ignore_index=True).transpose()
diff_third_cluster.index = ["1/2", "1/4"]
#diff_third_cluster.style.background_gradient(cmap='Reds')
diff_third_cluster.style.bar(subset=[0,1,2,3,4], align='mid', color=['#d65f5f', '#5fba7d'])

Unnamed: 0,0,1,2,3,4
1/2,2.23488,0.557486,0.129252,-6.84713,-5.83347
1/4,3.28316,1.58377,3.73468,-23.86,-21.2804


## Data - Cluto-t7.10k

source: G. Karypis, "CLUTO A Clustering Toolkit," Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/~cluto. Karypis, George, Eui-Hong Han, and Vipin Kumar.

- 10 000 rows
- 3 features (x, y, class {0,1,2,3,4,5,6,7,8,noise})


In [13]:
cluto = pd.read_csv("../../smalldata/cluto/cluto_t7_10k.csv", header=None)
cluto.columns = ["x", "y", "class"]
cluto.loc[cluto["class"] == "noise", "class"] = 9
cluto["class"] = cluto["class"].astype("category")
cluto

Unnamed: 0,x,y,class
0,539.512024,411.975006,1
1,542.241028,147.626007,2
2,653.468994,370.727997,0
3,598.585999,284.882996,1
4,573.062988,294.562988,1
...,...,...,...
9995,451.783997,372.544006,6
9996,550.674988,327.447998,1
9997,474.742004,161.518005,3
9998,535.835022,375.765991,1


In [36]:
import plotly.express as px
fig = px.scatter(cluto, x="x", y="y", color="class", title="Original Cluto Dataset")
fig.show()

In [15]:
# load data to h2o
data_h2o_cluto = h2o.H2OFrame(cluto)

# run h2o Kmeans to estimate good start points
h2o_km_cluto = H2OKMeansEstimator(k=10, init="furthest", standardize=True)

start = timer()
h2o_km_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

# show details
h2o_km_cluto.show()
print("Time:", timedelta(seconds=end-start))

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_7


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,10000.0,10.0,0.0,10.0,1805.215286,19998.0,18192.784714




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1802.9785769795371
Total Sum of Square Error to Grand Mean: 19998.000020727988
Between Cluster Sum of Square Error: 18195.02144374845

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,632.0,90.110234
1,,2.0,1158.0,248.530603
2,,3.0,1077.0,135.80662
3,,4.0,1144.0,201.030712
4,,5.0,767.0,118.214755
5,,6.0,1089.0,231.116331
6,,7.0,930.0,135.158163
7,,8.0,1480.0,371.581913
8,,9.0,1065.0,184.486618
9,,10.0,658.0,86.942629



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:31:17,0.003 sec,0.0,,
1,,2020-02-05 11:31:17,0.037 sec,1.0,10000.0,3172.544287
2,,2020-02-05 11:31:17,0.047 sec,2.0,1423.0,2105.87904
3,,2020-02-05 11:31:17,0.060 sec,3.0,632.0,1934.305159
4,,2020-02-05 11:31:17,0.073 sec,4.0,388.0,1854.254107
5,,2020-02-05 11:31:17,0.085 sec,5.0,237.0,1827.444092
6,,2020-02-05 11:31:17,0.098 sec,6.0,152.0,1819.567955
7,,2020-02-05 11:31:17,0.109 sec,7.0,129.0,1814.603401
8,,2020-02-05 11:31:17,0.121 sec,8.0,116.0,1810.912379
9,,2020-02-05 11:31:17,0.131 sec,9.0,102.0,1807.78761


Time: 0:00:00.217796


In [16]:
# run h2o constrained Kmeans
h2o_km_co_cluto = H2OKMeansEstimator(k=10, user_points=h2o.H2OFrame(h2o_km_cluto.centers()), cluster_size_constraints=[100, 200, 100, 200, 100, 100, 100, 100, 100, 100], standardize=True)

start = timer()
h2o_km_co_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

# show details
h2o_km_co_cluto.show()
time_h2o_km_co_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_cluto)

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_8


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,10000.0,10.0,0.0,10.0,1799.79873,19998.0,18198.20127




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1799.7987301078294
Total Sum of Square Error to Grand Mean: 19997.999999999996
Between Cluster Sum of Square Error: 18198.201269892168

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,651.0,94.106054
1,,2.0,1162.0,250.644337
2,,3.0,1048.0,126.654039
3,,4.0,1150.0,203.640064
4,,5.0,731.0,110.560217
5,,6.0,1088.0,230.431351
6,,7.0,929.0,132.741268
7,,8.0,1477.0,370.345078
8,,9.0,1067.0,186.353638
9,,10.0,697.0,94.322683



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 11:31:18,0.002 sec,0.0,,
1,,2020-02-05 11:34:37,3 min 19.779 sec,1.0,10000.0,1802.978577
2,,2020-02-05 11:37:42,6 min 23.954 sec,2.0,49.0,1801.495266
3,,2020-02-05 11:40:50,9 min 32.042 sec,3.0,31.0,1800.767735
4,,2020-02-05 11:44:04,12 min 46.856 sec,4.0,18.0,1800.422318
5,,2020-02-05 11:47:21,16 min 3.367 sec,5.0,16.0,1800.287319
6,,2020-02-05 11:50:49,19 min 31.059 sec,6.0,17.0,1800.14756
7,,2020-02-05 11:53:58,22 min 40.286 sec,7.0,9.0,1800.04148
8,,2020-02-05 11:56:51,25 min 33.335 sec,8.0,9.0,1799.990942
9,,2020-02-05 11:59:40,28 min 22.658 sec,9.0,8.0,1799.926388


Time: 0:31:11.393294


In [17]:
from h2o.estimators.aggregator import H2OAggregatorEstimator

# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 5000, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 5000,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)

start = timer()
agg.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_12_cluto = agg.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_12_cluto = H2OKMeansEstimator(k=10, user_points=h2o.H2OFrame(h2o_km_cluto.centers()), cluster_size_constraints=[50, 100, 50, 100, 50, 50, 50, 50, 50, 50], standardize=True)

h2o_km_co_agg_12_cluto.train(x=["x", "y"],training_frame=data_agg_12_cluto)
end = timer()

# show details
h2o_km_co_agg_12_cluto.show()
time_h2o_km_co_agg_12_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_cluto)

aggregator Model Build progress: |████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_10


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,4704.0,10.0,0.0,10.0,871.555135,9406.0,8534.444865




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 871.5551353295284
Total Sum of Square Error to Grand Mean: 9406.000000000002
Between Cluster Sum of Square Error: 8534.444864670473

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,393.0,89.003454
1,,2.0,507.0,105.224237
2,,3.0,485.0,64.172292
3,,4.0,513.0,92.889068
4,,5.0,410.0,70.703834
5,,6.0,511.0,109.64941
6,,7.0,432.0,65.993512
7,,8.0,574.0,121.380033
8,,9.0,514.0,93.042981
9,,10.0,365.0,59.496316



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 12:02:30,0.001 sec,0.0,,
1,,2020-02-05 12:03:25,55.295 sec,1.0,4704.0,890.838995
2,,2020-02-05 12:04:19,1 min 49.678 sec,2.0,33.0,888.636147
3,,2020-02-05 12:05:19,2 min 49.330 sec,3.0,25.0,888.008486
4,,2020-02-05 12:06:13,3 min 43.614 sec,4.0,27.0,887.250681
5,,2020-02-05 12:07:08,4 min 38.533 sec,5.0,31.0,886.11855
6,,2020-02-05 12:08:04,5 min 33.803 sec,6.0,32.0,884.601397
7,,2020-02-05 12:09:00,6 min 30.267 sec,7.0,41.0,882.673705
8,,2020-02-05 12:10:01,7 min 30.738 sec,8.0,54.0,879.886931
9,,2020-02-05 12:10:53,8 min 23.581 sec,9.0,64.0,876.269105


Time: 0:09:17.259652


In [19]:
# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 2500, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 2500,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)

start = timer()
agg_14.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_14_cluto = agg_14.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_14_cluto = H2OKMeansEstimator(k=10, user_points=h2o.H2OFrame(h2o_km_cluto.centers()), cluster_size_constraints=[25, 50, 25, 50, 25, 25, 25, 25, 25, 25], standardize=True)

h2o_km_co_agg_14_cluto.train(x=["x","y"],training_frame=data_agg_14_cluto)
end = timer()

# show details
h2o_km_co_agg_14_cluto.show()
time_h2o_km_co_agg_14_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_14_cluto)

aggregator Model Build progress: |████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1580898365102_12


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,1998.0,10.0,0.0,10.0,395.157417,3994.0,3598.842583




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 395.1574173501684
Total Sum of Square Error to Grand Mean: 3993.999999999999
Between Cluster Sum of Square Error: 3598.8425826498305

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,187.0,40.562536
1,,2.0,210.0,44.977765
2,,3.0,206.0,31.980904
3,,4.0,214.0,43.494947
4,,5.0,184.0,36.410856
5,,6.0,214.0,47.388271
6,,7.0,186.0,32.383887
7,,8.0,244.0,51.936723
8,,9.0,199.0,38.169969
9,,10.0,154.0,27.851559



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-02-05 12:46:43,0.003 sec,0.0,,
1,,2020-02-05 12:47:01,18.272 sec,1.0,1998.0,416.633854
2,,2020-02-05 12:47:18,35.063 sec,2.0,53.0,409.029713
3,,2020-02-05 12:47:35,51.951 sec,3.0,31.0,406.393712
4,,2020-02-05 12:47:52,1 min 9.604 sec,4.0,25.0,405.095004
5,,2020-02-05 12:48:09,1 min 26.763 sec,5.0,27.0,404.029171
6,,2020-02-05 12:48:26,1 min 43.776 sec,6.0,26.0,402.553729
7,,2020-02-05 12:48:43,2 min 0.783 sec,7.0,25.0,400.987244
8,,2020-02-05 12:49:06,2 min 23.610 sec,8.0,25.0,399.175594
9,,2020-02-05 12:49:24,2 min 40.941 sec,9.0,27.0,397.020557


Time: 0:02:57.746907


In [28]:
fig = px.scatter(cluto, x="x", y="y", color="class", title="Original Cluto Dataset")
fig.show()

In [29]:
data_agg_df_12_cluto = data_agg_12_cluto.as_data_frame()
data_agg_df_12_cluto["class"] = data_agg_df_12_cluto["class"].astype("category")
fig = px.scatter(data_agg_df_12_cluto, x="x", y="y", color="class", title="Aggregated (1/2 size) Cluto Dataset")
fig.show()

In [30]:
data_agg_df_14_cluto = data_agg_14_cluto.as_data_frame()
data_agg_df_14_cluto["class"] = data_agg_df_14_cluto["class"].astype("category")
fig = px.scatter(data_agg_df_14_cluto, x="x", y="y", color="class", title="Aggregated (1/4 size) Cluto Dataset")
fig.show()

In [31]:
cluto["km_t_pred"] = h2o_km_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
fig = px.scatter(cluto, x="x", y="y", color="km_t_pred", title="Predictions of standard K-means")
fig.show()

kmeans prediction progress: |█████████████████████████████████████████████| 100%


In [32]:
cluto["km_co_pred"] = h2o_km_co_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
fig = px.scatter(cluto, x="x", y="y", color="km_co_pred", title="Predictions of Constrained K-means trained with whole Cluto Dataset")
fig.show()

kmeans prediction progress: |█████████████████████████████████████████████| 100%


In [33]:
cluto["km_co_pred_1/2"] = h2o_km_co_agg_12_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
fig = px.scatter(cluto, x="x", y="y", color="km_co_pred_1/2", title="Predictions of Constrained K-means trained with aggregated (1/2 of size) Cluto Dataset")
fig.show()

kmeans prediction progress: |█████████████████████████████████████████████| 100%


In [34]:
cluto["km_co_pred_1/4"] = h2o_km_co_agg_14_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
fig = px.scatter(cluto, x="x", y="y", color="km_co_pred_1/4", title="Predictions of Constrained K-means trained with aggregated (1/4 of size) Cluto Dataset")
fig.show()

kmeans prediction progress: |█████████████████████████████████████████████| 100%


## Difference between result centroids calculated based on all data and aggregated data

In [35]:
centers_km_co_cluto = pd.DataFrame(h2o_km_co_cluto.centers())
centers_km_co_cluto["algo"] =  "km_co"
centers_km_co_agg_12_cluto = pd.DataFrame(h2o_km_co_agg_12_cluto.centers())
centers_km_co_agg_12_cluto["algo"] =  "km_co_agg_12"
centers_km_co_agg_14_cluto = pd.DataFrame(h2o_km_co_agg_14_cluto.centers())
centers_km_co_agg_14_cluto["algo"] =  "km_co_agg_14"

centers_all_cluto = pd.concat([centers_km_co_cluto, centers_km_co_agg_12_cluto, centers_km_co_agg_14_cluto])
centers_all

fig = px.scatter(centers_all_cluto, x=0, y=1, color="algo", title="Centroids")
fig.show()