In [1]:
import os

# change working directory, run this cell once
os.chdir("../")

In this notebook, we will demonstrate an example on how to use the ``AutoCluster`` class for clustering.

### Import packages

In [2]:
# we will be using sample datasets in sklearn
from sklearn import datasets
from collections import Counter
import pandas as pd

# autocluster functionalities
from autocluster import AutoCluster
from evaluators import get_evaluator
from utils.metafeatures import MetafeatureMapper

%load_ext autoreload
%autoreload 2

### Load sklearn digits dataset

In [None]:
digits_df = pd.DataFrame(datasets.load_digits()['data'])
digits_df.head(5)

- Note that we are converting the dataset from ``numpy`` format to ``pandas DataFrame`` format. 
- This is because the ``AutoCluster.fit()`` function only accepts ``DataFrame`` format as input.
- There are 64 columns in this dataset, named ``0``, ``1``, ``2`` ... and so on. 

In [None]:
print("Shape of this dataframe is {}".format(digits_df.shape))

### Finding an optimal clustering model using Bayesian Optimization (SMAC)

In [None]:
cluster = AutoCluster()
fit_params = {
    "df": digits_df, 
    "cluster_alg_ls": [
        'KMeans', 'GaussianMixture', 'MiniBatchKMeans'
    ], 
    "dim_reduction_alg_ls": [
        'PCA', 'IncrementalPCA', 
        'KernelPCA', 'FastICA', 'TruncatedSVD', 'NullModel'
    ],
    "optimizer": 'smac',
    "n_evaluations": 100,
    "run_obj": 'quality',
    "seed": 27,
    "cutoff_time": 50,
    "preprocess_dict": {
        "numeric_cols": list(range(64)),
        "categorical_cols": [],
        "ordinal_cols": [],
        "y_col": []
    },
    "evaluator": get_evaluator(evaluator_ls = ['silhouetteScore', 
                                               'daviesBouldinScore', 
                                               'calinskiHarabaszScore'], 
                               weights = [1, 1, 1], 
                               clustering_num = None, 
                               min_proportion = .01, 
                               min_relative_proportion='default'),
    "n_folds": 5,
    "warmstart": False
}
result_dict = cluster.fit(**fit_params)

There is a lot going on here, let's talk about some of the parameters used in the example above:
- ``cluster_alg_ls``: This is the list of possible clustering algorithms to include within the search space.
- ``dim_reduction_alg_ls``: This is the list of possible dimension reduction algorithms to include within the search space. Dimension reduction is performed **before** the clustering step. 
- ``optimizer``: There are two options for this, ``"smac"`` or ``"random"``. ``"smac"`` does Bayesian Optimization using the SMAC library, while ``"random"`` just performs random search optimization.
- ``n_evaluations``: number of iterations to run, generally the larger the better.
- ``cutoff_time``: If evaluating a certain configuration takes longer than this value (in seconds), it will be terminated.
- ``preprocess_dict``: This is important, ``AutoCluster.fit()`` uses this dictionary to preprocess the dataset. For instance, categorical columns will be one hot encoded, while ordinal columns will encoded as integers. 
- ``evaluator``: This is important, it tells ``AutoCluster.fit()`` how to evaluate a clustering result. 
    - ``evaluator_ls``: list of metric to include in a linear combination. Choices available are ``["silhouetteScore", "daviesBouldinScore", "calinskiHarabaszScore"]``.
    - ``weights``: how much weights to use for each metric in the linear combination.
    - ``clustering_num``: A tuple is expected. If clustering result has n_clusters outside this specified range, ``float(inf)`` will be returned from evaluator.
    - ``min_proportion``: The proportion of points in each cluster must be at least this value.
    - ``min_relative_proportion``: The ratio of number points in the smallest cluster to the number of points in the largest cluster must be at least this value. By using ``'default'``, ``min_relative_proportion`` will be set to `` 5 * min_proportion``.
- ``warmstart``: Whether or not to use warmstart, examples will be shown below on how to use this.

In [None]:
predictions = cluster.predict(digits_df)

In [None]:
print(result_dict["optimal_cfg"])
print(Counter(predictions))