# Examples for ML2DAC

In this notebook, we show examples on how to user our approach. Especially, how to set parameters and apply it on a custom dataset. Note that we use the MetaKnowledgeRepository (MKR) that we have created with the LearningPhase.py script. Hence, have a look at that script on how to built the MKR or how to extend it.

In [1]:
from MetaLearning.ApplicationPhase import ApplicationPhase
from MetaLearning import MetaFeatureExtractor
from pathlib import Path
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.filterwarnings(category=RuntimeWarning, action="ignore")
warnings.filterwarnings(category=SettingWithCopyWarning, action="ignore")
import numpy as np
np.random.seed(0)
# Specify where to find our MKR
mkr_path = Path("../MetaKnowledgeRepository/")

# Specify meta-feature set to use. This is the set General+Stats+Info 
mf_set = MetaFeatureExtractor.meta_feature_sets[4]

## Example on a simple synthetic dataset

First create a simple synthetic dataset.

In [2]:
# Create simple synthetic dataset
from sklearn.datasets import make_blobs
# We expect the data as numpy arrays
X,y = make_blobs(n_samples=1000, n_features=10, random_state=0)

# We also use a name to describe/identify this dataset
dataset_name = "simple_blobs_n1000_f10"

Specify some parameter settings of our approach.

In [3]:
# Parameters of our approach. This can be customized
n_warmstarts = 5 # Number of warmstart configurations (has to be smaller than n_loops)
n_loops = 10 # Number of optimizer loops. This is n_loops = n_warmstarts + x
limit_cs = True # Reduces the search space to suitable algorithms, dependening on warmstart configurations
time_limit = 120 * 60 # Time limit of overall optimization --> Aborts earlier if n_loops not finished but time_limit reached
cvi = "predict" # We want to predict a cvi based on our meta-knowledge

Instantiate our ML2DAC approach.

In [4]:
ML2DAC = ApplicationPhase(mkr_path=mkr_path, mf_set=mf_set)

Run the optimization procedure.

In [5]:
optimizer_result, additional_info = ML2DAC.optimize_with_meta_learning(X, n_warmstarts=n_warmstarts,
                                                                       n_optimizer_loops=n_loops, 
                                                                       limit_cs=limit_cs,
                                                                       cvi=cvi, time_limit=time_limit,
                                                                       dataset_name=dataset_name)

----------------------------------
Most similar dataset is: type=gaussian-k=10-n=1000-d=10-noise=0
--
Selected CVI: Calinski-Harabasz (CH)
--
Selected Warmstart Configs:
28    {'algorithm': 'affinity_propagation', 'damping...
74              {'algorithm': 'ward', 'n_clusters': 10}
64              {'algorithm': 'ward', 'n_clusters': 12}
24          {'algorithm': 'spectral', 'n_clusters': 12}
25    {'algorithm': 'dbscan', 'eps': 0.7585251766955...
Name: config, dtype: object
--
Selected Algorithms: ['affinity_propagation', 'ward', 'spectral', 'dbscan']
--
----------------------------------
Starting the optimization
Executing Configuration: Configuration:
  algorithm, Value: 'affinity_propagation'
  damping, Value: 0.9009756450229847

Obtained CVI score for CH: -651.9920584630048
----
Executing Configuration: Configuration:
  algorithm, Value: 'ward'
  n_clusters, Value: 10

Obtained CVI score for CH: -870.9795439649938
----
Executing Configuration: Configuration:
  algorithm, Value: 'aff

The result contains two parts: (1) opimizer_result, which contains a history of the executed configurations in their executed order, with their runtime and the scores of the selected CVI, and (2) additional_info, which has some basic information of our meta-learning procedure, i.e., how long the meta-feature extraction took, the selected CVI, the algorithms that we used in the configuraiton space, and the dataset from the MKR that was most similar to the new dataset.

In [6]:
optimizer_result.get_runhistory_df()

Unnamed: 0,runtime,CH,config,labels
0,0.899647,-651.9921,"{'algorithm': 'affinity_propagation', 'damping...","[10, 8, 8, 9, 7, 2, 2, 6, 15, 1, 5, 6, 5, 9, 1..."
1,0.055966,-870.9795,"{'algorithm': 'ward', 'n_clusters': 10}","[6, 5, 5, 1, 4, 5, 5, 8, 9, 7, 0, 0, 1, 1, 4, ..."
2,0.505862,-652.9966,"{'algorithm': 'affinity_propagation', 'damping...","[9, 7, 7, 8, 6, 2, 2, 5, 15, 1, 4, 5, 4, 8, 10..."
3,0.054794,-755.9523,"{'algorithm': 'ward', 'n_clusters': 12}","[6, 2, 2, 5, 4, 2, 2, 8, 9, 3, 10, 7, 5, 5, 4,..."
4,0.365247,-725.3935,"{'algorithm': 'spectral', 'n_clusters': 12}","[3, 11, 7, 0, 5, 7, 7, 9, 7, 5, 6, 9, 6, 9, 1,..."
5,0.04683,2147484000.0,"{'algorithm': 'dbscan', 'eps': 0.7585251766955...","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -..."
6,0.277159,-1023.55,"{'algorithm': 'spectral', 'n_clusters': 9}","[8, 5, 5, 0, 2, 5, 5, 4, 5, 2, 3, 4, 3, 3, 6, ..."
7,0.344897,-1522.253,"{'algorithm': 'spectral', 'n_clusters': 5}","[3, 4, 4, 0, 2, 4, 4, 0, 4, 2, 0, 0, 0, 0, 2, ..."
8,0.351404,-871.4639,"{'algorithm': 'spectral', 'n_clusters': 10}","[9, 7, 7, 0, 6, 7, 7, 5, 7, 4, 3, 5, 3, 5, 4, ..."
9,0.273288,-2628.857,"{'algorithm': 'spectral', 'n_clusters': 3}","[1, 1, 1, 0, 2, 1, 1, 0, 1, 2, 0, 0, 0, 0, 2, ..."


In [7]:
additional_info

{'dataset': 'simple_blobs_n1000_f10',
 'mf time': 1.043367862701416,
 'similar dataset': 'type=gaussian-k=10-n=1000-d=10-noise=0',
 'CVI': 'CH',
 'algorithms': ['affinity_propagation', 'ward', 'spectral', 'dbscan']}

Now we retrieve the best configuration with its predicted clustering labels and compare it against the ground-truth clustering.

In [8]:
best_config_stats = optimizer_result.get_incumbent_stats()
best_config_stats



{'runtime': 0.2732877731323242,
 'CH': -2628.8572356039226,
 'config': {'algorithm': 'spectral', 'n_clusters': 3},
 'labels': array([1, 1, 1, 0, 2, 1, 1, 0, 1, 2, 0, 0, 0, 0, 2, 1, 1, 1, 2, 2, 0, 0,
        2, 0, 0, 0, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2, 1, 0, 2, 0, 2, 0, 1, 2,
        0, 0, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 0, 2, 0, 2, 1,
        1, 0, 0, 2, 2, 1, 0, 0, 0, 0, 2, 0, 0, 0, 2, 1, 1, 1, 1, 0, 1, 0,
        2, 1, 2, 0, 2, 0, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 0, 0, 2, 0,
        0, 0, 2, 1, 1, 1, 2, 0, 2, 0, 0, 1, 0, 2, 1, 1, 1, 1, 0, 2, 1, 0,
        2, 0, 0, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 0, 0, 2,
        0, 2, 2, 0, 1, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 0, 0, 2, 2, 2, 1, 2,
        0, 0, 0, 0, 1, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 0, 0, 0, 1, 2, 1, 0,
        2, 1, 2, 2, 0, 2, 2, 1, 1, 0, 0, 1, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2,
        1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 0, 2, 2, 0, 2, 1, 0, 0, 2, 0, 1,
        1, 0, 1, 2, 1, 2, 2, 0, 2, 2, 2, 0, 1, 0, 1, 0, 2, 2,

In [9]:
predicted_labels = best_config_stats["labels"]

In [10]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(predicted_labels, y)

1.0