Basically, there are 3 different types of tasks:

1. Regression
2. Binary Classification
3. Mulitclass Classification

However, we can split them further:

3. Mulitclass Classification into ..
    - 3-class
    - 4-class
    - ...
    
For all three we can split based on their features.. only
- numerical
- categorical
- mixed

In [1]:
import openml

import pandas as pd

## Filtering available datasets

- Active datasets
- No missing values
    - If we have missing values in the original dataset, we can not trust the downstream task performance changes
- At least 5000 instances
- At least 5 features
- Drop duplicated datasets
- Drop datasets with no information about the number of classes
- (At the end) we only use 50 datasets for each task (regression, binary, multiclass)

In [2]:
# First get all available datasets
all_datasets = openml.tasks.list_tasks(output_format="dataframe")

# TODO

We probably want to filter here for `source_data` to catch all duplicated/augmented datasets..

In [30]:
datasets = all_datasets.copy()

# Datasets without missing values
datasets = datasets[datasets["NumberOfInstancesWithMissingValues"] == 0]

# Active datasets
datasets = datasets[datasets["status"] == "active"]

# Rename 
datasets = datasets.rename(columns={"NumberOfSymbolicFeatures": "NumberOfCategoricalFeatures"})

# Only look at datasets with at least 5000 instances and at least 5 features
datasets = datasets[datasets["NumberOfInstances"] >= 3000]
datasets = datasets[datasets["NumberOfFeatures"] >= 5]

# drop some corrupted datasets
datasets = datasets[~datasets["NumberOfClasses"].isna()]

# drop some unused columns
datasets = datasets.drop(columns=[
    "tid", "ttid", "task_type", "estimation_procedure", "evaluation_measures",
    "cost_matrix", "MaxNominalAttDistinctValues", "status", "target_value",
    "NumberOfMissingValues", "target_feature", "source_data", "number_samples",
    "source_data_labeled", "target_feature_event", "target_feature_left",
    "target_feature_right", "quality_measure", "NumberOfInstancesWithMissingValues"
])

# drop datasets that are very similiar. I.e., where the datasets consists of the same "shape"
columns_to_check_duplicates = datasets.columns.tolist()
columns_to_check_duplicates.remove("did")
columns_to_check_duplicates.remove("name")
columns_to_check_duplicates.remove("MinorityClassSize")
columns_to_check_duplicates.remove("MajorityClassSize")
datasets[columns_to_check_duplicates].astype(int)
datasets = datasets.drop_duplicates(columns_to_check_duplicates)

### Regression datasets

Regression datasets are datasets with `0` classes.

In [31]:
regression = datasets[datasets["NumberOfClasses"] == 0].copy()

# drop some unused columns
regression = regression.drop(columns=[
    "MajorityClassSize", "MinorityClassSize", "NumberOfClasses"
])

# Sort
regression = regression.sort_values(["NumberOfFeatures", "NumberOfInstances"])

regression = regression.reset_index(drop=True)

In [32]:
regression[:50]

Unnamed: 0,did,name,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
0,529,pollen,5.0,3848.0,5.0,0.0
1,1433,svmguide1,5.0,7089.0,5.0,0.0
2,688,visualizing_soil,5.0,8641.0,4.0,1.0
3,23395,COMET_MC_SAMPLE,6.0,89640.0,6.0,0.0
4,23397,COMET_MC_SAMPLE,6.0,761940.0,6.0,0.0
5,5648,COMET_MC,6.0,7619400.0,6.0,0.0
6,507,space_ga,7.0,3107.0,7.0,0.0
7,42545,stock_fardamento02,7.0,6277.0,6.0,1.0
8,198,delta_elevators,7.0,9517.0,7.0,0.0
9,23515,sulfur,7.0,10081.0,7.0,0.0


### Classification datasets

#### Binary Classification datasets

Binary Classification datasets are datasets with `2` classes.

In [33]:
binary_classification = datasets[datasets["NumberOfClasses"] == 2].copy()

# drop some unused columns
binary_classification = binary_classification.drop(columns=[
    "NumberOfClasses"
])

# Sort
binary_classification = binary_classification.sort_values(["NumberOfFeatures", "NumberOfInstances"])

binary_classification = binary_classification.reset_index(drop=True)

In [34]:
binary_classification[:50]

Unnamed: 0,did,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
0,923,visualizing_soil,4753.0,3888.0,5.0,8641.0,3.0,2.0
1,871,pollen,1924.0,1924.0,6.0,3848.0,5.0,1.0
2,40983,wilt,4578.0,261.0,6.0,4839.0,5.0,1.0
3,1489,phoneme,3818.0,1586.0,6.0,5404.0,5.0,1.0
4,803,delta_ailerons,3783.0,3346.0,6.0,7129.0,5.0,1.0
5,1046,mozilla4,10437.0,5108.0,6.0,15545.0,5.0,1.0
6,737,space_ga,1566.0,1541.0,7.0,3107.0,6.0,1.0
7,819,delta_elevators,4785.0,4732.0,7.0,9517.0,6.0,1.0
8,310,mammography,10923.0,260.0,7.0,11183.0,6.0,1.0
9,40922,Run_or_walk_information,44365.0,44223.0,7.0,88588.0,6.0,1.0


#### Multiclass Classification datasets

Multiclass Classification datasets are datasets with more than `2` classes.

In [35]:
multiclass_classification = datasets[datasets["NumberOfClasses"] > 2].copy()

# Sort
multiclass_classification = multiclass_classification.sort_values(["NumberOfClasses", "NumberOfFeatures", "NumberOfInstances"])

# reset indices
multiclass_classification = multiclass_classification.reset_index(drop=True)

In [36]:
multiclass_classification[:50]

Unnamed: 0,did,name,MajorityClassSize,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
0,41027,jungle_chess_2pcs_raw_endgame_complete,23062.0,4335.0,3.0,7.0,44819.0,6.0,1.0
1,1557,abalone,1447.0,1323.0,3.0,9.0,4177.0,7.0,2.0
2,119,"BNG(cmc,nominal,55296)",23655.0,12555.0,3.0,10.0,55296.0,0.0,10.0
3,255,BNG(cmc),23567.0,12447.0,3.0,10.0,55296.0,2.0,8.0
4,1226,Click_prediction_small,399482.0,67089.0,3.0,10.0,798964.0,9.0,1.0
5,1179,BNG(solar-flare),994382.0,1393.0,3.0,13.0,1000000.0,0.0,13.0
6,1185,BNG(wine),401055.0,277674.0,3.0,14.0,1000000.0,13.0,1.0
7,1222,letter-challenge-unlabeled.arff,10000.0,2760.0,3.0,17.0,20000.0,16.0,1.0
8,40497,thyroid-ann,3488.0,93.0,3.0,22.0,3772.0,21.0,1.0
9,1044,eye_movements,4262.0,2870.0,3.0,28.0,10936.0,24.0,4.0


# Discussion / Would like to hear your opinion

**I think we should**
1. Cap the number of instances and features.

    E.g. with max 50k inscantes and 20 features we only get 26 regression, 50 binary, and 13 multiclass datasets.
    What do you think? Should we go with these (not 150 datasets in total) or increase the caps so that we get 50 dataset for each tasks.
    
2. Try to remove this (probably) very similar datasets

    E.g. for the binary tasks we get a lot of these `FOREX_*` datasets, which are probably very similar
    
 
3. Because the number of available datasets droped very quickly, we should not divide them further, e.g., depending on their features (numerical, categorical, mixed)

    However, it could be interesting to look at this later on if we find outliers after imputation.
    Outliere means all experimental settings are the same but an imputation method performs on dataset `A` much worse than on dataset `B`