# Tutorial 1: TabZilla Datasets

This notebook demonstrates how to use TabZilla to download & preprocess datasets for analysis.

### Requirements

1. You need to have a python environment with the following python packages. We recommend following instructions on our [README](https://github.com/naszilla/tabzilla/blob/main/README.md) to prepare a virtual environment with `venv`. Required packages:

- [`openml`](https://pypi.org/project/openml/)
- [`argparse`](https://pypi.org/project/argparse/)
- [`pandas`](https://pypi.org/project/pandas/)
- [`scikit-learn`](https://pypi.org/project/scikit-learn/)

2. Make an [OpenML](www.openml.org) account. You might need to authenticate your account using an API key. 

3. Like all of our code, this notebook must be run from the TabZilla directory. Make sure to run the following cell to `cd` one level up, by running the following cell:

In [1]:
%cd ../

/Users/duncan/research/active_projects/tabzilla/TabZilla


In [2]:
# make sure you can import the openml package
import openml

### If you can't import `openml`

Please read [this guide](https://docs.openml.org/Python-guide/), and make sure you can import `openml` before proceeding.

### If you *can* import `openml`!

Use TabZilla code to pre-process a dataset! To pre-process a dataset, you need to pass a valid dataset name, using the TabZilla naming convention: `openml_<dataset-name>__<dataset-id>`. The code below prepares the OpenML Audiology dataset, and writes it to the folder `TabZilla/datasets`. Since we set flag "overwrite=False", we will check first whether the dataset has already been pre-processed:

In [3]:
from tabzilla_data_preprocessing import preprocess_dataset

dataset_path = preprocess_dataset('openml__audiology__7', overwrite=False)



openml__audiology__7                    | Found existing folder. Skipping.


### Which datasets can I pre-process with TabZilla?

To see a list of all valid TabZilla dataset names, look at the keys of `tabzilla_data_preprocessing.preprocessors`:

In [4]:
from tabzilla_data_preprocessing import preprocessors

valid_dataset_names = list(preprocessors.keys())

for n in valid_dataset_names[:5]: 
    print(n)

print("...")

openml__sick__3021
openml__kr-vs-kp__3
openml__letter__6
openml__balance-scale__11
openml__mfeat-factors__12
...


## Read the pre-processed dataset

Now that we prepared the Audiology dataset, we can read it using the TabZilla dataset interface. To read pre-processed datasets, you need to pass the local path where the dataset is stored. Our pre-processing code writes pre-processed datasets to `TabZilla/datasets`:

In [5]:
from tabzilla_datasets import TabularDataset
from pathlib import Path

dataset = TabularDataset.read(Path("./datasets/openml__audiology__7"))

This dataset object contains lots of useful information, such as the number of categorical features, the dataset size, and so on:

In [6]:
help(type(dataset))

Help on class TabularDataset in module tabzilla_datasets:

class TabularDataset(builtins.object)
 |  TabularDataset(name: str, X: numpy.ndarray, y: numpy.ndarray, cat_idx: list, target_type: str, num_classes: int, num_features: Optional[int] = None, num_instances: Optional[int] = None, cat_dims: Optional[list] = None, split_indeces: Optional[list] = None, split_source: Optional[str] = None) -> None
 |  
 |  Methods defined here:
 |  
 |  __init__(self, name: str, X: numpy.ndarray, y: numpy.ndarray, cat_idx: list, target_type: str, num_classes: int, num_features: Optional[int] = None, num_instances: Optional[int] = None, cat_dims: Optional[list] = None, split_indeces: Optional[list] = None, split_source: Optional[str] = None) -> None
 |      name: name of the dataset
 |      X: matrix of shape (num_instances x num_features)
 |      y: array of length (num_instances)
 |      cat_idx: indices of categorical features
 |      target_type: {"regression", "classification", "binary"}
 |      n

In [7]:
print(f"dataset name: {dataset.name}")
print(f"target type: {dataset.target_type}")
print(f"number of target classes: {dataset.num_classes}")
print(f"number of features: {dataset.num_features}")
print(f"number of instances: {dataset.num_instances}")
print(f"indices of categorical features: {dataset.cat_idx}")

dataset name: openml__audiology__7
target type: classification
number of target classes: 24
number of features: 69
number of instances: 226
indices of categorical features: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68]


### Where is the acutal data?

All features are stored in attirbute `X`, and targets are stored in attribute `y`:

In [8]:
print(f"X.shape: {dataset.X.shape}")
print(f"y.shape: {dataset.y.shape}")

X.shape: (226, 69)
y.shape: (226,)


In [9]:
# first instance in the dataset:
print("X[0, :]:")
print(dataset.X[0, :])

# first target:
print("y[0]:")
print(dataset.y[0])

X[0, :]:
[1 0 0 2 2 3 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 0 0]
y[0]:
2


### Dataset splits

To maintain consistency between experiments, we used fixed dataset splits defined in the OpenML task. These splits are also defined in the dataset object. The attribute `dataset.split_indeces` is a list of 10 dictionaries, each containing the indices of train, test, and validation instances:

In [10]:
dataset.split_indeces[0]

{'train': array([ 96, 163, 200,  70,  38,  31,  30,  37,  98,  20, 170, 112, 138,
         78, 132, 119,  24,  47, 141, 108, 123,  28, 143,  34, 208, 158,
        206,  73, 196, 203, 155,  84, 189, 152, 166, 182,  57, 171, 201,
        184,  94,  72, 137, 114, 202,  54, 150, 118, 205,  27,  61, 225,
         40, 168,  41, 125, 221, 116, 101, 133, 177, 178, 165, 154, 160,
        107, 211, 135, 120,  58, 180,  45,  55,  32, 153, 161,  99,  35,
         36, 144, 145, 115, 162,  56,  85, 218,  88, 159,  17, 121,  71,
        142, 105,  82,  33, 136, 173, 109,   0, 187,  68, 188,  29, 223,
         97, 176, 169, 172,  62,  87, 214,  23, 127, 134, 131, 207,  95,
        149,  52,  83,  18, 157, 209, 181, 140,  16,  39,  63,   8,  48,
         25,   2, 156,  10, 106,  64, 213, 197, 199,  91,   1,  66, 147,
          7, 148,  15, 222,  67,  49,  44, 151,  13, 179,  93, 215, 117,
         60,  79,  26, 198, 124,  59, 130,  14,  65,   4, 190,   6, 126,
        216,  90,   9,  53, 183, 100,  46,

# Processing Datasets for ML pipelines

The pre-processing described above simply downloads the OpenML datasets and saves them in a TabZilla-readable format. Before passing these datasets into an ML pipeline, we need to run some additional processing steps, including scaling & cleaning the features, encoding categorical features, and encoding the target. All of these are handled by the function [`tabzilla_data_processing.process_data'](tabzilla/blob/main/TabSurvey/tabzilla_data_processing.py).

The function `process_data` takes as input the indices of all training, testing, and validation instances, which can be read from the dataset attribute `split_indeces`. 

Here is an example using our pre-processed Audiology dataset:

In [11]:
from tabzilla_data_processing import process_data

processed_dataset_fold0 = process_data(
    dataset,
    dataset.split_indeces[0]['train'],
    dataset.split_indeces[0]['val'],
    dataset.split_indeces[0]['test'],
    impute=True,
)

The processed dataset object contains (X, y) pairs for the training, testing, and validation data:

In [12]:
print(processed_dataset_fold0["data_test"])

(array([[0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0]], dtype=object), array([ 7,  7,  7,  7,  7,  2,  2,  2,  2,  2,  2, 19, 18, 18,  6,  6,  3,
        3,  3, 14, 22, 17,  5]))
