# Tutorial 1: Working with the `TabularDataset` Class

Welcome to this comprehensive tutorial on using the `TabularDataset` class from TabCamel! This tutorial will guide you through:

- **Dataset Creation**: Loading datasets from OpenML for classification and regression tasks
- **Dataset Properties**: Understanding key attributes and methods
- **Sampling**: Creating stratified and uniform subsamples
- **Splitting**: Dividing datasets into train/test sets
- **Complex Operations**: Chaining operations for advanced data preprocessing workflows

By the end of this tutorial, you'll have a solid understanding of how to effectively use `TabularDataset` for your machine learning projects.


## 1. Environment Setup

First, let's prepare our Python environment by loading the necessary extensions and configuring automatic reloading for development.


In [1]:
# Enable auto-reloading of modules when they change
%load_ext autoreload
%autoreload 2

# Configure matplotlib to display plots inline
%matplotlib inline

In [2]:
# Import the main TabularDataset class
from tabcamel.data.dataset import TabularDataset

## 2. Creating Datasets

The `TabularDataset` class provides an easy way to load and work with datasets from OpenML, UCI, bnlearn, sklearn, pgmpy, and local sources. Let's explore how to create datasets for different types of machine learning tasks.


### 2.1 Loading OpenML Datasets

[OpenML](https://www.openml.org/) is a collaborative platform for machine learning that provides access to thousands of datasets. The `TabularDataset` class can automatically download and prepare these datasets for your machine learning experiments.


#### Classification Tasks

For classification problems, we need to specify `task_type="classification"`. Let's load the famous "adult" dataset (also known as Census Income dataset) for binary classification:


In [3]:
# Create a classification dataset from OpenML
dataset_openml = TabularDataset(
    dataset_name="adult",           # OpenML dataset name
    task_type="classification",     # Specify this is a classification task
)

# Display basic information about the dataset
dataset_openml

TabularDataset(dataset_name=adult, task_type=classification, target_col=target, is_tensor=False)

In [4]:
# Print detailed information about the dataset
print(dataset_openml)

Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [5]:
# Preview the first few rows of the dataset
dataset_openml.data_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States,<=50K
1,3,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,<=50K
2,2,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States,<=50K
3,3,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States,<=50K
4,1,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba,<=50K


#### Regression Tasks

For regression problems, we specify `task_type="regression"`. Let's load the "liver-disorders" dataset for continuous target prediction:


In [6]:
# Create a regression dataset from OpenML
dataset_openml = TabularDataset(
    dataset_name="liver-disorders",  # OpenML dataset name
    task_type="regression",  # Specify this is a regression task
)

# Display basic information about the regression dataset
dataset_openml

TabularDataset(dataset_name=liver-disorders, task_type=regression, target_col=target, is_tensor=False)

In [7]:
# Print detailed information about the regression dataset
print(dataset_openml)

Dataset: liver-disorders
Task type: regression
Status (is_tensor): False
Number of samples: 345
Number of features: 5 (Numerical: 5, Categorical: 0)
Number of classes: None
Class distribution: None


In [8]:
# Preview the first 10 rows of the regression dataset
dataset_openml.data_df.head(10)

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt,target
0,85,92,45,27,31,0.0
1,85,64,59,32,23,0.0
2,86,54,33,16,54,0.0
3,91,78,34,24,36,0.0
4,87,70,12,28,10,0.0
5,98,55,13,17,17,0.0
6,88,62,20,17,9,0.5
7,88,67,21,11,11,0.5
8,92,54,22,20,7,0.5
9,90,60,25,19,5,0.5


## 3. Understanding Dataset Properties

The `TabularDataset` class provides several useful properties and methods to inspect and understand your data. Let's explore the key attributes and functionalities.


In [9]:
# Create a dataset instance to explore its properties
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
    metafeature_dict={},  # Optional: dictionary for storing metadata
    data_df=None,         # Optional: provide your own DataFrame
    target_col=None,      # Optional: specify custom target column
)

# Display all available methods and attributes
print("Available methods and attributes:")
print(dir(dataset))

Available methods and attributes:
['X_df', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_categorical_feature_list', '_class2distribution', '_class2samples', '_col2type', '_data_df', '_data_id', '_data_source', '_dataset_name', '_infer_column_types', '_init_data_df', '_init_dataset_properties', '_is_tensor', '_metafeature_dict', '_num_categorical_features', '_num_classes', '_num_features', '_num_numerical_features', '_num_samples', '_numerical_feature_list', '_parse_metafeatures', '_sanity_check', '_target_col', '_task_type', '_update_data_df', '_update_dataset_properties', 'categorical_feature_list', 'class2distribution', 'class2samples', 'class_list', 'col2type', 'data_df', 'data

In [10]:
# Get the total number of samples in the dataset
print(f"Total number of samples: {len(dataset)}")

Total number of samples: 48842


In [11]:
# Get the number of data indices (should match the total length)
# data_indices tracks which rows from the original dataset are included
print(f"Number of data indices: {len(dataset.data_indices)}")

Number of data indices: 48842


## 4. Dataset Subsampling

Subsampling is useful when you want to work with a smaller portion of your dataset for faster experimentation or when dealing with computational constraints. The `TabularDataset` class supports both stratified and uniform sampling.


In [12]:
# Create a fresh dataset for sampling experiments
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
)

print("Original dataset:")
print(dataset)

Original dataset:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [13]:
# Perform stratified sampling to maintain class distribution
# sample_mode="stratified" ensures proportional representation of all classes
sample_dict = dataset.sample(
    sample_mode="stratified",  # Maintain class proportions
    sample_size=1000,  # Take 1000 samples
)

print("Sample dictionary keys:", sample_dict.keys())
print("\nSampled dataset:")
print(sample_dict["dataset_sampled"])

Sample dictionary keys: dict_keys(['dataset_sampled', 'sample_indices'])

Sampled dataset:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 1000
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.76, '>50K': 0.24}


In [14]:
# View the first 10 sample indices (row numbers from original dataset)
print("First 10 sample indices:")
sample_dict["sample_indices"][:10]

First 10 sample indices:


[20745, 1127, 14826, 8235, 22110, 28885, 21041, 2248, 1003, 3237]

In [15]:
# Test stratified sampling consistency
# With the same seed, smaller samples should be subsets of larger ones
subsample20 = dataset.sample(
    sample_size=20,
    sample_mode="stratified",
)

subsample40 = dataset.sample(
    sample_size=40,
    sample_mode="stratified",
)

# Check if the 20-sample subset is contained in the 40-sample subset
is_subset = set(subsample20["sample_indices"]).issubset(subsample40["sample_indices"])
print(f"Is 20-sample subset contained in 40-sample subset? {is_subset}")
is_subset

Is 20-sample subset contained in 40-sample subset? True


True

In [16]:
# Test uniform sampling consistency
# Uniform sampling also maintains the subset property with the same seed
subsample20 = dataset.sample(
    sample_size=20,
    sample_mode="uniform",
)

subsample40 = dataset.sample(
    sample_size=40,
    sample_mode="uniform",
)

subsample100 = dataset.sample(
    sample_size=100,
    sample_mode="uniform",
)

# Check subset relationships for uniform sampling
subset_20_in_40 = set(subsample20["sample_indices"]).issubset(subsample40["sample_indices"])
subset_40_in_100 = set(subsample40["sample_indices"]).issubset(subsample100["sample_indices"])

print(f"20 ⊆ 40: {subset_20_in_40}, 40 ⊆ 100: {subset_40_in_100}")

20 ⊆ 40: True, 40 ⊆ 100: True


## 5. Dataset Splitting

Dataset splitting is crucial for machine learning to separate data into training and testing sets. The `TabularDataset` class provides flexible splitting options with stratification support.


In [17]:
# Create a fresh dataset for splitting experiments
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
)

print("Original dataset for splitting:")
print(dataset)

Original dataset for splitting:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [18]:
# Perform stratified train-test split
# test_size=0.2 means 20% for testing, 80% for training
split_dict = dataset.split(
    test_size=0.2,  # 20% of data for testing
    split_mode="stratified",  # Maintain class proportions in both sets
)

print("Split dictionary keys:", split_dict.keys())
print("\nTraining set:")
print(split_dict["train_set"])
print("\nTest set:")
print(split_dict["test_set"])

Split dictionary keys: dict_keys(['train_set', 'test_set', 'indices_train', 'indices_test'])

Training set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 39073
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607299157986334, '>50K': 0.23927008420136667}

Test set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 9769
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7606715119254785, '>50K': 0.23932848807452145}


In [19]:
# View the first 10 training indices
print("First 10 training indices:")
split_dict["indices_train"][:10]

First 10 training indices:


[34495, 18591, 12562, 552, 3479, 40259, 23462, 30226, 44653, 18603]

In [20]:
# Compare the training set DataFrame index with the training indices
# These should correspond to the same rows from the original dataset
print("Training set DataFrame index (first 10 rows):")
split_dict["train_set"].data_df.head(10).index

Training set DataFrame index (first 10 rows):


Index([34495, 18591, 12562, 552, 3479, 40259, 23462, 30226, 44653, 18603], dtype='int64')

## 6. Advanced Operations: Chaining Sampling and Splitting

Real-world machine learning workflows often require combining multiple operations. The `TabularDataset` class allows you to chain operations like sampling and splitting in various orders to create sophisticated data preprocessing pipelines.


### 6.1 Sample Then Split

This approach first reduces the dataset size through sampling, then splits the sampled data. This is useful when you want to work with a smaller dataset while maintaining proper train-test separation.


In [21]:
# Start with a fresh dataset for chaining operations
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
)

print("Original dataset:")
print(dataset)

Original dataset:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [22]:
# Step 1: Stratified subsample 20% of the original data
# This maintains class distribution while reducing dataset size
dataset_subsample = dataset.sample(
    sample_size=0.2,  # Take 20% of original data
    sample_mode="stratified",
)["dataset_sampled"]

# Step 2: Stratified split the subsampled data into Train and Test sets (4:1 ratio)
# This ensures both training and testing maintain the original class distribution
split_dict = dataset_subsample.split(
    test_size=0.2,  # 20% of subsample for testing
    split_mode="stratified",
)
train_set = split_dict["train_set"]
test_set = split_dict["test_set"]

# Display the progression: Original → Subsample → Train/Test
print(
    f"Sample sizes - Original: {dataset.num_samples}, "
    f"Subsample: {dataset_subsample.num_samples}, "
    f"Train: {train_set.num_samples}, "
    f"Test: {test_set.num_samples}"
)

print(f"\nSubsampled dataset (20% of original):")
print(dataset_subsample)
print(f"\nTraining set (80% of subsample):")
print(train_set)
print(f"\nTest set (20% of subsample):")
print(test_set)

Sample sizes - Original: 48842, Subsample: 9768, Train: 7814, Test: 1954

Subsampled dataset (20% of original):
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 9768
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7606470106470107, '>50K': 0.23935298935298935}

Training set (80% of subsample):
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 7814
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7606859482979268, '>50K': 0.2393140517020732}

Test set (20% of subsample):
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 1954
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7604912998976459, '>50K': 0.23950870010235414}


### 6.2 Split Then Sample

This approach first splits the dataset, then samples from the training portion. This is common in active learning scenarios where you want to simulate having access to a large "oracle" set but only use a small portion for training.


In [23]:
# Create a fresh dataset for the split-then-sample workflow
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
    metafeature_dict={},  # Empty metadata dictionary
    data_df=None,  # Let TabularDataset load the data
    target_col=None,  # Use default target column
)

print("Original dataset for split-then-sample:")
print(dataset)

Original dataset for split-then-sample:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [24]:
# Step 1: Stratified split the data into Oracle and Test sets (4:1 ratio)
# The "oracle" set represents all available training data
split_dict = dataset.split(
    split_mode="stratified",
    test_size=0.2,  # 20% for final testing
)
oracle_set = split_dict["train_set"]  # 80% available for training
test_set = split_dict["test_set"]  # 20% reserved for final evaluation

print(f"Oracle set samples: {oracle_set.num_samples}, Test set samples: {test_set.num_samples}")

# Step 2: Stratified subsample 20% of the Oracle data for actual training
# This simulates having limited labeling budget in active learning
train_set = oracle_set.sample(
    sample_mode="stratified",
    sample_size=0.2,  # Use only 20% of available training data
)["dataset_sampled"]

print(f"\nFinal training set:")
print(train_set)

Oracle set samples: 39073, Test set samples: 9769

Final training set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 7814
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7606859482979268, '>50K': 0.2393140517020732}


### 6.3 Sequential Splitting

Sometimes you need to create multiple splits, such as train/validation/test sets. This can be achieved by splitting twice in sequence.


In [25]:
# Start with a fresh dataset for sequential splitting
dataset = TabularDataset(
    dataset_name="adult",
    task_type="classification",
)

print("Original dataset for sequential splitting:")
print(dataset)

Original dataset for sequential splitting:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 48842
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607182343065395, '>50K': 0.23928176569346055}


In [26]:
# Step 1: Initial split - divide data into training+validation and test sets
split_dict_1 = dataset.split(
    train_size=0.8,  # 80% of data for training+validation
    split_mode="stratified",  # 20% of data for testing
)

train_val_set = split_dict_1["train_set"]  # Training + validation set
test_set_1 = split_dict_1["test_set"]  # Test set

print("After initial split - Training + Validation set:")
print(train_val_set)
print(f"\nSplit sizes - Train+Val: {train_val_set.num_samples}, Test: {test_set_1.num_samples}")

# Step 2: Second split - divide training+validation into separate sets
split_dict_2 = train_val_set.split(
    train_size=0.8,  # 80% of train_val_set for training
    split_mode="stratified",  # 20% of train_val_set for validation
)

final_train_set = split_dict_2["train_set"]  # Final training set
validation_set = split_dict_2["test_set"]  # Validation set

print("After second split - Final training set:")
print(final_train_set)
print(
    f"\nFinal split sizes - Train: {final_train_set.num_samples}, "
    f"Validation: {validation_set.num_samples}, "
    f"Test: {test_set_1.num_samples}"
)

After initial split - Training + Validation set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 39073
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607299157986334, '>50K': 0.23927008420136667}

Split sizes - Train+Val: 39073, Test: 9769
After second split - Final training set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 31258
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607332522874144, '>50K': 0.23926674771258558}

Final split sizes - Train: 31258, Validation: 7815, Test: 9769


In [27]:
# Verify that the final training set is a subset of the initial training+validation set
# This should be True since we split train_val_set to create final_train_set
is_subset = set(train_val_set.data_indices).issuperset(final_train_set.data_indices)
print(f"Is final training set a subset of train+val set? {is_subset}")

Is final training set a subset of train+val set? True


### 6.4 Data Cleaning Then Splitting

Real datasets often have imbalanced classes or classes with very few samples. The `TabularDataset` class provides methods to clean the data before splitting.


In [28]:
# Load a dataset that may have imbalanced classes
dataset = TabularDataset(
    dataset_name="collins",
    task_type="classification",
)

print("Data indices before cleaning (sample from middle range):")
print(dataset.data_indices[480:500])

Data indices before cleaning (sample from middle range):
[480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499]


In [29]:
# Remove classes with fewer than 10 samples to ensure reliable splitting
# This is important for stratified splitting to work properly
dataset.drop_low_sample_class(min_sample_per_class=10)
print("Dataset after removing classes with < 10 samples:")
print(dataset)

Dataset after removing classes with < 10 samples:
Dataset: collins
Task type: classification
Status (is_tensor): False
Number of samples: 970
Number of features: 19 (Numerical: 19, Categorical: 0)
Number of classes: 26
Class distribution: {'109': 0.08247422680412371, '209': 0.08247422680412371, '207': 0.07731958762886598, '107': 0.07731958762886598, '106': 0.049484536082474224, '206': 0.049484536082474224, '201': 0.04536082474226804, '101': 0.04536082474226804, '105': 0.03711340206185567, '205': 0.03711340206185567, '108': 0.030927835051546393, '208': 0.030927835051546393, '214': 0.029896907216494847, '110': 0.029896907216494847, '210': 0.029896907216494847, '213': 0.029896907216494847, '114': 0.029896907216494847, '113': 0.029896907216494847, '102': 0.027835051546391754, '202': 0.027835051546391754, '111': 0.024742268041237112, '211': 0.024742268041237112, '104': 0.01752577319587629, '103': 0.01752577319587629, '204': 0.01752577319587629, '203': 0.01752577319587629}


In [30]:
# Examine data indices after cleaning (may have gaps due to removed classes)
print("Data indices after cleaning (sample from middle range):")
print(dataset.data_indices[480:500])

Data indices after cleaning (sample from middle range):
[486, 487, 488, 489, 490, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514]


In [31]:
# Split the cleaned dataset - stratification will work properly now
split_dict = dataset.split(split_mode="stratified", test_size=0.2)
train_set = split_dict["train_set"]
test_set = split_dict["test_set"]

In [32]:
# Verify that training indices are a subset of the cleaned dataset indices
is_train_subset = set(train_set.data_indices).issubset(dataset.data_indices)
print(f"Are training indices a subset of cleaned dataset? {is_train_subset}")

Are training indices a subset of cleaned dataset? True


In [33]:
# Verify that test indices are also a subset of the cleaned dataset indices
is_test_subset = set(test_set.data_indices).issubset(dataset.data_indices)
print(f"Are test indices a subset of cleaned dataset? {is_test_subset}")

Are test indices a subset of cleaned dataset? True


### 6.5 Handling Very Small Datasets

When working with very small datasets, sampling can be challenging. Let's see how `TabularDataset` handles edge cases with minimal data.


In [34]:
# Load a small dataset and create a very small sample
data = TabularDataset(
    dataset_name="soybean",
    task_type="classification",
)

# Sample only 19 instances - this tests edge case handling
data_small = data.sample(sample_mode="stratified", sample_size=19)["dataset_sampled"]
print("Very small dataset after sampling:")
print(data_small)
print(f"\nThis demonstrates that TabularDataset can handle very small samples")
print(f"while maintaining stratification as much as possible.")

Very small dataset after sampling:
Dataset: soybean
Task type: classification
Status (is_tensor): False
Number of samples: 19
Number of features: 35 (Numerical: 0, Categorical: 35)
Number of classes: 19
Class distribution: {'2-4-d-injury': 0.05263157894736842, 'alternarialeaf-spot': 0.05263157894736842, 'anthracnose': 0.05263157894736842, 'bacterial-blight': 0.05263157894736842, 'bacterial-pustule': 0.05263157894736842, 'brown-spot': 0.05263157894736842, 'brown-stem-rot': 0.05263157894736842, 'charcoal-rot': 0.05263157894736842, 'cyst-nematode': 0.05263157894736842, 'diaporthe-pod-&-stem-blight': 0.05263157894736842, 'diaporthe-stem-canker': 0.05263157894736842, 'downy-mildew': 0.05263157894736842, 'frog-eye-leaf-spot': 0.05263157894736842, 'herbicide-injury': 0.05263157894736842, 'phyllosticta-leaf-spot': 0.05263157894736842, 'phytophthora-rot': 0.05263157894736842, 'powdery-mildew': 0.05263157894736842, 'purple-seed-stain': 0.05263157894736842, 'rhizoctonia-root-rot': 0.0526315789473

## 7. Summary and Best Practices

Congratulations! You've learned how to use the `TabularDataset` class effectively. Here's a summary of key concepts:

### Key Features Covered:
- **Dataset Loading**: Automatic download and preparation of OpenML datasets
- **Task Types**: Support for both classification and regression tasks
- **Sampling Methods**: Stratified and uniform sampling with consistent seeding
- **Splitting Options**: Flexible train-test splitting with stratification
- **Operation Chaining**: Combining sampling, splitting, and cleaning operations

### Best Practices:
1. **Use Stratified Methods**: Always use `sample_mode="stratified"` and `split_mode="stratified"` for classification tasks to maintain class balance
2. **Clean Before Splitting**: Remove problematic classes or samples before splitting to ensure reliable stratification
3. **Understand Index Tracking**: The `data_indices` attribute tracks which samples from the original dataset are included
4. **Progressive Sampling**: Smaller samples are subsets of larger samples when using the same seed
5. **Handle Edge Cases**: The library gracefully handles very small datasets and edge cases

### Next Steps:
- Explore other TabCamel modules for data transformation and feature engineering
- Experiment with different OpenML datasets for your specific use cases
- Integrate `TabularDataset` into your machine learning pipelines

Happy machine learning with TabCamel! 🐪