## Dataset Generation for Benchmarking

In this section, we generate synthetic datasets to evaluate the performance differences between pandas and polars when handling DataFrames. A systematic benchmarking approach requires datasets of varying sizes and characteristics, allowing us to measure the performance impact of different data operations under various conditions.

### Objectives

The datasets generated here are designed to:

1. **Simulate Realistic Data Patterns**: We include numerical, textual, and date columns to mimic the diverse types of data typically handled in data processing tasks.
2. **Scale Across Different Sizes**: By generating datasets with increasing numbers of rows (from thousands to millions), we can observe how well each library handles different data volumes.
3. **Provide Incremental Samples per Size Category**: For each predefined size (e.g., 10,000 rows, 1 million rows), we create multiple datasets with a slightly increasing number of rows. This approach allows us to evaluate performance on a finer scale, capturing how incremental data increases affect each library’s efficiency.

### Dataset Structure

Each dataset has:

    A combination of numeric, string, and datetime columns to represent different data types.
    Rows increasing progressively by a percentage of the base size, allowing for a more granular benchmark.

These datasets will be saved as CSV files named in the format {number_of_rows}_{number_of_columns}.csv and stored in the datasets directory for easy access during our benchmarking process.

In [None]:
from data_generation.data_generator import DataGenerator
from data_generation.config import DATASET_SIZES

generator = DataGenerator(library='pandas')  # Cambia in 'polars' per usare Polars
    
# Genera tutte le serie di dataset
all_datasets = generator.generate_all_datasets(DATASET_SIZES)
    
# Salva tutti i dataset nella cartella specificata
generator.save_datasets(all_datasets, output_dir="datasets")