
**Smoke Test Dataset Description**

This dataset is generated from a combination of random samples of numbers and colors. Each sample in the dataset comprises:

- **ID**: A unique identifier.
- **Number**: A random value between 0 and 10.
- **Color**: A categorical attribute chosen randomly from the set { 'Red', 'Green', 'Blue' }.

The target value for each sample is derived using the equation:  

$$target=a×number+b×number^2 $$

With adjustments:
- An additional 50 is added for samples where the color is 'Green'.
- An added value of 100 for samples with the color 'Blue'.

Furthermore, the dataset introduces elements of unpredictability:
- Roughly 10% of the samples might lack either the number or color attribute.
- Specifically, the first data entry is guaranteed to have a missing value, either in the number or color column.

In essence, this dataset serves as a litmus test, aiming to assess algorithms' robustness, especially when encountering missing data and intricate non-linear relationships.

---

In [4]:
import pandas as pd
import numpy as np
import random
import string
import os

In [5]:
dataset_name = "smoke_test_regression"

In [6]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')

# Generation functions

In [13]:
def set_seed(seed_value=10):
    np.random.seed(seed_value)
    random.seed(seed_value)

In [14]:
def generate_id(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

In [15]:
# Generate the dataset
def generate_dataset(n_samples=N_SAMPLES, a=A, b=B):
    ids = [generate_id() for _ in range(n_samples)]
    numbers = np.random.uniform(0, 10, n_samples)
    colors = np.random.choice(['Red', 'Green', 'Blue'], n_samples)
    
    # Compute the target based on the provided formula
    target = a * numbers + b * numbers**2
    target[colors == 'Green'] += 50
    target[colors == 'Blue'] += 100

    # Create a DataFrame
    df = pd.DataFrame({
        'id': ids,
        'number': numbers,
        'color': colors,
        'target': target
    })

    # Introduce missing values
    n_missing = int(0.1 * n_samples)
    missing_indices_number = np.random.choice(n_samples, n_missing, replace=False)
    missing_indices_color = np.random.choice(n_samples, n_missing, replace=False)

    df.loc[missing_indices_number, 'number'] = np.nan
    df.loc[missing_indices_color, 'color'] = np.nan
    
    # Ensure first row has a missing value in either number or color
    if random.choice([True, False]):
        df.loc[0, 'number'] = np.nan
    else:
        df.loc[0, 'color'] = np.nan

    return df

# Create Data

In [16]:
# Constants
N_SAMPLES = 200
A = 1
B = 2

# Set the seed for reproducibility
set_seed()

# Generate the dataset
df = generate_dataset()
df.head()


Unnamed: 0,id,number,color,target
0,C14AN3,,Green,176.700313
1,5RKC75,0.207519,,0.293648
2,UEPXC0,6.336482,Red,86.638499
3,IWY0SQ,7.488039,Red,119.62949
4,3LTXI3,4.98507,Green,104.686918


In [17]:
df.shape

(200, 4)

# Save Main Data File

In [20]:
df.to_csv(outp_fname, index=False, float_format="%.4f")