# Smoke-Test Dataset for Multiclass Classification

This synthetic dataset presents a multiclass classification problem based on two types of features and a target with three classes. Each sample in the dataset has a unique identifier, a set of two features (color, number), and a target class label.

The first feature 'color' is a categorical feature, with three possible values: Red, Green, and Blue. These values are evenly and randomly distributed across the dataset.

The second feature 'number' is a continuous feature, with values drawn from a uniform distribution between 1 and 100 inclusive. Each feature value is independent of the others.

The class label for each sample is determined by the combination of both the categorical and numerical features. If the color is 'Red' and the number is greater than 50, the sample is labeled as Class 2. If the color is 'Green' and the number is less than or equal to 50, the sample is labeled as Class 1. All other combinations are labeled as Class 0. This forms a multiclass classification problem, where the task is to classify samples based on the combination of their color and number features.

The identifiers for the samples are randomly generated alphanumeric strings. They are used to uniquely identify each sample in the dataset.

Additionally, to add complexity to the dataset, approximately 10% of the data points in each feature are replaced with missing values. Furthermore, the first row always contains a missing value in either of the two features randomly, to ensure the presence of missing data even for smaller datasets.

In summary, this dataset presents a multiclass classification problem, where the task is to classify samples based on the combination of a categorical and numerical feature, while also dealing with missing data. The relationship between the features and target is known, however, the presence of missing values introduces complexity and makes it a challenging task for multiclass classification algorithms.


In [25]:
import numpy as np
import pandas as pd
import os
import sys
import random
import string

In [26]:
dataset_name = "smoke_test_mc"

In [27]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')

# Generation functions

In [28]:
def set_seed(seed_value):
    np.random.seed(seed_value)
    random.seed(seed_value)

In [29]:
def generate_id(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

In [30]:
def create_dataset(num_rows, seed_value=42):
    # Generating unique ids
    ids = [generate_id() for _ in range(num_rows)]

    # Generating categorical feature with 3 categories
    colors = np.random.choice(['Red', 'Green', 'Blue'], size=num_rows)

    # Generating numeric feature from 1 to 100
    numbers = np.random.choice(list(range(1, 101)), size=num_rows)

    # Creating target based on rules
    # color is Red and number > 50 - class 2
    # color is Green and number <= 50 - class 1
    # otherwise - class 0
    target = []
    for color, number in zip(colors, numbers):
        if color == 'Red' and number > 50:
            target.append(2)
        elif color == 'Green' and number <= 50:
            target.append(1)
        else:
            target.append(0)

    # Create DataFrame
    df = pd.DataFrame({'id': ids, 'color': colors, 'number': numbers, 'target': target})

    # Introduce missing values by setting 10% of each feature to NaN
    for column in ['color', 'number']:
        missing_rows = np.random.choice(df.index, size=int(num_rows * 0.1), replace=False)
        df.loc[missing_rows, column] = np.nan

    # Make sure the first row has a null value in either of the two features
    first_missing_feature = np.random.choice(['color', 'number'])
    df.loc[0, first_missing_feature] = np.nan

    return df

# Create Data

In [31]:
set_seed(seed_value=2)
data = create_dataset(200)
data.head(10)

Unnamed: 0,id,color,number,target
0,DFFXKT,,,2
1,QNCK1Z,Green,53.0,0
2,6X826R,Red,63.0,2
3,CBX3UY,Blue,27.0,0
4,17K9LP,Blue,84.0,0
5,OBLULI,Red,26.0,0
6,66X69L,Blue,37.0,0
7,207XWX,Green,,1
8,2KZ37P,Green,86.0,0
9,5R566W,Blue,19.0,0


In [33]:
data.shape

(200, 4)

# Save Main Data File

In [34]:
data.to_csv(outp_fname, index=False)