# Modulus8 Dataset

This script generates a synthetic dataset, named "Modulus8", designed to challenge multi-class classification models in a high-dimensional feature space. The dataset comprises records with 100 features: 50 of them influence the determination of the target class, while the remaining 50 introduce noise, thereby adding ambiguity to the classification task.

The dataset generation commences by creating random integer values between 0 and 100 for both predictor and dummy features. To determine the target class, the values of the initial 50 features are summed, followed by a modulus operation with 8. The outcome of this modulus operation is the class label, which can range from 0 to 7. This methodological twist ensures that classes aren't merely based on distinct value ranges but rather on a summation across multiple features, complicated further by a modulus operation.

Every record in the "Modulus8" dataset includes a unique ID, the 100 features (50 predictors and 50 noise features), and the target class. The target class, represented as integers between 0 and 7, stems from the modulus operation result. While generating the dataset, 3000 samples are initially created for each class. To achieve the desired class distribution—100 samples for class 0, 200 for class 1, and so forth until 3000 for class 7—the dataset undergoes subsampling.

The very architecture of this dataset creates a formidable classification challenge. With the entangled summation and modulus operations and the added noise features, pinpointing class boundaries becomes a non-trivial task for models.

The resulting "Modulus8" dataset, with its multifaceted design and inherent intricacies, offers a rigorous testing ground for machine learning enthusiasts. Although synthetically devised, it mirrors complexities encountered in real-world scenarios, truly evaluating the proficiency of various classification algorithms.

--- 

This updated description now combines the modulus operation's usage with the method of attaining the desired class distribution.

In [49]:
import os
import numpy as np
import pandas as pd
import random
from typing import List

In [50]:
dataset_name = 'modulus8'

In [51]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')
outp_chart_fname = os.path.join(output_dir, f'{dataset_name}_plot.png')

# Generate Data

In [52]:
def set_seed(seed_value=0):
    np.random.seed(seed_value)
    random.seed(seed_value)

In [53]:
def generate_samples(n_samples: int, n_predictors: int, n_dummy: int) -> tuple:
    """
    Generate random predictor and dummy values for a specified number of samples.
    
    Parameters:
    - n_samples (int): Number of samples to generate.
    - n_predictors (int): Number of predictor features.
    - n_dummy (int): Number of dummy/noise features.

    Returns:
    - tuple: Generated predictor values, dummy values, computed class labels.
    """
    predictors = np.random.randint(0, 101, size=(n_samples, n_predictors))
    dummy_features = np.random.randint(0, 101, size=(n_samples, n_dummy))
    labels = (predictors.sum(axis=1) % 8)
    
    return predictors, dummy_features, labels

In [54]:
def subsample_class_data(
        predictors: np.array,
        dummy_features: np.array,
        labels: np.array,
        class_label: int,
        n_samples: int
    ) -> tuple:
    """
    Subsample predictor and dummy values to match a specified class label and number of samples.
    
    Parameters:
    - predictors (np.array): Predictor values.
    - dummy_features (np.array): Dummy values.
    - labels (np.array): Computed class labels.
    - class_label (int): Desired class label.
    - n_samples (int): Number of samples to subsample.

    Returns:
    - tuple: Subsampled predictor values, dummy values, target values.
    """
    mask = (labels == class_label)
    
    selected_predictors = predictors[mask][:n_samples]
    selected_dummies = dummy_features[mask][:n_samples]
    selected_targets = np.array([class_label] * n_samples)
    
    return selected_predictors, selected_dummies, selected_targets

In [55]:
def generate_dataset(
        class_samples: list = [100, 200, 300, 500, 800, 1200, 2000, 3000],
        n_predictors: int = 50,
        n_dummy: int = 50
    ) -> pd.DataFrame:
    """
    Generate a synthetic dataset for multiclass classification.
    
    Parameters:
    - class_samples (list): A list containing the number of samples desired for each class.
    - n_predictors (int): Number of actual predictor features.
    - n_dummy (int): Number of dummy/noise features.

    Returns:
    - pd.DataFrame: A pandas DataFrame containing the generated dataset.
    """   
    all_predictors, all_dummies, all_targets = [], [], []
    
    predictors, dummy_features, labels = generate_samples(25000, n_predictors, n_dummy)
    
    for class_label in range(8):
        selected_predictors, selected_dummies, selected_targets = subsample_class_data(predictors, dummy_features, labels, class_label, class_samples[class_label])
        
        all_predictors.append(selected_predictors)
        all_dummies.append(selected_dummies)
        all_targets.append(selected_targets)

    # Concatenate the results
    predictors = np.vstack(all_predictors)
    dummy_features = np.vstack(all_dummies)
    target = np.concatenate(all_targets)

    # Create a dataframe
    df = pd.DataFrame(
        np.hstack([predictors, dummy_features]),
        columns=[f'predictor_{i+1}' for i in range(n_predictors)] + \
        [f'dummy_{i+1}' for i in range(n_dummy)]
    )
    
    # Add the target column
    df['target'] = target
    
    # Shuffle data
    df = df.sample(frac=1.0, replace=False)

    # Add an ID column
    df['id'] = range(len(df))
    
    # Arrange the columns
    df = df[['id'] + [col for col in df if col != 'id']]

    return df

In [56]:
# Set seed for reproducibility
set_seed()

# Generate original dataset
data = generate_dataset()

print(data.shape)
print(data.head())

(8100, 102)
      id  predictor_1  predictor_2  predictor_3  predictor_4  predictor_5  \
7451   0           92           34           12           95           19   
6586   1           23           60           17           90           82   
4081   2           46           46           12           48           63   
2655   3           81           11           12           63            5   
3339   4           47           25           72           91           40   

      predictor_6  predictor_7  predictor_8  predictor_9  ...  dummy_42  \
7451            2           55           87            6  ...        81   
6586           40           80           82           70  ...        96   
4081           71           91           52           61  ...        67   
2655           35           92            1           54  ...        37   
3339           93           27           71           59  ...        96   

      dummy_43  dummy_44  dummy_45  dummy_46  dummy_47  dummy_48  dummy_49

# Verify data was correctly generated

In [57]:
def verify_target_calculation(df: pd.DataFrame, n_predictors: int = 50) -> bool:
    """
    Verify that the target is correctly calculated based on the predictor features.

    Parameters:
    - df (pd.DataFrame): The generated dataset.
    - n_predictors (int): Number of predictor features.

    Returns:
    - bool: True if all targets are correctly calculated, False otherwise.
    """
    # Calculate the target based on the predictor features
    calculated_targets = df.iloc[:, 1:n_predictors+1].sum(axis=1) % 8

    # Check if the calculated targets match the actual targets in the dataframe
    matches = (calculated_targets == df['target']).all()

    # Check the counts per class
    class_counts = df['target'].value_counts().sort_index().values
    desired_counts = [100, 200, 300, 500, 800, 1200, 2000, 3000]

    correct_counts = all(count == desired for count, desired in zip(class_counts, desired_counts))

    return matches and correct_counts

print(verify_target_calculation(data))


True


# Save Main Data File

In [58]:
data.to_csv(outp_fname, index=False, float_format="%.4f")