# Modulus8 Dataset

This script generates a synthetic dataset, named "Modulus8", designed to challenge multi-class classification models in a high-dimensional feature space. 

The dataset generation commences by creating random integer values between 0 and 100 for five predictor features. To determine the target class, the values of the five features are summed, followed by a modulus operation with 8. The outcome of this modulus operation is the class label, which can range from 0 to 7. This methodological twist ensures that classes aren't merely based on distinct value ranges but rather on a summation across multiple features, complicated further by a modulus operation. Furthermore, class distribution is uneven as follows:  
- 100 samples for class 0
- 200 for class 1
- 300 for class 2
- 500 for class 3
- 800 for class 4
- 1,200 for class 5
- 2,000 for class 6, and 
- 3,000 for class 7

Every record in the "Modulus8" dataset includes a unique ID, the five features, and the target class. The target class, represented as integers between 0 and 7, stems from the modulus operation result. 

The very architecture of this dataset creates a formidable classification challenge. With the entangled summation and modulus operations, combined with uneven class distributions, pinpointing class boundaries becomes a non-trivial task for models.

The resulting "Modulus8" dataset, with its multifaceted design and inherent intricacies, offers a rigorous testing ground for machine learning enthusiasts. Although synthetically devised, it mirrors complexities encountered in real-world scenarios, truly evaluating the proficiency of various classification algorithms.


In [238]:
import os
import numpy as np
import pandas as pd
import random
from typing import List

In [239]:
dataset_name = 'modulus8'

In [240]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')
outp_chart_fname = os.path.join(output_dir, f'{dataset_name}_plot.png')

# Generate Data

In [241]:
def set_seed(seed_value=0):
    np.random.seed(seed_value)
    random.seed(seed_value)

In [242]:
def generate_samples(n_samples: int, n_predictors: int) -> tuple:
    """
    Generate random predictor for a specified number of samples.
    
    Parameters:
    - n_samples (int): Number of samples to generate.
    - n_predictors (int): Number of predictor features.

    Returns:
    - tuple: Generated predictor values, dummy values, computed class labels.
    """
    predictors = np.random.randint(0, 101, size=(n_samples, n_predictors))
    labels = (predictors.sum(axis=1) % 8)
    
    return predictors, labels

In [243]:
def subsample_class_data(
        predictors: np.array,
        labels: np.array,
        class_label: int,
        n_samples: int
    ) -> tuple:
    """
    Subsample predictor to match a specified class label and number of samples.
    
    Parameters:
    - predictors (np.array): Predictor values.
    - labels (np.array): Computed class labels.
    - class_label (int): Desired class label.
    - n_samples (int): Number of samples to subsample.

    Returns:
    - tuple: Subsampled predictor values, dummy values, target values.
    """
    mask = (labels == class_label)
    
    selected_predictors = predictors[mask][:n_samples]
    selected_targets = np.array([class_label] * n_samples)
    
    return selected_predictors, selected_targets

In [244]:
def generate_dataset(
        class_samples: list = [100, 200, 300, 500, 800, 1200, 2000, 3000],
        n_predictors: int = 5,
    ) -> pd.DataFrame:
    """
    Generate a synthetic dataset for multiclass classification.
    
    Parameters:
    - class_samples (list): A list containing the number of samples desired for each class.
    - n_predictors (int): Number of actual predictor features.

    Returns:
    - pd.DataFrame: A pandas DataFrame containing the generated dataset.
    """   
    all_predictors, all_dummies, all_targets = [], [], []
    
    predictors, labels = generate_samples(25000, n_predictors)
    
    for class_label in range(8):
        selected_predictors, selected_targets = subsample_class_data(
            predictors, labels, class_label, class_samples[class_label]
        )
        
        all_predictors.append(selected_predictors)
        all_targets.append(selected_targets)

    # Concatenate the results
    predictors = np.vstack(all_predictors)
    target = np.concatenate(all_targets)

    # Create a dataframe
    df = pd.DataFrame(
        predictors,
        columns=[f'predictor_{i+1}' for i in range(n_predictors)]
    )
    
    # Add the target column
    df['target'] = target
    
    # Shuffle data
    df = df.sample(frac=1.0, replace=False)

    # Add an ID column
    df['id'] = range(len(df))
    
    # Arrange the columns
    df = df[['id'] + [col for col in df if col != 'id']]

    return df

In [245]:
# Set seed for reproducibility
set_seed()

n_predictors = 5

# Generate original dataset
data = generate_dataset(
    class_samples = [100, 200, 300, 500, 800, 1200, 2000, 3000],
    n_predictors = n_predictors
)

print(data.shape)
print(data.head())

(8100, 7)
      id  predictor_1  predictor_2  predictor_3  predictor_4  predictor_5  \
5679   0           13           17           30           82           81   
1894   1            9           24            2           27           62   
1787   2            0           65           83            7           57   
7330   3           57           86           29           77           46   
8053   4           45           24           27           76           19   

      target  
5679       7  
1894       4  
1787       4  
7330       7  
8053       7  


In [248]:
data['target'].value_counts()

7    3000
6    2000
5    1200
4     800
3     500
2     300
1     200
0     100
Name: target, dtype: int64

# Verify data was correctly generated

In [246]:
def verify_target_calculation(df: pd.DataFrame, n_predictors: int = 50) -> bool:
    """
    Verify that the target is correctly calculated based on the predictor features.

    Parameters:
    - df (pd.DataFrame): The generated dataset.
    - n_predictors (int): Number of predictor features.

    Returns:
    - bool: True if all targets are correctly calculated, False otherwise.
    """
    # Calculate the target based on the predictor features
    calculated_targets = df.iloc[:, 1:n_predictors+1].sum(axis=1) % 8

    # Check if the calculated targets match the actual targets in the dataframe
    matches = (calculated_targets == df['target']).all()

    # Check the counts per class
    class_counts = df['target'].value_counts().sort_index().values
    desired_counts = [100, 200, 300, 500, 800, 1200, 2000, 3000]

    correct_counts = all(count == desired for count, desired in zip(class_counts, desired_counts))

    return matches and correct_counts

print(verify_target_calculation(data, n_predictors))


True


# Save Main Data File

In [247]:
data.to_csv(outp_fname, index=False, float_format="%.4f")