# NeuralSimReg Dataset
### A Synthetic Regression Dataset Simulating a Neural Network-Driven Process with Missing Values


This synthetic dataset represents a regression problem, simulating a scenario where data is generated by an unknown real-world process, represented here by a neural network. The dataset contains 10,000 samples. Each sample in the dataset has a unique identifier, a set of 16 features, and a continuous target value.

The features, originally 16-dimensional data points, are generated from a standard normal distribution (mean = 0 and standard deviation = 1). These features can be conceptualized as random inputs fed into a neural network, which simulates our unknown data generation process, producing a target output.

The neural network modeling this process comprises three layers: an input layer of 16 neurons, a first hidden layer of 8 neurons with a ReLU activation, and a second hidden layer of 4 neurons, also with a ReLU activation. The output layer has a single neuron without any activation function, generating a continuous target value for each input sample.

Post this neural network generation, as an additional artifact to mirror real-world complications, approximately 2% of the feature values have been made missing, represented as NaN values. This simulates scenarios where datasets might have gaps or missing entries, not due to the data generation process but as a result of data collection, storage, or other external factors.

The target value for each sample is the output from this neural network given the original 16-dimensional input features, reflecting the true value produced by our simulated data generation process. The central task is a regression problem, aiming to predict this true target value based on the features, even with their missing entries.

The unique challenge posed by this dataset arises from the intricacies introduced by the neural network. With the weights and biases of this neural network initialized randomly from a normal distribution (mean = 0 and standard deviation = 1), the relationship between the input features and the target value is intricate and potentially highly non-linear.

Each sample's identifier is a sequential integer, beginning from 1, serving to uniquely identify every sample in the dataset.

In essence, this dataset delineates a regression challenge where the objective is to predict the output of an intricate, simulated data generation process based on given input features. The non-linear transformations executed by the neural network, combined with high-dimensionality and introduced missing values in the input, amplify the challenge for regression algorithms.


In [113]:
import numpy as np
import os
import pandas as pd
from typing import Tuple

In [114]:
dataset_name = "neural_sim_reg"

In [115]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')

# Generation functions


In [116]:
def set_random_seeds(seed: int = 42) -> None:
    """
    Set seeds for reproducibility.

    Args:
        seed (int): The seed value to set for numpy's random operations.
    """
    np.random.seed(seed)

In [117]:
def relu(x: np.ndarray) -> np.ndarray:
    """
    Rectified Linear Unit (ReLU) activation function.

    Args:
        x (np.ndarray): Input array.

    Returns:
        np.ndarray: Output after applying the ReLU function element-wise.
    """
    return np.maximum(0, x)

In [118]:
def initialize_weights_and_biases(
        input_size: int, hidden1_size: int, 
        hidden2_size: int, output_size: int = 1
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Initialize weights and biases for the neural network.
    
    Args:
        input_size (int): Size of the input layer.
        hidden1_size (int): Size of the first hidden layer.
        hidden2_size (int): Size of the second hidden layer.
        output_size (int, optional): Size of the output layer. Defaults to 1.
    
    Returns:
        Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
        W1, b1, W2, b2, W3, b3: Weights and biases for the layers of the neural network.
    """
    W1 = np.random.normal(0, 1, (input_size, hidden1_size))
    b1 = np.random.normal(0, 1, (hidden1_size))

    W2 = np.random.normal(0, 1, (hidden1_size, hidden2_size))
    b2 = np.random.normal(0, 1, (hidden2_size))

    W3 = np.random.normal(0, 1, (hidden2_size, output_size))
    b3 = np.random.normal(0, 1, (output_size))
    
    return W1, b1, W2, b2, W3, b3

In [119]:
def neural_network(
        X: np.ndarray, W1: np.ndarray, b1: np.ndarray, W2: np.ndarray, 
        b2: np.ndarray, W3: np.ndarray, b3: np.ndarray
    ) -> np.ndarray:
    """
    Neural network function.
    
    Args:
        X (np.ndarray): Input data.
        W1, b1, W2, b2, W3, b3: Weights and biases for the layers of the
                                neural network.
        
    Returns:
        np.ndarray: Output from the neural network.
    """
    # First hidden layer
    Z1 = X.dot(W1) + b1
    A1 = relu(Z1)
    
    # Second hidden layer
    Z2 = A1.dot(W2) + b2
    A2 = relu(Z2)
    
    # Output layer
    Z3 = A2.dot(W3) + b3
    
    return Z3

In [120]:
def generate_synthetic_data(
        input_size: int, hidden1_size: int, hidden2_size: int,
        num_samples: int = 10000
    ) -> pd.DataFrame:
    """
    Generate synthetic data using a randomly initialized neural network.
    
    Args:
        input_size (int): Size of the input layer.
        hidden1_size (int): Size of the first hidden layer.
        hidden2_size (int): Size of the second hidden layer.
        num_samples (int, optional): Number of data samples to generate.
                                     Defaults to 10000.

    Returns:
        pd.DataFrame: DataFrame containing the synthetic data.
    """
    # Initialize weights and biases
    W1, b1, W2, b2, W3, b3 = initialize_weights_and_biases(
        input_size, hidden1_size, hidden2_size
    )

    # Generate random input data
    X = np.random.randn(num_samples, input_size)

    # Get the output of the neural network
    y = neural_network(X, W1, b1, W2, b2, W3, b3)

    # Create a pandas dataframe
    df = pd.DataFrame(
        X, columns=[f'feature_{i}' for i in range(1, input_size+1)]
    )
    df['target'] = y
    df.insert(0, 'sample_id', np.arange(1, num_samples + 1))

    return df

In [121]:
def introduce_missing_values(df: pd.DataFrame, percentage: float = 0.02) -> pd.DataFrame:
    """
    Introduce missing values in the feature columns of the dataframe.
    
    Args:
        df (pd.DataFrame): Original dataframe.
        percentage (float): Percentage of values to be replaced with NaN.
        
    Returns:
        pd.DataFrame: DataFrame with missing values.
    """
    num_rows, num_cols = df.shape
    num_missing = int(percentage * num_rows * num_cols)

    # Generate random row and column indices
    row_indices = np.random.randint(0, num_rows, num_missing)
    col_indices = np.random.randint(0, num_cols - 2, num_missing)  # Excluding 'target' and 'sample_id' columns

    # Introduce missing values
    for row, col in zip(row_indices, col_indices):
        df.iat[row, col] = np.nan

    return df

# Create Data

In [122]:
set_random_seeds(66)
# Generate the synthetic dataset with custom input size and hidden layer sizes
orig_data = generate_synthetic_data(num_samples=10000, input_size=16, hidden1_size=8, hidden2_size=4)
# Introduce missing values
data = introduce_missing_values(orig_data, percentage=0.02)
print(data.head())
data.shape

   sample_id  feature_1  feature_2  feature_3  feature_4  feature_5  \
0        1.0   1.530773  -0.880927   0.147362  -0.521976  -0.577191   
1        2.0   0.439518   0.219240  -0.763957  -0.612097   0.370604   
2        3.0  -0.227886  -0.859937   1.987863  -1.604954   0.784807   
3        4.0   0.064003   1.426716   1.019955  -1.007368   0.928673   
4        5.0  -1.445058  -0.029162   0.299268  -0.444360  -0.990869   

   feature_6  feature_7  feature_8  feature_9  feature_10  feature_11  \
0   0.034991  -0.845358   1.742985   0.195153   -0.950402    0.611827   
1   0.283830   0.082267   0.570342   1.197955   -0.201476    0.156258   
2   0.241413  -0.014350  -0.798338  -0.682943    0.983022    0.108191   
3  -0.693904   0.808132  -2.569986  -0.178805    0.365941   -0.613898   
4  -0.143019   0.999241   0.967716   0.557419    0.667822   -2.228937   

   feature_12  feature_13  feature_14  feature_15  feature_16     target  
0    0.858426    1.778379    1.570481    0.556079    0.5670

(10000, 18)

# Save Main Data File

In [123]:
data.to_csv(outp_fname, index=False, float_format="%.4f")