# Fluid Dynamics Pressure Estimation Dataset

This synthetic dataset encapsulates a regression problem rooted in fluid dynamics principles, specifically Bernoulli's equation for fluid flow. Each sample in the dataset carries a unique identifier, a set of fluid properties, and a target pressure value.

The fluid properties encompass:
- Fluid density, which represents the mass of the fluid per unit volume.
- Fluid velocity, indicating the speed at which the fluid is flowing.
- Height, representing the elevation or depth at which the pressure is measured.
- A categorical feature named 'body', denoting the celestial body, like Earth, Mars, or Jupiter. This abstractly signifies the gravitational acceleration affecting the fluid.

The target value for each sample is the pressure, computed based on Bernoulli's equation. The equation accounts for kinetic energy, potential energy due to gravity, and a constant term to determine the fluid's pressure.

To introduce complexity and simulate real-world imperfections, the dataset contains missing values. Approximately 5% of the data points across the features might be absent. This necessitates handling or imputation techniques, especially crucial for algorithms that cannot inherently manage missing data.

The sample identifiers are sequential integers, beginning from 0, ensuring a unique reference for each data entry.

In essence, this dataset embodies a regression challenge where the objective is to estimate fluid pressure given a set of fluid properties and conditions. The inherent physical relationships, combined with the abstract representation of gravitational forces and the presence of missing values, render it an intricate task for regression models.


In [1]:
from typing import Tuple, Optional
import pandas as pd
import numpy as np
import os

In [2]:
dataset_name = "fluid_pressure"

In [3]:
output_dir = f'./../../processed/{dataset_name}/'
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')

# Generation functions

In [4]:
def set_random_seeds(seed: int = 7) -> None:
    """
    Set seeds for reproducibility.

    Args:
        seed (int): The seed value to set for numpy's random operations.
    """
    np.random.seed(seed)

In [5]:
def generate_fluid_dataset(
    N: int = 1000,
    density_range: Tuple[float, float] = (1000, 13600),
    velocity_range: Tuple[float, float] = (0, 50),
    height_range: Tuple[float, float] = (0, 100),
    bodies: Tuple[str, ...] = ('Earth', 'Moon', 'Mars', 'Jupiter', 'Venus', 'Saturn'),
    constant_term: float = 1,  # in bars, equivalent to 1e5 Pa
    missing_percentage: Optional[float] = None
) -> pd.DataFrame:
    """
    Generate a synthetic dataset based on Bernoulli's equation for fluid flow with pressure in bars.
    
    Args:
    - N (int): Number of data points.
    - density_range (tuple): Min and max fluid density in kg/m^3.
    - velocity_range (tuple): Min and max fluid velocity in m/s.
    - height_range (tuple): Min and max height in meters.
    - bodies (tuple): Names of celestial bodies used for gravitational acceleration.
    - constant_term (float): Constant term used in Bernoulli's equation in bars.
    - missing_percentage (float, optional): Percentage of missing values to introduce to the dataset.
    
    Returns:
    - pd.DataFrame: Generated dataset.
    """
    
    # Gravitational accelerations mapping
    gravitational_map = {
        'Earth': 9.81,
        'Moon': 1.625,
        'Mars': 3.71,
        'Jupiter': 24.79,
        'Venus': 8.87,
        'Saturn': 10.44
    }
    
    # Fluid density
    rho = np.random.uniform(*density_range, N)

    # Fluid velocity
    v = np.random.uniform(*velocity_range, N)

    # Celestial body and its gravitational acceleration
    body = np.random.choice(bodies, N)
    g = np.array([gravitational_map[b] for b in body])

    # Height
    h = np.random.uniform(*height_range, N)

    # Compute pressure using Bernoulli's equation in bars
    P = constant_term - 0.5 * rho * v**2 / 1e5 - rho * g * h / 1e5

    # Create DataFrame
    df = pd.DataFrame({
        'id': np.arange(N),
        'fluid_density': rho,
        'fluid_velocity': v,
        'body': body,
        'height': h,
        'pressure_bars': P
    })

    # Introduce missing values if required (excluding target variable)
    if missing_percentage:
        num_missing = int(missing_percentage * N)
        for column in df.columns.difference(['id', 'pressure_bars']):
            missing_indices = np.random.choice(df.index, num_missing, replace=False)
            df.loc[missing_indices, column] = np.nan

    return df


# Create dataset

In [6]:
# Generate and display the dataset
set_random_seeds()
data = generate_fluid_dataset(N=2400, missing_percentage=0.05)
print(data.shape)
data.head()


(2400, 6)


Unnamed: 0,id,fluid_density,fluid_velocity,body,height,pressure_bars
0,0,1961.484446,14.243012,Mars,26.358794,-2.907725
1,1,,27.875967,Saturn,75.420882,-126.317536
2,2,6523.956316,26.177745,Saturn,3.149047,-23.498318
3,3,10115.661241,26.224657,Saturn,32.559694,-68.169832
4,4,13322.667851,17.199644,Earth,80.11854,-123.417298


# Save Main Data File

In [7]:
data.to_csv(outp_fname, index=False, float_format="%.6f")