The goal of this exercise is to work with statistical notions such as mean, standard deviation, and correlation.

Generate a numerical dataset with 300 datapoints (i.e. lines) and at least 6 columns and save it to a csv file names artificial_dataset.csv. This dataset must represent physical quantities of your choice, with units. The statistical relationships between the columns must make sens.

The columns must satisfy the following requirements:
- They must all have a different mean.
- They must all have a different standard deviation.
- At least one column should contain integers.
- At least one column should contain floats.
- Some columns must be positively correlated (a pair of column must have a correlation > 0.2).
- Some columns must be negatively correlated (a pair of column must have a correlation < -0.4).
- Some columns must have a correlation close to 0.

In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

n = 300

# Independent base variables
temperature_C = np.random.normal(loc=15, scale=5, size=n)              # float
altitude_m = np.random.randint(0, 3000, size=n)                        # integer

# Physically related variables
pressure_kPa = 101.3 - 0.011 * altitude_m + np.random.normal(0, 1.5, n)
speed_kmh = np.random.normal(loc=90, scale=20, size=n)
engine_rpm = (speed_kmh * 35 + np.random.normal(0, 300, n)).astype(int)

# Negative correlation with speed
fuel_consumption_L100km = (
    12 - 0.03 * speed_kmh + np.random.normal(0, 0.8, n)
)

# Create DataFrame
df = pd.DataFrame({
    "temperature_C": temperature_C,
    "pressure_kPa": pressure_kPa,
    "altitude_m": altitude_m,
    "speed_kmh": speed_kmh,
    "fuel_consumption_L100km": fuel_consumption_L100km,
    "engine_rpm": engine_rpm
})

# Save to CSV
df.to_csv("artificial_dataset.csv", index=False)

# Display basic statistics
df.describe()


Unnamed: 0,temperature_C,pressure_kPa,altitude_m,speed_kmh,fuel_consumption_L100km,engine_rpm
count,300.0,300.0,300.0,300.0,300.0,300.0
mean,14.972257,84.778664,1493.05,92.97238,9.260729,3271.626667
std,4.920968,9.910785,881.378364,19.076486,0.963161,708.553117
min,-1.206337,66.4238,9.0,41.522413,5.647737,1571.0
25%,11.58377,76.529123,747.25,79.210357,8.578614,2776.0
50%,15.296097,85.493774,1393.0,93.441532,9.283086,3275.5
75%,18.133289,93.120427,2183.25,104.170365,9.964203,3729.25
max,34.263657,103.851088,2995.0,142.647641,11.73574,5286.0


# ðŸ§ª Exercise 1 â€” Statistical Properties of an Artificial Dataset

## ðŸ“Š Dataset Description

The dataset consists of 300 samples representing physical measurements related to a vehicle operating in an environment. The variables include temperature, altitude, pressure, speed, fuel consumption, and engine rotation speed. Each feature is expressed in meaningful physical units (Â°C, meters, kPa, km/h, L/100 km, RPM).

The dataset was constructed to ensure realistic statistical relationships between the variables while satisfying the imposed constraints on means, standard deviations, and correlations.

---

## ðŸ“ˆ Statistical Interpretation

### Mean and Standard Deviation

Each column has a distinct mean and standard deviation, reflecting different physical scales and natural variability:

- Environmental variables (temperature, pressure) have relatively small variability.
- Mechanical variables (engine RPM, speed) exhibit larger dispersion.
- Altitude has the highest variance due to its wide range.

This diversity ensures that the dataset is suitable for studying normalization and scaling techniques in machine learning.

---

### Correlation Structure

The dataset exhibits three types of correlations:

#### Positive correlations
- **Speed vs. Engine RPM**: As vehicle speed increases, engine rotation speed increases.

#### Negative correlations
- **Altitude vs. Pressure**: Atmospheric pressure decreases with altitude.
- **Speed vs. Fuel Consumption**: Higher speeds lead to reduced fuel consumption in the modeled regime.

#### Near-zero correlations
- **Temperature vs. Mechanical variables**: Ambient temperature is largely independent of engine behavior and altitude.

---

## ðŸ§  Conclusion

This dataset demonstrates how statistical properties can be controlled while preserving physical realism. It provides a solid foundation for introducing concepts such as feature scaling, correlation analysis, and exploratory data analysis in machine learning.
