# Notebook 03: Data Aggregation

This notebook demonstrates how raw Array of Things (AoT) sensor data  
is aggregated into **10-minute, 30-minute, and 1-hour intervals**.

- The **research paper** analyzed the full AoT dataset (~500 nodes, 30-second sampling).  
- In this **repository**, we illustrate the aggregation process on the included  
  **sample raw trace** (`/data/raw/sample_raw_trace.csv`) for reproducibility.

The output sample aggregations are stored in `/data/aggregated/` as:

- `sample_aggregated_10min.csv`
- `sample_aggregated_30min.csv`
- `sample_aggregated_1hour.csv`

> ⚠️ Note: The provided precomputed files (`aot_aggregated_10min.csv`, `aot_aggregated_1hour.csv`)  
> correspond to broader analysis and should not be overwritten.

In [1]:
#Imports needed

import pandas as pd
import os

In [16]:
# Load the sample raw trace
raw_path = "../data/raw/sample_raw_trace.csv"
df_raw = pd.read_csv(raw_path)

# Ensure timestamp column is datetime and set as index
df_raw["timestamp"] = pd.to_datetime(df_raw["timestamp"])
df_raw = df_raw.set_index("timestamp").sort_index()

print("Raw trace shape:", df_raw.shape)
print("Time range:", df_raw.index.min(), "to", df_raw.index.max())
df_raw.head()

Raw trace shape: (504, 4)
Time range: 2020-01-12 00:00:00 to 2020-01-12 23:00:00


Unnamed: 0_level_0,node_id,sensor,parameter,value_hrf
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-12,001e0610ee36,hih6130,humidity,44.981605
2020-01-12,001e0610ee36,hih6130,temperature,24.507143
2020-01-12,001e0610ee36,htu21d,humidity,76.599697
2020-01-12,001e0610ee36,htu21d,temperature,15.986585
2020-01-12,001e0610ee36,co,concentration,-0.687963


## Aggregation Function

To aggregate the data, we group by **node ID**, **sensor type**, and **parameter** (e.g., humidity, temperature, concentration).  
We then resample based on the desired time frequency.

The function below can be reused to generate 10-minute, 30-minute, or hourly aggregates.

In [18]:
def aggregate_data(df, freq="10min"):
    """
    Aggregate raw sensor data to a given frequency.
    Groups by node_id, sensor, and parameter.

    freq: str, e.g., '10min', '30min', '1H'
    """
    return (
        df.groupby(["node_id", "sensor", "parameter"])
          .resample(freq)         
          .mean(numeric_only=True)
          .reset_index()
    )


## Generate Aggregated Data

We now produce three datasets: 10-minute, 30-minute, and 1-hour aggregates.

In [19]:
df_10min = aggregate_data(df_raw, "10min")
df_30min = aggregate_data(df_raw, "30min")
df_1hour = aggregate_data(df_raw, "1H")

print("10-min aggregated shape:", df_10min.shape)
print("30-min aggregated shape:", df_30min.shape)
print("1-hour aggregated shape:", df_1hour.shape)

display(df_10min.head())
display(df_30min.head())
display(df_1hour.head())

10-min aggregated shape: (2919, 5)
30-min aggregated shape: (987, 5)
1-hour aggregated shape: (504, 5)


  .resample(freq)


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 00:10:00,
2,001e0610ee36,co,concentration,2020-01-12 00:20:00,
3,001e0610ee36,co,concentration,2020-01-12 00:30:00,
4,001e0610ee36,co,concentration,2020-01-12 00:40:00,


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 00:30:00,
2,001e0610ee36,co,concentration,2020-01-12 01:00:00,0.570352
3,001e0610ee36,co,concentration,2020-01-12 01:30:00,
4,001e0610ee36,co,concentration,2020-01-12 02:00:00,-0.376578


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 01:00:00,0.570352
2,001e0610ee36,co,concentration,2020-01-12 02:00:00,-0.376578
3,001e0610ee36,co,concentration,2020-01-12 03:00:00,0.604394
4,001e0610ee36,co,concentration,2020-01-12 04:00:00,0.774425


## Save Aggregated Data

The aggregated data is saved to the `../data/aggregated/` directory.  
These files serve as ready-to-use inputs for later stages of analysis, such as clustering and annotation.

In [20]:
df_10min.to_csv("../data/aggregated/sample_aggregated_10min.csv", index=False)
df_30min.to_csv("../data/aggregated/sample_aggregated_30min.csv", index=False)
df_1hour.to_csv("../data/aggregated/sample_aggregated_1hour.csv", index=False)

print("Aggregated CSVs saved to ../data/aggregated/")

Aggregated CSVs saved to ../data/aggregated/


## Handling Missing Data (Optional)

Real-world sensor data often contains missing values due to outages or sensor faults.  
In our aggregated files, these appear as `NaN` entries.

Below, we demonstrate two common strategies:
- **Forward Fill** – carry the last known value forward.  
- **Interpolation** – estimate missing values using linear interpolation.  

These examples are for illustration only; the saved datasets preserve `NaN` values to maintain fidelity with raw traces.

In [21]:
# Forward fill
df_10min_ffill = df_10min.fillna(method="ffill")

# Linear interpolation
df_10min_interp = df_10min.interpolate()

print("Original with NaNs:")
display(df_10min.head(10))

print("After forward fill:")
display(df_10min_ffill.head(10))

print("After interpolation:")
display(df_10min_interp.head(10))

Original with NaNs:


  df_10min_ffill = df_10min.fillna(method="ffill")
  df_10min_interp = df_10min.interpolate()


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 00:10:00,
2,001e0610ee36,co,concentration,2020-01-12 00:20:00,
3,001e0610ee36,co,concentration,2020-01-12 00:30:00,
4,001e0610ee36,co,concentration,2020-01-12 00:40:00,
5,001e0610ee36,co,concentration,2020-01-12 00:50:00,
6,001e0610ee36,co,concentration,2020-01-12 01:00:00,0.570352
7,001e0610ee36,co,concentration,2020-01-12 01:10:00,
8,001e0610ee36,co,concentration,2020-01-12 01:20:00,
9,001e0610ee36,co,concentration,2020-01-12 01:30:00,


After forward fill:


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 00:10:00,-0.687963
2,001e0610ee36,co,concentration,2020-01-12 00:20:00,-0.687963
3,001e0610ee36,co,concentration,2020-01-12 00:30:00,-0.687963
4,001e0610ee36,co,concentration,2020-01-12 00:40:00,-0.687963
5,001e0610ee36,co,concentration,2020-01-12 00:50:00,-0.687963
6,001e0610ee36,co,concentration,2020-01-12 01:00:00,0.570352
7,001e0610ee36,co,concentration,2020-01-12 01:10:00,0.570352
8,001e0610ee36,co,concentration,2020-01-12 01:20:00,0.570352
9,001e0610ee36,co,concentration,2020-01-12 01:30:00,0.570352


After interpolation:


Unnamed: 0,node_id,sensor,parameter,timestamp,value_hrf
0,001e0610ee36,co,concentration,2020-01-12 00:00:00,-0.687963
1,001e0610ee36,co,concentration,2020-01-12 00:10:00,-0.478244
2,001e0610ee36,co,concentration,2020-01-12 00:20:00,-0.268525
3,001e0610ee36,co,concentration,2020-01-12 00:30:00,-0.058805
4,001e0610ee36,co,concentration,2020-01-12 00:40:00,0.150914
5,001e0610ee36,co,concentration,2020-01-12 00:50:00,0.360633
6,001e0610ee36,co,concentration,2020-01-12 01:00:00,0.570352
7,001e0610ee36,co,concentration,2020-01-12 01:10:00,0.41253
8,001e0610ee36,co,concentration,2020-01-12 01:20:00,0.254709
9,001e0610ee36,co,concentration,2020-01-12 01:30:00,0.096887


## Summary

- We demonstrated how to aggregate AoT sensor data into multiple time granularities.  
- Aggregated outputs were saved in the `../data/aggregated/` folder.  
- We preserved missing values (`NaN`) in the official outputs to stay consistent with real-world traces.  
- Examples of handling missing values were provided for reference but not applied to the official datasets.
