# Notebook 01: Data Description

This notebook describes the dataset used in this project.  
It addresses the following questions:

1. What do the aggregated files contain (10-minute, 30-minute, 1-hour)?  
2. Which sensors and nodes are included?  
3. What are the start and end timestamps for each file?  
4. What is the overall sensing duration across the datasets?

In [7]:
#Imports needed

import pandas as pd
from datetime import datetime

# File paths (relative to repo structure)
tenmin_path = "../data/aggregated/aot_aggregated_10min.csv"
hourly_path = "../data/aggregated/aot_aggregated_1hour.csv"
nodes_path = "../data/metadata/nodes.csv"
sensors_path = "../data/metadata/sensors.csv"

In [8]:
# Load metadata
nodes = pd.read_csv(nodes_path)
sensors = pd.read_csv(sensors_path)

print("Nodes shape:", nodes.shape)
print("Sensors shape:", sensors.shape)

nodes.head(), sensors.head()

Nodes shape: (126, 9)
Sensors shape: (193, 8)


(        node_id   project_id  vsn  \
 0  001e0610ba46  AoT_Chicago  004   
 1  001e0610ba3b  AoT_Chicago  006   
 2  001e0610f02f  AoT_Chicago  00A   
 3  001e0610ba8f  AoT_Chicago  00D   
 4  001e0610ba16  AoT_Chicago  010   
 
                                        address        lat        lon  \
 0           State St & Jackson Blvd Chicago IL  41.878377 -87.627678   
 1           18th St & Lake Shore Dr Chicago IL  41.858136 -87.616055   
 2  Lake Shore Drive & Fullerton Ave Chicago IL  41.926261 -87.630758   
 3                 Cornell & 47th St Chicago IL  41.810342 -87.590228   
 4          Homan Ave & Roosevelt Rd Chicago IL  41.866349 -87.710543   
 
             description      start_timestamp end_timestamp  
 0   AoT Chicago (S) [C]  2017/10/09 00:00:00           NaN  
 1       AoT Chicago (S)  2017/08/08 00:00:00           NaN  
 2  AoT Chicago (S) [CA]  2018/05/07 00:00:00           NaN  
 3       AoT Chicago (S)  2017/08/08 00:00:00           NaN  
 4   AoT Chicago (S)

> **Note:** Some fields in `nodes.csv` (such as `end_timestamp`) are missing (`NaN`).  
> This is expected, as not all nodes had a recorded shutdown when the dataset was exported.  
> For consistency and reproducibility, the actual sensing start and end times are derived  
> from the aggregated CSV files rather than relying on node metadata.

In [9]:
# Load aggregated datasets
df_10min = pd.read_csv(tenmin_path)
df_hour = pd.read_csv(hourly_path)

print("10min rows:", len(df_10min), "columns:", df_10min.columns.tolist())
print("1hour rows:", len(df_hour), "columns:", df_hour.columns.tolist())

df_10min.head()

10min rows: 58356 columns: ['timestamp', 'node_id', 'sensor', 'parameter', 'value_hrf']
1hour rows: 10091 columns: ['timestamp', 'node_id', 'sensor', 'parameter', 'value_hrf']


Unnamed: 0,timestamp,node_id,sensor,parameter,value_hrf
0,2020-01-12 00:00:00,001e0610ee36,hih6130,humidity,100.0
1,2020-01-12 00:00:00,001e0610ee36,hih6130,temperature,125.01
2,2020-01-12 00:00:00,001e0610ee36,htu21d,humidity,118.99
3,2020-01-12 00:00:00,001e0610ee36,htu21d,temperature,128.86
4,2020-01-12 00:00:00,001e0610ee43,co,concentration,-0.44551


In [10]:
# Unique nodes and sensors in each file
nodes_10min = df_10min["node_id"].nunique()
sensors_10min = df_10min["sensor"].nunique()

nodes_1hour = df_hour["node_id"].nunique()
sensors_1hour = df_hour["sensor"].nunique()

print(f"10min file covers {nodes_10min} nodes and {sensors_10min} sensors.")
print(f"1hour file covers {nodes_1hour} nodes and {sensors_1hour} sensors.")

10min file covers 10 nodes and 6 sensors.
1hour file covers 10 nodes and 6 sensors.


In [11]:
# Convert timestamps to datetime
df_10min["timestamp"] = pd.to_datetime(df_10min["timestamp"])
df_hour["timestamp"] = pd.to_datetime(df_hour["timestamp"])

# Start and end times
print("10min dataset time span:")
print("Start:", df_10min["timestamp"].min())
print("End:  ", df_10min["timestamp"].max())

print("\n1hour dataset time span:")
print("Start:", df_hour["timestamp"].min())
print("End:  ", df_hour["timestamp"].max())

10min dataset time span:
Start: 2020-01-12 00:00:00
End:   2020-02-08 23:50:00

1hour dataset time span:
Start: 2020-01-12 00:00:00
End:   2020-02-08 23:00:00


In [12]:
# Calculate duration in days
duration_10min = (df_10min["timestamp"].max() - df_10min["timestamp"].min()).days
duration_1hour = (df_hour["timestamp"].max() - df_hour["timestamp"].min()).days

print(f"10min dataset duration: {duration_10min} days")
print(f"1hour dataset duration: {duration_1hour} days")

10min dataset duration: 27 days
1hour dataset duration: 27 days


## Summary

| File                     | Nodes | Sensors | Start Date | End Date   | Duration (days) |
|--------------------------|-------|---------|------------|------------|-----------------|
| aot_aggregated_10min.csv | 10    | 6       | 2020-01-12 | 2020-02-08 | 27              |
| aot_aggregated_1hour.csv | 10    | 6       | 2020-01-12 | 2020-02-08 | 27              |

> Both the 10-minute and 1-hour datasets cover 10 nodes and 6 sensors  
> over the same time span (Jan 12 – Feb 8, 2020), lasting 27 days.

> ⚠️ **Note on dataset scope**  
> The research presented in the associated paper was conducted on the **full AoT dataset**  
> (≈500 deployed nodes across Chicago, with raw data collected at 30-second intervals).  
>  
> For reproducibility in this repository, we provide a **subset** covering 10 nodes,  
> 6 sensors, and a continuous 27-day window (Jan 12 – Feb 8, 2020).  
>  
> This subset is included to demonstrate the workflows and code without requiring  
> access to the full dataset. The same methods apply directly to the full-scale data.