## Measuring Air Quality
We downloaded data from one sensor over a 24-hour period and selected three half-hour time intervals spread throughout the day where the readings were roughly constant over the 30-minute period. This gave us three sets of 15 two-minute averages, for a total of 45 measurements:

In [4]:
import pandas as pd
import numpy as np

In [5]:
# This is the original data from the sensor
data = pd.read_csv("3.4/purple_air_2min_sample.csv")
data.head()

Unnamed: 0,aq2.5,time,hour,diffs,meds
0,6.14,2022-04-01 00:01:10 UTC,0,0.765,5.375
1,5.0,2022-04-01 00:03:10 UTC,0,-0.375,5.375
2,5.29,2022-04-01 00:05:10 UTC,0,-0.085,5.375
3,4.73,2022-04-01 00:07:10 UTC,0,-0.645,5.375
4,4.31,2022-04-01 00:09:10 UTC,0,-1.065,5.375


In [6]:
# Data attributes:
print("Dataset Overview:")
print(f"Total rows: {len(data)}")
print(f"Columns: {list(data.columns)}")
print(f"Time range: {data['time'].iloc[0]} to {data['time'].iloc[-1]}")

Dataset Overview:
Total rows: 150
Columns: ['aq2.5', 'time', 'hour', 'diffs', 'meds']
Time range: 2022-04-01 00:01:10 UTC to 2022-04-01 20:01:20 UTC


In [7]:
data['hour'].unique()

array([ 0,  6, 11, 16, 19])

In [8]:
data['meds'].unique()

array([ 5.375, 10.545,  6.6  , 10.43 ,  8.555])

The dataset contains 5 distinct time periods, each with its own median.
#### How to select 45 records (3 periods x 15 measurements):
1. Select periods where readings were "roughly constant over 30 minutes"
2. Choose 3 half-hour intervals spread throughout the day
3. Take 15 two-minute averages from each period (total 45 measurements)

### STEP1: Analyse stability of each time period

In [9]:
# Calculate ONLY variance for each period (sufficient for ranking stability)
period_stability = []

In [10]:
for meds_value in sorted(data['meds'].unique()):
    period_data = data[data['meds'] == meds_value]
    # Get the exact hour which all the records in this hour with same meds value
    hour = period_data['hour'].iloc[0]

    # Variance of diffs
    diffs_variance = np.var(period_data['diffs']) # Lower = more stable

    period_stability.append({
        'hour': hour, 
        'meds': meds_value,
        'stability': diffs_variance,
        'time_start': period_data['time'].iloc[0][11:16]
    })

    print(f"Hour {hour:2d}: diffs_variance={diffs_variance:.3f}")

Hour  0: diffs_variance=0.285
Hour 11: diffs_variance=0.876
Hour 19: diffs_variance=1.000
Hour 16: diffs_variance=0.806
Hour  6: diffs_variance=1.378


### STEP2: Sort by stability (lowest variance first)

In [11]:
period_stability.sort(key=lambda x: x['stability'])

In [12]:
# Select 3 periods: most stable + good time spread
selected = []

#### Strategy: pick most stable from early/mid/late day

In [13]:
early = [p for p in period_stability if p['hour'] <= 8]
mid = [p for p in period_stability if 9 <= p['hour'] <= 15]
late = [p for p in period_stability if p['hour'] >= 16]

In [14]:
if early: selected.append(early[0])
if mid: selected.append(mid[0])
if late: selected.append(late[0])

In [15]:
selected

[{'hour': np.int64(0),
  'meds': np.float64(5.375),
  'stability': np.float64(0.2847755555555555),
  'time_start': '00:01'},
 {'hour': np.int64(11),
  'meds': np.float64(6.6),
  'stability': np.float64(0.8756712222222222),
  'time_start': '11:03'},
 {'hour': np.int64(16),
  'meds': np.float64(10.43),
  'stability': np.float64(0.8060623333333334),
  'time_start': '16:03'}]

### STEP3: Extract first 15 measurements from each selected period

In [16]:
final_dataset = pd.DataFrame()

In [17]:
for i, period in enumerate(selected):
    period_data = data[data['meds'] == period['meds']].head(15)
    final_dataset = pd.concat([final_dataset, period_data], ignore_index=True)

    print(f"Period {i+1}: Hour {period['hour']} - {len(period_data)} measurements")

Period 1: Hour 0 - 15 measurements
Period 2: Hour 11 - 15 measurements
Period 3: Hour 16 - 15 measurements


In [18]:
final_dataset

Unnamed: 0,aq2.5,time,hour,diffs,meds
0,6.14,2022-04-01 00:01:10 UTC,0,0.765,5.375
1,5.0,2022-04-01 00:03:10 UTC,0,-0.375,5.375
2,5.29,2022-04-01 00:05:10 UTC,0,-0.085,5.375
3,4.73,2022-04-01 00:07:10 UTC,0,-0.645,5.375
4,4.31,2022-04-01 00:09:10 UTC,0,-1.065,5.375
5,5.66,2022-04-01 00:11:10 UTC,0,0.285,5.375
6,4.41,2022-04-01 00:13:10 UTC,0,-0.965,5.375
7,5.55,2022-04-01 00:15:10 UTC,0,0.175,5.375
8,5.63,2022-04-01 00:17:10 UTC,0,0.255,5.375
9,5.97,2022-04-01 00:19:10 UTC,0,0.595,5.375
