# Raw data cleaning and analysis using Pandas, Numpy

Keywords: data cleaning, energy metrics, baseline

This notebook demonstrates the use of Python library Pandas and Numpy to clean a set of building meter data and weather data for baseline model development and validation.

# Markdown Cell
"""
# Building Energy Data Analysis
This notebook demonstrates the process of cleaning, processing, and analyzing building energy data and weather data to generate performance metrics.

## Datasets
1. **Meter Data**: Electrical power consumption data at 15-min intervals.
2. **Weather Data**: Weather observations at 15-min intervals.
3. **Site Descriptions**: Metadata about building sites.

## Objectives
- Clean and preprocess the data.
- Merge datasets for integrated analysis.
- Calculate building energy performance metrics.
- Export the metrics in JSON format.
"""

## Imports

In [2]:
import pandas as pd
import numpy as np
import sys
import os
import json
import matplotlib.pyplot as plt


## Step 1: Load the Datasets
We load the meter data, weather data, and site descriptions for analysis.

In [3]:
# Load the data
meter_data = pd.read_csv('data/chapter2/meter-data/TwoCarnegiePlaza.csv')
weather_data = pd.read_csv('data/chapter2/SanBernadino_2018-01-01_2020-01-01_Weather.csv')
site_description = pd.read_csv('data/chapter2/sites-desc.csv')

# Display the first few rows of each dataset
meter_data.head(), weather_data.head(), site_description.head()

(      datetime           site_id  power
 0  6/1/08 0:00  TwoCarnegiePlaza  36.00
 1  6/1/08 0:15  TwoCarnegiePlaza  37.44
 2  6/1/08 0:30  TwoCarnegiePlaza  37.92
 3  6/1/08 0:45  TwoCarnegiePlaza  37.44
 4  6/1/08 1:00  TwoCarnegiePlaza  37.44,
           time  apparentTemperature  cloudCover  dewPoint  humidity  \
 0  1/1/08 0:00                 8.22         0.0  -10.5800      0.24   
 1  1/1/08 0:15                 8.34         0.0  -10.6125      0.24   
 2  1/1/08 0:30                 8.46         0.0  -10.6450      0.24   
 3  1/1/08 0:45                 8.58         0.0  -10.6775      0.24   
 4  1/1/08 1:00                 8.70         0.0  -10.7100      0.24   
 
           icon  precipIntensity  precipProbability precipType   pressure  \
 0  clear-night              0.0                0.0        NaN  1024.1900   
 1          NaN              0.0                0.0        NaN  1024.0975   
 2          NaN              0.0                0.0        NaN  1024.0050   
 3         

## Step 2: Clean and Preprocess the Data
### Meter Data
- Convert `datetime` to a proper timestamp.
- Drop rows with missing or invalid power values.

### Weather Data
- Select relevant weather attributes.
- Fill missing values with column means.

### Site Descriptions
- Ensure `site_id` values match across datasets.

In [None]:
# Clean meter data
meter_data['datetime'] = pd.to_datetime(meter_data['datetime'], format='%m/%d/%Y %H:%M:%S')
meter_data = meter_data.dropna(subset=['power'])
meter_data = meter_data[meter_data['power'] >= 0]

# Clean weather data
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], format='%m/%d/%Y %H:%M:%S')
relevant_weather_columns = ['datetime', 'temperature', 'humidity', 'windSpeed', 'precipIntensity']
weather_data = weather_data[relevant_weather_columns]
weather_data = weather_data.fillna(weather_data.mean())

# Filter site descriptions
valid_site_ids = site_description['site_id'].unique()
meter_data = meter_data[meter_data['site_id'].isin(valid_site_ids)]

  meter_data['datetime'] = pd.to_datetime(meter_data['datetime'])
  weather_data['datetime'] = pd.to_datetime(weather_data['datetime'])


## Step 3: Merge Datasets
Merge the meter data, weather data, and site descriptions for integrated analysis.

In [5]:
# Merge meter and weather data
merged_data = pd.merge_asof(
    meter_data.sort_values('datetime'), 
    weather_data.sort_values('datetime'), 
    on='datetime', 
    direction='nearest'
)

# Add site descriptions
final_data = pd.merge(merged_data, site_description, on='site_id', how='left')

## Step 4: Calculate Energy Performance Metrics
Metrics include:
- **Daily Energy Usage**: Sum of power readings per day.
- **Energy Use Intensity (EUI)**: Total energy usage per floor area.

In [6]:
# Calculate daily metrics
final_data['date'] = final_data['datetime'].dt.date
daily_metrics = final_data.groupby(['site_id', 'date']).agg({
    'power': 'sum',
    'temperature': 'mean',
    'humidity': 'mean',
    'windSpeed': 'mean',
    'precipIntensity': 'mean',
    'floor_area': 'first'
}).reset_index()

# Calculate EUI
daily_metrics['EUI'] = daily_metrics['power'] / daily_metrics['floor_area']

## Step 5: Export Metrics
Save the metrics as a JSON file for further use or sharing.

In [8]:
# Export metrics to JSON
output_json_path = 'data/chapter2/building_energy_metrics.json'
daily_metrics.set_index(['site_id', 'date']).to_json(output_json_path, orient='index')

print(f"Metrics exported to {output_json_path}")

Metrics exported to data/chapter2/building_energy_metrics.json
