# Predicting Energy and Gas Savings
====================================================

# Part 4: Statistical testing

Welcome to this Jupyter notebook, where we will embark on a statistical testing journey to assess the impact of energy efficiency projects in residential homes participating in the Home Performance with ENERGY STAR® Program from 2007 to 2012.

**Background:**
The Home Performance with ENERGY STAR® Program, overseen by the U.S. Environmental Protection Agency (EPA) and U.S. Department of Energy (DOE), focuses on promoting energy efficiency. Our analysis involves comparing estimated savings against normalized values from an open-source energy efficiency meter.

**About the Home Performance with ENERGY STAR® Program:**
"Home Performance with ENERGY STAR® is a national collaborative program between the U.S. Department of Energy and the U.S. Environmental Protection Agency. It includes a network of 32 utility and nonprofit sponsors, along with 1,300 home performance contractors. Since 2001, Home Performance with ENERGY STAR has been the trusted source for contractors and energy programs delivering home energy upgrades. These upgrades make American homes safer, healthier, and more energy-efficient. The program offers a comprehensive evaluation, with recommended work performed by trained and qualified contractors. A cornerstone of the program is a set of rigorous quality assurance requirements." (source: [Department of Energy](https://www.energy.gov/eere/buildings/home-performance-energy-starr), retrieved 29.01.2024)

**Dataset Overview:**
This dataset backcasts estimated modeled savings for completed projects in the State of New York (US) from 2007 to 2012. These projects are part of the Home Performance with ENERGY STAR® Program under Residential Existing Homes (One to Four Units) Predicted First Year Savings for Energy Efficiency Measures: 2007 – 2012. The analysis compares the estimated savings against normalized savings calculated by an open-source energy efficiency meter. Our focus in this notebook is on cleaning and preparing this data to develop, train, and fit machine learning models for accurate predictions.

**Datasource:**
[Data - New York State](https://data.world/data-ny-gov/jtrr-tvq4) (Retrieved on 29.01.2024)

**Project Goal:**
Our primary goal is to develop machine learning models capable of predicting gas and energy usage in residential homes as well as the project costs. In a last step we will conduct statistical tests to assess whether there are significant changes in gas usage before and after the energy efficiency projects.

Let's dive into the world of statistical testing! 



In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Load data

In [26]:
data = pd.read_csv('../data/cleaned_data/cleaned_data.csv')

In [27]:
data.head()

Unnamed: 0.1,Unnamed: 0,project_id,contractor_id,project_county,project_city,project_zip,climate_zone,weather_station,weather_station-normalization,project_completion_date,...,consolidated_edison,lipa,national_grid,national_fuel_gas,nyseg,orange_and_rockland,rochester_gas_and_electric,all_false,coordinates,age_category
0,0,P00000034473,CY0000000014,Onondaga,Fabius,13063,5,725190,725190,2007-08-17,...,0,0,1,0,0,0,0,False,"42.850323, -75.979919",Moderate
1,1,P00000110370,CY0000000014,Onondaga,Nedrow,13120,5,725190,725190,2007-10-04,...,0,0,1,0,0,0,0,False,"42.950373, -76.163321",Very Old
2,2,P00000182080,CY0000000014,Onondaga,Jamesville,13078,5,725190,725190,2008-02-27,...,0,0,1,0,0,0,0,False,"42.976691, -76.069719",Moderate
3,3,P00000196191,CY0000000261,Albany,Albany,12203,5,725180,725180,2008-02-20,...,0,0,1,0,0,0,0,False,"42.680815, -73.836193",Old
4,4,P00000327900,CY0000000004,Erie,Buffalo,14221,5,725280,725280,2008-06-18,...,0,0,1,0,0,0,0,False,"42.980424, -78.728009",Moderate


In [28]:
data = data.drop(columns = ['Unnamed: 0'])

In [29]:
data = data.drop(columns = ['all_false'])

# Statistical test on gas usage

As part of assessing the impact of energy efficiency projects, it is hypothesized that there are major changes in gas usage efficiency in residential homes, as indicated by the columns baseline_gas (gas usage before the project) and reporting_gas (gas usage after the project). To confirm this hypothesis, a t-test will be conducted to statistically assess the significance of the difference in gas usage before and after the energy efficiency projects.

## paired t-test

Null Hypothesis (H0): The mean difference between baseline_gas and reporting_gas is equal to zero (no change).

Alternative Hypothesis (H1): The mean difference between baseline_gas and reporting_gas is not equal to zero (there is a significant change).

In [33]:
from scipy.stats import ttest_rel

t_statistic, p_value = ttest_rel(data['baseline_gas'], data['reporting_gas'])

print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')

# Check if the result is statistically significant at a chosen significance level (here: 0.05)

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference.")
    
    # Determine the direction of the difference
    if t_statistic > 0:
        print("The mean of 'reporting_gas' is higher than 'baseline_gas'.")
    elif t_statistic < 0:
        print("The mean of 'reporting_gas' is lower than 'baseline_gas'.")
    else:
        print("No significant difference in means.")
else:
    print("Fail to reject the null hypothesis. No significant difference found.")



T-statistic: 25.32654973741271
P-value: 1.641024407155816e-130
Reject the null hypothesis. There is a significant difference.
The mean of 'reporting_gas' is higher than 'baseline_gas'.


This leads into the conclusion that there is a change in gas usage before and after the project.

--> Alternative Hypothesis (H1): The mean difference between baseline_gas and reporting_gas is not equal to zero (there is a significant change).

# Statistical test on electricity usage

As part of assessing the impact of energy efficiency projects, it is hypothesized that there are major changes in electricity usage efficiency in residential homes, as indicated by the columns baseline_electricity (electricity usage before the project) and reporting_electricity (electricity usage after the project). To confirm this hypothesis, a t-test will be conducted to statistically assess the significance of the difference in electricity usage before and after the energy efficiency projects.

## paired t-test

Null Hypothesis (H0): The mean difference between baseline_electric and reporting_electric is equal to zero (no change).

Alternative Hypothesis (H1): The mean difference between baseline_electric and reporting_electric is not equal to zero (there is a significant change).

In [44]:
from scipy.stats import ttest_rel

# Assuming 'baseline_electricity' and 'reporting_electricity' are your columns of interest in a DataFrame (e.g., data)
t_statistic, p_value = ttest_rel(data['baseline_electric'], data['reporting_electric'])

# Print the results
print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')

# Check if the result is statistically significant at a chosen significance level (here: 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference.")
    
    # Determine the direction of the difference
    if t_statistic > 0:
        print("The mean of 'baseline_electric' is higher than 'reporting_electricity'.")
    elif t_statistic < 0:
        print("The mean of 'baseline_electric' is lower than 'reporting_electricity'.")
    else:
        print("No significant difference in means.")
else:
    print("Fail to reject the null hypothesis. No significant difference found.")


T-statistic: 7.280835163901055
P-value: 4.042923029585936e-13
Reject the null hypothesis. There is a significant difference.
The mean of 'baseline_electric' is higher than 'reporting_electricity'.


# The question of seasonality and other external factors

Gas and electricity usage are likely influenced by the seasonal changes in the US state of New York, which experiences four distinct seasons, with January being the coldest month

The observation that the mean of 'reporting_gas' surpasses 'baseline_gas' raises questions about potential seasonal influences, particularly during colder months.

In [48]:
data['project_completion_date'].head()

0    2007-08-17
1    2007-10-04
2    2008-02-27
3    2008-02-20
4    2008-06-18
Name: project_completion_date, dtype: object

In [52]:
data['month'] = pd.to_datetime(data['project_completion_date']).dt.month

data['month']

0        8
1       10
2        2
3        2
4        6
        ..
3647     8
3648    10
3649    10
3650     9
3651    10
Name: month, Length: 3652, dtype: int32

In [55]:
data['month'].value_counts()

month
1     400
12    368
10    341
2     332
11    329
3     311
9     304
8     301
7     259
4     251
6     245
5     211
Name: count, dtype: int64

A substantial number of projects were completed during the winter months from October to March. Given that these months are crucial for energy efficiency due to colder temperatures, the current data, which only includes baseline and reporting values for 9 months, might not fully capture the impact of the projects.

To address this, it would be prudent to remodel the data by extending the baseline and reporting values to cover all 12 months. This adjustment would provide a more comprehensive view of the energy usage patterns throughout the entire year, ensuring that the analysis accounts for the crucial winter months when energy efficiency initiatives are most likely to make a higher impact.

Gas and electricity usage are influenced by various external factors, including seasonality, weather conditions, and other variables. To gain a more precise understanding of the impact of energy efficiency projects, it is essential to consider long-term data. This broader dataset would enable a more comprehensive analysis, allowing for a nuanced exploration of how different factors contribute to fluctuations in gas and electricity consumption over time.