This assignment is about understanding the ECO dataset and using it to create interactive and animated visualizations that help explore and analyze household electricity consumption and appliance-level consumption patterns.

## Data

The Electricity Consumption and Occupancy (ECO) dataset contains power consumption data for three households at both the overall household level (smart meters) and the appliance level (plugs). The dataset covers various appliances like fridge, freezer, microwave, coffee machine, kettle, stereo, laptop, entertainment systems, and other devices. The smart meter data contains 16 columns with various power-related measurements, while the plug data provides appliance-level consumption. The data is organized into daily CSV files spanning from June 27, 2012, to January 31, 2013.

The most relevant portions of the dataset for our questions will be the 'powerallphases' variable from the smart meter data, which represents the total power consumed over all power phases in the household, and the plug data, which will help us analyze individual appliance consumption patterns.

## Data-science Questions

+ How does the overall power consumption of a household vary over time, and can we identify any patterns or trends in the data (e.g., daily or weekly cycles, seasonal variations)?  
Visualization 1 will display a time-series plot of the overall power consumption for each household. This visualization will help us identify any patterns or trends in the data and enable users to interactively explore consumption over different time frames.

+ What is the contribution of each appliance to the total power consumption of a household? Can we identify any high-consumption appliances that could be targeted for energy efficiency improvements?  
Visualization 2 will display a stacked bar chart or pie chart showing the proportion of total power consumption attributed to each appliance for each household. This visualization will help us understand the contribution of each appliance to the overall consumption and identify high-consumption appliances for potential energy efficiency improvements.

By creating these visualizations, we aim to understand the patterns of household electricity consumption and the role of individual appliances in overall power usage. The interactive nature of the visualizations will allow users to explore the data and gain insights that could be used to inform energy-saving strategies or policies.

## Data preparation and EDA

In this section, we'll prepare the data for analysis and create some initial exploratory visualizations. First, we need to load and preprocess the data by combining daily CSV files, converting units, and aggregating the data as required. Then, we'll perform some exploratory data analysis (EDA) to understand the data and inform our later visualizations. The whole implementation can be divided into the following steps:

### 1. Load the necessary libraries and define helper functions:

I noticed that the entire data set is very large, and it takes a long time to run the program, so I added a variable called `time_point`, and I only read all the data before that date. If you want to use the full dataset, you can change the time point to 2013-01-23 by ``` time_point = '2013-01-23' ```

In [8]:
global time_point
time_point = '2012-07-01'

In [9]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# Helper function to load data
def load_data(folder_path):
    data = []
    for file in sorted(os.listdir(folder_path)):
        if file.endswith(".csv"):
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path, header=None)
            df['date'] = pd.to_datetime(file[:-4])
            data.append(df)
    combined_data = pd.concat(data, ignore_index=True)
    combined_data = combined_data[combined_data['date'] < time_point]
    return combined_data


### 2. Load and preprocess the smart meter data

In [10]:
# Load smart meter data for each household
sm_data_04 = load_data("eco/04_sm_csv")
sm_data_05 = load_data("eco/05_sm_csv")
sm_data_06 = load_data("eco/06_sm_csv")

# Rename columns and convert power consumption to kWh
for sm_data in [sm_data_04, sm_data_05, sm_data_06]:
    sm_data.columns = ['powerallphases', 'powerphase1', 'powerphase2', 'powerphase3', 'voltagephase1', 'voltagephase2', 'voltagephase3', 'currentphase1', 'currentphase2', 'currentphase3', 'pfphase1', 'pfphase2', 'pfphase3', 'freq', 'extra_col1', 'extra_col2', 'date']
    sm_data['powerallphases'] = sm_data['powerallphases'] / (3600 * 1000)  # Convert to kWh

### 3. Load and preprocess the plug data

In order to ensure the consistency of the data format, I deleted the `2012-06-25 15.50.57.csv` file under `\04_plugs_csv\08`.

In [12]:
# Load plug data for each household and appliance
plugs_data_04 = {appliance: load_data(f"eco/04_plugs_csv/{appliance}") for appliance in ['01', '02', '03', '04', '05', '06', '07', '08']}
plugs_data_05 = {appliance: load_data(f"eco/05_plugs_csv/{appliance}") for appliance in ['01', '02', '03', '04', '05', '06', '07', '08']}
plugs_data_06 = {appliance: load_data(f"eco/06_plugs_csv/{appliance}") for appliance in ['01', '02', '03', '04', '05', '06', '07']}

# Convert power consumption to kWh
for household_plugs_data in [plugs_data_04, plugs_data_05, plugs_data_06]:
    for appliance, df in household_plugs_data.items():
        df.columns = ['power', 'date']
        df['power'] = df['power'] / (3600 * 1000)  # Convert to kWh

In [14]:
# Combine the data from all three households and add a 'household' column to differentiate them
sm_data_04['household'] = 'Household 4'
sm_data_05['household'] = 'Household 5'
sm_data_06['household'] = 'Household 6'
combined_sm_data = pd.concat([sm_data_04, sm_data_05, sm_data_06])

# Combine the data from all three households and add a 'household' column to differentiate them
combined_plugs_data = pd.concat([
    pd.concat([df.assign(appliance=appliance) for appliance, df in plugs_data_04.items()]).assign(household='Household 4'),
    pd.concat([df.assign(appliance=appliance) for appliance, df in plugs_data_05.items()]).assign(household='Household 5'),
    pd.concat([df.assign(appliance=appliance) for appliance, df in plugs_data_06.items()]).assign(household='Household 6')
])

## Results and Analysis

### Rationale for design

+ Visual Encodings: We chose line charts to represent overall power consumption over time for each household because they are an intuitive and effective way to show trends and patterns in time series data. For the second visualization, we could use a stacked bar chart or a heatmap to display the distribution of power consumption among different appliances. These chart types can clearly show the proportion of power consumed by each appliance and make it easy to compare between households.Alternatives we considered include scatter plots and area charts. Scatter plots could reveal more details about the data points, but they might be less effective in communicating trends in the data. Area charts can also show trends over time, but they can be harder to interpret when comparing multiple datasets.
+ Interaction: For interactivity, we plan to use dropdown menus or sliders to allow users to switch between different households or select a specific time range for analysis. This will enable users to compare power consumption patterns between households or zoom in on specific periods of interest.An alternative to dropdown menus or sliders is providing clickable legends or other on-chart elements that allow users to interact with the data. However, dropdown menus and sliders provide a more consistent and accessible way for users to control the visualization's display.
+ Animation: Animation can be helpful for showing changes in data over time or highlighting specific patterns in the data. For example, we could use animation to show the progression of power consumption throughout a day or to highlight peak consumption periods.An alternative to animation is to provide static visualizations that focus on specific time periods or patterns. However, animation can make the data more engaging and help users better understand the trends.

### Question 1: Overall power consumption over time (Plotly)

In [6]:
# Create a subplot with a range slider
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=[0, 0], y=[0, 0], name='placeholder'), secondary_y=False)

# Update layout with dropdown menu and range slider
fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=[{"y": [combined_plugs_data.loc[combined_plugs_data['household'] == household, 'power'].values],
                           "x": [combined_plugs_data.loc[combined_plugs_data['household'] == household, 'date'].values],
                           "name": f"Appliance {appliance}"},
                          {"title": f"Overall Power Consumption by Household {household}"}],
                    label=household,
                    method="update"
                )
                for household in ['Household 4', 'Household 5', 'Household 6']
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.1,
            xanchor="left",
            y=1.1,
            yanchor="top"
        ),
    ],
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1, label="1d", step="day", stepmode="backward"),
                dict(count=7, label="1w", step="day", stepmode="backward"),
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="YTD", step="year", stepmode="todate"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(visible=True),
        type="date"
    )
)

fig.show()

The plot of plotly is too large to be displayed in the notebook, and the gererated html file is too large to be uploaded to Canvas (300+MB) when I just use the data of one week. Therefore, I paste the screenshot of the plotly plot here. In the interactuve plot, you can choose the household you want to see and the time range you want to see. The plot shows the overall power consumption of each household over time. The code is no error, you can run it to see the plot if you want.

From the visual interaction diagram, we can see that each household presents an obvious periodicity. For example, taking the day as the scale, we can find that 00:00 every day is a peak of energy consumption, and taking the scale of the week, we can find that there is a peak of energy consumption every day, and the energy consumption is stable at other times except the peak. Therefore, the whole data presents obvious periodicity.

### Question 2: Appliance-level power consumption (Altair)


In [None]:
import altair as alt

# Increase the maximum data limit in Altair
alt.data_transformers.disable_max_rows()

# Calculate the total power consumption of each appliance for each household
total_appliance_power = combined_plugs_data.groupby(['household', 'appliance']).agg({'power': 'sum'}).reset_index()

# Calculate the total power consumption for each household
total_household_power = combined_sm_data.groupby(['household']).agg({'powerallphases': 'sum'}).reset_index()

# Combine the two dataframes
total_power = total_appliance_power.merge(total_household_power, on='household')

# Create a new column for the percentage contribution of each appliance
total_power['percentage'] = total_power['power'] / total_power['powerallphases'] * 100

# Create a selection for the household dropdown
household_selection = alt.selection_single(
    fields=['household'],
    bind=alt.binding_select(options=['Household 4', 'Household 5', 'Household 6']),
    name='Select',
    init={'household': 'Household 4'}
)

# Create the bar chart
bars = alt.Chart(total_power).mark_bar().encode(
    x=alt.X('appliance:N', title='Appliance'),
    y=alt.Y('percentage:Q', title='Percentage of Total Power Consumption'),
    color=alt.Color('appliance:N', legend=alt.Legend(title='Appliance')),
    tooltip=['appliance', 'power', 'percentage']
).add_selection(
    household_selection
).transform_filter(
    household_selection
)

bars

From the analysis of the appliance, for household 4, the power consumption of 05 (Freezer) is far more than other appliances, close to 30%, and all others are below 5%. For household 5, the highest power consumption is 05 (Fridge), followed by 07 (PC), reaching 8% and 5% respectively. Household 6 is relatively balanced, and the highest is 05 (Entertainment).

## References

+ Al-Kababji A, Alsalemi A, Himeur Y, et al. Interactive visual study for residential energy consumption data[J]. Journal of Cleaner Production, 2022, 366: 132841.
+ https://dash.plotly.com/interactive-graphing
