# Chicago Energy Usage 2010 Analysis
This project attempts to perform some analysis on the Chicago Energy Usage 2010 dataset available in Kaggle ("https://www.kaggle.com/chicago/chicago-energy-usage-2010"). This analysis aims to identify and find out some interesting information regarding the energy and gas consumption in Chicago in year 2010. The analysis will look into some features in the dataset and aims to deduce and verify some assumption by performing some analysis and generate some visualizations as evidence.



## Downloading the Dataset

This section describes the steps needed to download the dataset into the notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's begin by downloading the data, and listing the files within the dataset.

## Data Preparation and Cleaning

This step encompasses loading the dataset, see the overall characteristics of the data, and identify missing / invalid values



We will download the dataset, create a copy of the dataframe, and look at how it looks like




In [None]:
energy_usage_df = pd.read_csv("../input/chicago-energy-usage-2010/energy-usage-2010.csv")

In [None]:
energy_usage_df_copy = energy_usage_df.copy()
#convert_census_block_int = {"CENSUS BLOCK" : int}
#convert_census_block_str = {"CENSUS BLOCK" : str}
#energy_usage_df_copy = energy_usage_df_copy.astype(convert_census_block_int)
#energy_usage_df_copy = energy_usage_df_copy.astype(convert_census_block_str)
energy_usage_df_copy 

Look at the dataset before it is cleaned

In [None]:
energy_usage_df_copy.describe()

In [None]:
energy_usage_df_copy



Look at missing values. Which columns and how many rows with missing values per column.

Many columns have a really high variance, so it wouldn't make sense to impute missing values with mean. We will drop missing values from the dataframe. It is also possible that the data may have outliers of number of electricity or gas consumed, however it is difficult to determine if a certain value in both of these column is an outlier, since there is no data of electrical appliances used. 

In [None]:
nan_count = energy_usage_df_copy.isnull().sum()
nan_count.head(60)

In [None]:
nan_count.tail(13)

The aim of this work is not to perform any prediction (classification or regression), however we will drop columns which contain a lot of missing data. Some columns can actually be recalculated from other columns, if need be.
We will drop the following columns in their entirety: "KWH STANDARD DEVIATION 2010", "THERMS SQFT STANDARD DEVIATION 2010", "THERMs STANDARD DEVIATION 2010 ", "KWH SQFT STANDARD DEVIATION 2010", 

In [None]:
energy_usage_df_copy.drop(columns=["KWH STANDARD DEVIATION 2010", "THERMS SQFT STANDARD DEVIATION 2010", "THERM STANDARD DEVIATION 2010", "KWH SQFT STANDARD DEVIATION 2010"], inplace=True)

In [None]:
nan_count = energy_usage_df_copy.isnull().sum()
nan_count.head(60)

In [None]:
nan_count.tail(9)

We are dropping rows with missing values on the following columns

In [None]:
energy_usage_df_copy.dropna(subset=["CENSUS BLOCK", "THERM FEBRUARY 2010","KWH JANUARY 2010","KWH FEBRUARY 2010","KWH MARCH 2010",
"KWH APRIL 2010","KWH MAY 2010","KWH JUNE 2010","KWH JULY 2010","KWH AUGUST 2010","KWH SEPTEMBER 2010","KWH OCTOBER 2010","KWH NOVEMBER 2010",
"KWH DECEMBER 2010","TOTAL KWH","THERM JANUARY 2010","THERM MARCH 2010","TERM APRIL 2010","THERM MAY 2010","THERM JUNE 2010","THERM JULY 2010",
"THERM AUGUST 2010","THERM SEPTEMBER 2010","THERM OCTOBER 2010","THERM NOVEMBER 2010","THERM DECEMBER 2010", "KWH TOTAL SQFT",
"THERMS SQFT MEAN 2010","ELECTRICITY ACCOUNTS", "TOTAL POPULATION", "RENTER-OCCUPIED HOUSING PERCENTAGE"], inplace=True)


In [None]:
nan_count_1 = energy_usage_df_copy.isnull().sum()
nan_count_1.head(60)

In [None]:
nan_count_1.tail(9)

Let's see how many data left after we remove rows with non missing values

In [None]:
all_count = energy_usage_df_copy.count()
all_count.head(60)

In [None]:
all_count.tail(9)

Here, we argue that we could still make relatively good inference later since we still have plenty of data remaining. The approach considered in above (drop entirely columns which have close to 10000 rows or above missing data, and then remove rows with missing values on the columns which have less missing values per column) are still better than dropping the rows in entirety as long as they have any missing values, since the latter will remove more data with non missing values.

Now we are done with cleaning missing values, we will attempt to remove some other columns which we will not use for the purpose of this data analysis. which are ELECTRICITY ACCOUNTS, GAS ACCOUNTS, and ZERO KWH ACCOUNTS.


In [None]:
energy_usage_df_copy.drop(columns=["ELECTRICITY ACCOUNTS", "GAS ACCOUNTS", "ZERO KWH ACCOUNTS"], inplace=True)

Now, we will attempt to clear any outstanding outliers. Here, we will identify upper outliers only since there are cases that buildings may consume less energy despite having more household or size. We will set the limit of the upper outliers based on .95 percentile limit.

The CENSUS BLOCK, Building type and Subtype columns are ignored since they represent code instead of real numerical values

In [None]:
energy_usage_df_copy_c = energy_usage_df_copy.copy()
#energy_usage_df_copy_c = energy_usage_df_copy_c.drop

#energy_usage_df_copy_c = energy_usage_df_copy_c[energy_usage_df_copy_c.columns[4:63]]
energy_usage_df_copy_c
#energy_usage_df_copy_c.dtypes.head(50)

In [None]:
def remove_outliers_IQR(df, out_cols, T=1.5):
    # Copy of df
    new_df = df.copy()
    init_shape = new_df.shape
    
    for c in out_cols:
        q1 = new_df[c].quantile(.25)
        q3 = new_df[c].quantile(.75)
        col_iqr = q3 - q1
        col_max = q3 + T * col_iqr
        #col_min = q1 - T * col_iqr
        # Filter data without outliers
        filtered_df = new_df[(new_df[c] <= col_max) 
        #& (new_df[c] >= col_min)
        ]
        #if verbose:
            #n_out = new_df.shape[0] - filtered_df.shape[0] 
            #print(f" Columns {c} had {n_out} outliers removed")
        new_df = filtered_df
            
    #if verbose:
        # Print shrink percentage
        #lines_red = df.shape[0] - new_df.shape[0]
        #print(f"Data reduced by {lines_red} lines, or {lines_red/df.shape[0]*100:.2f} %")
    return new_df

In [None]:
def remove_outliers_percentile(df, out_cols, percentile_val):
    # Copy of df
    new_df = df.copy()
    init_shape = new_df.shape
    
    for c in out_cols:
        #q1 = new_df[c].quantile(.25)
        q3 = new_df[c].quantile(percentile_val)
        #col_iqr = q3 - q1
        col_max = q3
        #col_min = q1 - T * col_iqr
        # Filter data without outliers
        filtered_df = new_df[(new_df[c] <= col_max) 
        #& (new_df[c] >= col_min)
        ]
        #if verbose:
            #n_out = new_df.shape[0] - filtered_df.shape[0] 
            #print(f" Columns {c} had {n_out} outliers removed")
        new_df = filtered_df
            
    #if verbose:
        # Print shrink percentage
        #lines_red = df.shape[0] - new_df.shape[0]
        #print(f"Data reduced by {lines_red} lines, or {lines_red/df.shape[0]*100:.2f} %")
    return new_df

In [None]:
#energy_usage_df_copy_c = remove_outliers_IQR(energy_usage_df_copy_c, energy_usage_df_copy_c.columns[4:63],  1.5)
energy_usage_df_copy_c = remove_outliers_percentile(energy_usage_df_copy_c, energy_usage_df_copy_c.columns[4:63], .99)
energy_usage_df_copy_c

In [None]:
energy_usage_df_copy_c.describe()

Now we will see columns which are non numerical in nature : COMMUNITY AREA NAME,BUILDING TYPE, and	BUILDING_SUBTYPE. Let's see what are the possible values here

In [None]:
energy_usage_df_copy_c["COMMUNITY AREA NAME"].value_counts()

In [None]:
energy_usage_df_copy_c["BUILDING TYPE"].value_counts()

In [None]:
energy_usage_df_copy_c["BUILDING_SUBTYPE"].value_counts()

It seems that there are no rows which have multiple categorical values, so we don't need to clean these columns.

## Exploratory Analysis and Visualization

Next, we will try to understand how the demographics of the consumer account look like. 



Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

**COMMUNITY AREA** 

Let's investigate the distribution of consumers according to their community area

In [None]:
energy_usage_df_copy_c["COMMUNITY AREA NAME"].nunique()

There are 75 area where our customers are located. 
Let's look which community area our customers are mostly located.

In [None]:
top_30_areas = energy_usage_df_copy_c["COMMUNITY AREA NAME"].value_counts().head(30)

Let's see the distribution of the customers in the 30 areas where the customers are mostly located.

In [None]:
plt.figure(figsize=(20,8))
plt.xticks(rotation=75)
#plt.title = "Area where customers are located"
ax = sns.barplot(x=top_30_areas.index, y=top_30_areas)
ax.set_title("Area where customers are located")
ax.set_xlabel("Community Area Name")
ax.set_ylabel("Number of Customer Accounts")

To get more idea of where these areas are located on the map, we can try to utilize python geocoding module GeoPy, which is a geocoding python client to convert physical address into latitude and longitude. We will also import Folium so we can display map

First, we create a new dataframe which will contain area name and their respective latitude and longitude

In [None]:
combined_area_name_loc_lat_lon = pd.DataFrame(energy_usage_df_copy_c["COMMUNITY AREA NAME"].value_counts().head(30))
combined_area_name_loc_lat_lon

In [None]:

combined_area_name_loc_lat_lon = combined_area_name_loc_lat_lon.reset_index()

combined_area_name_loc_lat_lon = combined_area_name_loc_lat_lon.rename(columns={"index" : "COMMUNITY AREA NAME", "COMMUNITY AREA NAME": "Number of Accounts"})# ["COMMUNITY AREA NAME", "Number of Accounts"]
combined_area_name_loc_lat_lon

In [None]:
import folium

In [None]:
import geopy as geo
from geopy.geocoders import Nominatim

In [None]:
geolocator = Nominatim(user_agent="chicago-energy-usage-2010-data-analysis")

In [None]:
#top_30_areas.index.to_list()
loc_lat = list()
loc_lon = list()
for loc in top_30_areas.index.to_list():
  location = geolocator.geocode(loc+", Chicago")
  loc_lat.append(location.latitude)
  loc_lon.append(location.longitude)


Let's create a dataframe of locations with their latitude and longitude

In [None]:
combined_area_name_loc_lat_lon["Latitude"] = loc_lat
combined_area_name_loc_lat_lon["Longitude"] = loc_lon
combined_area_name_loc_lat_lon

Let's show the map of Chicago

In [None]:
chicago_loc = geolocator.geocode("Chicago")

In [None]:
chicago_map = folium.Map(location=[chicago_loc.latitude, chicago_loc.longitude], zoom_start=12)
chicago_map

Now let's add circle marker to pinpoint the community area on the map. The marker will also have label showing the number of customer accounts, which will appear when its respective marker is clicked

In [None]:
location_points = folium.map.FeatureGroup()

for lat, lon, area, label in zip(combined_area_name_loc_lat_lon.Latitude, combined_area_name_loc_lat_lon.Longitude, combined_area_name_loc_lat_lon["COMMUNITY AREA NAME"], combined_area_name_loc_lat_lon["Number of Accounts"]):
  location_points.add_child(
      folium.CircleMarker(
          [lat,lon],
          radius=5,
          color="yellow",
          fill=True,
          fill_color='blue',
          #popup=label,
          fill_opacity=0.6,
          popup = area+",\n"+str(label)+" accounts"
      )
  )

chicago_map.add_child(location_points)


As expected, it appears that areas where most of the account owners are located are just outskirts of the downtown Chicago

**BUILDING TYPE** 

This section will look at the type of the buildings. There are three categories in the data : Residential, Commercial, and Industrial

In [None]:
acc_build_types_counts = energy_usage_df_copy_c["BUILDING TYPE"].value_counts()
acc_build_types_counts

Let's visualize the ratio and distribution in two types of chart. Bar chart and pie chart

In [None]:
plt.figure(figsize=(8,5))
plt.xticks(rotation=75)
#plt.title = "Distribution of type of buildings"
ax = sns.barplot(x=acc_build_types_counts.index, y=acc_build_types_counts)
ax.set_title("Distribution of type of buildings")
ax.set_xlabel("Building Type")
ax.set_ylabel("Number of Customer Accounts")

In [None]:
# create pie chart, percentage label shows two digit decimals

plt.figure(figsize=(10,8))

plt.pie(acc_build_types_counts, labels=acc_build_types_counts.index, autopct='%1.2f%%', startangle=180)
plt.title("Ratio of type of buildings")
plt.show()

It's obvious that most of the account owners are under Residential type. The number of industrial customers are significantly less than both residentual and commercial, hence it's not visible in the bar chart. In the pie chart, it can be seen due to the label, but it's still not visible visually just by the chart alone, if the chart is without label. 

**BUILDING SUB-TYPE**

Here, we will look at the distribution of customers based on the building sub-type, which defines more categories than the Building Type. We will try to see in two charts, bar chart and pie chart.

In [None]:
acc_subtype_count = energy_usage_df_copy_c['BUILDING_SUBTYPE'].value_counts()
acc_subtype_count_cp = acc_subtype_count.copy()
acc_list_ind = acc_subtype_count_cp.index.to_list()
acc_list_val = acc_subtype_count_cp.to_list()


In [None]:
plt.figure(figsize=(12,6))
#plt.title("Distribution of Customer according to Building sub-type")
plt.xticks (rotation=75)
ax = sns.barplot(x=acc_subtype_count.index, y=acc_subtype_count)
ax.set_title("Distribution of Customer according to Building sub-type")
ax.set_xlabel("Building Sub-Type")
ax.set_ylabel("Number of Buildings")

Again, bar chart doesn't show Municipal and Industrial due to their numbers which are significantly less than the other categories. Hence, we will see them in a pie chart.

In [None]:
plt.figure(figsize=(12,8))
plt.title("Ratio between Building Subtype Customers")

label_list = acc_subtype_count_cp.index.to_list()

for i in range(len(label_list)):
  label_list[i] = label_list[i]+" --- "+"{:.2f}".format(((acc_subtype_count_cp.to_list()[i]/32494)*100))+" %"


pie_ch = plt.pie(acc_subtype_count_cp, explode=(0.05,0.1,0.15,0.15,0.05,0.05), startangle=180, pctdistance=1.2)
plt.legend(pie_ch[0], labels=label_list, loc="best", bbox_to_anchor = (1,1))


From the charts, it looks that most of the customers are under "Single Family" and "Multi < 7" categories. It can be assumed that these buildings are landed properties and multi storey, but not high rise buildings.  

## Asking and Answering Questions

In this section, we will critically review the data and try to identify certain interesting characteristics of the data.



#### Q1: Which months see the most energy (electric) consumption?

To answer this question, we need to plot the electric and gas consumption against months in a year. 

In [None]:
electric_cons_months = energy_usage_df_copy_c[["KWH JANUARY 2010","KWH FEBRUARY 2010","KWH MARCH 2010",
"KWH APRIL 2010","KWH MAY 2010","KWH JUNE 2010","KWH JULY 2010","KWH AUGUST 2010","KWH SEPTEMBER 2010","KWH OCTOBER 2010","KWH NOVEMBER 2010",
"KWH DECEMBER 2010"]]

tot_elec_cons = list()

for i in range(12):
  tot_elec_cons.append(electric_cons_months[electric_cons_months.columns[i]].sum())

months = ["JANUARY", "FEBRUARY","MARCH ",
"APRIL","MAY","JUNE","JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER",
"DECEMBER"]

tot_elec_cons_series = pd.Series(
    data=tot_elec_cons,
    index=months
)

tot_elec_cons_series

In [None]:
plt.figure(figsize=(10,8))
plt.xticks(rotation=75)

#plt.title = "Total electricity consumption each month during year 2010"
ax = sns.barplot(x=tot_elec_cons_series.index, y=tot_elec_cons_series/1000)
ax.set(xlabel="Month", ylabel = "Total Electricity Consumption (MWH)", title="Total electricity consumption each month during year 2010")

From the chart, it can be seen that the highest electricty consumption happens on the month of July. It is also interesting that the consumption on the preceding and following months of July are also rather high. December is also comparably high. While the hypothesis needs to be verified, it could be that this happens due to the use of Air conditioning system and also holiday season. Considering that high proportion of the customer is residential, occupants would be present at their residence longer in the holiday season, hence they are more likely to consume electricity at home.

#### Q2: Which months see the most thermal (gas) consumption?

To answer this question, we need to plot the gas consumption against months in a year. 

In [None]:
gas_cons_months = energy_usage_df_copy_c[["THERM JANUARY 2010","THERM FEBRUARY 2010","THERM MARCH 2010","TERM APRIL 2010","THERM MAY 2010","THERM JUNE 2010","THERM JULY 2010",
"THERM AUGUST 2010","THERM SEPTEMBER 2010","THERM OCTOBER 2010","THERM NOVEMBER 2010","THERM DECEMBER 2010"]]

tot_gas_cons = list()

for i in range(12):
  tot_gas_cons.append(gas_cons_months[gas_cons_months.columns[i]].sum())


tot_gas_cons_series = pd.Series(
    data=tot_gas_cons,
    index=months
)

tot_gas_cons_series

In [None]:
plt.figure(figsize=(10,8))
plt.xticks(rotation=75)

#plt.title = "Total thermal (gas) consumption each month during year 2010"
ax = sns.barplot(x=tot_gas_cons_series.index, y=tot_gas_cons_series/1000)
ax.set(xlabel="Month", ylabel = "Total Gas Consumer (in kSQFT)", title="Total thermal (gas) consumption each month during year 2010")

The chart shows that the highest gas usage happens on the month of January. Overall,  the gas usage from December -  March are rather high. It is expected that this is highly correlated, or extremely speaking, might be caused by the use of heating during the winter season. This hypothesis is supported by the much lower use of thermal during late spring, summer, and early fall season. 

#### Q3: Which community area has the highest mean of "average building age" among the customers?

This dataset also contains some interesting demographics regarding the buildings themselves. Here we attempt to look at the average building age of each location, and attempt to identify which area has high average building age.

In [None]:
building_age_comm = energy_usage_df_copy_c[["COMMUNITY AREA NAME", "AVERAGE BUILDING AGE"]]
building_age_mean = energy_usage_df_copy_c[["COMMUNITY AREA NAME", "AVERAGE BUILDING AGE"]].groupby(by="COMMUNITY AREA NAME").mean()
#building_age_std = energy_usage_df_copy[["COMMUNITY AREA NAME", "AVERAGE BUILDING AGE"]].groupby(by="COMMUNITY AREA NAME").std()


In [None]:
sorted_building_age_mean_10 = building_age_mean.sort_values(by="AVERAGE BUILDING AGE", ascending=False).head(10)
sorted_building_age_mean_10

In [None]:
plt.figure(figsize=(10,8))
plt.xticks(rotation=75)
#plt.title = "Top 10 Community Area with the highest mean of Average Building Age"
ax = sns.barplot(x=sorted_building_age_mean_10.index, y= sorted_building_age_mean_10["AVERAGE BUILDING AGE"])
ax.set_title("Top 10 Community Area with the highest mean of Average Building Age")

We identified 10 areas with the highest mean of average building age. From the bar chart, the Englewood area appears to have the highest average building age. However, since the average building age value of buildings in Englewood and the rests down are not too far behind, we will observe the value closer to see the distribution. So, we will create a new dataframe containing only data from these area name, and see how the age of buildings owned by the customers are distributed in a boxplot.

In [None]:

building_age_comm_10 = building_age_comm[(building_age_comm["COMMUNITY AREA NAME"] == "Lower West Side") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "McKinley Park") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "Englewood") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "New City") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "South Lawndale") |  
                                (building_age_comm["COMMUNITY AREA NAME"] == "Bridgeport") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "West Englewood") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "Edgewater") | 
                                (building_age_comm["COMMUNITY AREA NAME"] == "Logan Square") |
                                (building_age_comm["COMMUNITY AREA NAME"] == "Avondale") ]
building_age_comm_10

#query_test = building_age_comm[(building_age_comm["COMMUNITY AREA NAME"] == sorted_building_age_mean_10.index.to_list())]
#query_test

In [None]:
plt.xticks(rotation=75)
ax = sns.boxplot(x=building_age_comm_10["COMMUNITY AREA NAME"], y=building_age_comm_10["AVERAGE BUILDING AGE"])
ax.set_title("Distribution of Average Building Age on Customers Living in Top 10 City Where Most Account Owners Reside")

From the boxplot, it is clear that the average building age distribution is very widely spread, indicated by high standard deviation. It cannot be surely inferred that Englewood area has most oldest buildings since the standard deviation is rather high in some areas. 

#### Q4: What is the energy consumption profile of households in different area?

Here we will attempt to investigate the energy consumption of different households in 3 areas where the highest number of household reside. Let's check again which areas are these.


In [None]:
energy_usage_df_copy_c["COMMUNITY AREA NAME"].value_counts().head(3)

##### A. Austin area

First we filter the dataframe so it contains only from Austin area, and then extracts only the columns of interest.

In [None]:
austin_energy_user = energy_usage_df_copy_c[(building_age_comm["COMMUNITY AREA NAME"] == "Austin")
                                ]

austin_ener_housesize = austin_energy_user[["COMMUNITY AREA NAME","TOTAL KWH", "AVERAGE HOUSESIZE"]]
#low_west_side_ener_housesize[["TOTAL KWH"]] = low_west_side_energy_user[["KWH JANUARY 2010","KWH FEBRUARY 2010","KWH MARCH 2010",
#"KWH APRIL 2010","KWH MAY 2010","KWH JUNE 2010","KWH JULY 2010","KWH AUGUST 2010","KWH SEPTEMBER 2010","KWH OCTOBER 2010","KWH NOVEMBER 2010",
austin_ener_housesize

We normalize the average house size, since the chart will be smooth. Hence, the average housesize is rounded to the closest integer value.

In [None]:
austin_ener_housesize["AVERAGE HOUSESIZE NORMALIZED"] = np.round(austin_ener_housesize["AVERAGE HOUSESIZE"])
austin_ener_housesize

In [None]:
avg_austin = austin_ener_housesize.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()


In [None]:
plt.figure(figsize=(20,8))
plt.title("Electricity Consumption Profile of Different Household in Austin Area, Chicago")


ax = sns.lineplot(x=avg_austin.index, y=avg_austin["TOTAL KWH"])
ax.set_xlabel("Average Household")
ax.set_ylabel("Average of Total Energy Consumption (KWH)")

It appears that the energy consumption profile in Austin is slightly higher on household between 3 - 4. compared to of household of size 1, 2 and 5

B. Belmont Cragin Area

In [None]:
belmont_energy_user = energy_usage_df_copy_c[(building_age_comm["COMMUNITY AREA NAME"] == "Belmont Cragin")
                                ]

belmont_ener_housesize = belmont_energy_user[["COMMUNITY AREA NAME","TOTAL KWH", "AVERAGE HOUSESIZE"]]
#low_west_side_ener_housesize[["TOTAL KWH"]] = low_west_side_energy_user[["KWH JANUARY 2010","KWH FEBRUARY 2010","KWH MARCH 2010",
#"KWH APRIL 2010","KWH MAY 2010","KWH JUNE 2010","KWH JULY 2010","KWH AUGUST 2010","KWH SEPTEMBER 2010","KWH OCTOBER 2010","KWH NOVEMBER 2010",
#"KWH DECEMBER 2010"]].sum(axis=1)
belmont_ener_housesize

In [None]:
belmont_ener_housesize["AVERAGE HOUSESIZE NORMALIZED"] = np.round(belmont_ener_housesize["AVERAGE HOUSESIZE"])


In [None]:
avg_belmont = belmont_ener_housesize.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()


In [None]:
plt.figure(figsize=(12,8))
plt.title("Electricity Consumption Profile of Different Household in Belmont Cragin Area, Chicago")

ax = sns.lineplot(x=avg_belmont.index, y=avg_belmont["TOTAL KWH"])
ax.set_xlabel("Average Household")
ax.set_ylabel("Average of Total Energy Consumption (KWH)")

The profile of consumption in Belmont Cragin appears to show higher for household of size 3 - 5, compared to household of size 1 - 2. 

C. West Town


In [None]:
west_town_energy_user = energy_usage_df_copy_c[(building_age_comm["COMMUNITY AREA NAME"] == "West Town")
                                ]

west_town_ener_housesize = west_town_energy_user[["COMMUNITY AREA NAME","TOTAL KWH", "AVERAGE HOUSESIZE"]]
#low_west_side_ener_housesize[["TOTAL KWH"]] = low_west_side_energy_user[["KWH JANUARY 2010","KWH FEBRUARY 2010","KWH MARCH 2010",
#"KWH APRIL 2010","KWH MAY 2010","KWH JUNE 2010","KWH JULY 2010","KWH AUGUST 2010","KWH SEPTEMBER 2010","KWH OCTOBER 2010","KWH NOVEMBER 2010",
#"KWH DECEMBER 2010"]].sum(axis=1)
west_town_ener_housesize

In [None]:
west_town_ener_housesize["AVERAGE HOUSESIZE NORMALIZED"] = np.round(west_town_ener_housesize["AVERAGE HOUSESIZE"])


In [None]:
avg_west_town = west_town_ener_housesize.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()

In [None]:
plt.figure(figsize=(12,8))
plt.title("Electricity Consumption Profile of Different Household in West Town Area, Chicago")

ax = sns.lineplot(x=avg_west_town.index, y=avg_west_town["TOTAL KWH"])
ax.set_xlabel("Average Household")
ax.set_ylabel("Average of Total Energy Consumption (KWH)")

In West Town, it seems that the household size of 4 is the top most energy consumer than the rest of the household. Buildings with household of 2 and 3 in West Town have relatively similar consumption profile.

So, it's obvious that these three areas considered appear to have very different energy consumption profile, which is rather interesting considering that they are in the proximity of Chicago.

#### Q5: What is the energy consumption profile among building subtypes on various household size?

For this, we will need to have separate dataframes for each building subtypes, and identify the categories

In [None]:
energy_usage_df_copy_c["BUILDING_SUBTYPE"].value_counts()

A. Single Family

In [None]:
energy_df_single_fam = energy_usage_df_copy_c[energy_usage_df_copy_c["BUILDING_SUBTYPE"] == "Single Family"]


In [None]:
energy_df_single_fam = energy_df_single_fam[["AVERAGE HOUSESIZE", "TOTAL KWH"]]

energy_df_single_fam["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_single_fam["AVERAGE HOUSESIZE"])
energy_df_single_fam["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_single_fam["AVERAGE HOUSESIZE"] / 0.5, 0) * 0.5

In [None]:
avg_energy_df_single_fam = energy_df_single_fam.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()


In [None]:
plt.figure(figsize=(12,8))
plt.title("Energy Consumption of single family of various household size")

ax = sns.lineplot(x=avg_energy_df_single_fam.index, y=avg_energy_df_single_fam["TOTAL KWH"])
ax.set_xlabel("Average Household Size")
ax.set_ylabel("Average Total Energy Consumption (KWH)")

For single family, it appears that the energy consumption profile increases until at most 3. However, little increase in energy consumption is observed on household size 3 and above up to 5 occupants. 

B. Multi < 7

In [None]:
energy_df_multi_small = energy_usage_df_copy_c[energy_usage_df_copy_c["BUILDING_SUBTYPE"] == "Multi < 7"]

In [None]:
energy_df_multi_small = energy_df_multi_small[["AVERAGE HOUSESIZE", "TOTAL KWH"]]

#energy_df_multi_small["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_multi_small["AVERAGE HOUSESIZE"])
energy_df_multi_small["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_multi_small["AVERAGE HOUSESIZE"] / 0.5, 0) * 0.5

In [None]:
energy_df_multi_small = energy_df_multi_small.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()

In [None]:
plt.figure(figsize=(12,8))
plt.title("Energy Consumption of multi < 7, of various household size")

ax = sns.lineplot(x=energy_df_multi_small.index, y=energy_df_multi_small["TOTAL KWH"])
ax.set_xlabel("Average Household Size")
ax.set_ylabel("Average Total Energy Consumption (KWH)")

For Multi < 7, it seems that the consumption profile is at the highest for household of size 2 - 3, and the profile starts to go lower for household of size 4 and above.

C. Commercial

In [None]:
energy_df_commercial = energy_usage_df_copy_c[energy_usage_df_copy_c["BUILDING_SUBTYPE"] == "Commercial"]

In [None]:
energy_df_commercial = energy_df_commercial[["AVERAGE HOUSESIZE", "TOTAL KWH"]]

#energy_df_commercial["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_commercial["AVERAGE HOUSESIZE"])
energy_df_commercial["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_commercial["AVERAGE HOUSESIZE"] / 0.5, 0) * 0.5


In [None]:
energy_df_commercial = energy_df_commercial.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()

In [None]:
plt.figure(figsize=(12,8))
plt.title("Energy Consumption of commercial, various household size")

ax = sns.lineplot(x=energy_df_commercial.index, y=energy_df_commercial["TOTAL KWH"])
ax.set_xlabel("Average Household Size")
ax.set_ylabel("Average Total Energy Consumption (KWH)")

In commercial subtype, it appears that the highest consumption profile is contributed by buildings with few occupants (2 or less). Significantly less energy consumption profile is observed on household size 2 and above.

D. Multi 7+

In [None]:
energy_df_multi_big = energy_usage_df_copy_c[energy_usage_df_copy_c["BUILDING_SUBTYPE"] == "Multi 7+"]

In [None]:
energy_df_multi_big = energy_df_multi_big[["AVERAGE HOUSESIZE", "TOTAL KWH"]]

energy_df_multi_big["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_multi_big["AVERAGE HOUSESIZE"] / 0.5, 0) * 0.5
#energy_df_multi_big["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_multi_big["AVERAGE HOUSESIZE"])

energy_df_multi_big = energy_df_multi_big.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()

In [None]:
plt.figure(figsize=(12,8))
plt.title("Energy Consumption of Multi 7+, various household size")

ax = sns.lineplot(x=energy_df_multi_big.index, y=energy_df_multi_big["TOTAL KWH"])
ax.set_xlabel("Average Household Size")
ax.set_ylabel("Average Total Energy Consumption (KWH)")

In multi 7+ subtype, there is an increasing consumption profile with the increasing household size, at least until household size 4.

E. Municipal

In [None]:
energy_df_municipal = energy_usage_df_copy_c[energy_usage_df_copy_c["BUILDING_SUBTYPE"] == "Municipal"]

In [None]:
energy_df_municipal = energy_df_municipal[["AVERAGE HOUSESIZE", "TOTAL KWH"]]

energy_df_municipal["AVERAGE HOUSESIZE NORMALIZED"] = np.round(energy_df_municipal["AVERAGE HOUSESIZE"])


energy_df_municipal = energy_df_municipal.groupby(by="AVERAGE HOUSESIZE NORMALIZED").mean()

In [None]:
plt.figure(figsize=(12,8))
plt.title("Energy Consumption of Municipal, various household size")

ax = sns.lineplot(x=energy_df_municipal.index, y=energy_df_municipal["TOTAL KWH"])
ax.set_xlabel("Average Household Size")
ax.set_ylabel("Average Total Energy Consumption (KWH)")

The consumption of municipal building and industrial unfortunately can not be profiled accurately than the previous subtype, due to lack of data on municipal buildings (only 20 entries, after missing values  are dropped and outliers are removed). However, there is some indication that household size of 2 have the highest consumption profile, but more data is needed.

## Inferences and Conclusion

This analysis attempted to find some interesting information regarding the energy & gas consumption of households in Chicago on year 2010.
During the course of the analysis, some of the following inferences are produced :

- For single family, it appears that the energy consumption profile increases until at most hosuehold size 3. However, the consumption of household size above 3 doesn't differ a lot.
- For Multi < 7 subtype, it seems that the consumption profile is at the highest for household of size 2 - 3, and the profile starts to become stagnant or lower for household of size 4 and above.
- In commercial subtype, it appears that the highest consumption profile is contributed by buildings with few occupants (2 or less). Significantly less energy consumption profile is observed on household size 2 and above.
- In multi 7+ subtype, there is an increasing consumption profile with the increasing household size, at least until household size 4.
- The consumption of municipal building and industrial unfortunately can not be profiled accurately than the previous subtype, due to lack of data on municipal buildings (only 20 entries, after missing values  are dropped and outliers are removed). However, there is some indication that household size of 2 have the highest consumption profile, but more data is needed.
- The energy consumption profile of buildings in different community area seem to differ. This might have some correlation with the demography of the population there, however, further analysis involving other demographic dataset (e.g., population datasets) is needed to verify this assumption.
- Buildings in Chicago are old. Based on the generated chart, it's inconclusive to see which area have the oldest buildings, since the data has a very wide standard deviation
- The electricity consumption during summer is higher, perhaps due to summer and people are staying more at home during this time (holiday). Gas consumption, on the other hand, is observed the highest during winter months, likely due to the use of gas for heating.

## References and Future Work

In the future, other population datasets could be included as part of future analysis to deduce further inferences and verify correlation of energy / gas consumption profile with certain demographic properties. Some features of the dataset were left untouched, so there are other interesting questions that could be raised for further analysis, e.g., are there any correlation between the size of the building and the occupancy rate with the energy consumption, or are there correlation between the age of the buildings and energy consumption, etc. 