# ASHARE- Househould energy prediction
## The goal of this project is to predict household energy consumption using the provided data files. 
### First, the given data must be overviewed, analyzed and compared, which will be done in this exploratory data analysis notebook.



### Here, packages are imported

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Importing plotting tool
import matplotlib.pyplot as plt

### It can be seen that the complete data source consists of 6 data files. The test.csv is the data file that will be used to train the machine learning algorithm and it is the first file to be analyzed:

# File "train.csv"
#### The "train.csv" looks like this:

In [None]:
df_train = pd.read_csv("../input/ashrae-energy-prediction/train.csv")
df_train.head(5)


The first column. "building_id" contains the building identifications. Each number represents one building and there are There are presumably the same amount of unique building ids as in building_metadata.

The second column, "meter", contains 4 different numbers: 0, 1, 2 and 3. Each number correspond to one source of energy consumption.

* 0 = electricty
* 1 = chilled water
* 2 = steam
* 3 = hot water

First, the amount of building_id readings can be analyzed. It is always a good idea to see if there are any NaN values. The following bar chart illustrates the amount of meter readings for each building ID. The building IDs are ordered from the highest amount of meter readings to the lowest amount of meter readings. From this box chart, it can be concluded that almost none of the buildings contain all possible meter readings. This might be caused by the fact that not all meter types are installed in all buildings, which means that not all builinds use the same source of energy.


In [None]:
print("Number of unique buildings: ", len(df_train.drop_duplicates(subset = "building_id")))

In [None]:
print("Amount of rows with NaN values: ", len(df_train))
print("Amount of rows without NaN values: ", len(df_train.dropna()))

In [None]:
#Bar chart for the overall amount of meter readings for each building ID

df_train["building_id"].value_counts().head()
xtick_values = []

df_train["building_id"].value_counts().plot.bar(figsize = (15,6), xlabel = "Building ID", ylabel = "Number of meter readings", fontsize = 10, xticks = [0], title = "Number of meter readings for each building ID")


#Alternative plt box chart that I cannot get to work, so I just saved it here for now
#plt.bar(df_train["building_id"].value_counts(), height = 50000)
#plt.show()

## Next, the values of the meter readings will be analyzed

In [None]:
df_train["meter"].value_counts().plot.bar( xlabel = "amount of readings", ylabel = "meter type", title = "Amount of readings for each meter type")
#plt.bar((df_train["meter"].value_counts()), height = 10000)

df_train.drop(["building_id", "timestamp"], axis = 1).groupby(["meter"]).mean().plot.bar(xlabel = "mean value of the meter readings", title = "Mean meater reading for each meter type")

### The first bar chart shows how many meter readings that are made by each meter type. It can be seen most of the readings are made by 0 = electricity. The second bar chart illustrates the mean meter reading (i.e. the electricity consumption) for each meter type. For 2 = steam, the results are significantly higher, meaning that steam heating uses the most electricity. 

### The following graph shows all meter readings during the year

In [None]:
xtickx = [0]

#for i in range(len(df_train)):
    #xtickx += [df_train["timestamp"][i]]
df_train.filter(["meter_reading", "timestamp"]).plot(figsize = (20,7), title = "all meter readings for each building id" )

### This graph is not very good, since it displays every meter reading for every building. This means that there are multiple data points for each time stamp.  

### In addition to this, it is difficult to distinguish the different meter types, since they have different mean values. Therefore. the different meter types will be illustrated individually.

In [None]:
#Creating one dataframe for each meter type

df_train_0 = df_train.loc[df_train["meter"] == 0]
df_train_1 = df_train.loc[df_train["meter"] == 1]
df_train_2 = df_train.loc[df_train["meter"] == 2]
df_train_3 = df_train.loc[df_train["meter"] == 3]

#Calculating the mean readings of the buildings
#In this case, it can be seen that the building_id will not make any sense anymore, 
#since it calculates the mean of the IDs. It doesnt matter since the mean meter_reading is of interest

df_train_0.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly electricity readings (index 0)")

### In this graph, it can be noted that electricity meter readings increase during the summmer, and slightliy decrease during fall, winter and spring. This might be due to increased usage of airconditioning during the summer. Note that there are momentary peaks that differ vastly compared to the mean values. 

In [None]:
df_train_1.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly chilled water readings (index 1)")

In [None]:
df_train_2.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean daily steam readings (index 2)")

### It can be noted that there is something strange about the steam meter readings. It looks very similar to the overall meter reading graph presented before. Also notice that the values are very high compared to the other meter readings. Perhaps there is a building/buildings that use a lot of energy for steam? This can explain why the bar chart indicted that the meter readings where so much higher compared to the rest of the meter readings. This must be analyzed further and potentailly processed or removed, it if is an anomaly.

In [None]:
df_train_3.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly hot water readings (index 3)")

### The hot water usage decreases duing the summer and increases during the winter. This is expected as the cold weather makes people shower hotter and longer. There might be some buildings that use hot water and radiators as heating.

# File "building_metadata"
This file contains information about each building ID. The first column, "site_id" contains an index which corresponds to a building site. In total there are 16 different sites that contain the building IDs.

The first thing that is investigated is the amount of buildings for each site. It can be seen that site 3 has almost double the amount of buildings compared to the second biggest site. 

It is plausibile that the primary use affects the amount of buildings,so this will be investigated next.


In [None]:
#print("Amount of sites: ",len(df_building_metadata.drop_duplicates(subset = "site_id")))
df_building_metadata = pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv")
df_building_metadata.head()


In [None]:
df_building_metadata["site_id"].value_counts(sort = False).plot.bar(figsize = (15, 7), xlabel = "Site IDs", ylabel = "Amount of building IDs", fontsize = 10, title = "Number of buildings on each site")

In [None]:
df_building_metadata["primary_use"].value_counts().plot.bar(figsize = (14,5), xlabel = "Primary use", ylabel = "Number of buildings", fontsize = 10, rot = 90, title = "Number of buildings used for a particual reason")

In [None]:
#print(df_building_metadata.set_index(keys=["site_id", "building_id"]))
#print(df_building_metadata.filter(["site_id", "primary_use"]))

df_building_metadata.filter(["site_id", "primary_use"]).value_counts().sort_values().reset_index().pivot(index = "site_id", columns = "primary_use", values = 0).plot.bar(stacked = True,figsize = (20,10), xlabel = "Site ID", ylabel = "Number of buildings", fontsize = 10, rot = 90, title = "Number of buildings used for a particual reason" )
#

## What can be concluded from this? There are many types uses for each site, which means that energy usage patterns will most likely vary depending on the primary use. However, it can be seen that education is a big part of many sites.

# File weather.csv
### This file looks like the following dataframe. It can be seen that for each site ID and timestamp, numerous recordings of different weather data has been done.

### First of all, which recordings done might have the greatest impact on the meter readings? Temperature might affect the electricity usage because of AC usage. Similarly, steam meter readings will probably decrease

In [None]:
df_weather = pd.read_csv("../input/ashrae-energy-prediction/weather_train.csv")
df_weather.head()

### The air temperature is shown in the following graph

In [None]:
df_weather.filter(["timestamp", "air_temperature"]).groupby(by = "timestamp").mean().plot(figsize = (15,7), xlabel = "mean air temperature", title = "Mean air temperature for all sites")

### It can be seen that the air temperature increases substantially during the summer, and is somewhat low during the winter. This information tells us that the climate is relatively warm, which can help to decide what factors are important. Also, energy usage for different things is different depending on climate. For example, people living in cold areas will probably not even have airconditioning installed, which means that the electricity usage is different compared to people who live in warmer areas. It would be good to know if the buildings have automatic adjustment of the indoor temperature, and in that case what parameters decide it.

### Correlation between temperature and different meter values:

In [None]:

print("Correlation hot water meter and air temperature: ",df_weather["air_temperature"].corr(df_train_3["meter_reading"]))
print("Correlation steam meter and air temperature: ",df_weather["air_temperature"].corr(df_train_2["meter_reading"]))
print("Correlation chilled water meter and air temperature: ",df_weather["air_temperature"].corr(df_train_1["meter_reading"]))
print("Correlation electricity meter and air temperature: ",df_weather["air_temperature"].corr(df_train_0["meter_reading"]))