Hello

The Kernel will conduct Data Loading and a simple Data EDA of the ASHRAE-Great Energy Predictor III competition.

I will keep working on various tasks later, and hope it will help all of you.

## 0. Config

In [None]:
#-*- coding: CP949 -*-
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

import matplotlib
from IPython.display import set_matplotlib_formats

matplotlib.rc('font', family='Malgun Gothic')

matplotlib.rc('axes', unicode_minus=False)

set_matplotlib_formats('retina')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%config IPCompleter.greedy=True

## 1. Loading Data

### 1.1. Loading Train Data

In [None]:
train = pd.read_csv("../input/ashrae-energy-prediction/train.csv")
train.shape
train.head()

In [None]:
print("Building ID Count : ",train['building_id'].nunique())
print("Building ID Min Value : ",train['building_id'].min())
print("Building ID Max Value : ",train['building_id'].max())

* 'building_id' is a unique ID of each building, and there are 1449 building information in Train Data.

In [None]:
tmp = train['meter'].value_counts(dropna=False)
tmp.plot.bar()
del tmp

* 'meter' feature is an integer value between 0 and 3, and each number represents the amount of power shown below.
  - 0 : electricity
  - 1 : chilledwater
  - 2 : steam
  - 3 : hotwater
  


* It has the most information on 'electricity' and the least amount of 'hotwater'.

In [None]:
tmp = train.sort_values(by = 'timestamp' )
tmp.head()
tmp.tail()
del tmp

* 'timestamp' column is the time when each measurement data is acquired.
* There is data from January 1, 2016 00:00 to December 31, 2016 23:00.
* You can see that there is 1 year data in 2016.


* The 'meter_reading' column is the target value we want to find, the building ID, the meter and the power consumption at that time

### 1.2. Loading Building Metadata

In [None]:
building_metadata = pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv")

building_metadata.shape
building_metadata.head()

print("SIte ID Count : ",building_metadata['site_id'].nunique())
print("SIte ID Min Value : ",building_metadata['site_id'].min())
print("SIte ID Max Value : ",building_metadata['site_id'].max())

* 'site_id' is the ID of the area where each building is located.
* It is divided into 16 regions

In [None]:
print("Building ID Count : ",building_metadata['building_id'].nunique())
print("Building ID Min Value : ",building_metadata['building_id'].min())
print("Building ID Max Value : ",building_metadata['building_id'].max())

* 'building_id' has a total of 1449 building information identical to the building ID of train data.
* Using this column, we can integrate the building information into the Train Data later.

In [None]:
plt.rcParams["figure.figsize"] = (18,8)
tmp = building_metadata['primary_use'].value_counts(dropna=False)
tmp
tmp.plot.bar(rot = 45)
del tmp

* 'primary_use' is for each building use.
* The use of the building can be used as a feature because it can affect the power consumption.

In [None]:
pd.isnull( building_metadata["square_feet"] ).sum()
building_metadata["square_feet"].describe()

sns.distplot( building_metadata["square_feet"] , bins = 500 )
plt.show()

* 'square_feet' is the gross floor area for each building.
* The values below 200000 are the majority.
* I think the total floor area will also be an important feature.

In [None]:
print( "No Data Row Count :" , pd.isnull( building_metadata["year_built"] ).sum() )
building_metadata["year_built"].describe()

In [None]:
tmp = building_metadata["year_built"].dropna()
sns.distplot( tmp , bins = 120 , kde=False)
plt.show()

* 'year_built' is the year the building was built.
* A lot of buildings built in the 60's and 70's.
* You can calculate the built year and use it as a feature.

In [None]:
print( "No Data Row Count :" , pd.isnull( building_metadata["floor_count"] ).sum() )
building_metadata["floor_count"].describe()

building_metadata["floor_count"].value_counts()

In [None]:
tmp = building_metadata["floor_count"].dropna()
sns.distplot( tmp  , kde=False)
plt.show()

* There are many buildings with no floor_count.
* Let's check out the features of buildings without floors later.
* Most buildings have with 10 floors or less.

### 1.3. Loading Weather Data

In [None]:
weather_test = pd.read_csv("../input/ashrae-energy-prediction/weather_test.csv")
weather_train = pd.read_csv("../input/ashrae-energy-prediction/weather_train.csv")

weather_train.shape
weather_train.head()

weather_test.shape
weather_test.head()

In [None]:
print("Site ID Count : " , weather_train['site_id'].nunique() )
weather_train['site_id'].value_counts(dropna=False)

print("Site ID Count : " , weather_test['site_id'].nunique() )
weather_test['site_id'].value_counts(dropna=False)

* There are 16 Site IDs, same as the Site ID in the Building Metadata above.
* It is data recorded for one year, surveying the climate data at every hour hour for each Site ID.
* Climate can play an important role in power consumption, so this data can be an important feature.

### 1.4. Loading Test & Submission Data

In [None]:
sample_submission = pd.read_csv("../input/ashrae-energy-prediction/sample_submission.csv")
test = pd.read_csv("../input/ashrae-energy-prediction/test.csv")

In [None]:
test.shape
test.head()

In [None]:
tmp = test.sort_values(by = 'timestamp')
tmp.head()
tmp.tail()
del tmp

In [None]:
sample_submission.shape
sample_submission.head()

* 'test' data has the building ID, meter and time data that we are going to predict in this competition.
* It has the same structure as the 'train' Data we saw earlier, but, of course, there is no 'meterr_reading' column and the timestamp has a range from 00 January 1, 2017 to 23 December 2018.
* We have to predict power consumption for two years based on the previous train data and other data.

In [None]:
print("Test Data Building ID Count : ",test['building_id'].nunique())
print("Test Data Building ID Min Value : ",test['building_id'].min())
print("Test Data Building ID Max Value : ",test['building_id'].max())

* 'Building ID' of 'train' and 'test' data also has the same value as we expected.

The next Kernel will carry out Data Cleaning and Feature Engineering.

Please wait for the next Kernel

Thank you for reading