# Overview

1448棟のビルの4種類のメーター（電力、冷水、スチーム、温水）の値を予測するコンペである。

背景としては、エネルギー使用量（4種類のメーターの値）が正確に予測できるようになることで<br>
省エネ投資を活発化させて、環境問題に貢献したいという狙いがある。

ビルの所有者は、ビルのエネルギー効率を改善するような投資を行うことで<br>
エネルギー効率が改善された分、コストを抑えることができる。

# Module

In [1]:
import gc
import optuna
import datetime
import warnings
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

# Datasets

In [2]:
train = pd.read_csv('../input/ashrae-energy-prediction/train.csv')

print(train.shape)
train.head()

(20216100, 4)


Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0
3,3,0,2016-01-01 00:00:00,0.0
4,4,0,2016-01-01 00:00:00,0.0


1時間ごとのメータ値を保持するテーブル（学習データ）<br>
・building_id：ビルのID<br>
・meter：0は電力、1は冷水、2はスチーム、3は温水<br>
・timestamp：日付と時刻<br>
・meter_reading：メーターの使用量（目的変数）<br>

In [3]:
test = pd.read_csv('../input/ashrae-energy-prediction/test.csv')

print(test.shape)
test.head()

(41697600, 4)


Unnamed: 0,row_id,building_id,meter,timestamp
0,0,0,0,2017-01-01 00:00:00
1,1,1,0,2017-01-01 00:00:00
2,2,2,0,2017-01-01 00:00:00
3,3,3,0,2017-01-01 00:00:00
4,4,4,0,2017-01-01 00:00:00


1時間ごとのメータ値を保持するテーブル（テストデータ）<br>
・row_id：行のID<br>
・building_id：ビルのID<br>
・meter：0は電力、1は冷水、2はスチーム、3は温水<br>
・timestamp：日付と時刻<br>

In [4]:
weather_train = pd.read_csv('../input/ashrae-energy-prediction/weather_train.csv')

print(weather_train.shape)
weather_train.head()

(139773, 9)


Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
1,0,2016-01-01 01:00:00,24.4,,21.1,-1.0,1020.2,70.0,1.5
2,0,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.2,0.0,0.0
3,0,2016-01-01 03:00:00,21.1,2.0,20.6,0.0,1020.1,0.0,0.0
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6


気象情報を保持するテーブル（学習データに対応）<br>
・site_id：ビルの物理的な位置<br>
・timestamp：日付と時刻<br>
・air_temperature：気温<br>
・dew_temperature：湿度<br>
・precip_depth：降水量<br>
・sea_level_pressure：海圧<br>
・wind_direction：風向<br>
・wind_speed：風速<br>

In [5]:
weather_test = pd.read_csv('../input/ashrae-energy-prediction/weather_test.csv')

print(weather_test.shape)
weather_test.head()

(277243, 9)


Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2017-01-01 00:00:00,17.8,4.0,11.7,,1021.4,100.0,3.6
1,0,2017-01-01 01:00:00,17.8,2.0,12.8,0.0,1022.0,130.0,3.1
2,0,2017-01-01 02:00:00,16.1,0.0,12.8,0.0,1021.9,140.0,3.1
3,0,2017-01-01 03:00:00,17.2,0.0,13.3,0.0,1022.2,140.0,3.1
4,0,2017-01-01 04:00:00,16.7,2.0,13.3,0.0,1022.3,130.0,2.6


気象情報を保持するテーブル（テストデータに対応）<br>
・site_id：ビルの物理的な位置<br>
・timestamp：日付と時刻<br>
・air_temperature：気温<br>
・dew_temperature：湿度<br>
・precip_depth：降水量<br>
・sea_level_pressure：海圧<br>
・wind_direction：風向<br>
・wind_speed：風速<br>

In [6]:
building = pd.read_csv('../input/ashrae-energy-prediction/building_metadata.csv')

print(building.shape)
building.head()

(1449, 6)


Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,
3,0,3,Education,23685,2002.0,
4,0,4,Education,116607,1975.0,


ビルの情報を保持するテーブル<br>
・site_id：ビルの物理的な位置<br>
・building_id：ビルのID<br>
・primary_use：使用目的<br>
・square_feet：面積<br>
・year_built：施工した年<br>
・floor_count：階数<br>

# Join

In [7]:
def reduce_mem_usage(df, use_float16=False):
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [8]:
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

weather_train = reduce_mem_usage(weather_train)
weather_test = reduce_mem_usage(weather_test)

Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.84 MB
Decreased by 71.8%
Memory usage of dataframe is 1272.51 MB
Memory usage after optimization is: 358.53 MB
Decreased by 71.8%
Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 4.45 MB
Decreased by 53.6%
Memory usage of dataframe is 19.04 MB
Memory usage after optimization is: 8.83 MB
Decreased by 53.6%


In [9]:
train = train.merge(building, on='building_id', how='left')
test = test.merge(building, on='building_id', how='left')

print(train.shape)
train.head()

(20216100, 9)


Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,
1,1,0,2016-01-01 00:00:00,0.0,0,Education,2720,2004.0,
2,2,0,2016-01-01 00:00:00,0.0,0,Education,5376,1991.0,
3,3,0,2016-01-01 00:00:00,0.0,0,Education,23685,2002.0,
4,4,0,2016-01-01 00:00:00,0.0,0,Education,116607,1975.0,


In [10]:
del building
gc.collect()

151

In [11]:
train = train.merge(weather_train, on=['site_id', 'timestamp'], how='left')
test = test.merge(weather_test, on=['site_id', 'timestamp'], how='left')

print(train.shape)
train.head()

(20216100, 16)


Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0
1,1,0,2016-01-01 00:00:00,0.0,0,Education,2720,2004.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0
2,2,0,2016-01-01 00:00:00,0.0,0,Education,5376,1991.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0
3,3,0,2016-01-01 00:00:00,0.0,0,Education,23685,2002.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0
4,4,0,2016-01-01 00:00:00,0.0,0,Education,116607,1975.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0


In [12]:
del weather_train, weather_test
gc.collect()

40

In [13]:
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

Memory usage of dataframe is 1639.08 MB
Memory usage after optimization is: 1137.81 MB
Decreased by 30.6%
Memory usage of dataframe is 3380.74 MB
Memory usage after optimization is: 2346.83 MB
Decreased by 30.6%


# Visualization

In [14]:
train['timestamp'] = pd.to_datetime(train['timestamp'])
test['timestamp'] = pd.to_datetime(test['timestamp'])

In [15]:
# train[train['site_id'] == 0].plot('timestamp', 'meter_reading')

In [16]:
# train[train['site_id'] == 1].plot('timestamp', 'meter_reading')

In [17]:
# train[train['site_id'] == 2].plot('timestamp', 'meter_reading')

# Feature Engineering

In [18]:
# train['hour'] = train['timestamp'].dt.hour
train['day'] = train['timestamp'].dt.day
train['weekend'] = train['timestamp'].dt.weekday
# train['month'] = train['timestamp'].dt.month

print(train.shape)
train.head()

(20216100, 18)


Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,day,weekend
0,0,0,2016-01-01,0.0,0,Education,7432,2008.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0,1,4
1,1,0,2016-01-01,0.0,0,Education,2720,2004.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0,1,4
2,2,0,2016-01-01,0.0,0,Education,5376,1991.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0,1,4
3,3,0,2016-01-01,0.0,0,Education,23685,2002.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0,1,4
4,4,0,2016-01-01,0.0,0,Education,116607,1975.0,,25.0,6.0,20.0,,1019.700012,0.0,0.0,1,4


# Preprocess

In [19]:
drop_col = ['timestamp',
            'site_id',
            'precip_depth_1_hr',
            'sea_level_pressure',
            'wind_direction',
            'wind_speed',
            'floor_count']
            
train = train.drop(drop_col, axis = 1)

gc.collect()

131

In [20]:
le = LabelEncoder()
train['primary_use'] = le.fit_transform(train['primary_use'])

In [21]:
target = np.log1p(train['meter_reading'])
train = train.drop(['meter_reading'], axis=1)

print(train.shape)
train.head()

(20216100, 10)


Unnamed: 0,building_id,meter,primary_use,square_feet,year_built,air_temperature,cloud_coverage,dew_temperature,day,weekend
0,0,0,0,7432,2008.0,25.0,6.0,20.0,1,4
1,1,0,0,2720,2004.0,25.0,6.0,20.0,1,4
2,2,0,0,5376,1991.0,25.0,6.0,20.0,1,4
3,3,0,0,23685,2002.0,25.0,6.0,20.0,1,4
4,4,0,0,116607,1975.0,25.0,6.0,20.0,1,4


# Modeling

In [22]:
'''
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=666)

def create_model(trial):
    num_leaves = trial.suggest_int('num_leaves', 2, 31)
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)
    max_depth = trial.suggest_int('max_depth', 3, 8)
    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)
    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)
    bagging_freq = trial.suggest_int('bagging_freq', 1, 7)
    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)
    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)
    subsample = trial.suggest_uniform('subsample', 0.1, 1.0)
    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)
    
    model = lgb.LGBMRegressor(
        num_leaves=num_leaves,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth, 
        min_child_samples=min_child_samples, 
        min_data_in_leaf=min_data_in_leaf,
        bagging_freq=bagging_freq,
        bagging_fraction=bagging_fraction,
        feature_fraction=feature_fraction,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        metric='rsme',
        random_state=666)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_val)
    score = 1 / np.sqrt(mean_squared_error(y_pred, y_val))
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=40)
params = study.best_params
print(params)
'''

"\nX_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=666)\n\ndef create_model(trial):\n    num_leaves = trial.suggest_int('num_leaves', 2, 31)\n    n_estimators = trial.suggest_int('n_estimators', 50, 300)\n    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)\n    max_depth = trial.suggest_int('max_depth', 3, 8)\n    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)\n    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)\n    bagging_freq = trial.suggest_int('bagging_freq', 1, 7)\n    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)\n    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)\n    subsample = trial.suggest_uniform('subsample', 0.1, 1.0)\n    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)\n    \n    model = lgb.LGBMRegressor(\n        num_leaves=num_leaves,\n        n_estimators=n_estimators,\n        learning_rate=

In [23]:
params = {'num_leaves': 11,
          'n_estimators': 292,
          'learning_rate': 0.8640663767978793,
          'max_depth': 7,
          'min_child_samples': 597,
          'min_data_in_leaf': 39,
          'bagging_freq': 3,
          'bagging_fraction': 0.849866403131204,
          'feature_fraction': 0.015663657484555338,
          'subsample': 0.6047292499624366,
          'colsample_bytree': 0.9865686659484679,
          'metric': 'rsme',
          'random_state': 666}

In [24]:
%%time

cls = lgb.LGBMRegressor(**params)
cls.fit(train, target)

CPU times: user 11min 18s, sys: 3.53 s, total: 11min 22s
Wall time: 3min 1s


LGBMRegressor(bagging_fraction=0.849866403131204, bagging_freq=3,
              colsample_bytree=0.9865686659484679,
              feature_fraction=0.015663657484555338,
              learning_rate=0.8640663767978793, max_depth=7, metric='rsme',
              min_child_samples=597, min_data_in_leaf=39, n_estimators=292,
              num_leaves=11, random_state=666, subsample=0.6047292499624366)

In [25]:
del train, target, params
gc.collect()

43

# Submit

In [26]:
row_id = test['row_id']
test = test.drop(['row_id'], axis = 1)

In [27]:
# test['hour'] = test['timestamp'].dt.hour
test['day'] = test['timestamp'].dt.day
test['weekend'] = test['timestamp'].dt.weekday
# test['month'] = test['timestamp'].dt.month

In [28]:
test = test.drop(drop_col, axis=1)
gc.collect()

101

In [29]:
test['primary_use'] = le.fit_transform(test['primary_use'])

print(test.shape)
test.head()

(41697600, 10)


Unnamed: 0,building_id,meter,primary_use,square_feet,year_built,air_temperature,cloud_coverage,dew_temperature,day,weekend
0,0,0,0,7432,2008.0,17.799999,4.0,11.7,1,6
1,1,0,0,2720,2004.0,17.799999,4.0,11.7,1,6
2,2,0,0,5376,1991.0,17.799999,4.0,11.7,1,6
3,3,0,0,23685,2002.0,17.799999,4.0,11.7,1,6
4,4,0,0,116607,1975.0,17.799999,4.0,11.7,1,6


In [30]:
test = reduce_mem_usage(test)

Memory usage of dataframe is 2187.13 MB
Memory usage after optimization is: 1352.04 MB
Decreased by 38.2%


In [31]:
import sys

print(sys.getsizeof(test))
print(sys.getsizeof(row_id))

1417718432
500371232


In [32]:
print(pd.DataFrame([[val for val in dir()], [sys.getsizeof(eval(val)) for val in dir()]],
                   index=['name','size']).T.sort_values('size', ascending=False).reset_index(drop=True))

del _11, _9, _18
gc.collect()

           name        size
0          test  1417718432
1        row_id   500371232
2           _11      932563
3            _9      932423
4           _18      336598
..          ...         ...
87          _25          28
88          ___          28
89     __spec__          16
90   __loader__          16
91  __package__          16

[92 rows x 2 columns]


0

In [33]:
target = np.expm1(cls.predict(test))

submission = pd.DataFrame(target, index=row_id, columns=['meter_reading'])
submission.head(10)

Unnamed: 0_level_0,meter_reading
row_id,Unnamed: 1_level_1
0,20.904493
1,4.635884
2,6.671875
3,29.810936
4,140.282104
5,7.963854
6,40.785556
7,88.987123
8,66.383996
9,40.094845


In [34]:
del row_id
gc.collect()

44

In [35]:
submission.to_csv('ASHRAE_submit.csv')
submission.head()

Unnamed: 0_level_0,meter_reading
row_id,Unnamed: 1_level_1
0,20.904493
1,4.635884
2,6.671875
3,29.810936
4,140.282104


予測値を提出する際の形式<br>
・row_id：行のID<br>
・meter_reading：メーターの使用量（目的変数）<br>