# ⚡데이콘: 전력사용량 예측 AI 경진대회

---
#### Overview
[건물 정보와 기후 정보를 활용한 전력사용량 에측](https://dacon.io/competitions/official/235736/overview/description)
```
주제
1. 전력 수요 예측 시뮬레이션을 통한 효율적인 인공지능 알고리즘 발굴
2. 전력 융합 신서비스 발굴 및 비즈니스 모델 개발 활용
3. 디지털 뉴딜의 성공을 위한 인공지능(AI)의 융합, 확산을 촉진
```

머신러닝 회귀분석을 공부하며 대회에 참여해보기!

### Import Library (step. 1)

In [1]:
import pandas as pd
import numpy as np

## Load Data (step. 2)

In [2]:
train = pd.read_csv("C:/Users/User/Downloads/data/energy/train_utf.csv")

#train.shape 122400 X 10
#60개의 건물 X 85일 24시간 =122400
print(train.shape)
train.head()

(122400, 10)


Unnamed: 0,num,date_time,전력사용량(kWh),기온(°C),풍속(m/s),습도(%),강수량(mm),일조(hr),비전기냉방설비운영,태양광보유
0,1,2020-06-01 00,8179.056,17.6,2.5,92.0,0.8,0.0,0.0,0.0
1,1,2020-06-01 01,8135.64,17.7,2.9,91.0,0.3,0.0,0.0,0.0
2,1,2020-06-01 02,8107.128,17.5,3.2,91.0,0.0,0.0,0.0,0.0
3,1,2020-06-01 03,8048.808,17.1,3.2,91.0,0.0,0.0,0.0,0.0
4,1,2020-06-01 04,8043.624,17.0,3.3,92.0,0.0,0.0,0.0,0.0


In [3]:
test = pd.read_csv("C:/Users/User/Downloads/data/energy/test_utf.csv")

#test.shape 10080 X 9
#60개의 건물 X 7일 24시간 =10080
print(test.shape)
test.head()

(10080, 9)


Unnamed: 0,num,date_time,기온(°C),풍속(m/s),습도(%),"강수량(mm, 6시간)","일조(hr, 3시간)",비전기냉방설비운영,태양광보유
0,1,2020-08-25 00,27.8,1.5,74.0,0.0,0.0,,
1,1,2020-08-25 01,,,,,,,
2,1,2020-08-25 02,,,,,,,
3,1,2020-08-25 03,27.3,1.1,78.0,,0.0,,
4,1,2020-08-25 04,,,,,,,


### Preprocessing (step. 3)

In [5]:
# Find Null
train.isnull().sum()

num           0
date_time     0
전력사용량(kWh)    0
기온(°C)        0
풍속(m/s)       0
습도(%)         0
강수량(mm)       0
일조(hr)        0
비전기냉방설비운영     0
태양광보유         0
dtype: int64

In [6]:
test.isnull().sum()

num                0
date_time          0
기온(°C)          6720
풍속(m/s)         6720
습도(%)           6720
강수량(mm, 6시간)    8400
일조(hr, 3시간)     6720
비전기냉방설비운영       7784
태양광보유           8456
dtype: int64

In [8]:
#건물별로 '비전기냉방설비운영'과 '태양광보유'를 판단해 test set의 결측치를 보간해줍니다
train[['num', '비전기냉방설비운영','태양광보유']]
ice={}
hot={}
count=0
for i in range(0, len(train), len(train)//60):
    count +=1
    ice[count]=train.loc[i,'비전기냉방설비운영']
    hot[count]=train.loc[i,'태양광보유']
    
for i in range(len(test)):
    test.loc[i, '비전기냉방설비운영']=ice[test['num'][i]]
    test.loc[i, '태양광보유']=hot[test['num'][i]]

print(test.shape)
test[['비전기냉방설비운영','태양광보유']].head()

(10080, 9)


Unnamed: 0,비전기냉방설비운영,태양광보유
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


In [9]:
#시간 변수와 요일 변수를 추가해봅니다.
def time(x):
    return int(x[-2:])
train['time']=train['date_time'].apply(lambda x: time(x))
test['time']=test['date_time'].apply(lambda x: time(x))

def weekday(x):
    return pd.to_datetime(x[:10]).weekday()
train['weekday']=train['date_time'].apply(lambda x :weekday(x))
test['weekday']=test['date_time'].apply(lambda x :weekday(x))

print(train.shape)
train[['time', 'weekday']].head()

(122400, 12)


Unnamed: 0,time,weekday
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


In [11]:
test.isnull().sum()

num                0
date_time          0
기온(°C)          6720
풍속(m/s)         6720
습도(%)           6720
강수량(mm, 6시간)    8400
일조(hr, 3시간)     6720
비전기냉방설비운영          0
태양광보유              0
time               0
weekday            0
dtype: int64

In [15]:
# 결측치 보간 ('https://teddylee777.github.io/pandas/pandas-interpolation')
test.interpolate(method='values')
test.head()

Unnamed: 0,num,date_time,기온(°C),풍속(m/s),습도(%),"강수량(mm, 6시간)","일조(hr, 3시간)",비전기냉방설비운영,태양광보유,time,weekday
0,1,2020-08-25 00,27.8,1.5,74.0,0.0,0.0,0.0,0.0,0,1
1,1,2020-08-25 01,,,,,,0.0,0.0,1,1
2,1,2020-08-25 02,,,,,,0.0,0.0,2,1
3,1,2020-08-25 03,27.3,1.1,78.0,,0.0,0.0,0.0,3,1
4,1,2020-08-25 04,,,,,,0.0,0.0,4,1


## Model (step. 4)

In [22]:
import math
import os
import matplotlib.pyplot as plt

from sklearn.metrics import mean_absolute_error

from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold

In [17]:
train.columns

Index(['num', 'date_time', '전력사용량(kWh)', '기온(°C)', '풍속(m/s)', '습도(%)',
       '강수량(mm)', '일조(hr)', '비전기냉방설비운영', '태양광보유', 'time', 'weekday'],
      dtype='object')

In [20]:
feature_names = ['num', 
           '기온(°C)', 
           '풍속(m/s)', 
           '습도(%)',
           '강수량(mm)', 
           '일조(hr)', 
           '비전기냉방설비운영', 
           '태양광보유', 
           'time', 
           'weekday']

label_name = ['전력사용량(kWh)']

In [21]:
X_train = train[feature_names]
y_train = train[label_name]

In [24]:
cross=KFold(n_splits=5, shuffle=True, random_state=42)
folds=[]
for train_idx, valid_idx in cross.split(X_train, y_train):
    folds.append((train_idx, valid_idx))

In [26]:
models={}
for fold in range(5):
    print(f'===================={fold+1}=======================')
    train_idx, valid_idx=folds[fold]
    
    X_trn, X_val = X_train.iloc[train_idx], X_train.iloc[valid_idx]
    y_trn, y_val = y_train.iloc[train_idx], y_train.iloc[valid_idx]
    
    model=LGBMRegressor(n_estimators = 400, learning_rate = 0.1)
    model.fit(X_trn, y_trn, eval_set=[(X_trn, y_trn), (X_val, y_val)], 
             early_stopping_rounds=30, verbose=100)
    models[fold]=model
    
    print(f'================================================\n\n')

Training until validation scores don't improve for 30 rounds
[100]	training's l2: 110589	valid_1's l2: 110225
[200]	training's l2: 75584	valid_1's l2: 78493.7
[300]	training's l2: 61181.3	valid_1's l2: 66570.1
[400]	training's l2: 51936.5	valid_1's l2: 59641.1
Did not meet early stopping. Best iteration is:
[400]	training's l2: 51936.5	valid_1's l2: 59641.1


Training until validation scores don't improve for 30 rounds
[100]	training's l2: 105803	valid_1's l2: 118195
[200]	training's l2: 71673	valid_1's l2: 85873.9
[300]	training's l2: 58798.5	valid_1's l2: 74705.2
[400]	training's l2: 51118.7	valid_1's l2: 68760.2
Did not meet early stopping. Best iteration is:
[400]	training's l2: 51118.7	valid_1's l2: 68760.2


Training until validation scores don't improve for 30 rounds
[100]	training's l2: 110710	valid_1's l2: 110163
[200]	training's l2: 74558	valid_1's l2: 77577.6
[300]	training's l2: 60377.4	valid_1's l2: 65985.7
[400]	training's l2: 52284.1	valid_1's l2: 60599.1
Did not meet ea

## Submission (Step. 5)

In [29]:
test.drop('date_time', axis=1, inplace=True)

In [30]:
submission = pd.read_csv("C:/Users/User/Downloads/data/energy/sample_submission.csv")
for i in range(5):
    submission['answer'] += models[i].predict(test)/5 

In [33]:
#제출
submission.to_csv('C:/Users/User/Downloads/data/energy/baseline_submission_0529.csv', index=False)