# Modelo predictivo de Consumo

El actual documento describe el proceso de entrenamiento y predicción de consumo de cada módulo con sensor de corriente.


## Modelo de Datos Star-Schema

### Tablas de dimensiones (consulta)

**lk_module**
- MOD_ID: int (PK)
- DEV_ID: int
- DEV_NAME: varchar
- MOD_NAME: varchar
- MOD_TYPE_ID: int
- MOD_TYPE_NAME: varchar

**lk_time**
- TIMESTAMP: int (PK)
- MINUTE: int
- MINDAY: int
- HOUR: int
- DAY: int
- WEEKDAY: int
- WEEKDAY_NAME: varchar
- MONTH: int
- MONTH_HOUR: int
- YEARDAY: int
- YEAR: int

**lk_sensor**
- SENSOR_TYPE_ID: int (PK)
- SENSOR_TYPE_NAME: varchar
- UNIT: varchar

### Tablas de hechos

**agg_power_consumption**
- id_power_consumption: int (SK)
- TIMESTAMP: int (fk)
- MOD_ID: int (fk)
- WATT_HOUR: float
- WATT_HOUR_ACC: int
- TOTAL_WATT_PER: float

**bt_events**
- id_event: int (SK)
- TIMESTAMP: int (fk)
- MOD_ID: int (fk)
- SENSOR_TYPE_ID: int (fk)
- MODULE_STATE: int
- SENSED_VALUE: float

## Import de librerías

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math as math
from datetime import datetime, timedelta
import MySQLdb
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'

## Configuración
### Conexión a bases de datos

In [44]:
# Base de datos transaccional
dbname_trx = "ratio_dev_eric"
dbhost_trx = "localhost"
dbport_trx = 3306
dbuser_trx = "root"
dbpass_trx = "root"

# Base de datos histórica
dbname_hist = "ratio_dwh"
dbhost_hist = "localhost"
dbport_hist = 3306
dbuser_hist = "root"
dbpass_hist = "root"

### Tablas de consulta (fuentes de datos)
Las fuentes de datos surgen de la base de datos transaccional, ya que se incorporan los eventos registrados en tiempo real para agregarlos a la base de datos histórica.

In [45]:
devices_tbl = "devices"
device_modules_tbl = "device_modules"
module_types_tbl = "module_types"
sensor_types_tbl = "sensor_types"
device_events_tbl = "device_events"
device_event_sensors_tbl = "device_event_sensors"

### Esquema del DWH

In [46]:
agg_power_consumption = "agg_power_consumption"
lk_module = "lk_module"
lk_time = "lk_time"
lk_sensor = "lk_sensor"
bt_events = "bt_events"

### Variables globales

In [47]:
# Voltaje de la red (en V)
voltage = 230
# Sensor_type_id de energia: 1
sensor_type_id = 1

## Entrenamiento de RNA Regressor

### Extracción de datos de la base DWH

In [48]:
conn = MySQLdb.connect(host=dbhost_hist, port=dbport_hist, user=dbuser_hist, passwd=dbpass_hist, db=dbname_hist)

df_modules = pd.read_sql('SELECT * FROM ' + lk_module, con=conn)
df_time = pd.read_sql('SELECT * FROM ' + lk_time, con=conn)
df_sensors = pd.read_sql('SELECT * FROM ' + lk_sensor, con=conn)
df_events = pd.read_sql('SELECT * FROM ' + bt_events, con=conn)
df_agg_power = pd.read_sql('SELECT * FROM ' + agg_power_consumption, con=conn)

conn.close()

El dataset df_modules representa la tabla lk_module:

In [49]:
df_modules

Unnamed: 0,MOD_ID,MOD_NAME,MOD_TYPE_ID,MOD_TYPE_NAME,DEV_ID,DEV_NAME
0,51,Lux Cocina mod,1,LUX,5,Lux Cocina
1,52,Pot Cocina,2,POTENTIA,5,Lux Cocina
2,61,Pot 1,2,POTENTIA,6,Potentia Heladera
3,71,Omni 1,3,OMNI,7,Omni Cocina


El dataset df_sensors representa la tabla lk_sensor:

In [50]:
df_sensors

Unnamed: 0,SENSOR_TYPE_ID,SENSOR_TYPE_NAME,UNIT
0,1,CURRENT,mAmp
1,2,LUMINOSITY,lum
2,3,MOVEMENT,
3,4,SOUND,dB
4,5,TEMPERATURE,C


El dataset de la dimension tiempo:

In [72]:
df_time.head(5)

Unnamed: 0,TIMESTAMP,YEAR,YEARDAY,MONTH,WEEKDAY,WEEKDAY_NAME,DAY,HOUR,MONTH_HOUR,MINUTE,MINDAY
0,2017-01-01 00:00:00,2017,1,1,6,Sunday,1,0,0,0,0
1,2017-01-01 00:01:00,2017,1,1,6,Sunday,1,0,0,1,1
2,2017-01-01 00:02:00,2017,1,1,6,Sunday,1,0,0,2,2
3,2017-01-01 00:03:00,2017,1,1,6,Sunday,1,0,0,3,3
4,2017-01-01 00:04:00,2017,1,1,6,Sunday,1,0,0,4,4


El dataset df_events se utilizará para generar las tablas de hechos

In [51]:
print 'Eventos obtenidos: {0}'.format(df_events.shape[0])
df_events.head(5)

Eventos obtenidos: 2330730


Unnamed: 0,id,MOD_ID,TIMESTAMP,MODULE_STATE,SENSOR_TYPE_ID,SENSED_VALUE
0,0,51,2017-01-01 00:00:00,0,2,2.0
1,1,51,2017-01-01 00:00:00,0,4,34.952475
2,2,51,2017-01-01 00:00:00,0,3,0.0
3,3,51,2017-01-01 00:00:00,0,1,0.009875
4,4,51,2017-01-01 00:01:00,0,2,4.0


In [52]:
print 'Consumos obtenidos: {0}'.format(df_agg_power.shape[0])
df_agg_power.head(5)

Consumos obtenidos: 15540


Unnamed: 0,id,TIMESTAMP,MOD_ID,WATT_HOUR,WATT_HOUR_ACC,TOTAL_WATT_PER
0,0,2017-01-01 00:00:00,51,2.459562,2.0,0.0003
1,1,2017-01-01 01:00:00,51,4.477683,6.0,0.0009
2,2,2017-01-01 02:00:00,51,4.222319,11.0,0.001651
3,3,2017-01-01 03:00:00,51,4.788357,15.0,0.002251
4,4,2017-01-01 04:00:00,51,3.241812,19.0,0.002852


#### Se prueba la integración de las tablas de dimensiones con la tabla de hechos
Se arma el cubo para el sensor de corriente

In [55]:
df_power_cube = df_agg_power.merge(df_modules,on='MOD_ID').merge(df_time,on='TIMESTAMP')
if df_power_cube.duplicated().any():
    print 'Se eliminan los duplicados (se mantiene el primero)'
    df_power_cube.drop_duplicates(inplace=True,keep='first')
else:
    print 'No existen eventos duplicados'
df_power_cube.head(5)

No existen eventos duplicados


Unnamed: 0,id,TIMESTAMP,MOD_ID,WATT_HOUR,WATT_HOUR_ACC,TOTAL_WATT_PER,MOD_NAME,MOD_TYPE_ID,MOD_TYPE_NAME,DEV_ID,...,YEAR,YEARDAY,MONTH,WEEKDAY,WEEKDAY_NAME,DAY,HOUR,MONTH_HOUR,MINUTE,MINDAY
0,0,2017-01-01 00:00:00,51,2.459562,2.0,0.0003,Lux Cocina mod,1,LUX,5,...,2017,1,1,6,Sunday,1,0,0,0,0
1,744,2017-01-01 00:00:00,52,78.473049,78.0,0.001305,Pot Cocina,2,POTENTIA,5,...,2017,1,1,6,Sunday,1,0,0,0,0
2,1,2017-01-01 01:00:00,51,4.477683,6.0,0.0009,Lux Cocina mod,1,LUX,5,...,2017,1,1,6,Sunday,1,1,1,0,60
3,745,2017-01-01 01:00:00,52,53.816835,132.0,0.002209,Pot Cocina,2,POTENTIA,5,...,2017,1,1,6,Sunday,1,1,1,0,60
4,2,2017-01-01 02:00:00,51,4.222319,11.0,0.001651,Lux Cocina mod,1,LUX,5,...,2017,1,1,6,Sunday,1,2,2,0,120


### Verificando dataset completo
Para realizar un modelo predictivo es necesario contar con un dataset completo que no tenga saltos en las mediciones. Por lo tanto, se debe verificar que cuente con datos en todos las horas del mes a analizar.

In [56]:
df_power_cube.isnull().any()

id                False
TIMESTAMP         False
MOD_ID            False
WATT_HOUR         False
WATT_HOUR_ACC     False
TOTAL_WATT_PER    False
MOD_NAME          False
MOD_TYPE_ID       False
MOD_TYPE_NAME     False
DEV_ID            False
DEV_NAME          False
YEAR              False
YEARDAY           False
MONTH             False
WEEKDAY           False
WEEKDAY_NAME      False
DAY               False
HOUR              False
MONTH_HOUR        False
MINUTE            False
MINDAY            False
dtype: bool

#### Corrigiendo valores null
Podríamos utilizar diferentes opciones:
- pad: se rellena con el último valor
- interpolate: se rellena con valores interpolados de la columna (quizá el más apropiado para valores continuos).
    La interpolacion puede ser:
    - Lineal
    - Cuadratica
    - Cubica
- escalar: se rellena con un número específico

In [59]:
#df_test_interp = pd.DataFrame(df_power_cube.WATT_HOUR.interpolate(method='linear'), columns=['WATT_HOUR_L'] )
#df_test_interp['WATT_HOUR_Q'] = df_power_cube.WATT_HOUR.interpolate(method='quadratic')
#df_test_interp['WATT_HOUR_C'] = df_power_cube.WATT_HOUR.interpolate(method='cubic')
#df_test_interp.plot()
df_power_cube['WATT_HOUR'] = df_power_cube.WATT_HOUR.interpolate(method='linear')

### Generando variable objetivo
Se define el valor de consumo mensual

In [90]:
df_predict = df_power_cube.groupby(['YEAR','MONTH','MOD_ID']).WATT_HOUR.sum().astype(int)
df_predict = df_predict.reset_index()
df_predict = df_predict.rename(columns={'WATT_HOUR': 'WATT_HOUR_MONTH'})
df_predict = df_predict.merge(df_power_cube,on=['YEAR','MONTH','MOD_ID'])
#df_predict.set_index('TIMESTAMP',inplace=True)
df_predict['TOTAL_WATT_PER'] = df_predict.WATT_HOUR_ACC / df_predict.WATT_HOUR_MONTH

In [91]:
df_predict.head(5)

Unnamed: 0,YEAR,MONTH,MOD_ID,WATT_HOUR_MONTH,id,TIMESTAMP,WATT_HOUR,WATT_HOUR_ACC,TOTAL_WATT_PER,MOD_NAME,...,DEV_ID,DEV_NAME,YEARDAY,WEEKDAY,WEEKDAY_NAME,DAY,HOUR,MONTH_HOUR,MINUTE,MINDAY
0,2017,1,51,6663,0,2017-01-01 00:00:00,2.459562,2.0,0.0003,Lux Cocina mod,...,5,Lux Cocina,1,6,Sunday,1,0,0,0,0
1,2017,1,51,6663,1,2017-01-01 01:00:00,4.477683,6.0,0.0009,Lux Cocina mod,...,5,Lux Cocina,1,6,Sunday,1,1,1,0,60
2,2017,1,51,6663,2,2017-01-01 02:00:00,4.222319,11.0,0.001651,Lux Cocina mod,...,5,Lux Cocina,1,6,Sunday,1,2,2,0,120
3,2017,1,51,6663,3,2017-01-01 03:00:00,4.788357,15.0,0.002251,Lux Cocina mod,...,5,Lux Cocina,1,6,Sunday,1,3,3,0,180
4,2017,1,51,6663,4,2017-01-01 04:00:00,3.241812,19.0,0.002852,Lux Cocina mod,...,5,Lux Cocina,1,6,Sunday,1,4,4,0,240


### Modelo Predictivo
- Variable objetivo: Porcentaje de Consumo Acumulado
Dado que la variable de valor acumulado tiene en su punto máximo el mismo valor que la variable objetivo buscada (consumo mensual), es correcto armar un modelo que permita predecir su crecimiento a lo largo del mes. 

El objetivo es predecir el ritmo con el cual crece el consumo acumulado a cada hora del mes. Por lo tanto se genera una nueva variable llamada TOTAL_WATT_PER que representa el porcentaje del consumo acumulado al momento. De esta forma, se puede entrenar un modelo de regresión bivaluado entre la hora relativa del mes y el el porcentaje de consumo. Es la forma más simple de alcanzar una regresión que represente cómo crece el consumo hora a hora en cada mes.

In [67]:
# Se define la función que genera el modelo para cada módulo
def create_module_regressor(mod_id):
    df_predict_mod = df_predict.query('MOD_ID == ' + str(mod_id))
    
    if df_predict_mod.shape[0] > 0:
        # Preparación de datos (split 70/30)
        features_names = ['MONTH_HOUR','MONTH']
        target_names = ['TOTAL_WATT_PER']

        features_data = df_predict_mod[features_names].values        # Columnas del predictor
        target_data = df_predict_mod[target_names].values                # Valor predecido
        split_test_size = 0.30                                       # 0.30 es 30%, el tamaño para pruebas

        features_train, features_test, target_train, target_test = train_test_split(features_data, target_data, test_size=split_test_size)

        # Entrenamiento y evaluación
        mlp = MLPRegressor(solver='lbfgs', hidden_layer_sizes=100, max_iter=300, shuffle=True, activation='identity')
        mlp.fit(features_train, target_train.ravel())
    
        print 'Score de entrenamiento módulo {0}: {1:.4f}'.format(mod_id,mlp.score(features_test, target_test))
        
        return mlp
    else:
        return 0

In [68]:
# Se prueba cada módulo
for mod_id in df_modules.MOD_ID:
    mlp = create_module_regressor(mod_id)

Score de entrenamiento módulo 51: 0.9657
Score de entrenamiento módulo 52: 0.9630


## Predicción del consumo
Se realizará el ejemplo de predicción para los siguientes datos:
- módulo: 52
- fecha: 20/8/17 13:00
- consumo acumulado: 37818
- Consumo mensual a predecir: 60357 Watt/h

In [69]:
df_predict_test = df_predict.query('MOD_ID == 52 & TIMESTAMP == "2017-08-20 13:00:00"')
df_predict_test

Unnamed: 0_level_0,YEAR,MONTH,MOD_ID,WATT_HOUR_MONTH,id,WATT_HOUR,WATT_HOUR_ACC,TOTAL_WATT_PER,MOD_NAME,MOD_TYPE_ID,...,DEV_ID,DEV_NAME,YEARDAY,WEEKDAY,WEEKDAY_NAME,DAY,HOUR,MONTH_HOUR,MINUTE,MINDAY
TIMESTAMP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-08-20 13:00:00,2017,8,52,60357,11389,111.922743,37818.0,0.626572,Pot Cocina,2,...,5,Lux Cocina,232,6,Sunday,20,13,469,0,780


In [70]:
mlp = create_module_regressor(52)
total_watt_per = mlp.predict(df_predict_test[['MONTH_HOUR','MONTH']])[0]
print 'Se predice que el porcentaje de consumo acumulado del mes es: {0}%'.format(round(total_watt_per,4)*100)

Score de entrenamiento módulo 52: 0.9701
Se predice que el porcentaje de consumo acumulado del mes es: 66.37%


In [111]:
df_predict_test.WATT_HOUR_ACC

TIMESTAMP
2017-08-20 13:00:00    37818.0
Name: WATT_HOUR_ACC, dtype: float64

In [71]:
consumo_mes_pred = df_predict_test.WATT_HOUR_ACC[0] / total_watt_per
print 'Se predice que el consumo del mes será aproximadamente de {0} Watts/hora'.format(int(consumo_mes_pred))

Se predice que el consumo del mes será aproximadamente de 56982 Watts/hora
