# Build complete Dataframe

Our first goal is to obtain the complete Dataframe of a building, that is, getting a time series starting from the first existing hour in the database, and finishing with the last one. Whenever the information in between is missing, we'll fill it with `NaN`, for later processing.

#### Directory structure

./<br></br>
notebook/<br></br>
    &emsp;|--- data-preprocessing<br></br>
    &emsp;&emsp;&emsp;&emsp;|--- complete_dataframe.ipynb<br></br>
out/

In [1]:
import pandas as pd
import numpy as np
import pymongo as pm
import datetime

In [2]:
HOST = '161.67.142.141'
PORT = 27017
DB = 'differential_uclm_db'
DB_COUNTERRAW = 'CounterRawConsumption'

START_DAY = 5 # Day starts at 5:00 am

### Database connection

In [3]:
def connectDB() -> pm.MongoClient:
    return pm.MongoClient(host=HOST, port=PORT)[DB]

In [4]:
db = connectDB()

## 1. Create hour index

First, we must obtain the building's first and last registered hour, building then the hour index between these dates.

### First and last registered hours
Find first and last registered hours for the specified building ID

In [5]:
def firstHour(db: pm.MongoClient, counter_id: int) -> datetime.datetime:
    return list(db[DB_COUNTERRAW].find({'counterinfo_id': counter_id}).sort('timestamp', pm.ASCENDING).limit(1))[0]['timestamp']

def lastHour(db: pm.MongoClient, counter_id: int) -> datetime.datetime:
    return list(db[DB_COUNTERRAW].find({'counterinfo_id': counter_id}).sort('timestamp', pm.DESCENDING).limit(1))[0]['timestamp']

In [6]:
counter_id = 27 # Building ID example
start, end = firstHour(db, counter_id).replace(hour=5), lastHour(db, counter_id).replace(hour=4) # Fix hours to have 24h days

start, end

(datetime.datetime(2011, 7, 26, 5, 0), datetime.datetime(2020, 3, 28, 4, 0))

### Build hour index
From firstHour lastHour with 1 hour step

In [7]:
def createIndex(first: datetime.datetime, last: datetime.datetime) -> pd.DatetimeIndex:
    return pd.date_range(start=first, end=last, freq='1H')

In [8]:
index = createIndex(start, end)

index

DatetimeIndex(['2011-07-26 05:00:00', '2011-07-26 06:00:00',
               '2011-07-26 07:00:00', '2011-07-26 08:00:00',
               '2011-07-26 09:00:00', '2011-07-26 10:00:00',
               '2011-07-26 11:00:00', '2011-07-26 12:00:00',
               '2011-07-26 13:00:00', '2011-07-26 14:00:00',
               ...
               '2020-03-27 19:00:00', '2020-03-27 20:00:00',
               '2020-03-27 21:00:00', '2020-03-27 22:00:00',
               '2020-03-27 23:00:00', '2020-03-28 00:00:00',
               '2020-03-28 01:00:00', '2020-03-28 02:00:00',
               '2020-03-28 03:00:00', '2020-03-28 04:00:00'],
              dtype='datetime64[ns]', length=76032, freq='H')

## 2. Build complete Dataframe
Now we rebuild complete Dataframe with the consumptions for every hour we got in the index, filling with `NaN` when the value is not found on the database or if it is a negative consumption. This is reindexing the Dataframe with the previous index we obtained

In [9]:
def getDataFrame(db: pm.MongoClient, counter_id: int) -> pd.DataFrame:
    cursor = db[DB_COUNTERRAW].find({'counterinfo_id': counter_id})
    df = pd.DataFrame(list(cursor))
    del df['_id']
    del df['counterinfo_id']
    
    df = df.set_index('timestamp') # Indexing dataframe by timestamp
    
    return df

In [10]:
df = getDataFrame(db, counter_id)
df

Unnamed: 0_level_0,consumption
timestamp,Unnamed: 1_level_1
2011-07-26 17:00:00,111.000000
2011-07-26 18:00:00,43.348334
2011-07-26 19:00:00,41.846246
2011-07-26 20:00:00,22.805419
2011-07-26 21:00:00,20.887574
...,...
2020-03-28 19:00:00,10.270344
2020-03-28 20:00:00,11.665155
2020-03-28 21:00:00,10.967742
2020-03-28 22:00:00,10.302608


### Reindex Dataframe

In [11]:
df = df.reindex(index=index)
df

Unnamed: 0,consumption
2011-07-26 05:00:00,
2011-07-26 06:00:00,
2011-07-26 07:00:00,
2011-07-26 08:00:00,
2011-07-26 09:00:00,
...,...
2020-03-28 00:00:00,11.066098
2020-03-28 01:00:00,10.978488
2020-03-28 02:00:00,10.858585
2020-03-28 03:00:00,10.967692


### Calculate day
Day recalculation needed because days will start, as defined in `START_DAY`, at 5:00 am

In [12]:
def calcDay(df: pd.DataFrame) -> pd.DataFrame:
    df['day'] = df.apply(lambda x: (x.name - pd.DateOffset(hours=START_DAY)).date(), axis= 1)
    df['day'] = pd.to_datetime(df['day'])
    
    return df

In [13]:
df = calcDay(df)
df

Unnamed: 0,consumption,day
2011-07-26 05:00:00,,2011-07-26
2011-07-26 06:00:00,,2011-07-26
2011-07-26 07:00:00,,2011-07-26
2011-07-26 08:00:00,,2011-07-26
2011-07-26 09:00:00,,2011-07-26
...,...,...
2020-03-28 00:00:00,11.066098,2020-03-27
2020-03-28 01:00:00,10.978488,2020-03-27
2020-03-28 02:00:00,10.858585,2020-03-27
2020-03-28 03:00:00,10.967692,2020-03-27


## 3. Reshape Dataframe into TimeSeries
Get new Dataframe with indexed with `day`, and its 24 consumptions

In [14]:
consumption = np.asarray(df['consumption'])
consumption = consumption.reshape((len(df['day']) // 24, 24)) # Reshape each day with its 24 consumptions

consumptions = pd.DataFrame({'consumptions': consumption.tolist()})

consumptions

Unnamed: 0,consumptions
0,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
1,"[17.0, 19.0, 18.3507946535444, 35.846312818818..."
2,"[18.8887041808661, 18.8030088936913, 18.845892..."
3,"[20.0, 21.0, 20.0, 37.7887789876153, 45.845704..."
4,"[17.2981132075472, 17.0, 17.2396974482587, 17...."
...,...
3163,"[10.1170330737468, 10.9676878955827, 10.967739..."
3164,"[10.8590892649269, 10.9677121385118, 10.967712..."
3165,"[10.9677235262438, 10.3559814000229, 10.579513..."
3166,"[10.9677623111933, 10.6143072999924, 10.321141..."


### Index by day

In [15]:
days = df['day'].drop_duplicates().tolist()

weekdays = []
for day in days:
    weekdays.append(day.weekday())

consumptions = pd.concat([pd.DataFrame({'day': days, 'weekday': weekdays}), consumptions], axis=1)
consumptions = consumptions.set_index(['day'])

consumptions.insert(0, 'building_id', counter_id)

consumptions

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-07-26,27,1,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
2011-07-27,27,2,"[17.0, 19.0, 18.3507946535444, 35.846312818818..."
2011-07-28,27,3,"[18.8887041808661, 18.8030088936913, 18.845892..."
2011-07-29,27,4,"[20.0, 21.0, 20.0, 37.7887789876153, 45.845704..."
2011-07-30,27,5,"[17.2981132075472, 17.0, 17.2396974482587, 17...."
...,...,...,...
2020-03-23,27,0,"[10.1170330737468, 10.9676878955827, 10.967739..."
2020-03-24,27,1,"[10.8590892649269, 10.9677121385118, 10.967712..."
2020-03-25,27,2,"[10.9677235262438, 10.3559814000229, 10.579513..."
2020-03-26,27,3,"[10.9677623111933, 10.6143072999924, 10.321141..."
