# Obtain weekday prototypes
The main objective of this task is to create every weekday prototypes. We want to model two types of days based on the consumption activity of each building:
- **Active** day.
- **Inactive** day.

Thus, for each counter on the database, we'll get 13 day prototypes (6 working days * 2 types of days + 1 inactive day corresponding to Sundays). Moreover, there are 97 different buildings, so we expect to get 13 * 97 prototypical days.

This previouisly mentioned activity will be defined by calculating the mean of Sundays' consumptions for each building (greater than this value plus some margin will indicate an active day; lower or equal than this value plus the margin should be labelled as an inactive day).

#### Directory structure
./<br></br>
notebook/<br></br>
    &emsp;|--- data-preprocessing<br></br>
    &emsp;&emsp;&emsp;&emsp;|--- weekday_prototypes.ipynb<br></br>
out/<br></br>
    &emsp;|--- raw_consumptions.zip

In [1]:
CONS_PATH = 'C:/Users/thmas/OneDrive - Universidad de Castilla-La Mancha/Informática/TFG/out/'

In [2]:
import pandas as pd
import numpy as np

In [3]:
counter_id = 487 # Counter ID example

raw = pd.read_pickle(CONS_PATH + 'raw_consumptions.zip')
raw

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-24,89,4,"[nan, nan, nan, nan, 0.0, 25.9682072759303, 34..."
2012-02-25,89,5,"[8.0, 8.56965980289508, 7.83041664589254, 7.83..."
2012-02-26,89,6,"[9.0, 9.0, 8.47872481882854, 8.52127518117146,..."
2012-02-27,89,0,"[9.93594069444675, 9.0, 10.0, 18.4133936140153..."
2012-02-28,89,1,"[15.0, 15.0, 15.0, 23.0, 41.3474893206788, 39...."
...,...,...,...
2020-03-28,2233,5,"[8.96294314928535, 9.1999884489703, 9.22916758..."
2020-03-29,2233,6,"[9.05122649923577, 9.10856876843712, 9.0668798..."
2020-03-30,2233,0,"[9.14786320617928, 9.46424320377272, 12.979311..."
2020-03-31,2233,1,"[9.09777728991234, 9.49875136817228, 13.959012..."


In [4]:
raw_df = raw[raw['building_id'] == counter_id]
raw_df

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-12-17,487,1,"[nan, nan, nan, nan, nan, nan, nan, 4.49733527..."
2013-12-18,487,2,"[12.1029321298894, 12.1029321298894, 12.102932..."
2013-12-19,487,3,"[11.264909064798, 11.264909064798, 11.26490906..."
2013-12-20,487,4,"[10.9838823956164, 10.9838823956164, 10.983882..."
2013-12-21,487,5,"[6.93115242178077, 7.59915393780765, 7.5991539..."
...,...,...,...
2020-03-28,487,5,"[8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 9.0, 9.0, 9.0, ..."
2020-03-29,487,6,"[9.0, 8.0, 8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 8.089..."
2020-03-30,487,0,"[10.0, 10.0, 11.0, 12.8805883330563, 12.119411..."
2020-03-31,487,1,"[8.0, 7.0, 7.0, 10.0, 11.0, 14.0, 14.0, 11.154..."


### Obtaining prototype measures
In order to get the required measures for every day, first we obtain them for sundays, which are supposed to be inactive days. After that, we base the active/inactive result depending on:

- **Active days** &rarr; daily consumption mean within (sundays.mean + 2 * sundays.std, +$\infty$)
- **Inactive days** &rarr; daily consumption mean within [0, sundays.mean + 2 * sundays.std]

We'll store all this prototypical days (every building has 13, as previously discussed) in a pandas DataFrame for later use

In [5]:
def dropNan(df: pd.DataFrame) -> pd.DataFrame:
    nan_rows = []
    
    for date in df.index:
        consumptions = df['consumptions'].loc[date]
        
        if True in np.isnan(consumptions):
            nan_rows.append(date)
            
    return df.drop(index=nan_rows)

In [6]:
def get_threshold(df: pd.DataFrame) -> float:
    df['daily'] = df['consumptions'].apply(np.sum)
    
    return df['daily'].mean() + df['daily'].std()

In [7]:
def get_prototype(df: pd.DataFrame, counter_id: int, weekday: int, active: bool, type: str = 'mean') -> pd.DataFrame:
    cons = []
    for i in range(24):
        i_consumptions = []
        for j in range(df.shape[0]):
            i_consumptions.append(df['consumptions'].iloc[j][i])

        if type == 'std':
            cons.append(np.std(i_consumptions))
        else:
            cons.append(np.mean(i_consumptions))
            
    return pd.DataFrame({'building_id': counter_id, 'weekday': weekday, 'active': active, 'consumptions': [cons]})

In [8]:
clean_df = dropNan(raw_df)

sundays = clean_df[clean_df['weekday'] == 6] # Select Sundays
sundays

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-12-22,487,6,"[7.59915393780765, 7.59915393780765, 7.5991539..."
2014-01-05,487,6,"[7.0, 6.0, 7.90981467630682, 6.09018532369318,..."
2014-01-12,487,6,"[9.58157567594984, 9.58157567594984, 9.5815756..."
2014-01-19,487,6,"[6.0, 6.98327570155606, 7.01672429844394, 6.0,..."
2014-01-26,487,6,"[7.0, 7.0, 8.0, 7.0, 8.0, 8.0, 7.0, 8.0, 8.0, ..."
...,...,...,...
2020-03-01,487,6,"[11.0982415902141, 9.56461691873026, 9.8314508..."
2020-03-08,487,6,"[9.46796523126527, 10.0, 9.0, 9.0, 9.0, 10.0, ..."
2020-03-15,487,6,"[10.0, 11.0, 10.0, 10.0, 11.0, 10.0, 11.0, 11...."
2020-03-22,487,6,"[9.59392398331422, 9.40607601668578, 9.0, 10.0..."


In [9]:
threshold = get_threshold(sundays)
threshold

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


313.9949861311162

In [10]:
mean_proto = get_prototype(sundays, counter_id, 6, False, type='mean') # Get Sundays prototype
mean_proto

Unnamed: 0,building_id,weekday,active,consumptions
0,487,6,False,"[9.873605754955125, 9.87921448850105, 9.841854..."


In [11]:
std_proto = get_prototype(sundays, counter_id, 6, False, type='std') # Get Sundays prototype
std_proto

Unnamed: 0,building_id,weekday,active,consumptions
0,487,6,False,"[2.8104372129285107, 2.829071572895955, 2.8767..."


In [12]:
for i in range(0, 6):
    df = clean_df[clean_df['weekday'] == i]
    df['daily'] = df['consumptions'].apply(np.sum)
    
    df_a = df.loc[df['daily'] > threshold] # Select active days
    mean_proto = mean_proto.append(get_prototype(df_a, counter_id, i, True, type='mean'))
    std_proto = std_proto.append(get_prototype(df_a, counter_id, i, True, type='std'))
    
    df_i = df.loc[df['daily'] <= threshold] # Select inactive days
    mean_proto = mean_proto.append(get_prototype(df_i, counter_id, i, False, type='mean'))
    std_proto = std_proto.append(get_prototype(df_i, counter_id, i, False, type='std'))

mean_proto.reset_index(drop=True, inplace=True), std_proto.reset_index(drop=True, inplace=True)
mean_proto

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,building_id,weekday,active,consumptions
0,487,6,False,"[9.873605754955125, 9.87921448850105, 9.841854..."
1,487,0,True,"[12.917540552534398, 12.846034435792236, 14.24..."
2,487,0,False,"[9.095493753498717, 9.137945673316095, 9.27321..."
3,487,1,True,"[14.138973848987886, 14.053810325536725, 15.34..."
4,487,1,False,"[9.429585882553527, 9.45685387208661, 9.522455..."
5,487,2,True,"[13.713544406563944, 13.719862107675102, 15.05..."
6,487,2,False,"[9.495998437211393, 9.502974940595934, 9.60752..."
7,487,3,True,"[13.918727156282095, 13.8660766138161, 15.2461..."
8,487,3,False,"[9.511928484158586, 9.491681189045083, 9.67152..."
9,487,4,True,"[13.872169929046645, 14.031764085183449, 15.11..."
