# Rebuild data
Rebuilding data consists on filling invalid values (`NaN` in our case) for each building's data. This is done applying two operations:
- **Substitution**. Any day where all data is missing will be replaced by the corresponding prototype.
- **Transformation**. Otherwise, a transformation of the corresponding prototype will replace the invalid values.

#### Directory structure
./<br></br>
notebook/<br></br>
    &emsp;|--- data_preprocessing<br></br>
    &emsp;&emsp;&emsp;&emsp;|--- rebuild_data.ipynb<br></br>
out/<br></br>
    &emsp;|--- consumptions_byday/<br></br>
    &emsp;|--- consumptions/<br></br>

In [1]:
OUT_PATH = 'C:/Users/thmas/OneDrive - Universidad de Castilla-La Mancha/Informática/TFG/out/'
PROTOS_PATH = OUT_PATH + 'prototypesMEAN.zip'

In [2]:
import pandas as pd
import numpy as np

In [3]:
protos = pd.read_pickle(PROTOS_PATH)
protos

Unnamed: 0,building_id,weekday,active,consumptions
0,89,6,False,"[20.062732473523376, 20.184259312415204, 20.22..."
1,89,0,True,"[26.567371347361295, 30.674557723872955, 32.76..."
2,89,0,False,"[16.72172803991431, 18.2722835870672, 19.09931..."
3,89,1,True,"[28.81085671076993, 32.64046573059663, 35.0222..."
4,89,1,False,"[18.730723885029384, 20.195585432027936, 22.38..."
...,...,...,...,...
1256,2233,3,False,"[10.69264612522166, 10.555827976698279, 15.201..."
1257,2233,4,True,"[15.789085622247345, 16.47521905710764, 23.141..."
1258,2233,4,False,"[10.568740121060749, 10.532340160478352, 15.35..."
1259,2233,5,True,"[20.619978839935616, 19.963428469117698, 19.39..."


In [4]:
counter_id = 487 # Building ID example

mean_proto = protos[protos['building_id'] == counter_id]
mean_proto.reset_index(drop=True, inplace=True)
mean_proto

Unnamed: 0,building_id,weekday,active,consumptions
0,487,6,False,"[9.873605754955125, 9.87921448850105, 9.841854..."
1,487,0,True,"[12.917540552534398, 12.846034435792236, 14.24..."
2,487,0,False,"[9.095493753498717, 9.137945673316095, 9.27321..."
3,487,1,True,"[14.138973848987886, 14.053810325536725, 15.34..."
4,487,1,False,"[9.429585882553527, 9.45685387208661, 9.522455..."
5,487,2,True,"[13.713544406563944, 13.719862107675102, 15.05..."
6,487,2,False,"[9.495998437211393, 9.502974940595934, 9.60752..."
7,487,3,True,"[13.918727156282095, 13.8660766138161, 15.2461..."
8,487,3,False,"[9.511928484158586, 9.491681189045083, 9.67152..."
9,487,4,True,"[13.872169929046645, 14.031764085183449, 15.11..."


In [5]:
raw_df = pd.read_pickle(OUT_PATH + 'raw_consumptions.zip')
raw_df

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-24,89,4,"[nan, nan, nan, nan, 0.0, 25.9682072759303, 34..."
2012-02-25,89,5,"[8.0, 8.56965980289508, 7.83041664589254, 7.83..."
2012-02-26,89,6,"[9.0, 9.0, 8.47872481882854, 8.52127518117146,..."
2012-02-27,89,0,"[9.93594069444675, 9.0, 10.0, 18.4133936140153..."
2012-02-28,89,1,"[15.0, 15.0, 15.0, 23.0, 41.3474893206788, 39...."
...,...,...,...
2020-03-28,2233,5,"[8.96294314928535, 9.1999884489703, 9.22916758..."
2020-03-29,2233,6,"[9.05122649923577, 9.10856876843712, 9.0668798..."
2020-03-30,2233,0,"[9.14786320617928, 9.46424320377272, 12.979311..."
2020-03-31,2233,1,"[9.09777728991234, 9.49875136817228, 13.959012..."


In [6]:
df = raw_df[raw_df['building_id'] == counter_id]
df

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-12-17,487,1,"[nan, nan, nan, nan, nan, nan, nan, 4.49733527..."
2013-12-18,487,2,"[12.1029321298894, 12.1029321298894, 12.102932..."
2013-12-19,487,3,"[11.264909064798, 11.264909064798, 11.26490906..."
2013-12-20,487,4,"[10.9838823956164, 10.9838823956164, 10.983882..."
2013-12-21,487,5,"[6.93115242178077, 7.59915393780765, 7.5991539..."
...,...,...,...
2020-03-28,487,5,"[8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 9.0, 9.0, 9.0, ..."
2020-03-29,487,6,"[9.0, 8.0, 8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 8.089..."
2020-03-30,487,0,"[10.0, 10.0, 11.0, 12.8805883330563, 12.119411..."
2020-03-31,487,1,"[8.0, 7.0, 7.0, 10.0, 11.0, 14.0, 14.0, 11.154..."


In [7]:
def get_threshold(df: pd.DataFrame) -> float:
    df['daily'] = df['consumptions'].apply(np.sum)
    
    return df['daily'].mean() + df['daily'].std()

Obtain the threshold for the current building

In [8]:
sundays = df[df['weekday'] == 6]
threshold = get_threshold(sundays)
threshold

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


313.9949861311162

### Substitution and Transformation
In order to get a "clean" dataset, that is, a dataset without any kind of invalid value (`Nan`, negatives...), a `transform(X, Y)` operation is applied to the whole dataset (day by day). If a whole day is invalid, it is simply replaced by the corresponding prototype (based on activity and weekday). Otherwise, the transformation is applied. This function consists on three steps:

- **Normalization**. $X$ and $Y$ are normalized.
- **Substitution**. Any invalid value on $Y$ is replaced with its corresponding $X$ value (same hour consumption).
- **Denormalization**. $Y$ is denormalized, to obtain the real consumptions.

In [9]:
def normalize(x: np.array, mean: float, std: float) -> np.array:
    return (x - mean) / std

In [10]:
def denormalize(x: np.array, mean: float, std: float) -> np.array:
    return x * std + mean

In [11]:
def transform(x: np.array, y: np.array) -> np.array:
    y = np.asarray(y)
    
    nans = np.isnan(y)
    
    if not np.any(nans): # If all are valid values, return the original
        return y
    elif np.all(nans): # If all are invalid values, return the prototype
        return x
    
    x_mean, x_std = np.mean(x), np.std(x)
    x_norm = normalize(x, x_mean, x_std)
    
    y_clean = y[~nans]
    y_mean, y_std = np.mean(y_clean), np.std(y_clean)
    y_norm = normalize(y, y_mean, y_std)
    
    y_norm[nans] = x_norm[nans]
           
    return denormalize(y_norm, y_mean, y_std)

In [12]:
for date in df.index:
    weekday = df.loc[date, 'weekday']
    
    day_of_week = (weekday == 6) or (weekday == 5) # It is a Sunday or a Saturday
    august = (date.month == 8) # August is an inactive month
    christmas = (date.month == 12 and date.day >= 23) or (date.month == 1 and date.day <= 6) # Christmas holidays
    
    inactive = (day_of_week or august or christmas)
    
    if inactive:
        df['consumptions'].loc[date] = transform(mean_proto['consumptions'][(mean_proto['weekday'] == weekday) & (mean_proto['active'] == False)].iloc[0], df.loc[date, 'consumptions'])
    else:
        df['consumptions'].loc[date] = transform(mean_proto['consumptions'][(mean_proto['weekday'] == weekday) & (mean_proto['active'] == True)].iloc[0], df.loc[date, 'consumptions'])

df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
  


Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-12-17,487,1,"[9.91793884357676, 9.879582778638838, 10.46261..."
2013-12-18,487,2,"[12.1029321298894, 12.1029321298894, 12.102932..."
2013-12-19,487,3,"[11.264909064798, 11.264909064798, 11.26490906..."
2013-12-20,487,4,"[10.9838823956164, 10.9838823956164, 10.983882..."
2013-12-21,487,5,"[6.93115242178077, 7.59915393780765, 7.5991539..."
...,...,...,...
2020-03-28,487,5,"[8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 9.0, 9.0, 9.0, ..."
2020-03-29,487,6,"[9.0, 8.0, 8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 8.089..."
2020-03-30,487,0,"[10.0, 10.0, 11.0, 12.8805883330563, 12.119411..."
2020-03-31,487,1,"[8.0, 7.0, 7.0, 10.0, 11.0, 14.0, 14.0, 11.154..."


### Activity recalculation
Only one step left to get the final dataset: calculate the day's activity. Sundays are inactive by definition, and the rest is divided via the previously calculated threshold

In [13]:
df.insert(2, 'active', True)
for date in df.index:
    daily_cons = np.sum(df.loc[date, 'consumptions']) # Calculate daily consumption
    
    if df.loc[date, 'weekday'] == 6:
        df.loc[date, 'active'] = False
    elif daily_cons > threshold:
        df.loc[date, 'active'] = True
    else:
        df.loc[date, 'active'] = False

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0_level_0,building_id,weekday,active,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-12-17,487,1,False,"[9.91793884357676, 9.879582778638838, 10.46261..."
2013-12-18,487,2,False,"[12.1029321298894, 12.1029321298894, 12.102932..."
2013-12-19,487,3,False,"[11.264909064798, 11.264909064798, 11.26490906..."
2013-12-20,487,4,False,"[10.9838823956164, 10.9838823956164, 10.983882..."
2013-12-21,487,5,False,"[6.93115242178077, 7.59915393780765, 7.5991539..."
...,...,...,...,...
2020-03-28,487,5,False,"[8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 9.0, 9.0, 9.0, ..."
2020-03-29,487,6,False,"[9.0, 8.0, 8.0, 9.0, 8.0, 8.0, 9.0, 9.0, 8.089..."
2020-03-30,487,0,False,"[10.0, 10.0, 11.0, 12.8805883330563, 12.119411..."
2020-03-31,487,1,False,"[8.0, 7.0, 7.0, 10.0, 11.0, 14.0, 14.0, 11.154..."
