# Rebuild data
Rebuilding data consists on filling invalid values (`NaN` or negatives, in our case) for each building's data. This is done applying two operations:
- **Substitution**. Any day when missing data is more than 1/3 of said day (8 consumptions) will be replaced by the corresponding prototype.
- **Transformation**. Otherwise, a transformation of the corresponding prototype will replace the invalid values.

#### Directory structure
./<br></br>
notebook/<br></br>
    &emsp;|--- data-preprocessing<br></br>
    &emsp;&emsp;&emsp;&emsp;|--- rebuild_data.ipynb<br></br>
out/<br></br>
    &emsp;|--- consumptions_byday/<br></br>
    &emsp;|--- consumptions/<br></br>

In [1]:
OUT_PATH = 'C:/Users/thmas/OneDrive - Universidad de Castilla-La Mancha/Informática/TFG/out/'
PROTOS_PATH = OUT_PATH + 'prototypesMEAN.zip'

In [2]:
import pandas as pd
import numpy as np

In [3]:
protos = pd.read_pickle(PROTOS_PATH)
protos

Unnamed: 0,building_id,weekday,active,consumptions
0,89,6,False,"[19.924767774500992, 20.044887155872043, 20.07..."
1,89,0,True,"[29.02426937511563, 33.83383962106855, 36.3533..."
2,89,0,False,"[18.80906478381575, 20.889957003428506, 21.949..."
3,89,1,True,"[31.603164166692242, 36.111243184484685, 38.31..."
4,89,1,False,"[20.74464915249005, 22.705367673055445, 25.013..."
...,...,...,...,...
1256,2233,3,False,"[11.559058932408963, 11.65321968245459, 18.233..."
1257,2233,4,True,"[20.295181986036788, 21.70307015406495, 26.496..."
1258,2233,4,False,"[11.251968907271063, 11.277439875585463, 17.49..."
1259,2233,5,True,"[21.780315206987908, 21.510202326332823, 21.37..."


In [4]:
counter_id = 27 # Building ID example

mean_proto = protos[protos['building_id'] == counter_id]
mean_proto.reset_index(drop=True, inplace=True)
mean_proto

Unnamed: 0,building_id,weekday,active,consumptions
0,27,6,False,"[21.745047367917923, 21.641706434645418, 21.64..."
1,27,0,True,"[21.9737123842489, 23.53765488309698, 31.45017..."
2,27,0,False,"[22.086298148730886, 22.063824145285825, 22.99..."
3,27,1,True,"[23.78626031114435, 25.229923672493, 32.973833..."
4,27,1,False,"[22.073689723127114, 22.17109445997468, 22.575..."
5,27,2,True,"[24.29696168208666, 25.699496872988096, 33.641..."
6,27,2,False,"[21.9154797375671, 21.885308243496592, 22.5045..."
7,27,3,True,"[24.204502262148857, 25.56229407924669, 33.674..."
8,27,3,False,"[21.84557905726422, 21.83394763076296, 22.4347..."
9,27,4,True,"[24.07414854421836, 25.331514277365237, 33.542..."


In [5]:
raw_df = pd.read_pickle(OUT_PATH + 'raw_consumptions.zip')
raw_df

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-24,89,4,"[nan, nan, nan, nan, 0.0, 25.9682072759303, 34..."
2012-02-25,89,5,"[8.0, 8.56965980289508, 7.83041664589254, 7.83..."
2012-02-26,89,6,"[9.0, 9.0, 8.47872481882854, 8.52127518117146,..."
2012-02-27,89,0,"[9.93594069444675, 9.0, 10.0, 18.4133936140153..."
2012-02-28,89,1,"[15.0, 15.0, 15.0, 23.0, 41.3474893206788, 39...."
...,...,...,...
2020-03-23,2233,0,"[9.23106212240757, 15.7486593144312, 19.269211..."
2020-03-24,2233,1,"[9.5668751135038, 15.3377513735436, 20.2699453..."
2020-03-25,2233,2,"[9.22960071120226, 15.9943441867809, 20.594844..."
2020-03-26,2233,3,"[9.49747602519851, 15.1675192485798, 20.593217..."


In [6]:
df = raw_df[raw_df['building_id'] == counter_id]
df

Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-07-26,27,1,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
2011-07-27,27,2,"[17.0, 19.0, 18.3507946535444, 35.846312818818..."
2011-07-28,27,3,"[18.8887041808661, 18.8030088936913, 18.845892..."
2011-07-29,27,4,"[20.0, 21.0, 20.0, 37.7887789876153, 45.845704..."
2011-07-30,27,5,"[17.2981132075472, 17.0, 17.2396974482587, 17...."
...,...,...,...
2020-03-23,27,0,"[10.1170330737468, 10.9676878955827, 10.967739..."
2020-03-24,27,1,"[10.8590892649269, 10.9677121385118, 10.967712..."
2020-03-25,27,2,"[10.9677235262438, 10.3559814000229, 10.579513..."
2020-03-26,27,3,"[10.9677623111933, 10.6143072999924, 10.321141..."


In [7]:
def get_threshold(df: pd.DataFrame) -> float:
    df['daily'] = df['consumptions'].apply(np.sum)
    
    return df['daily'].mean()

Obtain the threshold for the current building

In [8]:
sundays = df[df['weekday'] == 6]
threshold = get_threshold(sundays)
threshold

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


535.2051244012839

### Substitution and Transformation
In order to get a "clean" dataset, that is, a dataset without any kind of invalid value (`Nan`, negatives...), a `transform(X, Y)` operation is applied to the whole dataset (day by day). If a whole day is invalid, it is simply replaced by the corresponding prototype (based on activity and weekday). Otherwise, the transformation is applied. This function consists on three steps:

- **Normalization**. $X$ and $Y$ are normalized.
- **Substitution**. Any invalid value on $Y$ is replaced with its corresponding $X$ value (same hour consumption).
- **Denormalization**. $Y$ is denormalized, to obtain the real consumptions.

In [9]:
def normalize(x: np.array, mean: float, std: float) -> np.array:
    return (x - mean) / std

In [10]:
def denormalize(x: np.array, mean: float, std: float) -> np.array:
    return x * std + mean

In [11]:
def transform(x: np.array, y: np.array) -> np.array:
    y = np.asarray(y)
    
    nans = np.isnan(y)
    negatives = np.less(y, 0)
    
    if not np.any(nans) and not np.any(negatives): # If all are valid values, return the original
        return y
    elif np.all(nans) or np.all(negatives): # If all are invalid values, return the prototype
        return x
    
    x_mean, x_std = np.mean(x), np.std(x)
    x_norm = normalize(x, x_mean, x_std)
    
    y_clean = y[~nans]
    y_mean, y_std = np.mean(y_clean), np.std(y_clean)
    
    y_norm = normalize(y, y_mean, y_std)
    
    y_norm[nans] = x_norm[nans]
    y_norm[negatives] = x_norm[negatives]
           
    return denormalize(y_norm, y_mean, y_std)

In [12]:
for date in df.index:
    weekday = df.loc[date, 'weekday']
    
    day_of_week = (weekday == 6) or (weekday == 5) # It is a Sunday or a Saturday
    august = (date.month == 8) # August is an inactive month
    christmas = (date.month == 12 and date.day >= 23) or (date.month == 1 and date.day <= 6) # Christmas holidays
    
    inactive = (day_of_week or august or christmas)
    
    if inactive:
        df['consumptions'].loc[date] = transform(mean_proto['consumptions'][(mean_proto['weekday'] == weekday) & (mean_proto['active'] == False)].iloc[0], df.loc[date, 'consumptions'])
    else:
        df['consumptions'].loc[date] = transform(mean_proto['consumptions'][(mean_proto['weekday'] == weekday) & (mean_proto['active'] == True)].iloc[0], df.loc[date, 'consumptions'])

df

  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
  


Unnamed: 0_level_0,building_id,weekday,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-07-26,27,1,"[4.909419378934345, 6.2781544639955555, 13.620..."
2011-07-27,27,2,"[17.0, 19.0, 18.3507946535444, 35.846312818818..."
2011-07-28,27,3,"[18.8887041808661, 18.8030088936913, 18.845892..."
2011-07-29,27,4,"[20.0, 21.0, 20.0, 37.7887789876153, 45.845704..."
2011-07-30,27,5,"[17.2981132075472, 17.0, 17.2396974482587, 17...."
...,...,...,...
2020-03-23,27,0,"[10.1170330737468, 10.9676878955827, 10.967739..."
2020-03-24,27,1,"[10.8590892649269, 10.9677121385118, 10.967712..."
2020-03-25,27,2,"[10.9677235262438, 10.3559814000229, 10.579513..."
2020-03-26,27,3,"[10.9677623111933, 10.6143072999924, 10.321141..."


### Activity recalculation
Only one step left to get the final dataset: calculate the day's activity. Sundays are inactive by definition, and the rest is divided via the previously calculated threshold

In [13]:
df.insert(2, 'active', True)
for date in df.index:
    daily_cons = np.sum(df.loc[date, 'consumptions']) # Calculate daily consumption
    
    if df.loc[date, 'weekday'] == 6:
        df.loc[date, 'active'] = False
    elif daily_cons > threshold:
        df.loc[date, 'active'] = True
    else:
        df.loc[date, 'active'] = False

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0_level_0,building_id,weekday,active,consumptions
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-07-26,27,1,True,"[4.909419378934345, 6.2781544639955555, 13.620..."
2011-07-27,27,2,True,"[17.0, 19.0, 18.3507946535444, 35.846312818818..."
2011-07-28,27,3,True,"[18.8887041808661, 18.8030088936913, 18.845892..."
2011-07-29,27,4,True,"[20.0, 21.0, 20.0, 37.7887789876153, 45.845704..."
2011-07-30,27,5,False,"[17.2981132075472, 17.0, 17.2396974482587, 17...."
...,...,...,...,...
2020-03-23,27,0,False,"[10.1170330737468, 10.9676878955827, 10.967739..."
2020-03-24,27,1,False,"[10.8590892649269, 10.9677121385118, 10.967712..."
2020-03-25,27,2,False,"[10.9677235262438, 10.3559814000229, 10.579513..."
2020-03-26,27,3,False,"[10.9677623111933, 10.6143072999924, 10.321141..."
