Anticipez les besoins en consommation électrique de bâtiments
=============================================================

![logo-seattle](../reports/figures/logo-seattle.png)


Explication des variables:
[City of seattle](https://data.seattle.gov/dataset/2015-Building-Energy-Benchmarking/h7rm-fz6m)

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split

from src.utils.univar import UnivariateAnalysis
from src.utils.bivar import BivariateAnalysis


In [None]:
data = pd.read_pickle('../data/interim/full_data.pickle')

In [None]:
RANDOM_STATE = 5022020

In [None]:
data.columns

Variable à prédire (target)

    * SiteEnergyUse/WN (wether normalized?)
    * TotalGHGEmissions

Variables sélectionnées comme entrée du modèle
    
    * Surface du bâtiment
    * Date de construction (derniers travaux)
        ==>à transformer sous forme "age du bâtiment"
    * Nombre de bâtiments
    * Nombre d'étages
    * Surface de parking
    * Utilisation type du bâtiment
    

In [None]:
data = data[data['SiteEnergyUse_kBtu'].notna()]
data = data[data['TotalGHGEmissions'].notna()]

In [None]:
data['year'] = data.index.get_level_values('year')

In [None]:
data['BuildingAge'] = data['year'] - data['YearBuilt']

In [None]:
model_data = data[['PropertyGFATotal',
                   'BuildingAge',
                   'NumberofFloors',
                   'LargestPropertyUseType',
                   'LargestPropertyUseTypeGFA',
                   'SecondLargestPropertyUseType',
                   'SecondLargestPropertyUseTypeGFA',
                   'ThirdLargestPropertyUseType',
                   'ThirdLargestPropertyUseTypeGFA',
                   # targets
                   'SiteEnergyUse_kBtu',
                   'TotalGHGEmissions']]

In [None]:
model_data.loc[model_data['NumberofFloors'].isnull(), 'NumberofFloors'] = 0
model_data.loc[model_data['LargestPropertyUseTypeGFA'].isnull(),
               'LargestPropertyUseTypeGFA'] = 0
model_data.loc[model_data['SecondLargestPropertyUseTypeGFA'].isnull(),
               'SecondLargestPropertyUseTypeGFA'] = 0
model_data.loc[model_data['ThirdLargestPropertyUseTypeGFA'].isnull(),
               'ThirdLargestPropertyUseTypeGFA'] = 0

In [None]:
model_data.describe()

In [None]:
data_train, data_test = train_test_split(model_data)

In [None]:
data_train

In [None]:
data_test

In [None]:
predict_features = ['SiteEnergyUse_kBtu', 'TotalGHGEmissions']
y = model_data[predict_features].copy()
X = model_data[set(list(model_data.columns.values)) - set(predict_features)].copy()

In [None]:
y.describe()

In [None]:
y[y.notna()]

## Préparation des données

On normalise les données de sortie (scaling) les données sont 
comprises entre -1 et 1.

In [None]:
std_scale = preprocessing.StandardScaler().fit(y.values)
y_scaled = std_scale.transform(y)

In [None]:
y_scaled

Pour les données de sortie on transforme les données catégorielles en donnée binaire et les données continues sont normalisées (scaling)

In [None]:
scalers = dict()
for col in X.columns:
    if X[col].dtype.name == 'category':
        print(f'processing {col}: One Hot Encoding the categories')
        X = pd.concat([X, pd.get_dummies(X[col], prefix=col, dummy_na=True)],
                      axis=1)
        X.drop(col, axis=1, inplace=True)
    elif X[col].dtype.name in ('float64', 'int64'):
        print(f'processing {col}: Scaling')
        x = X[col].values.reshape(-1, 1)
        scalers[col] = preprocessing.StandardScaler().fit(x)
        X[col] = scalers[col].transform(x)

In [None]:
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_scaled, test_size=0.2,
                                                    random_state=RANDOM_STATE)

In [None]:
X_train

### Regression lineaire

In [None]:
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)

In [None]:
baseline_error = metrics.mean_squared_error(y_test, lr.predict(X_test))
print("Erreur quadratique moyenne : %5f" % baseline_error)
r2_score = metrics.r2_score(y_test, lr.predict(X_test))
print("R² score : %5f" % r2_score)
explained_var = metrics.explained_variance_score(y_test, lr.predict(X_test))
print("variance expliquée : %5f" % explained_var)


In [None]:
metrics.mean_absolute_error(y_test, lr.predict(X_test))