Anticipez les besoins en consommation électrique de bâtiments
=============================================================

![logo-seattle](https://www.seattle.gov/Documents/Departments/Arts/Downloads/Logo/Seattle_logo_landscape_blue-black.png)


Explication des variables:
[City of seattle](https://data.seattle.gov/dataset/2015-Building-Energy-Benchmarking/h7rm-fz6m)

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import decomposition

from src.utils.bivar import BivariateAnalysis

sns.set()

In [None]:
data = pd.read_pickle('../data/interim/full_data.pickle')

In [None]:
data.columns

Variable à prédire (targets)

   * SiteEnergyUse/WN (wether normalized?)
   * TotalGHGEmissions

Variables sélectionnées comme entrée du modèle
    
   * Surface du bâtiment
   * Premiere utilisation du bâtiment
   * Surface destinee au premier usage du batiment
    

## Préparation des variables pour modélisation

suppression des lignes dont la consomation est nulle

In [None]:
print(data.shape)
data = data[data["SiteEnergyUseWN_kBtu"] > 0]
print(data.shape)

On supprime également les lignes pour lesquelles on a pas l'usage principal du bâtiment

In [None]:
print(data.shape)
data = data[data['LargestPropertyUseType'].notna()]

On supprime les outliers (Ces derniers sont indiqués! :) <3 )

In [None]:
data = data[data['Outlier'].isna()]
data.shape

## Quelles sont les variables corrélées avec la cible?

In [None]:
bivar = BivariateAnalysis(data)
bivar.anova('SiteEnergyUseWN_kBtu', 'LargestPropertyUseType', orient='h')

In [None]:
bivar.regression(['PropertyGFATotal', 'SiteEnergyUseWN_kBtu'])

**Notes** : Passage en log pour les colonnes PropertyGFATotal et SiteEnergyUseWN_kBtu!

In [None]:
data['logPropertyGFATotal'] = data['PropertyGFATotal'].apply(np.log1p)
data['logSiteEnergyUseWN_kBtu'] = data['SiteEnergyUseWN_kBtu'].apply(np.log1p)

In [None]:
bivar = BivariateAnalysis(data)
bivar.regression(['logPropertyGFATotal', 'logSiteEnergyUseWN_kBtu'])


Meilleure corrélation entre les logs des variables

In [None]:
fig = px.scatter(data, x='logPropertyGFATotal',
                 y='logSiteEnergyUseWN_kBtu',
                 hover_data=['PropertyName'],
                 color='LargestPropertyUseType')
fig.show()

with open('test.html', 'w') as f:
    f.write(fig.to_html())


In [None]:
plt.subplots(1, figsize=(6, 4))
ax = sns.scatterplot(data=data, x='logPropertyGFATotal',
                y='logSiteEnergyUseWN_kBtu',
                hue='LargestPropertyUseType')
ax.get_legend().set_visible(False)

In [None]:
data.to_pickle('../data/interim/full_dataV2.pickle')

In [None]:
data.head()

In [None]:
data

### Update :

Certaines sont sous-représentées et semblent poser problème pour la suite.


In [None]:
# Most available building 
mfh = data.loc[data['LargestPropertyUseType'] == 'Multifamily housing'].copy()

In [None]:
bivar = BivariateAnalysis(mfh)
bivar.regression(['PropertyGFATotal', 'SiteEnergyUseWN_kBtu'])

In [None]:
mfh['PropertyGFATotal_inv'] = mfh['PropertyGFATotal'].apply(lambda x: 1 / x)
mfh['SiteEnergyUseWN_kBtu_inv'] = mfh['SiteEnergyUseWN_kBtu'].apply(lambda x: 1 / x)

In [None]:
bivar = BivariateAnalysis(mfh)
bivar.regression(['PropertyGFATotal_inv',
                  'SiteEnergyUseWN_kBtu_inv'])

In [None]:
mfh.loc[:, 'logPropertyGFATotal'] = mfh['PropertyGFATotal'].apply(np.log10)
mfh.loc[:, 'logSiteEnergyUseWN_kBtu'] = mfh['SiteEnergyUseWN_kBtu'].apply(np.log10)
bivar.regression(['logPropertyGFATotal', 'logSiteEnergyUseWN_kBtu'])

In [None]:
sns.jointplot(x='logPropertyGFATotal', y='logSiteEnergyUseWN_kBtu', data=mfh)

In [None]:
reg = linear_model.LinearRegression()

In [None]:
reg.fit(mfh.loc[2015]['logPropertyGFATotal'].values.reshape(-1, 1),
        mfh.loc[2015]['logSiteEnergyUseWN_kBtu'].values.reshape(-1, 1))

In [None]:
reg.score(mfh.loc[2016]['logPropertyGFATotal'].values.reshape(-1, 1),
          mfh.loc[2016]['logSiteEnergyUseWN_kBtu'].values.reshape(-1, 1))