## My very first EDA for ASHRAE

Start here --> [Discussion: Energy prediction - small summary](https://www.kaggle.com/c/ashrae-energy-prediction/discussion/112872#latest-670279)

### Useful EDAs

[EDA for ASHRAE from @nroman](https://www.kaggle.com/nroman/eda-for-ashrae)

NaNs in columns:
* floor_count --> only under 20% available (test and train)
* year_built --> only 40%
* cloud_coverage --> approx. 55%
* precip_depth_1_hr --> approx. 80%
* wind_direction, sea_level_pressure --> 90%

site_id 0 starts from March

outliers
* building_id = 1099

meter 
* 0: electricity
* 1: chilledwater
* 2: steam
* 3: hotwater
* Not every building has all meter type
* Differents meters == differents units? --> steam an outlier?

primary_use
* education --> 40%
* services --> >35% mean meter reading

sea_level_pressure
* site_id 5 --> NaN

datetime
* day of week
* hour of day
* holidays?
* non laborable?

### Other EDAs

[ASHRAE Heatmaps from @jtrotman](https://www.kaggle.com/jtrotman/ashrae-heatmaps)
* See --> first heatmap: count of building types at each site
  
[Locate cities according weather temperature from @patrick0302](https://www.kaggle.com/patrick0302/locate-cities-according-weather-temperature)
* Very useful for NaNs substitution
  
[ASHRAE WeatheR Analysis with OpenAir from @nicapotato](https://www.kaggle.com/nicapotato/ashrae-weather-analysis-with-openair)
* Nice EDA in R

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import datetime
import os

%matplotlib inline

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
building_metadata = pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv")
train = pd.read_csv("../input/ashrae-energy-prediction/train.csv",parse_dates=['timestamp'])
test = pd.read_csv("../input/ashrae-energy-prediction/test.csv",parse_dates=['timestamp'])
weather_train = pd.read_csv("../input/ashrae-energy-prediction/weather_train.csv",parse_dates=['timestamp'])
weather_test = pd.read_csv("../input/ashrae-energy-prediction/weather_test.csv",parse_dates=['timestamp'])

In [None]:
building_metadata = reduce_mem_usage(building_metadata)
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
weather_train = reduce_mem_usage(weather_train)
weather_test = reduce_mem_usage(weather_test)

### Buildings by primary_use and site_id

In [None]:
building_metadata.groupby(['primary_use','site_id']).size().unstack().fillna(0).style.background_gradient(axis=None)

### NaN in buildings by site_id

In [None]:
by_site_id = building_metadata.groupby('site_id')

ind = ['total','square_feet', 'year_built', 'floor_count']
df = pd.DataFrame([by_site_id.building_id.count(),by_site_id.square_feet.count(),by_site_id.year_built.count(),by_site_id.floor_count.count()], index=ind)

fig, axes = plt.subplots(1, 1, figsize=(14, 6), dpi=100)

sns.heatmap(df, cmap='Blues', linewidths=1.5, annot=True, fmt="d", ax=axes)
      

### NaN in weather_train by site_id

In [None]:
by_site_id = weather_train.groupby('site_id')

ind = ['total',
       'air_temperature', 
       'dew_temperature', 
       'wind_speed',
       'wind_direction', 
       'sea_level_pressure', 
       'precip_depth_1_hr',
       'cloud_coverage'
      ]
df = pd.DataFrame([by_site_id.timestamp.count(),
                   by_site_id.air_temperature.count(),
                   by_site_id.dew_temperature.count(),
                   by_site_id.wind_speed.count(),
                   by_site_id.wind_direction.count(),
                   by_site_id.sea_level_pressure.count(),
                   by_site_id.precip_depth_1_hr.count(),
                   by_site_id.cloud_coverage.count()
                  ],
                  index=ind)

fig, axes = plt.subplots(1, 1, figsize=(14, 6), dpi=100)

sns.heatmap(df, cmap='Blues', linewidths=1.5, annot=True, fmt="d", ax=axes)



### NaN in weather_test by site_id

In [None]:
by_site_id = weather_test.groupby('site_id')

ind = ['total',
       'air_temperature', 
       'dew_temperature', 
       'wind_speed',
       'wind_direction', 
       'sea_level_pressure', 
       'precip_depth_1_hr',
       'cloud_coverage'
      ]
df = pd.DataFrame([by_site_id.timestamp.count(),
                   by_site_id.air_temperature.count(),
                   by_site_id.dew_temperature.count(),
                   by_site_id.wind_speed.count(),
                   by_site_id.wind_direction.count(),
                   by_site_id.sea_level_pressure.count(),
                   by_site_id.precip_depth_1_hr.count(),
                   by_site_id.cloud_coverage.count()
                  ],
                  index=ind)

fig, axes = plt.subplots(1, 1, figsize=(14, 6), dpi=100)

sns.heatmap(df, cmap='Blues', linewidths=1.5, annot=True, fmt="d", ax=axes)




### Air / Dew T sample

In [None]:
w = pd.concat([weather_train,weather_test])[["site_id","timestamp","air_temperature","dew_temperature"]]

In [None]:
def plot_weather_site(weather, site_id):
    
    lw = weather.query(f"site_id == {site_id}").copy()
    lw["day"] = lw["timestamp"].dt.ceil("1d")
    lw["hour"] = lw["timestamp"].dt.hour
    lw['air_temperature'] = lw['air_temperature'].astype(np.float32)
    lw['dew_temperature'] = lw['dew_temperature'].astype(np.float32)

    p1 = lw[['day','hour','air_temperature']].pivot_table(values='air_temperature', index=['day'], columns=['hour']).copy()
    p2 = lw[['day','hour','dew_temperature']].pivot_table(values='dew_temperature', index=['day'], columns=['hour']).copy()


    fig = go.Figure(data=[
        go.Surface(z=p1, colorscale='YlOrRd', opacity=0.9, showscale=False),
        go.Surface(z=p2, colorscale='RdBu', opacity=0.2, showscale=False)
    ])

    fig.update_layout(title_text=f'site_id {site_id}',
                      height=1000,
                      width=1000)
    fig.show()

In [None]:
plot_weather_site(w,1)

In [None]:
plot_weather_site(w,13)

### Building sample

In [None]:
t = train.merge(building_metadata, on='building_id', how='left')

In [None]:
def plot_building_meter_reading(train, building_id):
    fig = make_subplots(rows=2, 
                        cols=2,
                        specs=[[{'type': 'surface'}, {'type': 'surface'}],[{'type': 'surface'}, {'type': 'surface'}]],
                        subplot_titles=("meter = 0", "meter = 2", "meter = 1", "meter = 3"))
    t = train.query(f'building_id == {building_id}').copy()
    t["day"] = t["timestamp"].dt.ceil("1d")
    t["hour"] = t["timestamp"].dt.hour
    for m in range(4):
        p = t.query(f'meter == {m}')[['day','hour','meter_reading']].pivot_table(values='meter_reading', index=['day'], columns=['hour']).copy()
        fig.add_trace( go.Surface(z=p, colorscale='YlOrRd', showscale=False), row=1+m%2, col=1+m//2)
    
    fig.update_layout(title_text=f'Building {building_id}',
                      height=1000,
                      width=1000)
    fig.show()

In [None]:
plot_building_meter_reading(t,801)

In [None]:
plot_building_meter_reading(t,1230)

### Strategy

* year_built --> category --> decade. NaN = mode by site_id (ok)
* floor_count --> NaN = mode by primary_use (ok)  
* precip_depth_1_hr --> precip_depth_1_hr == -1 --> 0 ; NaN = 0
* air_temperature, dew_temperature --> NaN = interpolate
* wind_direction --> mode by site_id
* precip_depth_1_hr --> NaN --> when cloud_coverage > 7 ==> 1 / when cloud_coverage > 8 ==> 5
* cloud_coverage --> NaN --> when precip_depth_1_hr > 1 ==> 7 / when precip_depth_1_hr > 5 ==> 5
* Removing weird data on site_id 0 (see https://www.kaggle.com/corochann/ashrae-training-lgbm-by-meter-type#Removing-weired-data-on-site_id-0) (ok)

### Ideas

* Weather is an inertial system --> use lags (see https://www.kaggle.com/corochann/ashrae-training-lgbm-by-meter-type)
  * precip_depth_1_hr, air_temperature, dew_temperature
* time to Hour of T max (by site_id)
* week of year instead month and day (weeks to summer)
* day of week --> is it weekend?
* wind --> cos(wind_direction) --> 0 means north wind
* has_(electricity|chilledwater|steam|hotwater)_meter

### LGBM

* This --> https://www.kaggle.com/vbmokin/very-significant-safe-memory-lightgbm
* This --> https://www.kaggle.com/aitude/ashrae-kfold-lightgbm-without-leak-1-08

### Testing FE

In [None]:
w = pd.concat([weather_train,weather_test])
w["hour"] = w["timestamp"].dt.hour
w["dmy"] = w["timestamp"].dt.floor('D')
w = w.loc[w.groupby(["site_id", "dmy"])["air_temperature"].idxmax()] 
w = w.groupby(["site_id"])
w.hour.apply(lambda x: x.mode())

# Hour of T max by site_id
htmax = [19,14,0,19,0,12,20,0,19,21,0,0,14,0,20,20]

In [None]:
# Hour of T max by site_id
w = pd.concat([weather_train,weather_test])
w["hour"] = w["timestamp"].dt.hour

htmax = [19,14,0,19,0,12,20,0,19,21,0,0,14,0,20,20]
w["htmax"] = w.site_id.apply (lambda x: htmax[x])
w["w_htmax"] = w.hour.sub(w.htmax).abs()
w["w_htmax"] = w.w_htmax.apply(lambda x: (12 - x) if x<12 else (x%12))
del w["htmax"], w["hour"]
w.head(25)