## Divide and Conquer
This notebook is to explore features and optimize models for each site_id. The idea is to resolve some data discrepancies that are present by dividing the data rather than cleaning.   

Note that this is just another approach, need not necessarily be better or worse, but probably can add some value to ensembles irrespective of its CV or public LB scores.

In [1]:
import gc
import os

import lightgbm as lgb
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder

from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")


path_data = "../input/"
path_train = path_data + "train.csv"
path_test = path_data + "test.csv"
path_building = path_data + "building_metadata.csv"
path_weather_train = path_data + "weather_train.csv"
path_weather_test = path_data + "weather_test.csv"

myfavouritenumber = 13
seed = myfavouritenumber

## Preparing data
There are two files with features that need to be merged with the data. One is building metadata that has information on the buildings and the other is weather data that has information on the weather.

In [2]:
df_train = pd.read_csv(path_train)
df_test = pd.read_csv(path_test)

building = pd.read_csv(path_building)
le = LabelEncoder()
building.primary_use = le.fit_transform(building.primary_use)

weather_train = pd.read_csv(path_weather_train)
weather_test = pd.read_csv(path_weather_test)

weather_train.drop(["sea_level_pressure", "wind_direction", "wind_speed"], axis=1, inplace=True) #???
weather_test.drop(["sea_level_pressure", "wind_direction", "wind_speed"], axis=1, inplace=True) #???

weather_train = weather_train.groupby("site_id").apply(lambda group: group.interpolate(limit_direction="both"))
weather_test = weather_test.groupby("site_id").apply(lambda group: group.interpolate(limit_direction="both"))

df_train = df_train.merge(building, on="building_id")
df_train = df_train.merge(weather_train, on=["site_id", "timestamp"], how="left")
df_train = df_train[~((df_train.site_id==0) & (df_train.meter==0) & (df_train.building_id <= 104) & (df_train.timestamp < "2016-05-21"))]
df_train.reset_index(drop=True, inplace=True)
df_train.timestamp = pd.to_datetime(df_train.timestamp, format='%Y-%m-%d %H:%M:%S')
df_train["log_meter_reading"] = np.log1p(df_train.meter_reading)

df_test = df_test.merge(building, on="building_id")
df_test = df_test.merge(weather_test, on=["site_id", "timestamp"], how="left")
df_test.reset_index(drop=True, inplace=True)
df_test.timestamp = pd.to_datetime(df_test.timestamp, format='%Y-%m-%d %H:%M:%S')

del building, le
gc.collect()

0

In [3]:
## Memory Optimization

# Original code from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage by @gemartin
# Modified to support timestamp type, categorical type
# Modified to add option to use float16

from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

def reduce_mem_usage(df, use_float16=False):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.        
    """
    
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [4]:
df_train = reduce_mem_usage(df_train, use_float16=True)
df_test = reduce_mem_usage(df_test, use_float16=True)

weather_train.timestamp = pd.to_datetime(weather_train.timestamp, format='%Y-%m-%d %H:%M:%S')
weather_test.timestamp = pd.to_datetime(weather_test.timestamp, format='%Y-%m-%d %H:%M:%S')
weather_train = reduce_mem_usage(weather_train, use_float16=True)
weather_test = reduce_mem_usage(weather_test, use_float16=True)

Memory usage of dataframe is 2122.08 MB
Memory usage after optimization is: 663.15 MB
Decreased by 68.7%
Memory usage of dataframe is 4135.66 MB
Memory usage after optimization is: 1312.28 MB
Decreased by 68.3%
Memory usage of dataframe is 6.40 MB
Memory usage after optimization is: 2.27 MB
Decreased by 64.6%
Memory usage of dataframe is 12.69 MB
Memory usage after optimization is: 4.49 MB
Decreased by 64.6%


## Feature Engineering: Time
Creating time-based features.

In [5]:
df_train["hour"] = df_train.timestamp.dt.hour
df_train["weekday"] = df_train.timestamp.dt.weekday

df_test["hour"] = df_test.timestamp.dt.hour
df_test["weekday"] = df_test.timestamp.dt.weekday

## Feature Engineering: Aggregation
Creating aggregate features for buildings at various levels.

In [6]:
df_building_meter = df_train.groupby(["building_id", "meter"]).agg(
    mean_building_meter=("log_meter_reading", "mean"),
    median_building_meter=("log_meter_reading", "median")).reset_index()

df_train = df_train.merge(df_building_meter, on=["building_id", "meter"])
df_test = df_test.merge(df_building_meter, on=["building_id", "meter"])

df_building_meter_hour = df_train.groupby([
    "building_id", "meter", "hour"
]).agg(mean_building_meter=("log_meter_reading", "mean"),
       median_building_meter=("log_meter_reading", "median")).reset_index()

df_train = df_train.merge(df_building_meter_hour,
                          on=["building_id", "meter", "hour"])
df_test = df_test.merge(df_building_meter_hour,
                        on=["building_id", "meter", "hour"])

## Feature Engineering: Lags
Creating lag-based features. These are statistics of available features looking back in time by fixed intervals.   
These features are created in the weather data itself and then merged with the train and test data.

In [7]:
def create_lag_features(df, window):
    """
    Creating lag-based features looking back in time.
    """
    
    feature_cols = ["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"]
    df_site = df.groupby("site_id")
    
    df_rolled = df_site[feature_cols].rolling(window=window, min_periods=0)
    
    df_mean = df_rolled.mean().reset_index().astype(np.float16)
    df_median = df_rolled.median().reset_index().astype(np.float16)
    df_min = df_rolled.min().reset_index().astype(np.float16)
    df_max = df_rolled.max().reset_index().astype(np.float16)
    df_std = df_rolled.std().reset_index().astype(np.float16)
    df_skew = df_rolled.skew().reset_index().astype(np.float16)
    
    for feature in feature_cols:
        df[f"{feature}_mean_lag{window}"] = df_mean[feature]
        df[f"{feature}_median_lag{window}"] = df_median[feature]
        df[f"{feature}_min_lag{window}"] = df_min[feature]
        df[f"{feature}_max_lag{window}"] = df_max[feature]
        df[f"{feature}_std_lag{window}"] = df_std[feature]
        df[f"{feature}_skew_lag{window}"] = df_std[feature]
        
    return df

## Features
Creating and selecting all the features.

In [8]:
weather_train = create_lag_features(weather_train, 3)
weather_train = create_lag_features(weather_train, 18)
weather_train = create_lag_features(weather_train, 72)
weather_train.drop(["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"], axis=1, inplace=True)

df_train = df_train.merge(weather_train, on=["site_id", "timestamp"], how="left")

del weather_train
gc.collect()

38

In [9]:
categorical_features = [
    "building_id",
    "primary_use",
    "meter",
    "weekday",
    "hour"
]

all_features = [col for col in df_train.columns if col not in ["timestamp", "site_id", "meter_reading", "log_meter_reading"]]

## KFold Cross Validation with LGBM
Since the test data is out of time and longer than train data, creating a reliable validation strategy is going to be a major challenge. Just using a simple KFold CV here.

The folds are applied to each site individually, thus building 16 sites x 3 folds = 48 models in total.

In [11]:
df_train.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,...,dew_temperature_min_lag72,dew_temperature_max_lag72,dew_temperature_std_lag72,dew_temperature_skew_lag72,precip_depth_1_hr_mean_lag72,precip_depth_1_hr_median_lag72,precip_depth_1_hr_min_lag72,precip_depth_1_hr_max_lag72,precip_depth_1_hr_std_lag72,precip_depth_1_hr_skew_lag72
0,0,0,2016-05-21,249.817001,0,0,7432,2008.0,,26.703125,...,18.296875,22.796875,0.956055,0.956055,5.375,0.0,-1.0,310.0,36.84375,36.84375
1,0,0,2016-05-22,251.182007,0,0,7432,2008.0,,28.90625,...,17.203125,22.796875,1.150391,1.150391,0.277832,0.0,-1.0,13.0,2.042969,2.042969
2,0,0,2016-05-23,237.531006,0,0,7432,2008.0,,26.703125,...,13.296875,22.796875,2.123047,2.123047,0.194458,0.0,-1.0,13.0,1.658203,1.658203
3,0,0,2016-05-24,223.197006,0,0,7432,2008.0,,26.09375,...,12.796875,21.703125,2.623047,2.623047,0.041656,0.0,-1.0,5.0,0.6152344,0.6152344
4,0,0,2016-05-25,208.863007,0,0,7432,2008.0,,26.09375,...,12.796875,21.703125,2.228516,2.228516,0.0,0.0,0.0,0.0,6.556511e-07,6.556511e-07


0

In [19]:
cv = 2
models = {}
site_model = {}
cv_scores = {"site_id": [], "meter_id":[], "cv_score": []}

for site_id in tqdm(range(16), desc="site_id"):
    
    models[site_id] = []
    site_model = {}
    for meter_id in range(4):
    
        print(cv, "fold CV for site_id:", site_id, "meter:", meter_id)
        
        kf = KFold(n_splits=cv, random_state=seed)
        
        site_model[meter_id] = []
        X_train_site = df_train[(df_train.site_id==site_id) & (df_train.meter==meter_id)].reset_index(drop=True)
        y_train_site = X_train_site.log_meter_reading
        y_pred_train_site = np.zeros(X_train_site.shape[0])

        score = 0
        
        if(len(X_train_site)==0):
            print("Site Id:", site_id, "meter Id:", meter_id, "nooooooo data!")
            continue

        for fold, (train_index, valid_index) in enumerate(kf.split(X_train_site, y_train_site)):
            X_train, X_valid = X_train_site.loc[train_index, all_features], X_train_site.loc[valid_index, all_features]
            y_train, y_valid = y_train_site.iloc[train_index], y_train_site.iloc[valid_index]

            dtrain = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features)
            dvalid = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_features)

            watchlist = [dtrain, dvalid]

            params = {"objective": "regression",
                      "num_leaves": 31,
                      "learning_rate": 0.049,
                      "bagging_fraction": 0.93,
                      "feature_fraction": 0.87,
                      "metric": "rmse"
                      }

            model_lgb = lgb.train(params, train_set=dtrain, num_boost_round=2300, valid_sets=watchlist, verbose_eval=101, early_stopping_rounds=101)
            site_model[meter_id].append(model_lgb)

            y_pred_valid = model_lgb.predict(X_valid, num_iteration=model_lgb.best_iteration)
            y_pred_train_site[valid_index] = y_pred_valid

            rmse = np.sqrt(mean_squared_error(y_valid, y_pred_valid))
            print("Site Id:", site_id, "meter Id:", meter_id, ", Fold:", fold+1, ", RMSE:", rmse)
            score += rmse / cv

            gc.collect()
        
        cv_scores["site_id"].append(site_id)
        cv_scores["meter_id"].append(meter_id)
        cv_scores["cv_score"].append(score)
        
        print("\nSite Id:", site_id, "meter Id:", meter_id, ", CV RMSE:", np.sqrt(mean_squared_error(y_train_site, y_pred_train_site)), "\n")
    models[site_id].append(site_model)
        
    

site_id:   0%|          | 0/16 [00:00<?, ?it/s]

2 fold CV for site_id: 0 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.30277	valid_1's rmse: 0.327049
Early stopping, best iteration is:
[76]	training's rmse: 0.328625	valid_1's rmse: 0.31452
Site Id: 0 , Fold: 1 , RMSE: 0.3147001252094329
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.201057	valid_1's rmse: 0.443401
Early stopping, best iteration is:
[78]	training's rmse: 0.215826	valid_1's rmse: 0.440946
Site Id: 0 , Fold: 2 , RMSE: 0.44150648244136476

Site Id: 0 meter Id: 0 , CV RMSE: 0.3833822971676108 

2 fold CV for site_id: 0 meter: 1
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.89065	valid_1's rmse: 1.81455
Early stopping, best iteration is:
[65]	training's rmse: 1.00247	valid_1's rmse: 1.7975
Site Id: 0 , Fold: 1 , RMSE: 1.7727346342464745
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.972413	valid_1's rmse

site_id:   6%|▋         | 1/16 [00:08<02:11,  8.78s/it]

Site Id: 0 , Fold: 2 , RMSE: 1.7592474418217607

Site Id: 0 meter Id: 1 , CV RMSE: 1.7660039535824579 

2 fold CV for site_id: 0 meter: 2
2 fold CV for site_id: 0 meter: 3
2 fold CV for site_id: 1 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.224667	valid_1's rmse: 0.722135
[202]	training's rmse: 0.184048	valid_1's rmse: 0.721451
Early stopping, best iteration is:
[162]	training's rmse: 0.19548	valid_1's rmse: 0.720765
Site Id: 1 , Fold: 1 , RMSE: 0.7195955596849142
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.208585	valid_1's rmse: 0.474162
[202]	training's rmse: 0.156646	valid_1's rmse: 0.471913
[303]	training's rmse: 0.138676	valid_1's rmse: 0.470877
[404]	training's rmse: 0.12753	valid_1's rmse: 0.469931
[505]	training's rmse: 0.119763	valid_1's rmse: 0.46963
[606]	training's rmse: 0.113206	valid_1's rmse: 0.469143
[707]	training's rmse: 0.107616	valid_1's rmse: 0.469007
[808]	training's rms

site_id:  12%|█▎        | 2/16 [00:24<02:33, 10.94s/it]

Site Id: 1 , Fold: 2 , RMSE: 2.0926559691816355

Site Id: 1 meter Id: 3 , CV RMSE: 1.874528805456746 

2 fold CV for site_id: 2 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.320697	valid_1's rmse: 0.404735
Early stopping, best iteration is:
[84]	training's rmse: 0.342614	valid_1's rmse: 0.402645
Site Id: 2 , Fold: 1 , RMSE: 0.4025428043073553
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.270667	valid_1's rmse: 0.521056
Early stopping, best iteration is:
[88]	training's rmse: 0.281825	valid_1's rmse: 0.520581
Site Id: 2 , Fold: 2 , RMSE: 0.5204660480613595

Site Id: 2 meter Id: 0 , CV RMSE: 0.46525563751782684 

2 fold CV for site_id: 2 meter: 1
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.854272	valid_1's rmse: 1.03255
[202]	training's rmse: 0.763506	valid_1's rmse: 1.03283
Early stopping, best iteration is:
[143]	training's rmse: 0.804456	valid_1's rmse:

site_id:  19%|█▉        | 3/16 [00:55<03:37, 16.77s/it]

Site Id: 2 , Fold: 2 , RMSE: 1.0132434338199099

Site Id: 2 meter Id: 3 , CV RMSE: 0.9538066011595626 

2 fold CV for site_id: 3 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.295744	valid_1's rmse: 0.44017
[202]	training's rmse: 0.260695	valid_1's rmse: 0.435123
[303]	training's rmse: 0.241672	valid_1's rmse: 0.43355
[404]	training's rmse: 0.229379	valid_1's rmse: 0.432694
[505]	training's rmse: 0.220224	valid_1's rmse: 0.432441
[606]	training's rmse: 0.213449	valid_1's rmse: 0.432026
[707]	training's rmse: 0.207673	valid_1's rmse: 0.43185
Early stopping, best iteration is:
[667]	training's rmse: 0.210151	valid_1's rmse: 0.431761
Site Id: 3 , Fold: 1 , RMSE: 0.4328283510679374
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.308004	valid_1's rmse: 0.39228
[202]	training's rmse: 0.26934	valid_1's rmse: 0.394624
Early stopping, best iteration is:
[114]	training's rmse: 0.300584	valid_1's rmse: 0.39147

site_id:  25%|██▌       | 4/16 [01:47<05:31, 27.59s/it]

Site Id: 3 , Fold: 2 , RMSE: 0.3916030102285219

Site Id: 3 meter Id: 0 , CV RMSE: 0.41273073149957273 

2 fold CV for site_id: 3 meter: 1
2 fold CV for site_id: 3 meter: 2
2 fold CV for site_id: 3 meter: 3
2 fold CV for site_id: 4 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.140345	valid_1's rmse: 0.264287
Early stopping, best iteration is:
[89]	training's rmse: 0.14535	valid_1's rmse: 0.263483
Site Id: 4 , Fold: 1 , RMSE: 0.2634982170730112
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.204197	valid_1's rmse: 0.311672
[202]	training's rmse: 0.178917	valid_1's rmse: 0.29625
[303]	training's rmse: 0.168088	valid_1's rmse: 0.295243
[404]	training's rmse: 0.160665	valid_1's rmse: 0.294797
Early stopping, best iteration is:
[402]	training's rmse: 0.160812	valid_1's rmse: 0.29479


site_id:  31%|███▏      | 5/16 [02:03<04:24, 24.04s/it]

Site Id: 4 , Fold: 2 , RMSE: 0.3046612025821676

Site Id: 4 meter Id: 0 , CV RMSE: 0.28482429562755734 

2 fold CV for site_id: 4 meter: 1
2 fold CV for site_id: 4 meter: 2
2 fold CV for site_id: 4 meter: 3
2 fold CV for site_id: 5 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.493899	valid_1's rmse: 0.713113
Early stopping, best iteration is:
[76]	training's rmse: 0.533979	valid_1's rmse: 0.708026
Site Id: 5 , Fold: 1 , RMSE: 0.7062101098585115
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.511019	valid_1's rmse: 0.649477
[202]	training's rmse: 0.439769	valid_1's rmse: 0.626741
[303]	training's rmse: 0.400323	valid_1's rmse: 0.61077
[404]	training's rmse: 0.373508	valid_1's rmse: 0.60494
[505]	training's rmse: 0.354664	valid_1's rmse: 0.60138
[606]	training's rmse: 0.337614	valid_1's rmse: 0.597013
[707]	training's rmse: 0.323702	valid_1's rmse: 0.595049
[808]	training's rmse: 0.311084	valid_1's r

site_id:  38%|███▊      | 6/16 [02:35<04:23, 26.37s/it]

Site Id: 5 , Fold: 2 , RMSE: 0.610098823043413

Site Id: 5 meter Id: 0 , CV RMSE: 0.659906543817126 

2 fold CV for site_id: 5 meter: 1
2 fold CV for site_id: 5 meter: 2
2 fold CV for site_id: 5 meter: 3
2 fold CV for site_id: 6 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.113847	valid_1's rmse: 0.567571
[202]	training's rmse: 0.094923	valid_1's rmse: 0.561375
[303]	training's rmse: 0.0880039	valid_1's rmse: 0.560313
[404]	training's rmse: 0.0835327	valid_1's rmse: 0.559666
[505]	training's rmse: 0.0805015	valid_1's rmse: 0.55904
[606]	training's rmse: 0.0778016	valid_1's rmse: 0.558582
[707]	training's rmse: 0.0758442	valid_1's rmse: 0.558186
[808]	training's rmse: 0.0739801	valid_1's rmse: 0.558103
[909]	training's rmse: 0.0720982	valid_1's rmse: 0.557936
Early stopping, best iteration is:
[882]	training's rmse: 0.0726904	valid_1's rmse: 0.557873
Site Id: 6 , Fold: 1 , RMSE: 0.5527633570477961
Training until validation scores don't i

site_id:  44%|████▍     | 7/16 [02:57<03:44, 24.97s/it]


Site Id: 6 meter Id: 2 , CV RMSE: 1.7088259902990603 

2 fold CV for site_id: 6 meter: 3
2 fold CV for site_id: 7 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.941404	valid_1's rmse: 1.36793
Early stopping, best iteration is:
[68]	training's rmse: 1.1009	valid_1's rmse: 1.33736
Site Id: 7 , Fold: 1 , RMSE: 1.337364981346566
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.132864	valid_1's rmse: 3.47238
[202]	training's rmse: 0.11461	valid_1's rmse: 3.46767
[303]	training's rmse: 0.104778	valid_1's rmse: 3.46606
[404]	training's rmse: 0.0979614	valid_1's rmse: 3.46471
[505]	training's rmse: 0.0916536	valid_1's rmse: 3.46412
[606]	training's rmse: 0.0862576	valid_1's rmse: 3.46323
[707]	training's rmse: 0.081571	valid_1's rmse: 3.46312
[808]	training's rmse: 0.0775651	valid_1's rmse: 3.46286
[909]	training's rmse: 0.0743621	valid_1's rmse: 3.46257
Early stopping, best iteration is:
[858]	training's r

site_id:  50%|█████     | 8/16 [03:06<02:43, 20.41s/it]

Early stopping, best iteration is:
[67]	training's rmse: 1.23594	valid_1's rmse: 1.85632
Site Id: 7 , Fold: 2 , RMSE: 1.8563176675907185

Site Id: 7 meter Id: 3 , CV RMSE: 2.014611538654603 

2 fold CV for site_id: 8 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.410066	valid_1's rmse: 0.506266
[202]	training's rmse: 0.357235	valid_1's rmse: 0.50328
[303]	training's rmse: 0.328279	valid_1's rmse: 0.504354
Early stopping, best iteration is:
[215]	training's rmse: 0.352665	valid_1's rmse: 0.5031
Site Id: 8 , Fold: 1 , RMSE: 0.5025125012604219
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.381708	valid_1's rmse: 0.657619
[202]	training's rmse: 0.336434	valid_1's rmse: 0.660456
Early stopping, best iteration is:
[115]	training's rmse: 0.37231	valid_1's rmse: 0.656668


site_id:  56%|█████▋    | 9/16 [03:15<01:56, 16.69s/it]

Site Id: 8 , Fold: 2 , RMSE: 0.6666477206290783

Site Id: 8 meter Id: 0 , CV RMSE: 0.5903124848076743 

2 fold CV for site_id: 8 meter: 1
2 fold CV for site_id: 8 meter: 2
2 fold CV for site_id: 8 meter: 3
2 fold CV for site_id: 9 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.423325	valid_1's rmse: 0.597729
[202]	training's rmse: 0.304588	valid_1's rmse: 0.568622
[303]	training's rmse: 0.256189	valid_1's rmse: 0.564312
[404]	training's rmse: 0.232351	valid_1's rmse: 0.564327
Early stopping, best iteration is:
[378]	training's rmse: 0.237391	valid_1's rmse: 0.564033
Site Id: 9 , Fold: 1 , RMSE: 0.5589464861936153
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.387288	valid_1's rmse: 0.729175
[202]	training's rmse: 0.277369	valid_1's rmse: 0.705899
[303]	training's rmse: 0.234168	valid_1's rmse: 0.701277
[404]	training's rmse: 0.208842	valid_1's rmse: 0.69938
[505]	training's rmse: 0.193204	valid_1's

site_id:  62%|██████▎   | 10/16 [04:27<03:21, 33.51s/it]

Site Id: 9 , Fold: 2 , RMSE: 1.2031874570909684

Site Id: 9 meter Id: 2 , CV RMSE: 1.2086131997967986 

2 fold CV for site_id: 9 meter: 3
2 fold CV for site_id: 10 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.313768	valid_1's rmse: 0.65656
Early stopping, best iteration is:
[76]	training's rmse: 0.332779	valid_1's rmse: 0.640458
Site Id: 10 , Fold: 1 , RMSE: 0.6381965606793799
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.300208	valid_1's rmse: 0.546822
Early stopping, best iteration is:
[35]	training's rmse: 0.413539	valid_1's rmse: 0.53662
Site Id: 10 , Fold: 2 , RMSE: 0.5366201014738451

Site Id: 10 meter Id: 0 , CV RMSE: 0.5896000711479777 

2 fold CV for site_id: 10 meter: 1
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.668243	valid_1's rmse: 1.66893
Early stopping, best iteration is:
[99]	training's rmse: 0.670946	valid_1's rmse: 1.6689
Site Id: 10 

site_id:  69%|██████▉   | 11/16 [04:34<02:07, 25.49s/it]


Site Id: 10 meter Id: 3 , CV RMSE: 2.673855941222364 

2 fold CV for site_id: 11 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.102099	valid_1's rmse: 0.723679
Early stopping, best iteration is:
[78]	training's rmse: 0.112748	valid_1's rmse: 0.723341
Site Id: 11 , Fold: 1 , RMSE: 0.7199553315115751
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.0693934	valid_1's rmse: 0.368463
[202]	training's rmse: 0.0552492	valid_1's rmse: 0.363466
[303]	training's rmse: 0.0497373	valid_1's rmse: 0.362794
[404]	training's rmse: 0.0458841	valid_1's rmse: 0.362716
Early stopping, best iteration is:
[333]	training's rmse: 0.0484052	valid_1's rmse: 0.36267
Site Id: 11 , Fold: 2 , RMSE: 0.36218842120675926

Site Id: 11 meter Id: 0 , CV RMSE: 0.5698754828154072 

2 fold CV for site_id: 11 meter: 1
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.49242	valid_1's rmse: 0.875848
Earl

site_id:  75%|███████▌  | 12/16 [04:39<01:16, 19.22s/it]

[202]	training's rmse: 0.403717	valid_1's rmse: 1.02352
Early stopping, best iteration is:
[146]	training's rmse: 0.429488	valid_1's rmse: 1.02318
Site Id: 11 , Fold: 2 , RMSE: 1.0179036616912231

Site Id: 11 meter Id: 3 , CV RMSE: 1.0687559890275677 

2 fold CV for site_id: 12 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.350443	valid_1's rmse: 0.416814
[202]	training's rmse: 0.297694	valid_1's rmse: 0.398329
[303]	training's rmse: 0.270204	valid_1's rmse: 0.393472
[404]	training's rmse: 0.251765	valid_1's rmse: 0.390886
[505]	training's rmse: 0.23749	valid_1's rmse: 0.389723
[606]	training's rmse: 0.226398	valid_1's rmse: 0.388789
[707]	training's rmse: 0.218169	valid_1's rmse: 0.388401
[808]	training's rmse: 0.209894	valid_1's rmse: 0.387505
[909]	training's rmse: 0.2022	valid_1's rmse: 0.387039
[1010]	training's rmse: 0.195344	valid_1's rmse: 0.386518
[1111]	training's rmse: 0.18928	valid_1's rmse: 0.386226
[1212]	training's rmse: 0

site_id:  81%|████████▏ | 13/16 [04:57<00:56, 18.99s/it]

Site Id: 12 , Fold: 2 , RMSE: 0.40750042758112076

Site Id: 12 meter Id: 0 , CV RMSE: 0.3948678438191429 

2 fold CV for site_id: 12 meter: 1
2 fold CV for site_id: 12 meter: 2
2 fold CV for site_id: 12 meter: 3
2 fold CV for site_id: 13 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.33312	valid_1's rmse: 0.382939
[202]	training's rmse: 0.270349	valid_1's rmse: 0.387057
Early stopping, best iteration is:
[140]	training's rmse: 0.297055	valid_1's rmse: 0.382113
Site Id: 13 , Fold: 1 , RMSE: 0.382088665879786
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.228423	valid_1's rmse: 0.582526
Early stopping, best iteration is:
[83]	training's rmse: 0.24631	valid_1's rmse: 0.581284
Site Id: 13 , Fold: 2 , RMSE: 0.5803833076207529

Site Id: 13 meter Id: 0 , CV RMSE: 0.49134332821286986 

2 fold CV for site_id: 13 meter: 1
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.

site_id:  88%|████████▊ | 14/16 [06:24<01:18, 39.34s/it]

Site Id: 13 , Fold: 2 , RMSE: 1.6135501785114292

Site Id: 13 meter Id: 2 , CV RMSE: 1.6579690013768107 

2 fold CV for site_id: 13 meter: 3
2 fold CV for site_id: 14 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.625953	valid_1's rmse: 0.892536
[202]	training's rmse: 0.507264	valid_1's rmse: 0.884477
[303]	training's rmse: 0.444396	valid_1's rmse: 0.884135
Early stopping, best iteration is:
[227]	training's rmse: 0.487544	valid_1's rmse: 0.883735
Site Id: 14 , Fold: 1 , RMSE: 0.887660638763436
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.603045	valid_1's rmse: 0.998262
[202]	training's rmse: 0.484157	valid_1's rmse: 0.995518
[303]	training's rmse: 0.422044	valid_1's rmse: 0.993954
[404]	training's rmse: 0.380528	valid_1's rmse: 0.994003
Early stopping, best iteration is:
[343]	training's rmse: 0.401006	valid_1's rmse: 0.993229
Site Id: 14 , Fold: 2 , RMSE: 0.9644229214639962

Site Id: 14 meter I

site_id:  94%|█████████▍| 15/16 [07:07<00:40, 40.56s/it]

Site Id: 14 , Fold: 2 , RMSE: 1.6282911222079761

Site Id: 14 meter Id: 3 , CV RMSE: 1.7661661193845997 

2 fold CV for site_id: 15 meter: 0
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.221824	valid_1's rmse: 0.359303
[202]	training's rmse: 0.190301	valid_1's rmse: 0.348319
[303]	training's rmse: 0.180261	valid_1's rmse: 0.346346
[404]	training's rmse: 0.173324	valid_1's rmse: 0.345801
[505]	training's rmse: 0.168149	valid_1's rmse: 0.345663
[606]	training's rmse: 0.163726	valid_1's rmse: 0.345657
[707]	training's rmse: 0.159843	valid_1's rmse: 0.345516
[808]	training's rmse: 0.156029	valid_1's rmse: 0.345518
Early stopping, best iteration is:
[763]	training's rmse: 0.15768	valid_1's rmse: 0.345474
Site Id: 15 , Fold: 1 , RMSE: 0.34909907369190696
Training until validation scores don't improve for 101 rounds
[101]	training's rmse: 0.19721	valid_1's rmse: 0.307189
[202]	training's rmse: 0.164492	valid_1's rmse: 0.305814
[303]	training's rmse: 0.

site_id: 100%|██████████| 16/16 [07:57<00:00, 29.83s/it]

[101]	training's rmse: 0.238522	valid_1's rmse: 2.45703
Early stopping, best iteration is:
[65]	training's rmse: 0.26749	valid_1's rmse: 2.4528
Site Id: 15 , Fold: 2 , RMSE: 2.4527998292054445

Site Id: 15 meter Id: 3 , CV RMSE: 1.8966646577015345 






In [20]:
pd.DataFrame.from_dict(cv_scores)

Unnamed: 0,site_id,meter_id,cv_score
0,0,0,0.378103
1,0,1,1.765991
2,1,0,0.592588
3,1,3,1.860042
4,2,0,0.461504
5,2,1,1.134675
6,2,3,0.951827
7,3,0,0.412216
8,4,0,0.28408
9,5,0,0.658154


In [None]:
del df_train, X_train_site, y_train_site, X_train, y_train, dtrain, X_valid, y_valid, dvalid, y_pred_train_site, y_pred_valid, rmse, score, cv_scores
gc.collect()

## Scoring on test data
The test data for each site is scored individually using the 3 models, one from each fold. The final prediction is the average of the 3 models.

In [30]:
weather_test = create_lag_features(weather_test, 3)
weather_test = create_lag_features(weather_test, 18)
weather_test = create_lag_features(weather_test, 72)
weather_test.drop(["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"], axis=1, inplace=True)

In [31]:
df_test_sites = []

for site_id in tqdm(range(16), desc="site_id"):
    
    for meter_id in range(4):
    
        print("Preparing test data for site_id", site_id, "and meter", meter_id)

        X_test_site = df_test[(df_test.site_id==site_id) & (df_test.meter==meter_id)]
        weather_test_site = weather_test[weather_test.site_id==site_id]
        
        X_test_site = X_test_site.merge(weather_test_site, on=["site_id", "timestamp"], how="left")

        row_ids_site = X_test_site.row_id

        X_test_site = X_test_site[all_features]
        y_pred_test_site = np.zeros(X_test_site.shape[0])
        
        if(len(X_test_site)==0):
            print("no data.. skip")
            continue

        print("Scoring for site_id", site_id, "meter_id", meter_id)    
        for fold in range(cv):
            model_lgb = models[site_id][0][meter_id][fold]
            y_pred_test_site += model_lgb.predict(X_test_site, num_iteration=model_lgb.best_iteration) / cv
            gc.collect()

        df_test_site = pd.DataFrame({"row_id": row_ids_site, "meter_reading": y_pred_test_site})
        df_test_sites.append(df_test_site)

        print("Scoring for site_id", site_id, "meter_id", meter_id,"completed\n")
        gc.collect()

site_id:   0%|          | 0/16 [00:00<?, ?it/s]

Preparing test data for site_id 0 and meter 0
Scoring for site_id 0 meter_id 0
Scoring for site_id 0 meter_id 0 completed

Preparing test data for site_id 0 and meter 1
Scoring for site_id 0 meter_id 1


site_id:   6%|▋         | 1/16 [00:04<01:05,  4.36s/it]

Scoring for site_id 0 meter_id 1 completed

Preparing test data for site_id 0 and meter 2
no data.. skip
Preparing test data for site_id 0 and meter 3
no data.. skip
Preparing test data for site_id 1 and meter 0
Scoring for site_id 1 meter_id 0
Scoring for site_id 1 meter_id 0 completed

Preparing test data for site_id 1 and meter 1
no data.. skip
Preparing test data for site_id 1 and meter 2
no data.. skip
Preparing test data for site_id 1 and meter 3
Scoring for site_id 1 meter_id 3


site_id:  12%|█▎        | 2/16 [00:11<01:13,  5.22s/it]

Scoring for site_id 1 meter_id 3 completed

Preparing test data for site_id 2 and meter 0
Scoring for site_id 2 meter_id 0
Scoring for site_id 2 meter_id 0 completed

Preparing test data for site_id 2 and meter 1
Scoring for site_id 2 meter_id 1
Scoring for site_id 2 meter_id 1 completed

Preparing test data for site_id 2 and meter 2
no data.. skip
Preparing test data for site_id 2 and meter 3
Scoring for site_id 2 meter_id 3


site_id:  19%|█▉        | 3/16 [00:21<01:24,  6.52s/it]

Scoring for site_id 2 meter_id 3 completed

Preparing test data for site_id 3 and meter 0
Scoring for site_id 3 meter_id 0
Scoring for site_id 3 meter_id 0 completed

Preparing test data for site_id 3 and meter 1
no data.. skip
Preparing test data for site_id 3 and meter 2
no data.. skip
Preparing test data for site_id 3 and meter 3


site_id:  25%|██▌       | 4/16 [00:45<02:23, 11.96s/it]

no data.. skip
Preparing test data for site_id 4 and meter 0
Scoring for site_id 4 meter_id 0


site_id:  31%|███▏      | 5/16 [00:52<01:52, 10.24s/it]

Scoring for site_id 4 meter_id 0 completed

Preparing test data for site_id 4 and meter 1
no data.. skip
Preparing test data for site_id 4 and meter 2
no data.. skip
Preparing test data for site_id 4 and meter 3
no data.. skip
Preparing test data for site_id 5 and meter 0
Scoring for site_id 5 meter_id 0


site_id:  38%|███▊      | 6/16 [01:10<02:06, 12.66s/it]

Scoring for site_id 5 meter_id 0 completed

Preparing test data for site_id 5 and meter 1
no data.. skip
Preparing test data for site_id 5 and meter 2
no data.. skip
Preparing test data for site_id 5 and meter 3
no data.. skip
Preparing test data for site_id 6 and meter 0
Scoring for site_id 6 meter_id 0
Scoring for site_id 6 meter_id 0 completed

Preparing test data for site_id 6 and meter 1
Scoring for site_id 6 meter_id 1
Scoring for site_id 6 meter_id 1 completed

Preparing test data for site_id 6 and meter 2
Scoring for site_id 6 meter_id 2


site_id:  44%|████▍     | 7/16 [01:19<01:43, 11.54s/it]

Scoring for site_id 6 meter_id 2 completed

Preparing test data for site_id 6 and meter 3
no data.. skip
Preparing test data for site_id 7 and meter 0
Scoring for site_id 7 meter_id 0
Scoring for site_id 7 meter_id 0 completed

Preparing test data for site_id 7 and meter 1
Scoring for site_id 7 meter_id 1
Scoring for site_id 7 meter_id 1 completed

Preparing test data for site_id 7 and meter 2
Scoring for site_id 7 meter_id 2
Scoring for site_id 7 meter_id 2 completed

Preparing test data for site_id 7 and meter 3
Scoring for site_id 7 meter_id 3


site_id:  50%|█████     | 8/16 [01:22<01:11,  8.97s/it]

Scoring for site_id 7 meter_id 3 completed

Preparing test data for site_id 8 and meter 0
Scoring for site_id 8 meter_id 0


site_id:  56%|█████▋    | 9/16 [01:25<00:51,  7.32s/it]

Scoring for site_id 8 meter_id 0 completed

Preparing test data for site_id 8 and meter 1
no data.. skip
Preparing test data for site_id 8 and meter 2
no data.. skip
Preparing test data for site_id 8 and meter 3
no data.. skip
Preparing test data for site_id 9 and meter 0
Scoring for site_id 9 meter_id 0
Scoring for site_id 9 meter_id 0 completed

Preparing test data for site_id 9 and meter 1
Scoring for site_id 9 meter_id 1
Scoring for site_id 9 meter_id 1 completed

Preparing test data for site_id 9 and meter 2
Scoring for site_id 9 meter_id 2


site_id:  62%|██████▎   | 10/16 [02:00<01:32, 15.44s/it]

Scoring for site_id 9 meter_id 2 completed

Preparing test data for site_id 9 and meter 3
no data.. skip
Preparing test data for site_id 10 and meter 0
Scoring for site_id 10 meter_id 0
Scoring for site_id 10 meter_id 0 completed

Preparing test data for site_id 10 and meter 1
Scoring for site_id 10 meter_id 1
Scoring for site_id 10 meter_id 1 completed

Preparing test data for site_id 10 and meter 2
no data.. skip
Preparing test data for site_id 10 and meter 3
Scoring for site_id 10 meter_id 3


site_id:  69%|██████▉   | 11/16 [02:02<00:57, 11.42s/it]

Scoring for site_id 10 meter_id 3 completed

Preparing test data for site_id 11 and meter 0
Scoring for site_id 11 meter_id 0
Scoring for site_id 11 meter_id 0 completed

Preparing test data for site_id 11 and meter 1
Scoring for site_id 11 meter_id 1
Scoring for site_id 11 meter_id 1 completed

Preparing test data for site_id 11 and meter 2
no data.. skip
Preparing test data for site_id 11 and meter 3
Scoring for site_id 11 meter_id 3


site_id:  75%|███████▌  | 12/16 [02:03<00:33,  8.33s/it]

Scoring for site_id 11 meter_id 3 completed

Preparing test data for site_id 12 and meter 0
Scoring for site_id 12 meter_id 0


site_id:  81%|████████▏ | 13/16 [02:12<00:25,  8.64s/it]

Scoring for site_id 12 meter_id 0 completed

Preparing test data for site_id 12 and meter 1
no data.. skip
Preparing test data for site_id 12 and meter 2
no data.. skip
Preparing test data for site_id 12 and meter 3
no data.. skip
Preparing test data for site_id 13 and meter 0
Scoring for site_id 13 meter_id 0
Scoring for site_id 13 meter_id 0 completed

Preparing test data for site_id 13 and meter 1
Scoring for site_id 13 meter_id 1
Scoring for site_id 13 meter_id 1 completed

Preparing test data for site_id 13 and meter 2
Scoring for site_id 13 meter_id 2


site_id:  88%|████████▊ | 14/16 [02:53<00:36, 18.37s/it]

Scoring for site_id 13 meter_id 2 completed

Preparing test data for site_id 13 and meter 3
no data.. skip
Preparing test data for site_id 14 and meter 0
Scoring for site_id 14 meter_id 0
Scoring for site_id 14 meter_id 0 completed

Preparing test data for site_id 14 and meter 1
Scoring for site_id 14 meter_id 1
Scoring for site_id 14 meter_id 1 completed

Preparing test data for site_id 14 and meter 2
Scoring for site_id 14 meter_id 2
Scoring for site_id 14 meter_id 2 completed

Preparing test data for site_id 14 and meter 3
Scoring for site_id 14 meter_id 3


site_id:  94%|█████████▍| 15/16 [03:11<00:18, 18.24s/it]

Scoring for site_id 14 meter_id 3 completed

Preparing test data for site_id 15 and meter 0
Scoring for site_id 15 meter_id 0
Scoring for site_id 15 meter_id 0 completed

Preparing test data for site_id 15 and meter 1
Scoring for site_id 15 meter_id 1
Scoring for site_id 15 meter_id 1 completed

Preparing test data for site_id 15 and meter 2
Scoring for site_id 15 meter_id 2
Scoring for site_id 15 meter_id 2 completed

Preparing test data for site_id 15 and meter 3
Scoring for site_id 15 meter_id 3


site_id: 100%|██████████| 16/16 [03:35<00:00, 13.48s/it]

Scoring for site_id 15 meter_id 3 completed






## Submission
Preparing final file for submission.

In [32]:
submit = pd.concat(df_test_sites)
submit.meter_reading = np.clip(np.expm1(submit.meter_reading), 0, a_max=None)

In [34]:
# should be (41697600, 2)
print(submit.shape)
submit.to_csv("../submission/sub_site_meter_base.csv", index=False)
submit.head()

(41697600, 2)


Unnamed: 0,row_id,meter_reading
0,0,183.095036
1,3096,183.081411
2,6192,184.294072
3,9288,189.648076
4,12384,190.205822
