## Introduction
This kernel is created with respect to the Kaggle competition [ASHRAE - Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction). In this competition we need to build a models to predict  metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters.
![bulb](http://yesofcorsa.com/wp-content/uploads/2017/04/Lamp-Light-Wallpaper-Download-Free.jpg)<br><br>

In this kernel, we have done an **Exploratory Data Analysis(EDA)** of the entire dataset and have also created a model with the help of **LightGBM Regressor**. We have also performed data-frame memory usage reduction, data pre-processing(Dealing with the NaN values) and feature Engineering before training the model.

## Load Packages and Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score,mean_absolute_error,mean_squared_error,r2_score
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold, StratifiedKFold
import lightgbm as lgb
from tqdm import tqdm_notebook as tqdm
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
BASE_PATH = '/kaggle/input/ashrae-energy-prediction/'

In [None]:
def Pre_process_data(df,col):
    '''
    Input: Data-frame and Column name.
    Operation: Fills the nan values with the minimum value in their respective column.
               and if the column is "dew_temperature" or "air_temperature" it uses forward 
               fill and backward fill.
    Output: Returns the pre-processed data-frame.
    '''
    #df['primary_use'] = df['primary_use'].astype("category").cat.codes
    print("Name of column with NaN: "+str(col))
    print(df[col].value_counts(dropna=False, normalize=True).head())
    if col=='dew_temperature' or col=='air_temperature':
        df[col] = df[col].ffill(axis = 0) 
        df[col] = df[col].bfill(axis = 0)
    else:
        df[col] = df[col].fillna(df[col].min())
    #df.drop(['building_id'], axis=1, inplace=True)
    return df

In [None]:
#Based on this great kernel https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65
def reduce_mem_usage(df):
    '''
    Input - data-frame.
    Operation - Reduce memory usage of the data-frame.
    '''
    start_mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    #NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in df.columns:
        if df[col].dtype != object:  # Exclude strings            
            # Print current column type
            print("******************************")
            print("Column: ",col)
            print("dtype before: ",df[col].dtype)            
            # make variables for Int, max and min
            IsInt = False
            mx = df[col].max()
            mn = df[col].min()
            print("min for this col: ",mn)
            print("max for this col: ",mx)
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(df[col]).all(): 
                #NAlist.append(col)
                df = Pre_process_data(df,col)
                   
            # test if column can be converted to an integer
            asint = df[col].fillna(0).astype(np.int64)
            result = (df[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif mx < 65535:
                        df[col] = df[col].astype(np.uint16)
                    elif mx < 4294967295:
                        df[col] = df[col].astype(np.uint32)
                    else:
                        df[col] = df[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)    
            # Make float datatypes 32 bit
            else:
                df[col] = df[col].astype(np.float32)
            
            # Print new column type
            print("dtype after: ",df[col].dtype)
            print("******************************")
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return df

In [None]:
weather_train = pd.read_csv(BASE_PATH+'weather_train.csv')
weather_train = weather_train.set_index('timestamp')
weather_train.head()

In [None]:
weather_test = pd.read_csv(BASE_PATH+'weather_test.csv')
weather_test = weather_test.set_index('timestamp')
weather_test.head()

## EDA for weather data-set

In [None]:
sns.set(rc={'figure.figsize':(17,15)})
fig, ax = plt.subplots(2, 1)
weather_train['air_temperature'].plot(marker='.',color='r', alpha=0.1,ax=ax[0])
weather_test['air_temperature'].plot(marker='.', alpha=0.1,ax=ax[1])
ax[0].set_ylabel('Air Temperature')
ax[1].set_ylabel('Air Temperature')
#weather_test['air_temperature'].plot(marker='.', alpha=0.1);
#weather_train['air_temperature'].plot(marker='.', alpha=0.3);

In [None]:
sns.set(rc={'figure.figsize':(17,15)})
fig, ax = plt.subplots(2, 1)
weather_train['cloud_coverage'].plot(marker='o',color='r',alpha=0.3,linestyle='-',ax=ax[0])
weather_test['cloud_coverage'].plot(marker='o',linestyle='-', alpha=0.3,ax=ax[1])
ax[0].set_ylabel('Cloud Coverage')
ax[1].set_ylabel('Cloud Coverage')

In [None]:
sns.set(rc={'figure.figsize':(17,15)})
fig, ax = plt.subplots(2, 1)
weather_train['dew_temperature'].plot(marker='.',color='r',alpha=0.3,ax=ax[0])
weather_test['dew_temperature'].plot(marker='.', alpha=0.3,ax=ax[1])
ax[0].set_ylabel('Dew Temperature')
ax[1].set_ylabel('Dew Temperature')

In [None]:
sns.set(rc={'figure.figsize':(16,9)})
weather_train['air_temperature'].plot(marker='o',color='r',alpha=0.3,label='air_temperature')
weather_train['dew_temperature'].plot(marker='.',color='y', alpha=0.1,label='dew_temperature')
plt.legend()
#ax[0].set_ylabel('Dew Temperature')
#ax[1].set_ylabel('Dew Temperature')

In [None]:
sns.set(rc={'figure.figsize':(16,9)})
weather_test['air_temperature'].plot(marker='o',color='r',alpha=0.3,label='air_temperature')
weather_test['dew_temperature'].plot(marker='.',color='y', alpha=0.1,label='dew_temperature')
plt.legend()

In [None]:
sns.set(rc={'figure.figsize':(17,15)})
k1 = sns.lineplot(weather_train['wind_direction'],weather_train['wind_speed'],label='2016-2017')
k2 = sns.lineplot(weather_test['wind_direction'],weather_test['wind_speed'],label='2017-2018')
plt.show()

In [None]:
train = pd.read_csv(BASE_PATH+'train.csv')
#train = train.set_index('timestamp')
train.head()

In [None]:
building = pd.read_csv(BASE_PATH+'building_metadata.csv')
building.head()

## EDA for our Training Data-frame

In [None]:
train_final = train.merge( building, left_on = "building_id",right_on = "building_id", how = "left")
train_final = reduce_mem_usage(train_final)
#train_final.head()

In [None]:
train_final = train_final.merge( weather_train, on=['site_id', 'timestamp'], how='left')
train_final = reduce_mem_usage(train_final)
train_final.head()

In [None]:
plt.figure(figsize=(20,20))
sns.countplot(y='primary_use', data=train_final)

In [None]:
#plt.figure(figsize=(20,20))
#N = 10000
#k=sns.barplot(train_final['meter_reading'][0:N], train_final['primary_use'][0:N])

In [None]:
train_final = train_final.set_index('timestamp')
sns.set(rc={'figure.figsize':(20,15)})
train_final['meter_reading'].plot(marker='o',c='r',alpha=0.3,label='2016-2017')
plt.legend()

> **I would come up with more EDA representing the training data frame. Plotting some of the graphs is taking a lot of time sometimes even one hour and more. I would come up with a solution to plot those graphs and add it to the kernel**

In [None]:
del train,
del k1
del k2
gc.collect()

In [None]:
train_final = train_final.reset_index()
'''train_final.drop(['timestamp'], 
               axis=1, inplace=True)'''
train_final.head()

In [None]:
del building, weather_train
gc.collect()

## Feature Engineering
We have added more features with respect to features that are already present in the data frame and turned categorical values into integer values.
Feature Engineering done:
* Transforming the "Primary_use" column into an unique integer column.
* Adding only the month from the "timestamp" column.<br>
  Logic: Weather change takes place between an interval of months, therefore energy consumption also differs month-wise.
* wind_load - The generic formula for wind load is F = A x P x Cd where F is the force or wind load, A is the projected area of the object, P is the wind pressure, and Cd is the drag coefficient.
* temp_diff - Difference between dew temperature and air_temperature.
* Floor_Area - Square area of the building multiplied with number of floors of the building.

In [None]:
def Feature_Engineering(df):
    '''
    Input: Data-frame
    Operation: Adding features to the data-frame
    Output: Inproved Data frame.
    '''
    df['primary_use'] = df['primary_use'].astype("category").cat.codes
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    df["month"] = df["timestamp"].dt.month.astype(np.uint8)
    df.drop(['building_id','timestamp'], axis=1, inplace=True)
    df['wind_load'] = df['sea_level_pressure']*(df['wind_speed']**2)
    df['windspeed_mean'] = df['wind_speed'] / df.groupby(['wind_direction'])['wind_speed'].transform('mean')
    df['windspeed_diff_std'] = df['wind_speed'] - df.groupby(['wind_direction'])['wind_speed'].transform('std')
    df['temp_diff'] = df['dew_temperature'] - df['air_temperature']
    df['Floor_Area'] = df['floor_count']*df['square_feet']
    
    return df

In [None]:
train_final = Feature_Engineering(train_final)
gc.collect()

In [None]:
train_final.head()

## Correlation matrix(Heat map)

In [None]:
f,ax = plt.subplots(figsize=(18,18))
sns.heatmap(train_final.corr(), annot=True, linewidths=.5, fmt= '.2f',ax=ax,cmap="YlGnBu")
plt.show()

## Lesson Learned
According to my knowledge, all types of data pre-processing, and data engineering are always done with the input labels. The target label is kept as it is.<br>In this case also I assumed the same thing, but no matter how many times I fit the model and validated its performance there was a huge RMSE(Root Mean Squared Error) and MAE(Mean Absolute Error). I tried to fit the model with different regressor algorithms but the same thing was repeated again and again.<br>Then I went through some of the notebooks in the Notebooks panel of the competition and found that the users are transforming the target variable in this case to obtain better accuracy. It was a completely new thing for me to learn that the target variable could also be processed according to our requirement to get better results.<br>According to my opinion the main reason for this problem to occur was because the ML model was not able to find proper pattern in the data to generate the target data, but once you transform the target data the model is able to find a pattern in the input data corresponding to the target data and hence the errors are reduced while predicting the target variable.
<br> In this case we have transformed the target value corresponding to the equation **log(1 + x)**.

In [None]:
y = np.log1p(train_final["meter_reading"])
train_final.drop("meter_reading", axis=1, inplace=True)
X = train_final

In [None]:
#x_train, x_val, y_train, y_val = train_test_split(X, y,test_size=0.3,random_state=42)
del train_final
gc.collect()

## Model Fit
I have tried to fit and validate the data with many regressor algorithms, but LightGBM and CatBoost regressor turned out to be the most perfect to solve this problem. With CatBoost regressor we achieved a **Variance score** of **0.711**, **MAE** of **0.734**, **MSE** of **1.33**, **R-Score** of **0.711** over 2000 iteration. I have commented out the portion where the CatBoost regressor is used for model fit and evaluation.<br>I am thinking of creating another kernel where I would evaluate and show the performance of Catboost with a higher number of iterations.<br> With LightGBM regressor we achieved a **Variance score** of **0.87**, **MAE** of **0.448**, **MSE** of **0.594**, **R-Score** of **0.87** over 2000 iteration.<br>
The performance of LightGBM could be seen in the output of the cell down below.

In [None]:
folds = 4
seed = 42
num_epochs =300
num_batch_size = 1080
kfold = KFold(n_splits = folds, shuffle = True, random_state = seed)
models = []
for train_index, val_index in kfold.split(X,y):
    X_train, X_val = X.loc[train_index,:], X.loc[val_index,:] 
    y_train, y_val = y[train_index], y[val_index]
    print({'train size':len(X_train), 'eval size':len(X_val)})
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_val, y_val)
    
    clf = lgb.LGBMRegressor(metric='rmse',
                            learning_rate=0.4,
                            #feature_fraction= 0.9,
                            n_estimators= 600,
                            subsample=0.3,  # batches of 25% of the data
                            subsample_freq=1,
                            #lgb_train,
                            #num_boost_round=2000,
                            #valid_sets=(lgb_train, lgb_eval),
                            #early_stopping_rounds=20,
                            verbose_eval = 500
                           )
    clf.fit(X_train, y_train,
                eval_set=[(X_val, y_val)],
                #early_stopping_rounds=50,
                verbose=200)
    #clf = CatBoostRegressor(iterations=2000,depth= 9,random_seed = 23,
    #                       task_type = "GPU")
    #clf.fit(X_train,y_train)
    y_pred = clf.predict(X_val)
    print("\nVariance_Score\t:"+str(explained_variance_score(y_val,y_pred)))
    print("Mean_Absolute_Error\t:"+str(mean_absolute_error(y_val,y_pred)))
    print("Mean_Squared_Error\t:"+str(mean_squared_error(y_val,y_pred)))
    print("R2-Score\t:"+str(r2_score(y_val,y_pred)))
    models.append(clf)
    del X_train, X_val, y_train, y_val, clf, lgb_train, lgb_eval
    gc.collect()

    print (20*'---')

In [None]:
plt.figure(figsize=(15,15))
plt.bar(range(len(models[0].feature_importances_)), models[0].feature_importances_)
plt.title("Feature Importance")
plt.xticks(np.arange(len(X.columns)),X.columns, rotation=90)
plt.show()

In [None]:
building = pd.read_csv(BASE_PATH+'building_metadata.csv')
test = pd.read_csv(BASE_PATH+'test.csv')
#test = test.set_index('timestamp')
#test.head()

In [None]:
test_final = test.merge( building, left_on = "building_id",right_on = "building_id", how = "left")
test_final = reduce_mem_usage(test_final)

In [None]:
test_final = test_final.merge( weather_test, on=['site_id', 'timestamp'], how='left')
test_final = reduce_mem_usage(test_final)
test_final.drop(['row_id'], axis=1, inplace=True)
#test_final.head()

In [None]:
del building,test,X,y,weather_test
gc.collect()

In [None]:
test_final = Feature_Engineering(test_final)
test_final.head()

## Submission
Preparing for submission. The submission code is inpired from the kernel [Starter EDA and Feature selection ASHRAE3](https://www.kaggle.com/hmendonca/starter-eda-and-feature-selection-ashrae3)

In [None]:
# split test data into batches
set_size = len(test_final)
iterations = 100
batch_size = set_size // iterations

print(set_size, iterations, batch_size)
assert set_size == iterations * batch_size

In [None]:
meter_reading = []
for i in tqdm(range(iterations)):
    pos = i*batch_size
    fold_preds = [np.expm1(model.predict(test_final[test_final.columns].iloc[pos : pos+batch_size])) for model in models]
    meter_reading.extend(np.mean(fold_preds, axis=0))

#print(len(meter_reading))
assert len(meter_reading) == set_size

**numpy.clip()** function is used to Clip (limit) the values in an array.

Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.

In [None]:
submission = pd.read_csv(BASE_PATH+'sample_submission.csv')
submission['meter_reading'] = np.clip(meter_reading, a_min=0, a_max=None) # clip min at zero
del meter_reading, test_final
gc.collect()

In [None]:
submission.to_csv('submission2.csv', index=False)
submission.head(9)

## References
These are the following kernels which helped me a lot while understanding the problem and also I think you must explore them to know different types of EDA that could be done with the data and different approaches to solving the prediction problem.
* [Starter EDA and Feature selection ASHRAE3](https://www.kaggle.com/hmendonca/starter-eda-and-feature-selection-ashrae3)
* [Simple LightGBM LB 1.24](https://www.kaggle.com/isaienkov/simple-lightgbm-lb-1-24)
* [Starter âš¡ Great Energy Predictor ](https://www.kaggle.com/jesucristo/starter-great-energy-predictor)
* [ASHRAE - Energy prediction](https://www.kaggle.com/allunia/ashrae-energy-prediction)

![meme](https://pics.me.me/im-not-lazy-im-just-in-energy-saving-mode-7936534.png)

## Thank you
### Thank you for staying with me throughout the kernel. Please **up-vote** if you think this kernel was informative or if you like it.
### Also comment down below about how much did you like this kernel, or what all improvements that you think could be added to this kernel.