[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1RN93yQ0kqAwlgAQZLVs9fAezk_5DryAd?usp=sharing)

# Set Up

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from numpy import percentile
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import RidgeCV

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Data Analysis and Preprocessing

In [None]:
# pandas couldn't identify the type of the following two columns
recs_df = pd.read_csv("../input/residential-energy-consumption-survey/recs2009_public.csv",dtype={"NOCRCASH":"object","NKRGALNC":"object"})

We take advantage of pandas profiling to get an overview of the data and a quick analysis. Due to the large size of the report, I chose minimal report without correlation calculations. I commented the following lines, since the report was too large for Google Colab to save.

In [None]:
# profile = ProfileReport(recs_df, title="RECS Dataset Profile",minimal=True)
# profile.to_notebook_iframe()

In [None]:
# profile.to_file("recs_report.html")

Some important insights from a glance at the report:

* over 900 columns, high dimensional data --> curse-of-dimensionality a great concern

* many columns with almost constant values (zeros for example) --> redundant columns

* no missing values reported (missing values are most likely already imputetd by the center who did the survey)

* many columns with skewed distribution, must keep an eye on these when doing regression

## Target Variable Analysis

In [None]:
recs_df['KWH'].hist()

we can see that the distribution is a bit skewed. Since our end goal is to predict this value, and it's continuous, we need a regression model. However, based on linear regression assumptions, we ideally want our target variable and predictors be normally distributed. 

## Correlations

We will have a look at the correlation among variables in our dataset.

In [None]:
def get_high_corr_df(df,positive_threshold=0.4):
  corr = df.corr().stack().reset_index().drop_duplicates()
  corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']
  high_corr = corr[((corr['FEATURE_1'] != corr['FEATURE_2'] ) & ((corr['CORRELATION'] >= positive_threshold) | (corr['CORRELATION'] <= positive_threshold*-1)))]
  return high_corr

In [None]:
high_corr = get_high_corr_df(recs_df)
high_corr[((high_corr['FEATURE_1'] == 'KWH'))]

There is a perfect correlation of 1 between KWH and BTUEL. If we have a look at the codebook provided on the website, we'll see that they are representing the same thing (total electricity site usage), but in different units (one in kw/h and the other one in thousand BTU). Furthermore, there are columns that show categories of KWH such as electricity usage for air-conditioning. I personally think, we should drop these columns for training a prediction model, otherwise they would serve a data-leak or cheating for the model. They are part of our target variable.

In [None]:
KWH_cheat_columns = [c for c in recs_df.columns if (("KWH" in c and len(c)>3) or "BTUEL" in c) ]
recs_df.drop(columns=KWH_cheat_columns,inplace=True)

drop id, unique value

In [None]:
recs_df.drop(columns=["DOEID"],inplace=True)

drop the imputation flag columns, extra information that (I personally think) is irrelevant to prediction. Moreover, mostly zeros and also no noticable correlation found with "KWH" 

In [None]:
imputation_columns = [c for c in recs_df.columns if c.startswith("Z")]
recs_df.drop(columns=imputation_columns,inplace=True)

there are some columns that almost have a constant value, such as DOLKEROTH with mostly zeros, or AGEHHMEMCAT11 with mostly -2.

First, we'll find these columns.

In [None]:
columns_with_constant_values = []
for c in recs_df.columns:
  value_frequencies = recs_df[c].value_counts(normalize=True)
  if value_frequencies.max()>=0.85:
    columns_with_constant_values.append(c)
print(columns_with_constant_values)
print(len(columns_with_constant_values))

Then, see if any of them have a meaningful correlation with our target variable.

In [None]:
high_corr[((high_corr['FEATURE_1'] == 'KWH') & (high_corr['FEATURE_2'].isin(columns_with_constant_values)))]

We'll drop the redundant columns.

In [None]:
recs_df.drop(columns=columns_with_constant_values,inplace=True)

now, we look for highly correlated (close to 1 or -1) pairs. If such pairs exist perhaps we could get rid of one of the variables, and reduce number of columns for building a model.

In [None]:
high_corr = get_high_corr_df(recs_df) #calculate again, after deleting so many columns
redundant_sets=[]
for i,row in high_corr[(high_corr['CORRELATION']>=0.90)|(high_corr['CORRELATION']<=-0.90)].iterrows():
  f1 = row['FEATURE_1']
  f2 = row['FEATURE_2']
  fset = {f1,f2}
  belongs_to_sets = []
  for j in range(len(redundant_sets)):
    if len(redundant_sets[j].intersection(fset))!=0:  
      belongs_to_sets.append(j)

  if len(belongs_to_sets)==0:
    redundant_sets.append(fset)
  elif len(belongs_to_sets)==1:
    redundant_sets[belongs_to_sets[0]].update(fset)
  else:
    sets_to_merge = [redundant_sets[j] for j in belongs_to_sets]
    for sm in sets_to_merge:
      redundant_sets.remove(sm)
      fset.update(sm)
    redundant_sets.append(fset)
  

In [None]:
for s in redundant_sets:
  print(s)

we can see from the names of each group that the values are highly related, and we can just use one of them as the representor of that group. for example, for the {'PELHOTWA', 'ELWATER'} pair, most people chose "not applicable" for the first one, and 0 (not electricity used for heating water) for the second one. So when they don't use it, they don't pay for it. the same information. we can just use one.

In [None]:
for s in redundant_sets:
  s.pop()
  recs_df.drop(columns=list(s),inplace=True)

In [None]:
sns.heatmap(recs_df.corr())

Even after removing redundant columns, there is still noticable correlation between some columns in data. When using linear regression, we ideally want the predictors to have no correlation with each other.

## Noise and Outliers

Linear Regression is sensitive to outliers and we must take care of them before building a prediction model.

Strategy for dealing with outliers: If the number of outliers found for a column make up less than 1% of the rows, remove the rows, else replace the outliers with mean.

Initially, the threshold was higher (I tried a range of values from 20% to 1%). I noticed that even with a low threshold as much as 2%, we would still lose about 25% of the rows after outlier removal. So in order not to lose too much data, I used the afromentioned strategy. I think it's not ideal, but a quick and general solution for now.

In [None]:
rows = set()
for c,column in recs_df.iteritems():
  if column.dtype=='object':
    continue

  q25, q75 = percentile(column, 25), percentile(column, 75)
  iqr = q75 - q25
  cut_off = iqr * 1.5
  lower, upper = q25 - cut_off, q75 + cut_off
  outliers = recs_df[(recs_df[c] < lower) | (recs_df[c] > upper)] 
  if len(outliers)>0:
    percentage = outliers.shape[0]/recs_df.shape[0]*100
    print('%s #outliers: %d %d%%' % (c,outliers.shape[0],percentage))
    if percentage<1:
      recs_df.drop(index=outliers.index,inplace=True)

    else:
      recs_df[c].where(((recs_df[c] < lower) | (recs_df[c] > upper)),recs_df[c].mean())

In [None]:
recs_df.shape[0]

## Skewness

Pandas profiling revealed that distribution of a lot of columns are higly skewed. Therefore the normality assumption for linear regression will be violated. We will detect skewed columns and try to transform into a normal distribution witn log transformation. The rule of thumb is if the skewness is not within [-1,1] range that the distribution is skewed. But in the profiling report, I noticed almost no column has a perfect normal distribuion and most have skewness around 2-3. So I set the threshold a bit higher in order not to transform too many columns and completely transform the data.

In [None]:
for c in recs_df.columns:
  if recs_df[c].dtype=='object':
    continue
  skew = recs_df[c].skew()
  if skew>2 or skew<-2:
    print('%s skew before transforamtion: %f' % (c,skew))
    recs_df[c] = np.log(recs_df[c] + 1 - min(recs_df[c]))
    print('%s skew after log transforamtion: %f' % (c,recs_df[c].skew()))

## Scaling

The unit of the values in the columns are not uniform. Some came from multi-answer questions so only have a categorical value. Some have continues values with large numbers. 

However, we will not use nueral networks or linear models sensitive to unscaled data.So we don't have to worry about this much for now. (Also, linear regression makes no assumption about the scale of the variables).



## Train Test Sets

In [None]:
X = recs_df.loc[:,recs_df.columns!='KWH'].copy()
y = recs_df['KWH'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Dealing with Categorical Values

For dealing with categorical values, we'll use label encoding. We will fit the encoder only on training set so we don't leak any information to test set and help the models cheat!

In [None]:
categorical_columns = [c for c in recs_df.columns if recs_df[c].dtype=='object']
for c in categorical_columns:
  print(c)
  encoder = LabelEncoder()
  encoder.fit(X_train[c].values)
  X_train[c] = encoder.transform(X_train[c])
  X_test[c] = encoder.transform(X_test[c])

# Prediction 

We'll first try two separate regression models to predict KWH. Then we'll see if we can improve the prediction by stacking the two and building an ensemble model. 

In [None]:
from sklearn.linear_model import LinearRegression

estimators = [('lgbm', GradientBoostingRegressor(random_state=42,max_depth=5)),
              ('lr', LinearRegression()),
              ('rc', RidgeCV())]

stack = StackingRegressor(estimators=estimators[:-1],final_estimator=estimators[-1][1])
print("Individual resutls:")
for estimator in estimators:
  print("model: %s, score: %f" %(estimator[0],estimator[1].fit(X_train, y_train).score(X_test, y_test)))
  
print("################\nEnsemble result")
stack.fit(X_train, y_train).score(X_test, y_test)