## Pre-processing Data

In this section, we'll be focusing on preparing our data for machine learning model. We will be creating dummy variables for discreet variables, using standardized scale for numerical data, and creating our training and testing data for the model. 


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
#Loading Energy Efficiency dataframe
#Printing first five rows
df = pd.read_excel('../data/cleaned/data_cleaned.xlsx')
df.head()


Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Orientation,Glazing Area,Glazing Area Distribution,Heating Load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84


We are only interested in analyzing the relationship between building features and Heating Load in this analysis. Thus, we've decided to remove Cooling Load, the second y variable. 

In [3]:
#Checking dataframe size and data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Relative Compactness       768 non-null    float64
 1   Surface Area               768 non-null    float64
 2   Wall Area                  768 non-null    float64
 3   Roof Area                  768 non-null    float64
 4   Overall Height             768 non-null    float64
 5   Orientation                768 non-null    int64  
 6   Glazing Area               768 non-null    float64
 7   Glazing Area Distribution  768 non-null    int64  
 8   Heating Load               768 non-null    float64
dtypes: float64(7), int64(2)
memory usage: 54.1 KB


Most of are variables are float values, with the exception of Orientation and Glazing Area Distribution (which possess integer values).

In [4]:
#Checking for null values
df.isna().sum()

Relative Compactness         0
Surface Area                 0
Wall Area                    0
Roof Area                    0
Overall Height               0
Orientation                  0
Glazing Area                 0
Glazing Area Distribution    0
Heating Load                 0
dtype: int64

There are no missing values present in our dataset. 

## Creating Dummy Variables

We'll generate dummy variables for our discrete features, namely, Orientation and Glazing Area Distribution. Despite their integer nature, both Orientation and Glazing Area act as categorical variables. We'll employ the pandas.get_dummies() function to specifically choose Orientation and Glazing Area Distribution, transforming them into dummy variables.

In [5]:
#Selecting the integer values (Orientation and Glazing Area Distribution)
df_int = df.select_dtypes(include = 'int').columns
df_int

Index(['Orientation', 'Glazing Area Distribution'], dtype='object')

In [6]:
df_dummies = pd.get_dummies(df, columns = df_int, drop_first= True)
df_dummies.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load,Orientation_3,Orientation_4,Orientation_5,Glazing Area Distribution_1,Glazing Area Distribution_2,Glazing Area Distribution_3,Glazing Area Distribution_4,Glazing Area Distribution_5
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55,False,False,False,False,False,False,False,False
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55,True,False,False,False,False,False,False,False
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55,False,True,False,False,False,False,False,False
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55,False,False,True,False,False,False,False,False
4,0.9,563.5,318.5,122.5,7.0,0.0,20.84,False,False,False,False,False,False,False,False


In [7]:
#Using astype(int) to ensure that Orientation and Glazing Area Distribution values show up integer values in output instead of Boolean values. 

bool_to_int = ["Orientation_3", "Orientation_4", "Orientation_5", "Glazing Area Distribution_1", "Glazing Area Distribution_2", "Glazing Area Distribution_3", "Glazing Area Distribution_4", "Glazing Area Distribution_5"]

for column in bool_to_int:
    df_dummies[column] = df_dummies[column].astype(int)


In [8]:
#Checking data types in new dataframe
df_dummies.dtypes

#Checking first few rows of data in dataframe
df_dummies.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load,Orientation_3,Orientation_4,Orientation_5,Glazing Area Distribution_1,Glazing Area Distribution_2,Glazing Area Distribution_3,Glazing Area Distribution_4,Glazing Area Distribution_5
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,0,0,0,0,0,0
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55,1,0,0,0,0,0,0,0
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,1,0,0,0,0,0,0
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,1,0,0,0,0,0
4,0.9,563.5,318.5,122.5,7.0,0.0,20.84,0,0,0,0,0,0,0,0


In [9]:
df_dummies.rename(columns = 
                  {"Orientation_3": "Orientation of 3", "Orientation_4": "Orientation of 4", "Orientation_5": "Orientation of 5",
                  "Glazing Area Distribution_1": "Glazing Area Distribution of 1", "Glazing Area Distribution_2": "Glazing Area Distribution of 2", 
                  "Glazing Area Distribution_3": "Glazing Area Distribution of 3", "Glazing Area Distribution_4": "Glazing Area Distribution of 4", 
                  "Glazing Area Distribution_5": "Glazing Area Distribution of 5"}, inplace = True
                  )
df_dummies

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load,Orientation of 3,Orientation of 4,Orientation of 5,Glazing Area Distribution of 1,Glazing Area Distribution of 2,Glazing Area Distribution of 3,Glazing Area Distribution of 4,Glazing Area Distribution of 5
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,0,0,0,0,0,0
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55,1,0,0,0,0,0,0,0
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,1,0,0,0,0,0,0
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,1,0,0,0,0,0
4,0.90,563.5,318.5,122.50,7.0,0.0,20.84,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,0.4,17.88,0,0,1,0,0,0,0,1
764,0.62,808.5,367.5,220.50,3.5,0.4,16.54,0,0,0,0,0,0,0,1
765,0.62,808.5,367.5,220.50,3.5,0.4,16.44,1,0,0,0,0,0,0,1
766,0.62,808.5,367.5,220.50,3.5,0.4,16.48,0,1,0,0,0,0,0,1


In [10]:
#Checking column length for df_dummies dataframe
df_dummies.shape

(768, 15)

In [11]:
#Double checking value type
df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Relative Compactness            768 non-null    float64
 1   Surface Area                    768 non-null    float64
 2   Wall Area                       768 non-null    float64
 3   Roof Area                       768 non-null    float64
 4   Overall Height                  768 non-null    float64
 5   Glazing Area                    768 non-null    float64
 6   Heating Load                    768 non-null    float64
 7   Orientation of 3                768 non-null    int32  
 8   Orientation of 4                768 non-null    int32  
 9   Orientation of 5                768 non-null    int32  
 10  Glazing Area Distribution of 1  768 non-null    int32  
 11  Glazing Area Distribution of 2  768 non-null    int32  
 12  Glazing Area Distribution of 3  768 

In [12]:
#Double checking for missing values
df_dummies.isna().sum()

Relative Compactness              0
Surface Area                      0
Wall Area                         0
Roof Area                         0
Overall Height                    0
Glazing Area                      0
Heating Load                      0
Orientation of 3                  0
Orientation of 4                  0
Orientation of 5                  0
Glazing Area Distribution of 1    0
Glazing Area Distribution of 2    0
Glazing Area Distribution of 3    0
Glazing Area Distribution of 4    0
Glazing Area Distribution of 5    0
dtype: int64

In [13]:
#Checking for shape of long format dataframe
df_dummies.shape

(768, 15)

## Scaling Data

In this section we'll be utilizing MinMaxScaler which will place our data between a range of 0-1. Our dataset is comprised of variables with different units and extreme scale differences (e.g.: The scale of Surface Area vs scale of Relative Compactness). We will split data into training and test (using a 75/25) split. We will perform fit only on the training data, mitigating the risk of data leakage and overfitting on the test set.


In [14]:
df_dummies.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load,Orientation of 3,Orientation of 4,Orientation of 5,Glazing Area Distribution of 1,Glazing Area Distribution of 2,Glazing Area Distribution of 3,Glazing Area Distribution of 4,Glazing Area Distribution of 5
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,0,0,0,0,0,0
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55,1,0,0,0,0,0,0,0
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,1,0,0,0,0,0,0
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,1,0,0,0,0,0
4,0.9,563.5,318.5,122.5,7.0,0.0,20.84,0,0,0,0,0,0,0,0


In [15]:
#X - Dropping y  
#y - Is our target variable
X = df_dummies.drop('Heating Load', axis = 1)
y = df_dummies['Heating Load']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 30)


In [16]:
X_scaling = ["Relative Compactness", "Surface Area", "Wall Area", "Roof Area", "Overall Height"]

After undergoing the feature selection process, we've edited the X_scaling column to remove the Glazing Area column (correlation 0.26) as we will not be using these features for our model due to its low correlation to our target variable.

In [17]:
#Storing Scaler into X_scale variable
X_scale = MinMaxScaler()

#Fitting X data 
X_scale.fit(X_train[X_scaling])

#Transforming X data for training and testing
X_train_scale = X_scale.transform(X_train[X_scaling])
X_test_scale = X_scale.transform(X_test[X_scaling])


In [18]:
#Saving dataframes for modeling notebook

from sklearn import datasets
%store df_dummies


Stored 'df_dummies' (DataFrame)


## Summary

In this notebook we reloaded in our raw original dataframe (as we did in the EDA notebook) and renamed each column according to building feature (e.g: from X1 to Relative Compactness). We dropped the Cooling Load y Variable since we are specifically interested in the relationship between different building features and Heating Load. We created dummy variable for the Orientation and Glazing Area Distribution column, as they possess discrete values acting as categorical variables. 

We utilized the astype(int) method specifically on the Orientation and Glazing Area Distribution columns so that our output displayed 0's and 1's instead of boolean values. Additionally, we split our data into training/testing and used MinMaxScaler to scale all of our numerical variables. 

We can begin the modeling and testing process. 