## Feature Selection

In this section, we will look for relevant and significant features to improve our model and reduce the complexity of our model. 

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
import os
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score



In [1]:
#Retrieving stored dataframe 
%store -r df_dummies
df_dummies

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load,Orientation of 3,Orientation of 4,Orientation of 5,Glazing Area Distribution of 1,Glazing Area Distribution of 2,Glazing Area Distribution of 3,Glazing Area Distribution of 4,Glazing Area Distribution of 5,Height of 3.5,Height of 7
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,0,0,0,0,0,0,0,1
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55,1,0,0,0,0,0,0,0,0,1
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,1,0,0,0,0,0,0,0,1
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55,0,0,1,0,0,0,0,0,0,1
4,0.90,563.5,318.5,122.50,7.0,0.0,20.84,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,0.4,17.88,0,0,1,0,0,0,0,1,1,0
764,0.62,808.5,367.5,220.50,3.5,0.4,16.54,0,0,0,0,0,0,0,1,1,0
765,0.62,808.5,367.5,220.50,3.5,0.4,16.44,1,0,0,0,0,0,0,1,1,0
766,0.62,808.5,367.5,220.50,3.5,0.4,16.48,0,1,0,0,0,0,0,1,1,0


In [5]:
from sklearn.feature_selection import chi2 #categorical and continuous data

#Selecting all columns with discrete data
discrete_vars = df_dummies.select_dtypes(include = "int")
discrete_vars

#Discretizing y variable for chi2 test
y = df_dummies['Heating Load'].astype(int)

discrete_vars.head()



Unnamed: 0,Orientation of 3,Orientation of 4,Orientation of 5,Glazing Area Distribution of 1,Glazing Area Distribution of 2,Glazing Area Distribution of 3,Glazing Area Distribution of 4,Glazing Area Distribution of 5,Height of 3.5,Height of 7
0,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,1


We are observing how the X variables (that function as categorical variables) correlate to our target variable (Heating Load). This table shows our dummy encoded variable columns (for Orientation, Glazing Area Distribution, and Overall Height).

In [4]:
#Perfoming chi2 test 
chi = chi2(discrete_vars, y)
print("F-score and P-values: ")
print(chi)

F-score and P-values: 
(array([ 11.7140531 ,   8.71391864,   7.62687059,  29.77837921,
        21.71365983,  34.91779404,  19.68589057,  21.18022182,
       362.31731188, 362.31731188]), array([9.99957896e-01, 9.99999172e-01, 9.99999875e-01, 7.58177867e-01,
       9.71076613e-01, 5.19928030e-01, 9.87619468e-01, 9.76494040e-01,
       1.59314906e-55, 1.59314906e-55]))


We are using chi2 test to evaluate which variables are strongly correlated with target variable (Heating Load). Chi2 displays both the F-score and p-value. A high F-score is indicative of the X variable being strongly correlated to target variable. 

In [6]:
#Checking p-values
p_values = pd.Series(chi[1])
p_values.index = discrete_vars.columns
p_values

Orientation of 3                  9.999579e-01
Orientation of 4                  9.999992e-01
Orientation of 5                  9.999999e-01
Glazing Area Distribution of 1    7.581779e-01
Glazing Area Distribution of 2    9.710766e-01
Glazing Area Distribution of 3    5.199280e-01
Glazing Area Distribution of 4    9.876195e-01
Glazing Area Distribution of 5    9.764940e-01
Height of 3.5                     1.593149e-55
Height of 7                       1.593149e-55
dtype: float64

Using alpha of 0.05, we see that the p-values of the Orientation and Glazing Distribution Area columns are not sufficient to reject the null hypothesis, which implies a weak relationship between these variables and Heating Load. However, the Height columns have extremely low p-values (1.593149e-55), which indicates a highly significant relationship with target variable. We will conduct additional tests to explore the relationship between other building features and Heating Load. 

In [7]:
#Selecting all columns with continous variables
cont_vars = df_dummies.select_dtypes(include = "float")
cont_vars.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load
0,0.98,514.5,294.0,110.25,7.0,0.0,15.55
1,0.98,514.5,294.0,110.25,7.0,0.0,15.55
2,0.98,514.5,294.0,110.25,7.0,0.0,15.55
3,0.98,514.5,294.0,110.25,7.0,0.0,15.55
4,0.9,563.5,318.5,122.5,7.0,0.0,20.84


In [8]:
#Correlation between X variables and target
cont_vars.corr()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Heating Load
Relative Compactness,1.0,-0.9919015,-0.2037817,-0.8688234,0.8277473,-2.960552e-15,0.622272
Surface Area,-0.9919015,1.0,0.1955016,0.8807195,-0.8581477,3.636925e-15,-0.65812
Wall Area,-0.2037817,0.1955016,1.0,-0.2923165,0.2809757,-8.567455e-17,0.455671
Roof Area,-0.8688234,0.8807195,-0.2923165,1.0,-0.9725122,-1.759011e-15,-0.861828
Overall Height,0.8277473,-0.8581477,0.2809757,-0.9725122,1.0,1.4891340000000002e-17,0.88943
Glazing Area,-2.960552e-15,3.636925e-15,-8.567455e-17,-1.759011e-15,1.4891340000000002e-17,1.0,0.269842
Heating Load,0.6222719,-0.6581199,0.4556714,-0.8618281,0.8894305,0.2698417,1.0


The correlation matrix above displays values containing the correlation across mulitple building features (X variables) and Heating Load (y variable). The numbers listed (-1 to 1) represents correlation where: 1 represents a very strong positive correlation, 0 represents no correlation, and -1 represents a strong negative correlation. 

We are interested in analyzing the correlation of all building features across Heating Load (see Heating Load column on far right of table). 

Glazing Area (correlation 0.27) possess almost no correlation to Heating Load. We will drop this column, as this attribute will not help us predict linear regression model. Wall Area (correlation 0.46) possesses some correlation to Heating Load. 

There is a moderate positive correlation between Relative Compactness (0.62) and Heating Load. There is a very strong positive correlation between Overall Height of building (0.89) and Heating Load. Additionally, the graph reveals a moderate negative correlation between Surface Area (correlation -0.66) and Heating Load. Lastly, there is a strong negative correlation between Roof Area (correlation -0.86) and Heating Load. 

Thus, we will utilize Relative Compactness, Overall Height, Surface Area, and Roof Area for our linear regression model. 