# Exercise 1.2
Let's look at a multivariate linear regression problem with a dataset: https://archive.ics.uci.edu/ml/datasets/Energy+efficiency

> The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.
> 
> Specifically:
> - X1 Relative Compactness
> - X2 Surface Area
> - X3 Wall Area
> - X4 Roof Area
> - X5 Overall Height
> - X6 Orientation
> - X7 Glazing Area
> - X8 Glazing Area Distribution
> - y1 Heating Load
> - y2 Cooling Load

1. Create a correlation matrix and pick the best two features for modelling using linear regression. What do you observe about the dataset in general?
2. Develop a linear regression model for estimating y1 (heating load) using 60 percent of data picked randomly for training and remaining for testing.  Visualise your model prediction using appropriate plots. Report the RMSE and R-squared score. 
3. Try the approach with all input features, i) without normalising input data, ii) with normalising input data.
4. Run 30 experiments each and report the mean and std of the RMSE and R-squared score of the train and test datasets. Write a paragraph to compare your results of the different approaches taken.

In [1]:
import pandas as pd
import altair as alt

# First lets load the data
df = pd.read_excel('data/ENB2012_data.xlsx')
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48,16.61


In [2]:
# First lets create a correlation matrix to determine whe features might be best for modeling
df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
X1,1.0,-0.991901,-0.203782,-0.868823,0.827747,0.0,-0.0,-0.0,0.622272,0.634339
X2,-0.991901,1.0,0.195502,0.88072,-0.858148,-0.0,0.0,0.0,-0.65812,-0.672999
X3,-0.203782,0.195502,1.0,-0.292316,0.280976,-0.0,-0.0,0.0,0.455671,0.427117
X4,-0.868823,0.88072,-0.292316,1.0,-0.972512,-0.0,-0.0,-0.0,-0.861828,-0.862547
X5,0.827747,-0.858148,0.280976,-0.972512,1.0,0.0,0.0,-0.0,0.88943,0.895785
X6,0.0,-0.0,-0.0,-0.0,0.0,1.0,-0.0,-0.0,-0.002587,0.01429
X7,-0.0,0.0,-0.0,-0.0,0.0,-0.0,1.0,0.212964,0.269842,0.207505
X8,-0.0,0.0,0.0,-0.0,-0.0,-0.0,0.212964,1.0,0.087368,0.050525
Y1,0.622272,-0.65812,0.455671,-0.861828,0.88943,-0.002587,0.269842,0.087368,1.0,0.975862
Y2,0.634339,-0.672999,0.427117,-0.862547,0.895785,0.01429,0.207505,0.050525,0.975862,1.0


Now let's interpret the above correlation martix.

First lets take a look at what the different values mean:
- A value of `1.0` means that the 2 columns are perfectly **positively** correlated.
- A value of `-1.0` means that the 2 columns are perfectly **negatively** correlated.
- A value of `0.0` means that the 2 columns are not correlated.

From the above we can first observe that the diagonal matrix values (where the columns are measured against themselves) are all `1.0`. However, this is of no use to use.

The confusion matrix is basically "mirrored" by nature - so we only need to look at the top diagonal of the matrix.

From here correlation information between feature columns could be used to conduct some feature or dimensionality reduction. A further principal component analysis (PCA) test would highlight the additional information added by each feature.

For our case though we're mainly interested in how the features correlate with our labels.

To pick the "best" two features for modelling we'd want to pick the strongest correlated features wit the labels. These can either be positve or negative correlation.

For our modeling use-case `X4` and `X5` look to be strongly correlated to our labels so let's choose them.

<p style="height: 320px;" align="center">
  <img src="assets/exercise_1_2_correlation_matrix.png">
</p>

- Again `X1` = **Relative Compactness** and `X5` = **Overall Height**
- Ok now let's use these 2 features to create a linear regression model to predict the `y1` (Heating Load).
- We want to use 60% of the data for training
- Visualuise the model predictions
- Report on the RMSE and R-squared

In [3]:
# First select the relevant columns from our dataset
df_X = df[['X1', 'X5']]
df_y = df[['Y1']]

# Now lets split our data into training testing split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.4, random_state=1)

# Now let's fit a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

# Now let's make predictions for our training and testing sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Now let's calculate the RMSE and R^2 for our training and testing sets
from sklearn.metrics import mean_squared_error, r2_score
rmse_traing = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

rmse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"RMSE Training: {rmse_traing}, R^2 Training: {r2_train}")
print(f"RMSE Testing: {rmse_test}, R^2 Testing: {r2_test}")

RMSE Training: 15.979267396405923, R^2 Training: 0.844975763291451
RMSE Testing: 18.741486688579293, R^2 Testing: 0.8111131593317275


In [18]:
# Now lets visualise how well we went
# Lets do this by plotting the actual vs predicted values
plot_df = X_test.copy()
plot_df['Y1'] = y_test
plot_df['Y1_pred'] = y_test_pred

# Create our scatter plot
scatter_plot = alt.Chart(plot_df).mark_point().encode(
  x='Y1',
  y='Y1_pred',
)

# Create a line between the min and max values
line = alt.Chart(pd.DataFrame({
  'Y1': [plot_df['Y1'].min(), plot_df['Y1'].max()],
  'Y1_pred': [plot_df['Y1'].min(), plot_df['Y1'].max()]
})).mark_line(color='red').encode(
  x='Y1',
  y='Y1_pred'
)

scatter_plot + line

In [19]:
# Now lets try this again will all features

df_X = df[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']]
df_y = df[['Y1']]

# Now lets split our data into training testing split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.4, random_state=1)

# Now let's fit a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

# Now let's make predictions for our training and testing sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Now let's calculate the RMSE and R^2 for our training and testing sets
from sklearn.metrics import mean_squared_error, r2_score
rmse_traing = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

rmse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"RMSE Training: {rmse_traing}, R^2 Training: {r2_train}")
print(f"RMSE Testing: {rmse_test}, R^2 Testing: {r2_test}")

RMSE Training: 7.897811675141516, R^2 Training: 0.9233787008982612
RMSE Testing: 9.628388784593195, R^2 Testing: 0.9029598895504963


In [20]:
# Now lets visualise how well we went
# Lets do this by plotting the actual vs predicted values
plot_df = X_test.copy()
plot_df['Y1'] = y_test
plot_df['Y1_pred'] = y_test_pred

# Create our scatter plot
scatter_plot = alt.Chart(plot_df).mark_point().encode(
  x='Y1',
  y='Y1_pred',
)

# Create a line between the min and max values
line = alt.Chart(pd.DataFrame({
  'Y1': [plot_df['Y1'].min(), plot_df['Y1'].max()],
  'Y1_pred': [plot_df['Y1'].min(), plot_df['Y1'].max()]
})).mark_line(color='red').encode(
  x='Y1',
  y='Y1_pred'
)

scatter_plot + line

In [35]:
# Now try it with normalising all the features
from sklearn.preprocessing import StandardScaler

df_X = df[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']]
df_X =pd.DataFrame(StandardScaler().fit_transform(df_X), columns=df_X.columns)
df_y = df[['Y1']]

# Now lets split our data into training testing split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.4, random_state=1)

# Now let's fit a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

# Now let's make predictions for our training and testing sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Now let's calculate the RMSE and R^2 for our training and testing sets
from sklearn.metrics import mean_squared_error, r2_score
rmse_traing = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

rmse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"RMSE Training: {rmse_traing}, R^2 Training: {r2_train}")
print(f"RMSE Testing: {rmse_test}, R^2 Testing: {r2_test}")

RMSE Training: 8.404026519404367, R^2 Training: 0.9184676140570688
RMSE Testing: 9.829018523042766, R^2 Testing: 0.9009378345198815


In [36]:
# Now lets visualise how well we went
# Lets do this by plotting the actual vs predicted values
plot_df = X_test.copy()
plot_df['Y1'] = y_test
plot_df['Y1_pred'] = y_test_pred

# Create our scatter plot
scatter_plot = alt.Chart(plot_df).mark_point().encode(
  x='Y1',
  y='Y1_pred',
)

# Create a line between the min and max values
line = alt.Chart(pd.DataFrame({
  'Y1': [plot_df['Y1'].min(), plot_df['Y1'].max()],
  'Y1_pred': [plot_df['Y1'].min(), plot_df['Y1'].max()]
})).mark_line(color='red').encode(
  x='Y1',
  y='Y1_pred'
)

scatter_plot + line