# This answers Questions from 12 to 20, in the Stage-B Assessment. 

# DATASET CONTENT 

- 1 response
- 28 predictors 

- Dataset attribute:
    - Date, time year-month-day hour:minute:second
    - Appliances, energy use in Wh
    - lights, energy use of light fixtures in the house in Wh
    - T1, Temperature in kitchen area, in Celsius
    - RH_1, Humidity in kitchen area, in %
    - T2, Temperature in living room area, in Celsius
    - RH_2, Humidity in living room area, in %
    - T3, Temperature in laundry room area
    - RH_3, Humidity in laundry room area, in %
    - T4, Temperature in office room, in Celsius
    - RH_4, Humidity in office room, in %
    - T5, Temperature in bathroom, in Celsius
    - RH_5, Humidity in bathroom, in %
    - T6, Temperature outside the building (north side), in Celsius
    - RH_6, Humidity outside the building (north side), in %
    - T7, Temperature in ironing room , in Celsius
    - RH_7, Humidity in ironing room, in %
    - T8, Temperature in teenager room 2, in Celsius
    - RH_8, Humidity in teenager room 2, in %
    - T9, Temperature in parents room, in Celsius
    - RH_9, Humidity in parents room, in %
    - To, Temperature outside (from Chievres weather station), in Celsius
    - Pressure (from Chievres weather station), in mm Hg
    - RH_out, Humidity outside (from Chievres weather station), in %
    - Wind speed (from Chievres weather station), in m/s
    - Visibility (from Chievres weather station), in km
    - Tdewpoint (from Chievres weather station), Â°C
    - rv1, Random variable 1, nondimensional
    - rv2, Random variable 2, nondimensional

# Workflow from top to bottom 

# 1. Importing Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats

# 2. Import the Dataset 

In [2]:
energydata = pd.read_csv("energydata_complete.csv")

In [3]:
# reading data 

energydata

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.890000,47.596667,19.200000,44.790000,19.790000,44.730000,19.000000,...,17.033333,45.5300,6.600000,733.5,92.000000,7.000000,63.000000,5.300000,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.890000,46.693333,19.200000,44.722500,19.790000,44.790000,19.000000,...,17.066667,45.5600,6.483333,733.6,92.000000,6.666667,59.166667,5.200000,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.890000,46.300000,19.200000,44.626667,19.790000,44.933333,18.926667,...,17.000000,45.5000,6.366667,733.7,92.000000,6.333333,55.333333,5.100000,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.890000,46.066667,19.200000,44.590000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.250000,733.8,92.000000,6.000000,51.500000,5.000000,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.890000,46.333333,19.200000,44.530000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.133333,733.9,92.000000,5.666667,47.666667,4.900000,10.084097,10.084097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,2016-05-27 17:20:00,100,0,25.566667,46.560000,25.890000,42.025714,27.200000,41.163333,24.700000,...,23.200000,46.7900,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.500000,46.500000,25.754000,42.080000,27.133333,41.223333,24.700000,...,23.200000,46.7900,22.600000,755.2,56.000000,3.500000,24.500000,13.300000,49.282940,49.282940
19732,2016-05-27 17:40:00,270,10,25.500000,46.596667,25.628571,42.768571,27.050000,41.690000,24.700000,...,23.200000,46.7900,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.500000,46.990000,25.414000,43.036000,26.890000,41.290000,24.700000,...,23.200000,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784


- From the reading data, we have;
    - 29 columns 
    - 19735 rows

# 3. Checking the Dataset Info 

In [4]:
# computing the missing data against 
# the percentage  

total_missing = energydata.isnull().sum().sort_values(ascending=False)
percent_missing = (energydata.isnull().sum()/energydata.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent_missing], axis=1, keys=['Total_missing', 'Percentage_missing'])
missing_data
                   
                   
                   
                   
                   

Unnamed: 0,Total_missing,Percentage_missing
date,0,0.0
T7,0,0.0
rv1,0,0.0
Tdewpoint,0,0.0
Visibility,0,0.0
Windspeed,0,0.0
RH_out,0,0.0
Press_mm_hg,0,0.0
T_out,0,0.0
RH_9,0,0.0


We have no missing data in the dataset

In [5]:
# checking the decriptive statistics summary 
# of the dataset 

energydata.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [6]:
# checking the info of all the dataset

energydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

# 4. Workflow to answering Question 12 - 20

- NORMALISING THE DATASET

In [7]:
# importing from sklearn

from sklearn.preprocessing import MinMaxScaler

In [8]:
Scaler = MinMaxScaler()

In [9]:
date_dropped = energydata.drop(['date'], axis=1)

date_dropped = pd.DataFrame(Scaler.fit_transform(date_dropped), columns = date_dropped.columns)

# Question 12

- From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius(x=T2) and the temperature outside the building (y=T6). What is the R^2 value in two d.p?

In [11]:
# making T2 an independent variable X from the normalised 
# dataset date_dropped

X = date_dropped[['T2']]
X

Unnamed: 0,T2
0,0.225345
1,0.225345
2,0.225345
3,0.225345
4,0.225345
...,...
19730,0.711655
19731,0.701769
19732,0.692651
19733,0.677054


In [12]:
# making T6 an target variable y from the normalised 
# dataset date_dropped

y = energydata['T6']
y

0         7.026667
1         6.833333
2         6.560000
3         6.433333
4         6.366667
           ...    
19730    24.796667
19731    24.196667
19732    23.626667
19733    22.433333
19734    21.026667
Name: T6, Length: 19735, dtype: float64

In [13]:
# importing the dataset split module from sklearn

from sklearn.model_selection import train_test_split

In [14]:
# Splitting the Dataset for training and testing 

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3)

In [15]:
# reading the splitted dataset 

X_train

Unnamed: 0,T2
14523,0.545190
8361,0.104192
973,0.152653
6900,0.290768
16974,0.815847
...,...
1075,0.130119
17614,0.537921
11270,0.268234
16414,0.552459


In [16]:
y_train

14523    19.463333
8361      6.190000
973      -1.838889
6900      6.300000
16974    27.982857
           ...    
1075     -5.030000
17614    14.890000
11270     9.145000
16414    17.260000
12601     4.433333
Name: T6, Length: 13814, dtype: float64

In [17]:
# importing the model from sklearn 

from sklearn.linear_model import LinearRegression

In [18]:
# training the model with splitted dataset 

regressor = LinearRegression(fit_intercept=True)
regressor.fit(X_train, y_train)

LinearRegression()

In [19]:
# predicting with model 
# by supplying it with testing data

y_predict = regressor.predict(X_test)

In [20]:
# import the r2 performance metric from sklearn 

from sklearn.metrics import r2_score

In [21]:
# checking the performance of the model through
# r2 performance metric

r2 = r2_score(y_test, y_predict)

In [22]:
# rounding the outcome to 3 d.p

round(r2, 3)

0.633

- Answer to Question 12 is:
    - 0.633

# Question 13

- Normalise the dataset using the MinMaxScaler after removing the following columns: ["date", "lights"]. The target variable is "Appliances". Use a 70-30 train-test set split with a random state of 42(for reprducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions.


- What is the Mean Absolute Error in 2 d.p?

In [23]:
# since date has been dropped 
# the lights column would be dropped from the dataset 

new_energy = date_dropped.drop(['lights'], axis=1)
new_energy

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.046729,0.327350,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,...,0.223032,0.677290,0.372990,0.097674,0.894737,0.500000,0.953846,0.538462,0.265449,0.265449
1,0.046729,0.327350,0.541326,0.225345,0.682140,0.215188,0.748871,0.351351,0.782437,0.175506,...,0.226500,0.678532,0.369239,0.100000,0.894737,0.476190,0.894872,0.533937,0.372083,0.372083
2,0.037383,0.327350,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,...,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.037383,0.327350,0.524080,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,...,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.046729,0.327350,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,...,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,0.084112,0.926786,0.537657,0.711655,0.606309,0.830841,0.579374,0.864865,0.765258,0.752031,...,0.864724,0.729443,0.891747,0.602326,0.416667,0.238095,0.348718,0.901961,0.861981,0.861981
19731,0.074766,0.919747,0.536006,0.701769,0.607836,0.825302,0.582178,0.864865,0.765258,0.754897,...,0.864724,0.729443,0.887460,0.602326,0.421053,0.250000,0.361538,0.900452,0.985726,0.985726
19732,0.242991,0.919747,0.538666,0.692651,0.627198,0.818378,0.603988,0.864865,0.771233,0.754897,...,0.864724,0.729443,0.883173,0.602326,0.425439,0.261905,0.374359,0.898944,0.583979,0.583979
19733,0.383178,0.919747,0.549491,0.677054,0.634717,0.805085,0.585294,0.864865,0.773794,0.752031,...,0.864724,0.730581,0.878885,0.602326,0.429825,0.273810,0.387179,0.897436,0.126371,0.126371


DATASET HAS BEEN NORMALIZED ALREADY

In [24]:
# making the predictors from the normalised data 
# X as predictors


X = new_energy.drop(['Appliances'], axis=1)
X

Unnamed: 0,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.327350,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,0.381691,...,0.223032,0.677290,0.372990,0.097674,0.894737,0.500000,0.953846,0.538462,0.265449,0.265449
1,0.327350,0.541326,0.225345,0.682140,0.215188,0.748871,0.351351,0.782437,0.175506,0.381691,...,0.226500,0.678532,0.369239,0.100000,0.894737,0.476190,0.894872,0.533937,0.372083,0.372083
2,0.327350,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,0.380037,...,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.327350,0.524080,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,0.380037,...,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.327350,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,0.380037,...,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,0.926786,0.537657,0.711655,0.606309,0.830841,0.579374,0.864865,0.765258,0.752031,0.339590,...,0.864724,0.729443,0.891747,0.602326,0.416667,0.238095,0.348718,0.901961,0.861981,0.861981
19731,0.919747,0.536006,0.701769,0.607836,0.825302,0.582178,0.864865,0.765258,0.754897,0.338487,...,0.864724,0.729443,0.887460,0.602326,0.421053,0.250000,0.361538,0.900452,0.985726,0.985726
19732,0.919747,0.538666,0.692651,0.627198,0.818378,0.603988,0.864865,0.771233,0.754897,0.337585,...,0.864724,0.729443,0.883173,0.602326,0.425439,0.261905,0.374359,0.898944,0.583979,0.583979
19733,0.919747,0.549491,0.677054,0.634717,0.805085,0.585294,0.864865,0.773794,0.752031,0.336583,...,0.864724,0.730581,0.878885,0.602326,0.429825,0.273810,0.387179,0.897436,0.126371,0.126371


In [25]:
# making the Appliances the target from the normalised
# dataset

y = new_energy['Appliances']
y


0        0.046729
1        0.046729
2        0.037383
3        0.037383
4        0.046729
           ...   
19730    0.084112
19731    0.074766
19732    0.242991
19733    0.383178
19734    0.392523
Name: Appliances, Length: 19735, dtype: float64

In [26]:
# splitting the dataset in accordance to the instructions
# in the question 13

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)

In [27]:
# training the model with splitted dataset

linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(X_train, y_train)

LinearRegression()

In [28]:
# predicting the newly trained model with the testing data 

y_predict = linear_model.predict(X_test)

In [29]:
# import the MAE performance metric from sklearn 

from sklearn.metrics import mean_absolute_error

In [30]:
# checking the performance of the model through MAE

MAE = mean_absolute_error(y_test, y_predict)

In [31]:
# rounding the outcome to 2d.p

round(MAE, 2)

0.05

- the Mean Absolute Error is:
    - 0.05 (to 2.d.p)

# Question 14

- What is the Residual Sum of Square(in two decimal places)?

In [32]:
# Checking performance of the model through
# Residual Sum of Square 


RSS = ((y_test-y_predict)**2).sum()
round(RSS, 2)

45.35

- the RSS of the model is:
    - 45.35

# Question 15

- What is the Root Mean Squared Error (in 3 d.p)?

In [34]:
# importing MSE from the sklearn

from sklearn.metrics import mean_squared_error

In [35]:
# Checking performance of the model through
# Root Mean Squared Error 

RMSE = np.sqrt(mean_squared_error(y_test,y_predict))
round(RMSE, 3)

0.088

- the RMSE of the model is:
    - 0.088

# Question 16

- What is the Coefficient of Determination (in two decimal places)

In [36]:
# importing r2 from the sklearn

from sklearn.metrics import r2_score

In [37]:
# Checking performance of the model through
# r2

r2 = r2_score(y_test, y_predict)
round(r2, 2)

0.15

- the Coefficient of Determination of the model is:
    - 0.15

# Question 17

- Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

In [38]:
# defining a function to obtain the feature weights of
# the linear model

def weights_df(model, feat, col_name):
    """
    this function returns the weight of every feature
    ON THE MODEL
    """
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features', col_name]
    weights_df[col_name].round(3)
    return weights_df

In [39]:
# Obtaining the Linear model weights

linear_model_weights = weights_df(linear_model,
                                     X_train,
                                      'Linear_Model_Weight')

In [40]:
# printing the weights

linear_model_weights

Unnamed: 0,Features,Linear_Model_Weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


In [43]:
linear_model_weights['Linear_Model_Weight'].max()

0.5535465998386391

In [44]:
linear_model_weights['Linear_Model_Weight'].min()

-0.4566979483385004

- From the printed linear model weights, the features that have lowest and height weights are:
    - RH_2 with -0.456698 feature at Index 0
    - RH_1 with 0.553547 feature at Index 25

# Question 18

- Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [45]:
# import ridge from sklearn

from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(X_train, y_train)

Ridge(alpha=0.4)

In [46]:
# predicting with ridge model with test data

y_ridge_predict = ridge_reg.predict(X_test)
y_ridge_predict

array([0.03321872, 0.24043824, 0.03461337, ..., 0.06872351, 0.10025536,
       0.05851175])

In [47]:
# checking the performance ridge model through 
# RMSE

RMSE = np.sqrt(mean_squared_error(y_test,y_ridge_predict))
round(RMSE, 3)

0.088

- the answer to question is No. the RMSE does not changed even when we the data is trained through RIDGE


    - the RMSE in both is 0.088

# Question 19


- Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [48]:
# importing lasso from sklearn

from sklearn.linear_model import Lasso

In [49]:
# training the lasso model with train data set
# in accordance with the instruction in the question

lasso_reg = Lasso(alpha = 0.001)
lasso_reg.fit(X_train, y_train)

Lasso(alpha=0.001)

In [50]:
# defining a function to obtain the feature weights of
# the lasso model

def lasso_weights(model, feat, col_name):
    """
    this function returns the weight of every feature
    ON THE MODEL
    """
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features', col_name]
    weights_df[col_name].round(3)
    return weights_df

In [52]:
# Obtaining the lasso model weights 

lasso_weight_df = lasso_weights(lasso_reg, X_train,
                                'Lasso_Weight')

In [53]:
# Printing lasso weight 

lasso_weight_df

Unnamed: 0,Features,Lasso_Weight
0,RH_out,-0.049557
1,RH_8,-0.00011
2,T1,0.0
3,Tdewpoint,0.0
4,Visibility,0.0
5,Press_mm_hg,-0.0
6,T_out,0.0
7,RH_9,-0.0
8,T9,-0.0
9,T8,0.0


- From the printed lasso model weights, there are only 4 features with non-zero:
    1. RH_out feature with -0.049557 weight at Index 0
    2. RH_8 feature with -0.000110 weight at Index 1
    3. Windspeed feature with 0.002912 weight at Index 24
    4. RH_1 feature with 0.017880 weight at Index 25

# Question 20

- What is the new RMSE with lasso regression? in 3 d.p

In [54]:
# predicting the lasso model 

y_lasso_predict = lasso_reg.predict(X_test)
y_lasso_predict

array([0.07370267, 0.08143458, 0.07716072, ..., 0.07792848, 0.09034412,
       0.08359255])

In [55]:
RMSE = np.sqrt(mean_squared_error(y_test,y_lasso_predict))
round(RMSE, 3)

0.094

- the RMSE of the lasso model is:
    - 0.094