## Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [these example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/5.solution_evaluating_goodness_of_fit.ipynb).



### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model
import statsmodels.formula.api as smf
from sqlalchemy import create_engine
import statsmodels.api as sm

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weather = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

weather.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


In [2]:
# Y is the target variable
Y = weather['apparenttemperature'] - weather['temperature']

# X is the feature set 
X = weather[['humidity','windspeed']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,13:34:53,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


- The Rsquared value is 0.288 and the adjusted Rsquared value is 0.288 
    - These values are not very good, they indicate that our model explains just 28.8% of the variance in the difference between apparent temperature and temperature. We would want to add other explanatory variables to try to increase these values.
    
Lets include interaction of humidity and windspeed to see if we can improve the predicitve power of our model.

### Interaction of Humidity and Windspeed

In [3]:
# This is the interaction between bmi and smoking
weather["humidity_windspeed"] = weather.humidity * weather.windspeed

# X is the feature set
X = weather[['humidity','windspeed', 'humidity_windspeed']]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Sat, 04 Jan 2020   Prob (F-statistic):               0.00
Time:                        13:39:09   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.0839      0

- By adding the interaction of humidity and windspeed we were able to increase the Rsquared value to 34.1% meaning our model explains 34.1% of the value of our target. There is still room for improvement in our model.

Lets add visibility as another explanitory variable to the first model

In [4]:
# X is the feature set
X = weather[['humidity','windspeed', 'visibility']]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                 1.401e+04
Date:                Sat, 04 Jan 2020   Prob (F-statistic):               0.00
Time:                        13:42:15   Log-Likelihood:            -1.6938e+05
No. Observations:               96453   AIC:                         3.388e+05
Df Residuals:                   96449   BIC:                         3.388e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5756      0.028     56.605      0.0

- With the addition of visibility to our first model we increased Rsquared to 30.4% which was a 1.6% increase in the variance explanation over the first model. 
- Adjusted R-Squared increase slightly less to 30.3%
- By adding the interaction term in the 2nd model we increased Adjusted Rsquared to 34.1% which was an increase of 5.3% versus 1.5% for the inclusion of visibility in the 3rd model
    - Based upon this information I would choose the 2nd model because it provides more explanatory power of the target variable than the 1st or 3rd. 
    
Lets add visibility to the 2nd model to try to increase our explained variance

In [5]:
# X is the feature set
X = weather[['humidity','windspeed', 'humidity_windspeed', 'visibility']]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                 1.377e+04
Date:                Sat, 04 Jan 2020   Prob (F-statistic):               0.00
Time:                        13:50:23   Log-Likelihood:            -1.6504e+05
No. Observations:               96453   AIC:                         3.301e+05
Df Residuals:                   96448   BIC:                         3.301e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -1.1006      0

- We were able to increase Adjusted Rsquared by 2.2 points

Comparing AIC and BIC scores it appears that the 2nd model is the best model.

However the 4th model actually improves on the 2nd according to the AIC and BIC scores.


###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [8]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import seaborn as sns
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


house_prices_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [9]:
house_prices_df = pd.concat([house_prices_df,pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
house_prices_df = pd.concat([house_prices_df,pd.get_dummies(house_prices_df.street, prefix="street", drop_first=True)], axis=1)
dummy_column_names = list(pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(house_prices_df.street, prefix="street", drop_first=True).columns)

In [10]:
# Y is the target variable
Y = house_prices_df['saleprice']
# X is the feature set
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.767
Method:,Least Squares,F-statistic:,482.0
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,14:00:42,Log-Likelihood:,-17475.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1449,BIC:,35030.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.173e+05,1.8e+04,-6.502,0.000,-1.53e+05,-8.19e+04
overallqual,2.333e+04,1088.506,21.430,0.000,2.12e+04,2.55e+04
grlivarea,45.6344,2.468,18.494,0.000,40.794,50.475
garagecars,1.345e+04,2990.453,4.498,0.000,7584.056,1.93e+04
garagearea,16.4082,10.402,1.577,0.115,-3.997,36.813
totalbsmtsf,28.3816,2.931,9.684,0.000,22.633,34.131
mszoning_FV,2.509e+04,1.37e+04,1.833,0.067,-1761.679,5.19e+04
mszoning_RH,1.342e+04,1.58e+04,0.847,0.397,-1.77e+04,4.45e+04
mszoning_RL,2.857e+04,1.27e+04,2.246,0.025,3612.782,5.35e+04

0,1,2,3
Omnibus:,415.883,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,41281.526
Skew:,-0.115,Prob(JB):,0.0
Kurtosis:,29.049,Cond. No.,55300.0


- Using the p-value of the F-statistic we can say that our model is useful and contributes something that is statistically significant in the expanation of saleprice
- Rsquared of 0.769 and Adjusted Rsquared of 0.767 suggests that our explanatory variables explain approximately 77% of the variance in the saleprice. This is a good number, and suggests that our model is useful, however we may be able to improve on this by adding additional features
- AIC and BIC for the first model dont really tell us anything because we need another model to compare against. 
- The model is satisfactory as it explains a lot of the variance and is statistically significant in explaining our target of sale price, however it can probably be improved upon.

Lets add additional features that had strong correlation values to see if we can improve our model

In [11]:
# fit new model
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'yearbuilt'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.773
Model:,OLS,Adj. R-squared:,0.771
Method:,Least Squares,F-statistic:,447.3
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,14:20:43,Log-Likelihood:,-17463.0
No. Observations:,1460,AIC:,34950.0
Df Residuals:,1448,BIC:,35010.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.615e+05,9.23e+04,-6.086,0.000,-7.43e+05,-3.81e+05
overallqual,2.113e+04,1168.784,18.081,0.000,1.88e+04,2.34e+04
grlivarea,49.4035,2.566,19.255,0.000,44.371,54.436
garagecars,1.012e+04,3043.388,3.326,0.001,4153.097,1.61e+04
garagearea,18.6952,10.331,1.810,0.071,-1.570,38.960
totalbsmtsf,26.6874,2.928,9.114,0.000,20.944,32.431
yearbuilt,234.9417,47.874,4.908,0.000,141.033,328.851
mszoning_FV,1.453e+04,1.37e+04,1.057,0.291,-1.24e+04,4.15e+04
mszoning_RH,9651.6551,1.57e+04,0.613,0.540,-2.12e+04,4.05e+04

0,1,2,3
Omnibus:,432.09,Durbin-Watson:,1.985
Prob(Omnibus):,0.0,Jarque-Bera (JB):,47875.845
Skew:,-0.185,Prob(JB):,0.0
Kurtosis:,31.051,Cond. No.,258000.0


By adding yearbuilt we got a slight improvement in our model, increase in Adj Rsquared of .04 and slight reductions in AIC and BIC

Lets add lotarea

In [12]:
# fit new model
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'yearbuilt', 'lotarea'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.777
Model:,OLS,Adj. R-squared:,0.775
Method:,Least Squares,F-statistic:,421.0
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,14:24:58,Log-Likelihood:,-17448.0
No. Observations:,1460,AIC:,34920.0
Df Residuals:,1447,BIC:,34990.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.318e+05,9.22e+04,-6.851,0.000,-8.13e+05,-4.51e+05
overallqual,2.164e+04,1160.714,18.645,0.000,1.94e+04,2.39e+04
grlivarea,46.9059,2.580,18.181,0.000,41.845,51.967
garagecars,9777.1130,3013.501,3.244,0.001,3865.816,1.57e+04
garagearea,17.9118,10.228,1.751,0.080,-2.152,37.975
totalbsmtsf,23.9611,2.940,8.149,0.000,18.193,29.729
yearbuilt,262.0698,47.647,5.500,0.000,168.605,355.535
lotarea,0.6034,0.109,5.523,0.000,0.389,0.818
mszoning_FV,1.031e+04,1.36e+04,0.757,0.449,-1.64e+04,3.71e+04

0,1,2,3
Omnibus:,462.782,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,59111.177
Skew:,-0.315,Prob(JB):,0.0
Kurtosis:,34.166,Cond. No.,1370000.0


Adding lotarea saw another small improvement in the model, lets take out street pave

In [17]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


house_prices_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [18]:
house_prices_df = pd.concat([house_prices_df,pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
dummy_column_names = list(pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True).columns)

In [19]:
# fit new model
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'yearbuilt', 'lotarea'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.777
Model:,OLS,Adj. R-squared:,0.775
Method:,Least Squares,F-statistic:,458.8
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,14:28:56,Log-Likelihood:,-17448.0
No. Observations:,1460,AIC:,34920.0
Df Residuals:,1448,BIC:,34980.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.103e+05,9.08e+04,-6.724,0.000,-7.88e+05,-4.32e+05
overallqual,2.168e+04,1160.704,18.675,0.000,1.94e+04,2.4e+04
grlivarea,47.1491,2.574,18.318,0.000,42.100,52.198
garagecars,9839.7613,3013.860,3.265,0.001,3927.763,1.58e+04
garagearea,17.1985,10.216,1.683,0.092,-2.841,37.238
totalbsmtsf,24.0665,2.940,8.186,0.000,18.299,29.834
yearbuilt,259.8229,47.628,5.455,0.000,166.396,353.250
lotarea,0.5715,0.107,5.365,0.000,0.363,0.780
mszoning_FV,1.453e+04,1.32e+04,1.097,0.273,-1.15e+04,4.05e+04

0,1,2,3
Omnibus:,460.624,Durbin-Watson:,1.985
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58239.341
Skew:,-0.307,Prob(JB):,0.0
Kurtosis:,33.935,Cond. No.,1350000.0


taking out street we dont see any change in the model

lets log transform the target variable, and add a summation variable to of totalsf and an interaction variable of totalsf and overallquality to see if we can improve our model

In [23]:
house_prices_df['totalsf'] = house_prices_df['totalbsmtsf'] + house_prices_df['firstflrsf'] + house_prices_df['secondflrsf']
house_prices_df['int_qual_sf'] = house_prices_df['totalsf'] * house_prices_df['overallqual']

Y = np.log1p(house_prices_df['saleprice'])
# fit new model
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'yearbuilt', 'lotarea', 'int_qual_sf'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.829
Model:,OLS,Adj. R-squared:,0.828
Method:,Least Squares,F-statistic:,640.2
Date:,"Sat, 04 Jan 2020",Prob (F-statistic):,0.0
Time:,14:38:10,Log-Likelihood:,559.83
No. Observations:,1460,AIC:,-1096.0
Df Residuals:,1448,BIC:,-1032.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.7570,0.404,16.745,0.000,5.965,7.549
overallqual,0.1082,0.007,15.734,0.000,0.095,0.122
grlivarea,0.0002,1.61e-05,12.940,0.000,0.000,0.000
garagecars,0.0671,0.013,5.014,0.000,0.041,0.093
garagearea,0.0001,4.54e-05,2.262,0.024,1.36e-05,0.000
yearbuilt,0.0018,0.000,8.697,0.000,0.001,0.002
lotarea,3.049e-06,4.67e-07,6.528,0.000,2.13e-06,3.97e-06
int_qual_sf,2.857e-06,1.49e-06,1.922,0.055,-5.88e-08,5.77e-06
mszoning_FV,0.4517,0.058,7.727,0.000,0.337,0.566

0,1,2,3
Omnibus:,785.499,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23028.443
Skew:,-1.932,Prob(JB):,0.0
Kurtosis:,22.069,Cond. No.,2060000.0


We significantly increased the adjusted Rsquared value to 0.828 which means that 17.2% of the variance of our target variable is unexplained by our model.

Additionally AIC and BIC are much lower than the previous models