As in previous lessons, please submit links to two Jupyter Notebooks (one for each assessment below).

Please submit links to all your work after the assessment questions.

#### 1. Weather model
For this assessment, you'll revisit the *historical temperature* dataset. To complete this assessment, submit a link to a Jupyter Notebook containing your solutions to the following tasks:

* First, load the dataset from the weatherinszeged table from Thinkful's database.
* As in the previous lesson, build a linear regression model where your target variable is the difference between the `apparenttemperature` and the `temperature`. As explanatory variables, use `humidity` and `windspeed`. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?
* Next, include the interaction of `humidity` and `windspeed` to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?
* Add `visibility` as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the `visibility` in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.
#### 2. House prices model
In this exercise, you'll work on your house prices model. To complete this assessment, submit a link to a Jupyter Notebook containing your solutions to the following tasks:

* Load the houseprices data from Thinkful's database.
* Run your *house prices* model again and assess the goodness of fit of your model using an F-test, R-squared, adjusted R-squared, AIC, and BIC.
* Do you think that your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.
* For each model that you try, get the goodness-of-fit metrics and compare your models with each other. Which model is the best and why?

# 1. Weather model

In [1]:
# For convenience, we will load all environments here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

In [3]:
# First, load the dataset from the weatherinszeged table from Thinkful's database.
engine_w = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine_w)

# No need for an open connection, because you're only doing a single query.
engine_w.dispose()

weather_df.info()
weather_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   date                 96453 non-null  datetime64[ns, UTC]
 1   summary              96453 non-null  object             
 2   preciptype           96453 non-null  object             
 3   temperature          96453 non-null  float64            
 4   apparenttemperature  96453 non-null  float64            
 5   humidity             96453 non-null  float64            
 6   windspeed            96453 non-null  float64            
 7   windbearing          96453 non-null  float64            
 8   visibility           96453 non-null  float64            
 9   loudcover            96453 non-null  float64            
 10  pressure             96453 non-null  float64            
 11  dailysummary         96453 non-null  object             
dtypes: datetime64[ns, 

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


In [4]:
# Build a linear regression model
# where your target variable is the difference between the apparenttemperature and the temperature. 
weather_df['temperature_difference'] = weather_df['apparenttemperature'] - weather_df['temperature'] 
Y_1 = weather_df['temperature_difference']

# As explanatory variables, use humidity and windspeed. 
X_1 = weather_df[['humidity', 'windspeed']]
X_1 = sm.add_constant(X_1)

# Now, estimate your model using OLS. 
results_1 = sm.OLS(Y_1, X_1).fit()
print(results_1.summary())

                              OLS Regression Results                              
Dep. Variable:     temperature_difference   R-squared:                       0.288
Model:                                OLS   Adj. R-squared:                  0.288
Method:                     Least Squares   F-statistic:                 1.949e+04
Date:                    Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                            15:23:56   Log-Likelihood:            -1.7046e+05
No. Observations:                   96453   AIC:                         3.409e+05
Df Residuals:                       96450   BIC:                         3.409e+05
Df Model:                               2                                         
Covariance Type:                nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       

### What are the R-squared and adjusted R-squared values? 
The R-squared value is 0.288, and the adjusted R-squared value is also 0.288.

### Do you think they are satisfactory? Why?
The model is not satisfactory. The R-squared is low, which tells me that my variables do not explain much information about the outcome.

In [5]:
# Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS.
weather_df['humidity_windspeed'] = weather_df['humidity'] * weather_df['windspeed']

Y_2 = weather_df['temperature_difference']
X_2 = weather_df[['humidity', 'windspeed', 'humidity_windspeed']]
X_2 = sm.add_constant(X_2)
results_2 = sm.OLS(Y_2, X_2).fit()
print(results_2.summary())

                              OLS Regression Results                              
Dep. Variable:     temperature_difference   R-squared:                       0.341
Model:                                OLS   Adj. R-squared:                  0.341
Method:                     Least Squares   F-statistic:                 1.666e+04
Date:                    Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                            15:28:39   Log-Likelihood:            -1.6669e+05
No. Observations:                   96453   AIC:                         3.334e+05
Df Residuals:                       96449   BIC:                         3.334e+05
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------

### Now, what is the R-squared of this model? 
The R-squared of the model is now 0.341.

### Does this model improve upon the previous one?
The model did improve a little from the previous model. However, the R-squared of 0.341 is still a low number. The model needs to improve a lot more.

In [7]:
# Add visibility as an additional explanatory variable to the first model and estimate it. 
Y_3 = weather_df['temperature_difference']
X_3 = weather_df[['humidity', 'windspeed', 'visibility']]
X_3 = sm.add_constant(X_3)
results_3 = sm.OLS(Y_3, X_3).fit()
print(results_3.summary())

                              OLS Regression Results                              
Dep. Variable:     temperature_difference   R-squared:                       0.304
Model:                                OLS   Adj. R-squared:                  0.303
Method:                     Least Squares   F-statistic:                 1.401e+04
Date:                    Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                            15:32:29   Log-Likelihood:            -1.6938e+05
No. Observations:                   96453   AIC:                         3.388e+05
Df Residuals:                       96449   BIC:                         3.388e+05
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       

### Did R-squared increase? 
No, it decreased.

### What about adjusted R-squared? 
It decreased as well. 

### Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?
Looking at the adjusted R-squared, the interaction term is more useful in this model.

### Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.
With respect to the AIC and BIC scores, the interaction term is the most useful model. The interaction term has the lowest AIC and BIC scores.

# 2. Houseprices

In [8]:
# Load the houseprices data from Thinkful's database.
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

# First, load the dataset from the weatherinszeged table from Thinkful's database.
engine_h = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('select * from houseprices',con=engine_h)

# No need for an open connection, because you're only doing a single query.
engine_h.dispose()

houseprices_df.info()
houseprices_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1460 non-null   int64  
 1   mssubclass     1460 non-null   int64  
 2   mszoning       1460 non-null   object 
 3   lotfrontage    1201 non-null   float64
 4   lotarea        1460 non-null   int64  
 5   street         1460 non-null   object 
 6   alley          91 non-null     object 
 7   lotshape       1460 non-null   object 
 8   landcontour    1460 non-null   object 
 9   utilities      1460 non-null   object 
 10  lotconfig      1460 non-null   object 
 11  landslope      1460 non-null   object 
 12  neighborhood   1460 non-null   object 
 13  condition1     1460 non-null   object 
 14  condition2     1460 non-null   object 
 15  bldgtype       1460 non-null   object 
 16  housestyle     1460 non-null   object 
 17  overallqual    1460 non-null   int64  
 18  overallc

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [27]:
# Run your house prices model again and assess the goodness of fit of your model 
# using an F-test, R-squared, adjusted R-squared, AIC, and BIC.
Y_4 = houseprices_df['saleprice']
X_4 = houseprices_df[['lotarea', 'overallqual', 'yearbuilt', 'garagecars']]
X_4 = sm.add_constant(X_4)
results_4 = sm.OLS(Y_4, X_4).fit()
print(results_4.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.695
Model:                            OLS   Adj. R-squared:                  0.694
Method:                 Least Squares   F-statistic:                     829.6
Date:                Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:01:50   Log-Likelihood:                -17677.
No. Observations:                1460   AIC:                         3.536e+04
Df Residuals:                    1455   BIC:                         3.539e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -3.604e+05   9.29e+04     -3.881      

* F-test: The F-statistic is 986.5, and the associated p-value is close to 0. This means that my features add some information to the reduced model, and my model is useful in explaining the sale price.
* R-squared: The R-squared value is 0.695, which has some room to improve. This means that my model explains 69.5% of the variance in the sale price.
* Adjusted R-squared: My adjusted R-squared value is 0.694, which is very similar to the R-squared value. This indicates that the number of variables isn't a huge factor for my R-squared value.
* AIC and BIC: My AIC value is 35,360, and my BIC value is 35,390. Since the lower AIC and BIC values indicate better models, my model has some room for improvement. Since AIC and BIC are not too different, I can conclude that my AIC value did not overfit the model.

### Do you think that your model is satisfactory? If so, why?
My model is not satisfactory. The R-squared and the adjusted R-squared has some room for improvement, as well as AIC and BIC values. 

In [28]:
# In order to improve the goodness of fit of your model, try different model specifications 
# by adding or removing some variables.
Y_5 = houseprices_df['saleprice']
X_5 = houseprices_df[['lotarea', 'overallqual', 'yearbuilt', 'garagecars', 'yearremodadd']]
X_5 = sm.add_constant(X_5)
results_5 = sm.OLS(Y_5, X_5).fit()
print(results_5.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.699
Model:                            OLS   Adj. R-squared:                  0.698
Method:                 Least Squares   F-statistic:                     674.0
Date:                Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:02:08   Log-Likelihood:                -17669.
No. Observations:                1460   AIC:                         3.535e+04
Df Residuals:                    1454   BIC:                         3.538e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -7.768e+05   1.38e+05     -5.628   

In [29]:
Y_6 = houseprices_df['saleprice']
X_6 = houseprices_df[['lotarea', 'overallqual', 'garagecars', 'yearremodadd']]
X_6 = sm.add_constant(X_6)
results_6 = sm.OLS(Y_6, X_6).fit()
print(results_6.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.698
Method:                 Least Squares   F-statistic:                     842.2
Date:                Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:02:16   Log-Likelihood:                -17669.
No. Observations:                1460   AIC:                         3.535e+04
Df Residuals:                    1455   BIC:                         3.537e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -7.275e+05    1.3e+05     -5.591   

In [31]:
houseprices_df['lotarea_overallqual'] = houseprices_df.lotarea * houseprices_df.overallqual

Y_7 = houseprices_df['saleprice']
X_7 = houseprices_df[['lotarea', 'overallqual', 'garagecars', 'yearremodadd', 'lotarea_overallqual']]
X_7 = sm.add_constant(X_7)
results_7 = sm.OLS(Y_7, X_7).fit()
print(results_7.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.702
Model:                            OLS   Adj. R-squared:                  0.701
Method:                 Least Squares   F-statistic:                     684.8
Date:                Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:04:12   Log-Likelihood:                -17660.
No. Observations:                1460   AIC:                         3.533e+04
Df Residuals:                    1454   BIC:                         3.536e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -7.152e+05   1

In [33]:
Y_8 = houseprices_df['saleprice']
X_8 = houseprices_df[['overallqual', 'garagecars', 'yearremodadd', 'lotarea_overallqual']]
X_8 = sm.add_constant(X_8)
results_8 = sm.OLS(Y_8, X_8).fit()
print(results_8.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.701
Model:                            OLS   Adj. R-squared:                  0.700
Method:                 Least Squares   F-statistic:                     853.3
Date:                Sat, 04 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:05:00   Log-Likelihood:                -17662.
No. Observations:                1460   AIC:                         3.533e+04
Df Residuals:                    1455   BIC:                         3.536e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -7.245e+05   1

Looking at all things considered, the last model has the best R-squared, F-statistics, AIC/BIC values, and overall p-values. The last model seems to be the best model so far.