# Assignment 

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

# Weather model 

We will be using the weatherinszeged  from the previous assignments to build linear regression models to predict temperature.  

Then, we will exmaine the __R-squared__ and __adjusted R-squared__ to see how explantory variables perform in each model. R-squared is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the target variation that is explained by a linear model. A zero percentage indicates that the model explains none of the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits the data.

We will also look at the __Akaike Information Criterion (AIC)__ and __Bayesian Information Criterion (BIC)__ when deciding which model performs the best. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. Both BIC and AIC attempt to resolve the problem of overfitting by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC. For both AIC and BIC, the lower the value the better. Hence, we choose the model with the lowest AIC or BIC value.

## Iteration 1 

We will start by building a linear regression model where the target variable is the difference between the apparenttemperature and the temperature. Humidity and windspeed will be our explanatory variables. 

In [1]:
# Libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from sklearn.decomposition import PCA 
from sklearn import linear_model
from sqlalchemy import create_engine
import warnings
import statsmodels.api as sm


# Import data

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weatherinszeged_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [2]:
# Y is the target variable
Y = weatherinszeged_df['apparenttemperature'] - weatherinszeged_df['temperature']
# X is the feature set
X = weatherinszeged_df[['humidity','windspeed']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,23:14:32,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


$$ difference\_of\_apparenttemperature\_and\_temperature = 2.4381 - 3.0292\_humidity - 0.1193\_windspeed $$

In the tables above, we see that for our first model, R-squared is 0.288 and adjusted R-squared is 0.288. This model explains approximately 29% of the variance in the target, which is relatively low. About 71% of the target variance is unexplained by the model, so this would not be satisficatory if we produce predictions that are reasonably precise.

However, a low R-squared is not inherently bad since predicting temperature can be difficult. Since the coefficents are statistically significant, we are able to represent the mean change in the target variable for one unit of change in the explanatory variables. This still allows up to draw important conclusions on changes in the explanatory variables effect changes in the target variable. 
 

## Iteration 2 


Let's add the interaction between humidity and windspeed to see if it will improve the model.

In [3]:
weatherinszeged_df['humidity_windspeed_interaction'] = weatherinszeged_df.humidity * weatherinszeged_df.windspeed

# Y is the target variable
Y = weatherinszeged_df['apparenttemperature'] - weatherinszeged_df['temperature']
# X is the feature set
X = weatherinszeged_df[['humidity','windspeed', 'humidity_windspeed_interaction']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,23:23:43,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidity_windspeed_interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


After adding the interaction between humidity and windspeed, the R-squared and adjusted R-squared for the second model 0.341. This model captures 5.3% more of the variance explained in the target variable than the first model. 
 

## Iteration 3 

Now, we will add visibility as an explanatory variable to the first model to see if there is an improvement.

In [4]:
# Y is the target variable
Y = weatherinszeged_df['apparenttemperature'] - weatherinszeged_df['temperature']
# X is the feature set
X = weatherinszeged_df[['humidity','windspeed','visibility']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.304
Model:,OLS,Adj. R-squared:,0.303
Method:,Least Squares,F-statistic:,14010.0
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,23:33:27,Log-Likelihood:,-169380.0
No. Observations:,96453,AIC:,338800.0
Df Residuals:,96449,BIC:,338800.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.5756,0.028,56.605,0.000,1.521,1.630
humidity,-2.6066,0.025,-102.784,0.000,-2.656,-2.557
windspeed,-0.1199,0.001,-179.014,0.000,-0.121,-0.119
visibility,0.0540,0.001,46.614,0.000,0.052,0.056

0,1,2,3
Omnibus:,3833.895,Durbin-Watson:,0.282
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4584.022
Skew:,-0.459,Prob(JB):,0.0
Kurtosis:,3.545,Cond. No.,131.0


From the summary, we can see that visibility did increase R-squared and adjusted R-squared from 0.288 to 0.304. This model is able to capture 1.6% more of the variance in the target variable than in the first model. The second model with the interaction between humidity and windspeed is able to capture the most variation in the target variable with a R-squared of 34.1%. 



### Which weather model performs best?

Overall, the second model with the addition of the interaction between humidity and wind speed outperforms the other models. In terms of R-squared, the second model is able to explain 34.1% of the variation in the target variable. While the first and third is only able to explain 28.8% and 30.4% of the variation in the target variable respectively. Looking at AIC and BIC, the second model also has the smallest criterion value of 3.388e+05, indicating that this model loses less information than the other models when adding new features.

## House prices model

In this exercise, we will work on our house prices model. We will complete the following tasks: 

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

## Iteration 1

We will reload our linear regression model where the target variable is sale price and the following are explanatory variables: overall quality, above grade (ground) living area, garage size, basement area, and first floor area. 

In [5]:
# Libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from sklearn.decomposition import PCA 
from sklearn import linear_model
from sqlalchemy import create_engine
import warnings
import statsmodels.api as sm


# Import data

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [6]:
# Y is the target variable
Y = houseprices_df['saleprice']
# X is the feature set
X = houseprices_df[['overallqual','grlivarea','garagearea','totalbsmtsf','firstflrsf']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.758
Model:,OLS,Adj. R-squared:,0.757
Method:,Least Squares,F-statistic:,911.5
Date:,"Mon, 12 Aug 2019",Prob (F-statistic):,0.0
Time:,00:22:33,Log-Likelihood:,-17508.0
No. Observations:,1460,AIC:,35030.0
Df Residuals:,1454,BIC:,35060.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.034e+05,4938.326,-20.948,0.000,-1.13e+05,-9.38e+04
overallqual,2.532e+04,1055.957,23.976,0.000,2.32e+04,2.74e+04
grlivarea,43.3833,2.699,16.074,0.000,38.089,48.678
garagearea,56.6798,6.126,9.252,0.000,44.662,68.697
totalbsmtsf,22.9518,4.340,5.289,0.000,14.439,31.465
firstflrsf,11.2909,5.070,2.227,0.026,1.345,21.237

0,1,2,3
Omnibus:,507.298,Durbin-Watson:,1.98
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52609.779
Skew:,-0.602,Prob(JB):,0.0
Kurtosis:,32.383,Cond. No.,11400.0


In the tables above, we see that for our first model, R-squared is 0.758, adjusted R-squared is 0.757, F statistic is 911.5, AIC is 3.503 and BIC is 3.506. This is a pretty good start since our current model is able to explain about 76% of the variation in the target variable. 

Let's modify the explanatory variables to see if we can increase the R-squared and F statistis and decrease the AIC and BIC values. 

## Iteration 2 