### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically.

Based on the estimated model provided, 875 is the inetercept of the model. The relationship between annual income and recreational expenditure appears to be quadratic. The coefficient of annual income is 0.0012. As a person's annual income increases by one year, her/his recreational expenditure increases by \\$0.0012 on average. As the annual income is squared, the expenditure increases by \\$0.0002. Those with children, will spend \\$223.57 per child less than those without children. 

In order to determine whether the coefficent is statistically insignificant, we can use the summary() function to returns t-statistics and associated p-values. The p-value associated with a t-test quantifies the likelihood that the estimated coefficient is actually equal to zero in the real population. The lower the p-value, the more significant the coefficient is. As a general rule of thumb, when the p-value of a coefficient is less than or equal to 0.1, we say that the coefficient is statistically significant. However, it's admirable to have a p-value that is less than or equal to 0.05.


### 2. Weather model

In this exercise, we'll work with the historical temperature data from the previous checkpoint. We will complete the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [5]:
# Libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from sklearn.decomposition import PCA 
from sklearn import linear_model
from sqlalchemy import create_engine
import warnings
import statsmodels.api as sm


# Import data

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weatherinszeged_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [6]:
# Y is the target variable
Y = weatherinszeged_df['apparenttemperature'] - weatherinszeged_df['temperature']
# X is the feature set
X = weatherinszeged_df[['humidity','windspeed']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,19:22:35,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


$$ difference\_of\_apparenttemperature\_and\_temperature = 2.4381 - 3.0292\_humidity - 0.1193\_windspeed $$

As shown in the summary, all the coefficents are statistically significant since their p-value is less than 0.05. The bias of the model is 2.4381.  I expected a positive relationship between humidity and the target variable. However, there appears to be an inverse relationship between humidity and the difference of apparent temperature and temperature. As humidity increases, the difference decreases by 3.029. This is also true for windspeed. As windspeed increases, the differences decreases by 0.119. 

Next, we will explore the interaction between humidity and windspeed.

In [7]:
weatherinszeged_df['humidity_windspeed_interaction'] = weatherinszeged_df.humidity * weatherinszeged_df.windspeed

# Y is the target variable
Y = weatherinszeged_df['apparenttemperature'] - weatherinszeged_df['temperature']
# X is the feature set
X = weatherinszeged_df[['humidity','windspeed', 'humidity_windspeed_interaction']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,19:50:39,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidity_windspeed_interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


Estimated coefficents of model 1:
$$ difference\_of\_apparenttemperature\_and\_temperature = 2.4381 - 3.0292\_humidity - 0.1193\_windspeed $$

Estimated coefficents of model 2:
$$ difference\_of\_apparenttemperature\_and\_temperature = 0.0839 + 0.1775\_humidity + 0.0905\_windspeed - 0.2971\_humidity\_windspeed\_interaction $$

Like the first mode, all coefficents have p-values less than 0.05 and are statistically significant. After adding the interaction between humidity and windspeed, the individual variables have changed to positive attributes of the target variable. With the second model, humidity will increase the target variable by 0.1775 and windspeed will add 0.0905 to the target varaible. Meanwhile, the interaction between humidity and windspeed decreases the target variable by 0.2971.  

We can also interept the effects of a one point increase in humidity and windspeed on the target as:

- humidity is 0.1775 - 0.2971 X windspeed point increase in the target
- windspeed in 0.09 - 0.30 X humidity point increase in the target

###  3. House prices model

In this exercise, we'll interpret your house prices model. To complete this assignment, We will complete the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [8]:
# Libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from sklearn.decomposition import PCA 
from sklearn import linear_model
from sqlalchemy import create_engine
import warnings
import statsmodels.api as sm


# Import data

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [13]:
# Y is the target variable
Y = houseprices_df['saleprice']
# X is the feature set
X = houseprices_df[['overallqual','grlivarea','garagearea','totalbsmtsf','firstflrsf']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.758
Model:,OLS,Adj. R-squared:,0.757
Method:,Least Squares,F-statistic:,911.5
Date:,"Sun, 11 Aug 2019",Prob (F-statistic):,0.0
Time:,20:19:02,Log-Likelihood:,-17508.0
No. Observations:,1460,AIC:,35030.0
Df Residuals:,1454,BIC:,35060.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.034e+05,4938.326,-20.948,0.000,-1.13e+05,-9.38e+04
overallqual,2.532e+04,1055.957,23.976,0.000,2.32e+04,2.74e+04
grlivarea,43.3833,2.699,16.074,0.000,38.089,48.678
garagearea,56.6798,6.126,9.252,0.000,44.662,68.697
totalbsmtsf,22.9518,4.340,5.289,0.000,14.439,31.465
firstflrsf,11.2909,5.070,2.227,0.026,1.345,21.237

0,1,2,3
Omnibus:,507.298,Durbin-Watson:,1.98
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52609.779
Skew:,-0.602,Prob(JB):,0.0
Kurtosis:,32.383,Cond. No.,11400.0


Our estimated model for sale price:

$$ saleprice = -103,449.38 + 25,317.47\_overallqual + 43.38\_grlivarea + 56.67\_garagearea + 22.95\_totalbsmtsf + 11.29\_firstflrsf $$

It appears that all explanatory variables selected are statistically significant since the p-values are less than 0.05. 

We can interpet one point increase of each explanatory variable on the target variable as the following:
 - overall quality increases the sale price by \$25,317.47. This makes sense since higher quality houses will cost more than poor quality houses. 
 
 - the price of above grade (ground) living area is $43.38 per square feet
 
 - an increase of garage size is $56.67 per square feet
 
 - an increase of basement area is \$22.95 per square feet
 
 - first floor are is $11.29 per square feet
 
I'm not fully sure why garages per square feet are more expensive than basement area. I would have assummed that prices for above ground living area and first floor area would have been more similar. 