
### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

The bias term is 873. On average, the families with kids spend $223.5 less on annual recreation than those families without a child. 

The relationship between the recreation expenditure and annual income is quadratic. An increase of 1K in annual income not only increases the recreation expenditure by \$1.2 but also an additional 0.02 x annual income. This second piece comes from the relation between the recreation expenditure and the square of the annual income. 

To ensure my interpretations make sense statistically, I need to know the p-value of each regression coefficient. 

### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sqlalchemy import create_engine
import statsmodels.api as sm

import warnings 
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [10]:
Y = weather_df['apparenttemperature'] - weather_df['temperature']
X = weather_df[['humidity', 'windspeed']]

print(Y.head())
print(X.head())

0   -2.083333
1   -2.127778
2    0.000000
3   -2.344444
4   -1.777778
dtype: float64
   humidity  windspeed
0      0.89    14.1197
1      0.86    14.2646
2      0.89     3.9284
3      0.83    14.1036
4      0.83    11.0446


In [11]:
X = sm.add_constant(X)

lin_reg = sm.OLS(Y, X).fit()
lin_reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Wed, 16 Oct 2019",Prob (F-statistic):,0.0
Time:,14:53:49,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


$$ apparent\_temperature - temperature = 2.4381 - 3.0292 humidity - 0.1193 windspeed $$

+ All regression coefficients are statistically significant with p-value less than 0.05.

+ The bias term is 2.4381. 

+ On average, as humidity increases one unit, the difference between apparent temperature and temperature decreases 3.03 units; as windspeed increases one unit, the difference between apparent temperature and temperature decreases 0.12 unit. 

In [12]:
weather_df['hum_wind'] = weather_df['humidity'] * weather_df['windspeed']
Y = weather_df['apparenttemperature'] - weather_df['temperature']
X = weather_df[['humidity', 'windspeed', 'hum_wind']]

In [13]:
X = sm.add_constant(X)

lin_reg = sm.OLS(Y, X).fit()
lin_reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Wed, 16 Oct 2019",Prob (F-statistic):,0.0
Time:,15:05:43,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
hum_wind,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


$$ apparent\_temperature - temperature = 0.0839 + 0.1775 humidity + 0.0905 windspeed - 0.2971 humidity\_windspeed$$

+ All regression coefficients are statistically significant with p-value less than 0.05. 
+ After adding the interaction between humidity and windspeed, the signs of both humidity and windspeed changed from negative to positive. 
+ On average, 1 unit increase in humidity and windspeed results in 0.18 and 0.09 unit increases in the target, respectively. 
+ According to the model, the coefficient of the interaction term is -0.3. We can interpret it as follows. Given a windspeed level, 1 unit increase in humidity results in 0.18 - 0.30 X windspeed unit increase in the target. This means that the increase in the target is lower for high values of windspeed than for low values of windspeed. So, the windspeed mitigates the effect of humidity increase on the target. Similarly, for a given humidity level, 1 unit increase in windspeed results in 0.09 - 0.30 x humidity unit increase in the target. So the humidity also mitigates the effect of windspeed on the target. 

###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [15]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


house_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [16]:
subcat = ['mszoning', 'street', 'exterqual']

#convert categorical variable to dummies 

for i in subcat:
    dummy_column_names = []
    house_df = pd.concat([house_df,
                         pd.get_dummies(house_df[i], prefix=i, drop_first=True)], axis=1
     
                        )
#Hobson: why this line doesn't work? 
    #dummy_column_names.append(list(pd.get_dummies(house_df[i], prefix=i, drop_first=True).columns))
                              
house_df.head()
#dummy_column_names

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,salecondition,saleprice,mszoning_FV,mszoning_RH,mszoning_RL,mszoning_RM,street_Pave,exterqual_Fa,exterqual_Gd,exterqual_TA
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,Normal,208500,0,0,1,0,1,0,1,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,Normal,181500,0,0,1,0,1,0,0,1
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,Normal,223500,0,0,1,0,1,0,1,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,Abnorml,140000,0,0,1,0,1,0,0,1
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,Normal,250000,0,0,1,0,1,0,1,0


In [17]:
cat_column_names = list(house_df.columns[-8:])
cat_column_names

['mszoning_FV',
 'mszoning_RH',
 'mszoning_RL',
 'mszoning_RM',
 'street_Pave',
 'exterqual_Fa',
 'exterqual_Gd',
 'exterqual_TA']

In [18]:
Y = house_df['saleprice']

X = house_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'firstflrsf',
              'fullbath', 'totrmsabvgrd', 'yearbuilt', 'yearremodadd'] + cat_column_names]

X.head()

Unnamed: 0,overallqual,grlivarea,garagecars,garagearea,totalbsmtsf,firstflrsf,fullbath,totrmsabvgrd,yearbuilt,yearremodadd,mszoning_FV,mszoning_RH,mszoning_RL,mszoning_RM,street_Pave,exterqual_Fa,exterqual_Gd,exterqual_TA
0,7,1710,2,548,856,856,2,8,2003,2003,0,0,1,0,1,0,1,0
1,6,1262,2,460,1262,1262,2,6,1976,1976,0,0,1,0,1,0,0,1
2,7,1786,2,608,920,920,2,6,2001,2002,0,0,1,0,1,0,1,0
3,7,1717,3,642,756,961,1,7,1915,1970,0,0,1,0,1,0,0,1
4,8,2198,3,836,1145,1145,2,9,2000,2000,0,0,1,0,1,0,1,0


In [19]:
X = sm.add_constant(X)

lin_reg = sm.OLS(Y, X).fit()

lin_reg.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.795
Model:,OLS,Adj. R-squared:,0.792
Method:,Least Squares,F-statistic:,309.5
Date:,"Wed, 16 Oct 2019",Prob (F-statistic):,0.0
Time:,15:27:15,Log-Likelihood:,-17389.0
No. Observations:,1460,AIC:,34820.0
Df Residuals:,1441,BIC:,34920.0
Df Model:,18,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.281e+05,1.39e+05,-5.961,0.000,-1.1e+06,-5.56e+05
overallqual,1.637e+04,1237.233,13.234,0.000,1.39e+04,1.88e+04
grlivarea,53.8024,4.077,13.198,0.000,45.806,61.799
garagecars,1.206e+04,2936.522,4.107,0.000,6300.127,1.78e+04
garagearea,8.8562,9.975,0.888,0.375,-10.710,28.423
totalbsmtsf,16.9508,4.128,4.106,0.000,8.852,25.049
firstflrsf,7.0145,4.825,1.454,0.146,-2.450,16.479
fullbath,-5485.6729,2587.432,-2.120,0.034,-1.06e+04,-410.136
totrmsabvgrd,-1431.8145,1089.432,-1.314,0.189,-3568.857,705.228

0,1,2,3
Omnibus:,644.161,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,92540.27
Skew:,-0.988,Prob(JB):,0.0
Kurtosis:,41.953,Cond. No.,529000.0


According to the model, except for garagearea, firstflrsf, totrmsabvgrd, mszoning, and street, the regression coefficient for other features are statistically significant. I will exclude these insignificant features from the new model. 

In [27]:
Y = house_df['saleprice']

X = house_df[['overallqual', 'grlivarea', 'garagecars',
              'fullbath', 'yearbuilt', 'yearremodadd', 'mszoning_RL']]

X = sm.add_constant(X)

lin_reg = sm.OLS(Y, X).fit()

lin_reg.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.762
Model:,OLS,Adj. R-squared:,0.761
Method:,Least Squares,F-statistic:,664.4
Date:,"Wed, 16 Oct 2019",Prob (F-statistic):,0.0
Time:,15:46:49,Log-Likelihood:,-17496.0
No. Observations:,1460,AIC:,35010.0
Df Residuals:,1452,BIC:,35050.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.184e+06,1.31e+05,-9.021,0.000,-1.44e+06,-9.27e+05
overallqual,2.258e+04,1184.126,19.071,0.000,2.03e+04,2.49e+04
grlivarea,59.3148,2.979,19.910,0.000,53.471,65.159
garagecars,1.587e+04,1842.753,8.610,0.000,1.23e+04,1.95e+04
fullbath,-9395.5710,2687.441,-3.496,0.000,-1.47e+04,-4123.889
yearbuilt,260.0664,52.302,4.972,0.000,157.471,362.662
yearremodadd,300.8563,65.392,4.601,0.000,172.584,429.129
mszoning_RL,1.8e+04,2627.373,6.852,0.000,1.28e+04,2.32e+04

0,1,2,3
Omnibus:,476.027,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17945.068
Skew:,0.817,Prob(JB):,0.0
Kurtosis:,20.097,Cond. No.,412000.0


According to the new model, all regression coefficients are statistically significant with p-valule less than 0.05. 
On average, 
+ 1 unit increase in overallqual results in \$17,480 increase in sale price.
+ 1 unit increase in grilivarea results in \$59 increase in sale price. 
+ 1 unit increase in fullbath results in \$8,958 decrease in sale price. 
+ 1 unit increase in yearbuilt results in \$328 increase in sale price.
+ 1 unit increase in yearremodadd results in \$203 increase in sale price.
+ 1 unit increase in mszoning_RL results in \$18,000 increase in sale price.

As we can see, overall quality and the residiential low density have more prominent effect on house prices. Both makes sense to me. 