### 1 - Interpretation and significance:
------------------------------------

The model is: 

 *expenditure* = 873 + 0.0012*annual_income* + 0.00002*annual_income^2* - 223.57*have kids*
 
The necessary thing that is not given in the question is the statistical significancy of the coefficients. Although the coefficients appear different from zero, if they are statistically insignificant, we should consider them as zero. So, t statistics or the associated p-values should be provided.


Assuming that all the estimated coefficients are statistically significant, we can interpret the model as follows: 

* The bias term is 873. 
* On average, families with kids spend \$223.57 less than families without kids.
* The relation between the recreation expdentiture and the income is quadratic: an increase of 1 thousand dollar in annual income not only increase the recreation expenditure by $1.2 but also an additional 0.02 times annual income. Since the relationship is quadratic, the magnitude of the second term increases as the level of income increases.

### 2 - Weather model:
--------------------------------

#### Load the dataset:

In [1]:
# Import libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import statsmodels.api as sm
from scipy import stats
from sklearn import linear_model
from sqlalchemy import create_engine

warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Edit pandas display option to show more rows and columns:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [3]:
# Query the database to extract dataset:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('SELECT * FROM weatherinszeged', con=engine)

# Dispose the connection, as we're only doing a single query:
engine.dispose()

In [4]:
df['target'] =  df['temperature']-df['apparenttemperature']
df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary,target
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.,2.083333
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.,2.127778
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.,0.0
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.,2.344444
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.,1.777778


#### Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS.

In [5]:
# Define the target variable and the explanatory variables:
Y = df['target']
X = df[['humidity', 'windspeed']]

# Manually add a constant in statsmodels' sm
X = sm.add_constant(X)

# Fit the variables to the regression model
results = sm.OLS(Y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,target,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Tue, 07 Jan 2020",Prob (F-statistic):,0.0
Time:,17:56:03,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4381,0.021,-115.948,0.000,-2.479,-2.397
humidity,3.0292,0.024,126.479,0.000,2.982,3.076
windspeed,0.1193,0.001,176.164,0.000,0.118,0.121

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


* All of the estimated coefficients are statistically significant as their p-values are very minimal (<0.05)
* The regression model is: *target* = -2.4381 +3.0292*humidity* + 0.1193*windspeed*. We can intepret the model as:
> The bias term of the model is -2.4381. As the humidity increases by 1 point, the target varible increase sby 3.0292 degree and as the windspeed increases by 1 point, the difference between apparenttemperature and temperature also increases by 0.1193 degree.

#### Include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Interpret the estimated coefficients

In [6]:
# Define the target variable and the explanatory variables:
df['hum_wind_interaction'] = df['humidity'] * df['windspeed']
Y1 = df['target']
X1 = df[['humidity', 'windspeed', 'hum_wind_interaction']]

# Manually add a constant in statsmodels' sm
X1 = sm.add_constant(X1)

# Fit the variables to the regression model
results = sm.OLS(Y1, X1).fit()
results.summary()

0,1,2,3
Dep. Variable:,target,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Tue, 07 Jan 2020",Prob (F-statistic):,0.0
Time:,17:56:03,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0839,0.033,-2.511,0.012,-0.149,-0.018
humidity,-0.1775,0.043,-4.133,0.000,-0.262,-0.093
windspeed,-0.0905,0.002,-36.797,0.000,-0.095,-0.086
hum_wind_interaction,0.2971,0.003,88.470,0.000,0.291,0.304

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


* Similar to the previous model, all of the coefficients are statistically significant as their p-values are less than 0.05.
* Interestingly, the signs of the both humidity and windspeed coefficients changed from postive to negative when we include the intereaction term.
* The model is: *target* = -0.0839 - 0.1775*humidity* - 0.0905*windspeed* + 0.2971*hum_wind_interaction*
> According to the model, 1 point increase in the humidity and windspeed results in 0.1775 and 0.0905 point decreases in the target variable, respectively. The coefficient of the interaction term is 0.2971. We can interpret it as follows. Given a windspeed level, 1 point increase in humidity results in (-0.1775 + 0.2971 x windspeed) point decrease in the target and vice versa. So, the windspeed mitigates the effect of humidity on the target. 

### House prices model
------------------------------
#### Load the dataset

In [7]:
# Query the database to extract dataset:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('SELECT * FROM houseprices', con=engine)

# Dispose the connection, as we're only doing a single query:
engine.dispose()

# Print out the head of the dataset:
houseprices_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,overallqual,overallcond,yearbuilt,yearremodadd,roofstyle,roofmatl,exterior1st,exterior2nd,masvnrtype,masvnrarea,exterqual,extercond,foundation,bsmtqual,bsmtcond,bsmtexposure,bsmtfintype1,bsmtfinsf1,bsmtfintype2,bsmtfinsf2,bsmtunfsf,totalbsmtsf,heating,heatingqc,centralair,electrical,firstflrsf,secondflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,kitchenqual,totrmsabvgrd,functional,fireplaces,fireplacequ,garagetype,garageyrblt,garagefinish,garagecars,garagearea,garagequal,garagecond,paveddrive,wooddecksf,openporchsf,enclosedporch,threessnporch,screenporch,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [8]:
# Convert categorical variables to dummy variables:
houseprices_df['street'] = pd.get_dummies(houseprices_df['street'], drop_first=True)
houseprices_df['centralair'] = pd.get_dummies(houseprices_df['centralair'], drop_first=True)
houseprices_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,overallqual,overallcond,yearbuilt,yearremodadd,roofstyle,roofmatl,exterior1st,exterior2nd,masvnrtype,masvnrarea,exterqual,extercond,foundation,bsmtqual,bsmtcond,bsmtexposure,bsmtfintype1,bsmtfinsf1,bsmtfintype2,bsmtfinsf2,bsmtunfsf,totalbsmtsf,heating,heatingqc,centralair,electrical,firstflrsf,secondflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,kitchenqual,totrmsabvgrd,functional,fireplaces,fireplacequ,garagetype,garageyrblt,garagefinish,garagecars,garagearea,garagequal,garagecond,paveddrive,wooddecksf,openporchsf,enclosedporch,threessnporch,screenporch,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,1,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,1,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,1,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,1,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,1,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,1,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,1,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,1,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,1,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,1,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


#### Build linear regression model and discuss results

In [9]:
# Define the target variable and the explanatory variables:
Y2 = houseprices_df['saleprice']
X2 = houseprices_df[['overallqual', 'grlivarea', 'garagecars', 'fullbath', 'street', 'centralair']]

# Manually add a constant in statsmodels' sm
X2 = sm.add_constant(X2)

# Fit the variables to the regression model
result = sm.OLS(Y2, X2).fit()
result.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.74
Method:,Least Squares,F-statistic:,693.9
Date:,"Tue, 07 Jan 2020",Prob (F-statistic):,0.0
Time:,17:56:06,Log-Likelihood:,-17557.0
No. Observations:,1460,AIC:,35130.0
Df Residuals:,1453,BIC:,35170.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.085e+05,1.71e+04,-6.359,0.000,-1.42e+05,-7.5e+04
overallqual,2.642e+04,1113.049,23.740,0.000,2.42e+04,2.86e+04
grlivarea,52.0425,2.829,18.394,0.000,46.492,57.593
garagecars,2.07e+04,1839.002,11.255,0.000,1.71e+04,2.43e+04
fullbath,-907.6381,2614.198,-0.347,0.728,-6035.644,4220.368
street,-564.8670,1.67e+04,-0.034,0.973,-3.33e+04,3.22e+04
centralair,1.583e+04,4520.327,3.501,0.000,6960.230,2.47e+04

0,1,2,3
Omnibus:,436.868,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10276.944
Skew:,0.834,Prob(JB):,0.0
Kurtosis:,15.89,Cond. No.,35600.0


The statistically significant variables are overallqual, grlivarea, garagecars, and centralair. All the other variables are statistically insignificant and hence we can drop them as they will not add values to the model. According to the estimation results:
* 1 point increase in overallqual results in \$2642 increase in sale price.
* 1 point increase in grlivarea results in \$52 increase in sale price.
* 1 point increase in garagecars results in \$20700 increase in sale price.
* A house with central air results in \$15830 increase in sale price compared to one without central air.

#### Now, exclude the insignificant features from your model. Did anything change?

In [10]:
# Define the target variable and the explanatory variables:
Y2 = houseprices_df['saleprice']
X2 = houseprices_df[['overallqual', 'grlivarea', 'garagecars', 'centralair']]

# Manually add a constant in statsmodels' sm
X2 = sm.add_constant(X2)

# Fit the variables to the regression model
result = sm.OLS(Y2, X2).fit()
result.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.741
Method:,Least Squares,F-statistic:,1042.0
Date:,"Tue, 07 Jan 2020",Prob (F-statistic):,0.0
Time:,18:01:14,Log-Likelihood:,-17557.0
No. Observations:,1460,AIC:,35120.0
Df Residuals:,1455,BIC:,35150.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.092e+05,5658.917,-19.305,0.000,-1.2e+05,-9.81e+04
overallqual,2.635e+04,1089.520,24.182,0.000,2.42e+04,2.85e+04
grlivarea,51.6199,2.556,20.196,0.000,46.606,56.634
garagecars,2.061e+04,1810.725,11.381,0.000,1.71e+04,2.42e+04
centralair,1.586e+04,4505.540,3.521,0.000,7024.428,2.47e+04

0,1,2,3
Omnibus:,439.032,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10183.651
Skew:,0.846,Prob(JB):,0.0
Kurtosis:,15.827,Cond. No.,9690.0


The results resemble the previous model's results. The R^2 and adjusted R^2 are almost the same. The estimated coefficients are close to the previous model except for the 2 variables that we removed. 

Considering the results, the most prominent factors affecting the sale price seem to be the overallqual and garagecars.