To close out this lesson, you're going to do three exercises. For the first exercise, you'll write up a short answer to a question. For the second and third exercises, you'll do your work in Jupyter Notebooks.

Please submit links to all your work at the end of this assessment.

1. Interpretation and significance

Suppose that you would like to know how much families in the US are spending on recreation annually. Use the following estimated model:

```
𝑒𝑥𝑝𝑒𝑛𝑑𝑖𝑡𝑢𝑟𝑒 = 873 + 0.0012𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒 + 0.00002𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒² − 223.57ℎ𝑎𝑣𝑒_𝑘𝑖𝑑𝑠
```

Here, `expenditure` is the annual spending on recreation in US dollars, `annual_income` is the annual income in US dollars, and `have_kids` is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics can be given in order to ensure that your interpretations make sense statistically? Write up your answer.

2. Weather model

In this exercise, you'll work with the historical temperature data from the previous lesson. To complete this exercise, submit a link to a Jupyter Notebook containing your solutions to the following tasks:
* First, load the dataset from the `weatherinszeged` table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the `apparenttemperature` and the `temperature`. As explanatory variables, use `humidity` and `windspeed`. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?
* Next, include the interaction of `humidity` and `windspeed` to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for `humidity` and `windspeed` change? Interpret the estimated coefficients.

3. House prices model

In this exercise, you'll interpret your house prices model. To complete this exercise, submit a link to a Jupyter Notebook containing your solutions to the following tasks:
* Load the _houseprices_ data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

# 1. Interpretation and significance
* `annual_income` is a continuous variable. As income increases by \$0.0012, the expenditure increases by one dollar. Without p-value, it is difficult to say its statistical significance.
* `annual_income^2` is an interaction term between annual income and itself again. As income is counted twice and increases by \$0.00002, the expenditure increases by one dollar. I do not know the significance of this method. I also need p-value to know its statistical significance.
* `have_kids` is a dummy variable. Each kid will decrease the expenditure by $223.57. I need the p-value as well.

# 2. Weather model

In [1]:
# For convenience, we will load all environments here.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

In [2]:
# First, load the dataset from the weatherinszeged table from Thinkful's database.
engine_w = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine_w)

# No need for an open connection, because you're only doing a single query
engine_w.dispose()

weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


In [5]:
# Build a linear regression model where your target variable is the difference 
# between the `apparenttemperature` and the `temperature`. 
Y_w = weather_df['temperature'] - weather_df['apparenttemperature']

# As explanatory variables, use `humidity` and `windspeed`. 
X_w = weather_df[['humidity', 'windspeed']]
X_w = sm.add_constant(X_w)

# Now, estimate your model using OLS. 
results_w = sm.OLS(Y_w, X_w).fit()
print(results_w.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Fri, 03 Dec 2021   Prob (F-statistic):               0.00
Time:                        17:33:19   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

### Are the estimated coefficients statistically significant?
Based on the coefficients and the p-values, both estimated coefficients seem to be statistically significant.

### Are the signs of the estimated coefficients in line with your previous expectations? 
Yes, they are in line with my previous expectations. 

### Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?
As humidity increases by 3.03 units, the difference between the temperature and the apparent temperature will increase by one degree. Likewise, as windspeed increases by 0.12 units, the difference between the temperature and the apparent temperature will increase by one degree.

# 3. House prices

In [8]:
# Load the houseprices data from Thinkful's database.
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine_h = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('select * from houseprices',con=engine_h)

# No need for an open connection, because you're only doing a single query
engine_h.dispose()

houseprices_df.info()
houseprices_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1460 non-null   int64  
 1   mssubclass     1460 non-null   int64  
 2   mszoning       1460 non-null   object 
 3   lotfrontage    1201 non-null   float64
 4   lotarea        1460 non-null   int64  
 5   street         1460 non-null   object 
 6   alley          91 non-null     object 
 7   lotshape       1460 non-null   object 
 8   landcontour    1460 non-null   object 
 9   utilities      1460 non-null   object 
 10  lotconfig      1460 non-null   object 
 11  landslope      1460 non-null   object 
 12  neighborhood   1460 non-null   object 
 13  condition1     1460 non-null   object 
 14  condition2     1460 non-null   object 
 15  bldgtype       1460 non-null   object 
 16  housestyle     1460 non-null   object 
 17  overallqual    1460 non-null   int64  
 18  overallc

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [10]:
# Run your house prices model again and interpret the results. 
Y_h = houseprices_df['saleprice']
X_h = houseprices_df[['lotarea', 'overallqual', 'yearbuilt', 'garagecars']]
X_h = sm.add_constant(X_h)
results_h = sm.OLS(Y_h, X_h).fit()
print(results_h.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.695
Model:                            OLS   Adj. R-squared:                  0.694
Method:                 Least Squares   F-statistic:                     829.6
Date:                Fri, 03 Dec 2021   Prob (F-statistic):               0.00
Time:                        17:47:26   Log-Likelihood:                -17677.
No. Observations:                1460   AIC:                         3.536e+04
Df Residuals:                    1455   BIC:                         3.539e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -3.604e+05   9.29e+04     -3.881      

### Which features are statistically significant, and which are not?
The p-values for all variables seem to indicate statistical significance. However, the `lotarea` variable only impacts the sale price by \$1.31. 

In [11]:
# Now, exclude the insignificant features from your model. 
Y_1 = houseprices_df['saleprice']
X_1 = houseprices_df[['overallqual', 'yearbuilt', 'garagecars']]
X_1 = sm.add_constant(X_1)
results_1 = sm.OLS(Y_1, X_1).fit()
print(results_1.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.669
Model:                            OLS   Adj. R-squared:                  0.668
Method:                 Least Squares   F-statistic:                     981.1
Date:                Fri, 03 Dec 2021   Prob (F-statistic):               0.00
Time:                        17:50:25   Log-Likelihood:                -17737.
No. Observations:                1460   AIC:                         3.548e+04
Df Residuals:                    1456   BIC:                         3.550e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -2.532e+05   9.62e+04     -2.631      

### Did anything change?
The significance for yearbuilt drastically reduced, but it is still statistically significant.


### Interpret the statistically significant coefficients by quantifying their relations with the house prices. 
* The overall quality of the house increasing by 1 point will increase the sale price of the house by \$35,780.
* The year built of the house increasing by 1 year will increase the sale price of the house by \$85.81.
* The number of cars that a garage can hold increasing by 1 will increase the sale price of the house by \$ 26,440.

### Which features have a more prominent effect on house prices?
The overall quality of the house has the most prominent effect on the sale price.

### Do the results sound reasonable to you? If not, try to explain the potential reasons.
Yes, the results sound reasonable to me.