# Interpreting Estimated Coefficients

# Exercise 1

Baseline expenditure, assuming zero annual income and zero kids, will always be 873 dollars on average. This is the bias term. For every additional dollar of annual income, a family will spend an additional 0.0012 dollars on recreation annually. For every additional dollar of income, the family will also spend that dollar multiplied by 0.00002 dollars. Spending on recreating will go up exponentially in this case. Finally, a family with kids will spend, on average, 223.57 dollars less on recreation annually than a family without kids. I will display the annual recreation expenditures for families with and without kids in a graph below


In [1]:
import numpy as np
import matplotlib.pyplot as plt

X = np.arange(10000, 45000, 1000)
Y = 873 + 0.0012*X + 0.00002*(X**2)
Y_kids = 873 + (0.0012*X) + (0.00002*(X**2)) - 223.57

plt.plot(X, Y, label ='No kids')
plt.plot(X, Y_kids, label= 'Kids')
plt.xlabel('Income')
plt.ylabel('Expenditure')
plt.legend()
plt.title('Annual expenditure on recreation')
plt.show()

print(873 + (0.0012*47001) + (0.00002*(47001**2)) - 223.57)

<Figure size 640x480 with 1 Axes>

44887.711220000005


# Exercise 2

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

  import pandas.util.testing as tm
  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)


First, load the dataset from the weatherinszeged table from Thinkful's database.

Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

In [3]:
weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


In [4]:
Y = weather_df['temp_differential'] = weather_df['apparenttemperature'] - weather_df['temperature']

X = weather_df[['humidity', 'windspeed']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 16 May 2020",Prob (F-statistic):,0.0
Time:,22:27:50,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


* The coefficients are both statistically significant, as the p-value < 0.05
* Interpreting, a 1 point increase in humidity is associated with a 3.0292 increase in our temperature differential and a 1 point increase in windspeed is associated with a 0.1193 increase in our temperature differential.
* It seems like the higher the humidity and windspeed, the larger the difference between apparent and actual temperature. This is because our temperature differential is already negative to begin with and it makes sense because these two weather features can affect temperatures when factored in.

Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

In [5]:
weather_df['humidity_windspeed_interaction'] = weather_df.humidity * weather_df.windspeed

Y = weather_df['temp_differential']

X = weather_df[['humidity', 'windspeed', 'humidity_windspeed_interaction']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,temp_differential,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sat, 16 May 2020",Prob (F-statistic):,0.0
Time:,22:27:51,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidity_windspeed_interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


Y = 0.0839 + 0.18(humidity) + 0.09(windspeed) - 0.30(humidity_windspeed)

* The coefficients are all statistically significant, as the p-value < 0.05.
* The valence signs for humidity and windspeed changed from negative to positive.
* Interpreting, a 1-point increase in humidity is associated with a 0.1775 increase in our target variable (aka a decrease in the temperature differential). A 1-point increase in windspeed is associated with a 0.905 increase in our target variable (aka a decrease in the temperature differential). A 1-point increase in our humidity_windspeed interaction term is associated with a 0.2971 decrease in our target variable (aka an increase in the temperature differential).
* This is interesting, because it seems that when humidity and windspeed are present individually, apparent and real temperatures are more similar, yet when humidity and windspeed increase together, the opposite is true.
* **From answers: According to the model, the coefficient of the interaction term is -0.30. We can interpret it as follows. Given a windspeed level, 1 point increase in humidity results in 0.18 - 0.30 X windspeed point increase in the target. This means that the increase in the target is lower for high values of windspeed than for low values of windspeed. So, the windspeed mitigates the effect of humidity increase on the target. Similarly for a given humidity level, 1 point increase in the windspeed results in 0.09 - 0.30 X humidity point increase in the target. So, the humidity also mitigates the effect of windspeed on the target.**

# Exercise 3

Load the houseprices data from Thinkful's database.
Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
Now, exclude the insignificant features from your model. Did anything change?
Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [6]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [7]:
house_prices_df = pd.concat([house_prices_df,pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
house_prices_df = pd.concat([house_prices_df,pd.get_dummies(house_prices_df.street, prefix="street", drop_first=True)], axis=1)
dummy_column_names = list(pd.get_dummies(house_prices_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(house_prices_df.street, prefix="street", drop_first=True).columns)

In [8]:
# Y is the target variable
Y = house_prices_df['saleprice']
# X is the feature set
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf'] + dummy_column_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.767
Method:,Least Squares,F-statistic:,482.0
Date:,"Sat, 16 May 2020",Prob (F-statistic):,0.0
Time:,22:27:53,Log-Likelihood:,-17475.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1449,BIC:,35030.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.173e+05,1.8e+04,-6.502,0.000,-1.53e+05,-8.19e+04
overallqual,2.333e+04,1088.506,21.430,0.000,2.12e+04,2.55e+04
grlivarea,45.6344,2.468,18.494,0.000,40.794,50.475
garagecars,1.345e+04,2990.453,4.498,0.000,7584.056,1.93e+04
garagearea,16.4082,10.402,1.577,0.115,-3.997,36.813
totalbsmtsf,28.3816,2.931,9.684,0.000,22.633,34.131
mszoning_FV,2.509e+04,1.37e+04,1.833,0.067,-1761.679,5.19e+04
mszoning_RH,1.342e+04,1.58e+04,0.847,0.397,-1.77e+04,4.45e+04
mszoning_RL,2.857e+04,1.27e+04,2.246,0.025,3612.782,5.35e+04

0,1,2,3
Omnibus:,415.883,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,41281.526
Skew:,-0.115,Prob(JB):,0.0
Kurtosis:,29.049,Cond. No.,55300.0


* garagearea, mszoning_FV, mszoning_RH, mszoning_RM and street_Pave are all statistically insignificant because their p-value > 0.05. The rest of the coefficients are statistically significant.

In [9]:
# Y is the target variable
Y = house_prices_df['saleprice']
# X is the feature set
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf'] + dummy_column_names]

X = X.drop(columns=['garagearea', 'mszoning_FV', 'mszoning_RH', 'mszoning_RM', 'street_Pave'])

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.767
Model:,OLS,Adj. R-squared:,0.766
Method:,Least Squares,F-statistic:,956.8
Date:,"Sat, 16 May 2020",Prob (F-statistic):,0.0
Time:,22:27:53,Log-Likelihood:,-17481.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1454,BIC:,35010.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.083e+05,4804.236,-22.540,0.000,-1.18e+05,-9.89e+04
overallqual,2.396e+04,1060.549,22.588,0.000,2.19e+04,2.6e+04
grlivarea,45.4093,2.452,18.517,0.000,40.599,50.220
garagecars,1.763e+04,1731.766,10.183,0.000,1.42e+04,2.1e+04
totalbsmtsf,28.8729,2.862,10.088,0.000,23.259,34.487
mszoning_RL,1.596e+04,2558.589,6.238,0.000,1.09e+04,2.1e+04

0,1,2,3
Omnibus:,402.656,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35429.68
Skew:,-0.08,Prob(JB):,0.0
Kurtosis:,27.133,Cond. No.,9530.0


Did anything change? Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices? Do the results sound reasonable to you? If not, try to explain the potential reasons.

* The coefficient values have changed, but all of them are still statistically significant. The coefficients are all similar to their previous number, except our dummy variable's, which is significantly lower than its previous reading.
* A 1-point increase in the overall quality score of the house is associated with a 23,960 dollar increase in the house's sale price.
* A 1-point increase in grlivarea is associated with a 45.4093 dollar increase in the house's sale price.
* For every additional car a home's garage can accommodate, there is an associated 17,630 dollar increase in the value of the house in question.
* A 1-point increase in the totalbsmtsf is associated with a 28.87 dollar increase in the house's sale price. A 1-point increase in mszoning_RL is associated with a 15,960 dollar increase in the house's sale price.

Considering the results, the coefficients that seem to have the most explanatory power for home salesprice are overallqual, garagecars and mszoning_RL.