#### Import packages and datasets

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import sys, os

path_to_src = os.path.join('..', '..', 'src')
sys.path.insert(1, path_to_src)
from custom_functions import *

In [2]:
path_to_db = os.path.join('..', '..', 'data', 'processed', 'main.db')
conn = sqlite3.connect(path_to_db)

query_df = '''SELECT * FROM step3_final_df'''
query_performance = '''SELECT * FROM step3_performance_metrics'''

df_final = pd.read_sql(query_df, conn, index_col='index')
df_final.reset_index(drop=True, inplace=True)

performance_metrics = pd.read_sql(query_performance, conn, index_col='index')
performance_metrics.reset_index(drop=True, inplace=True)

conn.close()

#### Show final model statistics

In [3]:
x = list(df_final.drop('SalePrice_log', axis=1).columns)
model, _ = produce_model(df_final, x, 'SalePrice_log')
print(model.summary())

Modeling: SalePrice_log ~ Heating_ElecBB+Heating_FloorWall+Heating_HeatPump+Heating_HotWater+Heating_Radiant+SqFtTotLiving_log+Basement_Finished+Porch_Open+Porch_Closed+Porch_Both
                            OLS Regression Results                            
Dep. Variable:          SalePrice_log   R-squared:                       0.404
Model:                            OLS   Adj. R-squared:                  0.403
Method:                 Least Squares   F-statistic:                     1217.
Date:                Sun, 14 Mar 2021   Prob (F-statistic):               0.00
Time:                        00:43:38   Log-Likelihood:                -9173.4
No. Observations:               17986   AIC:                         1.837e+04
Df Residuals:                   17975   BIC:                         1.845e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                        coef  

#### Quantify impact of coefficients given log-scaled output

In [4]:
results = pd.DataFrame(model.params).reset_index()
results.columns = ['attribute', 'coeff']
results['log_transformed?'] = ['_log' in x for x in results.attribute]

In [5]:
exp_func = lambda x: np.round(10**x, 2)-1
results['% change in SalePrice per *unit* input increase'] = results['coeff'].apply(exp_func)
results.iloc[0, -1] = 'NA'
results.loc[results['log_transformed?']==True, 
            '% change in SalePrice per *unit* input increase'] = 'NA'

In [6]:
logged_inputs_converted = []
for row in range(results.shape[0]):
    if results.iloc[row]['log_transformed?'] == True:
        x = 1.01**results.iloc[row]['coeff']
        logged_inputs_converted.append(x-1)
    else:
        logged_inputs_converted.append('NA')
results['% change in SalePrice per *percent* input increase']  = logged_inputs_converted

In [7]:
results

Unnamed: 0,attribute,coeff,log_transformed?,% change in SalePrice per *unit* input increase,% change in SalePrice per *percent* input increase
0,Intercept,7.999294,False,,
1,Heating_ElecBB,-0.058457,False,-0.13,
2,Heating_FloorWall,0.07805,False,0.2,
3,Heating_HeatPump,0.112877,False,0.3,
4,Heating_HotWater,0.267768,False,0.85,
5,Heating_Radiant,0.273241,False,0.88,
6,SqFtTotLiving_log,0.710611,True,,0.00709587
7,Basement_Finished,0.071071,False,0.18,
8,Porch_Open,0.033703,False,0.08,
9,Porch_Closed,0.10285,False,0.27,


## Findings

### Enclosing the porch
Take note of the **Porch_** metrics in the *Results* table. Note that these are one-hot encoded, with the default being no porch at all. To put it into context, homes with *Open* porches tend to sell for roughly 8% more than homes without a porch at all. However, that number jumps to 27% for homes with *Enclosed* porches, meaning that homes with *enclosed* porches sell for nearly 20% more than homes with *open* porches. It looks like it's time to enclose that porch of yours!

### Finishing the basement
Refer to the **Basement_Finished** row in the *Results* table. This is one-hot encoded to indicate whether or not a basement is finished or unfinished. Note that this excludes homes with no basement at all. As we can see, homes with finished basements typically sell for 18% more than homes with unfinished basements. Maybe it's worth the time and money to get it finished up. 

### Choosing a heating system
Refer to the **Heating_** rows in the *Results* table. These are one-hot encoded values, with the default being Forced Air, given that it is by far the most common. It is used in over 75% of the 18,000 homes analyzed. Here, the results are unrealistically extreme. Instead of looking at the degree to which a heating system can impact home price, let's instead view it as a *trend*. The heating sources in the most expensive sales are Radiant and Hot Water. Heat pump and Floor-Wall are also correlated with a higher price than Forced Air. Coming in last, Electric Baseboard heating is associated with the lowest sale prices of homes. 

### Consider an add-on
Amongst the strongest predictors of sale price is the **Livable Square Feet** in a home. This is not surprising. The relationship is striking: for every 10% increase in square footage, there is a 7% increase in home value. That's substantial. It is also a possible explanation as to why finishing the basement can have such a positive impact. 