### 1) Load the houseprices data

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm

  from pandas.core import datetools


In [2]:
df = pd.read_csv('houseprices.cvs')

In [3]:
num_col = ['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'firstflrsf']
cat_col = ['exterqual', 'kitchenqual']

In [4]:
df2 = pd.concat([df[num_col], df[cat_col], df['saleprice']], axis = 1)

In [5]:
for col in cat_col:
    df2 = pd.concat([df2, pd.get_dummies(df[col], drop_first=True, prefix = col)], axis = 1)

In [6]:
df2.head()

Unnamed: 0,overallqual,grlivarea,garagecars,garagearea,totalbsmtsf,firstflrsf,exterqual,kitchenqual,saleprice,exterqual_Fa,exterqual_Gd,exterqual_TA,kitchenqual_Fa,kitchenqual_Gd,kitchenqual_TA
0,7,1710,2,548,856,856,Gd,Gd,208500,0,1,0,0,1,0
1,6,1262,2,460,1262,1262,TA,TA,181500,0,0,1,0,0,1
2,7,1786,2,608,920,920,Gd,Gd,223500,0,1,0,0,1,0
3,7,1717,3,642,756,961,TA,Gd,140000,0,0,1,0,1,0
4,8,2198,3,836,1145,1145,Gd,Gd,250000,0,1,0,0,1,0


### 2) Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.

The R-squared and adjusted R-squared values are similar, at 0.792 and 0.790.

F-statistic is 459 and has a p-value near 0, telling us that the features explain something about the target variable.

Lastly, the AIC and BIC are 3.482e+04 and 3.489e+04.

In [7]:
Y = df2['saleprice']
X = df2.drop(['saleprice', 'exterqual', 'kitchenqual'], axis = 1)
X = sm.add_constant(X)
results = sm.OLS(Y,X).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.792
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     459.0
Date:                Mon, 09 Sep 2019   Prob (F-statistic):               0.00
Time:                        22:23:36   Log-Likelihood:                -17398.
No. Observations:                1460   AIC:                         3.482e+04
Df Residuals:                    1447   BIC:                         3.489e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           3.758e+04   1.12e+04      3.

### 3) Do you think your model is satisfactory? If so, why?
As a first model, getting a R-squared values of 0.792 isn't bad.  This means that ~79% of the variance is explained.  Of course, there is room for improvement, as that means 21% of the variance is not explained.  

### 4) In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.

I've created new features that adds up or multiplies the external and kitchen qualities as well as summing up the total square footage around the property.  However, none of these really made much difference in R-squared and adjusted R-squared values.  The biggest difference in improving these values was taking the log of sale prices.

In [8]:
df2['sumqual'] = df2['exterqual_Fa'] + df2['exterqual_Gd'] + df2['exterqual_TA'] + df2['kitchenqual_Fa'] + df2['kitchenqual_Gd'] + df2['kitchenqual_TA']

In [9]:
df2['prodqual'] = df2['exterqual_Fa'] + df2['exterqual_Gd'] * df2['exterqual_TA'] * df2['kitchenqual_Fa'] * df2['kitchenqual_Gd'] * df2['kitchenqual_TA']

In [10]:
df2['totalarea'] = df2['grlivarea'] + df2['totalbsmtsf'] + df2['firstflrsf'] + df2['garagearea']

In [18]:
Y2 = np.log(df2['saleprice'])
X2 = pd.concat([df2.drop(['saleprice', 'exterqual', 'kitchenqual'], axis = 1)], axis = 1)
X2 = sm.add_constant(X2)
results2 = sm.OLS(Y2,X2).fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.822
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     475.2
Date:                Mon, 09 Sep 2019   Prob (F-statistic):               0.00
Time:                        22:29:10   Log-Likelihood:                 526.73
No. Observations:                1460   AIC:                            -1023.
Df Residuals:                    1445   BIC:                            -944.2
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                10.4310      0.22

### 5)  For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?
The table below summarizes the adjusted r-square value, F-statistic and corresponding p-value, and AIC and BIC.

For the adjusted R-squared, the higher value is the second model.  

For f-statistics, both models have a p-value under 0.1 and the null hypoethsis is rejected, meaning both models are useful and contributes something to the explanation of the target.

Lastly, AIC and BIC are better if they have lower values.   Both AIC and BIC are lower for the second model.  

Overall, it seems that the second model is better than the first.

| Model | Adjusted R-squared   | F-statistic & p-value   | AIC   | BIC   |
|------|------|
| 1 | 0.790   | 459.0 & 0  | 3.482e+04   | 3.489e+04   |
| 2 | 0.820  | 475.2 & 0   | -1023   | -944.2   |
