## Model Benchmarks

#### External Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import r2_score

%matplotlib inline

  from pandas.core import datetools


#### Read in Data

In [2]:
clean_train = pd.read_csv('../data/EDA_Preprocessing_train.csv')

In [3]:
clean_test = pd.read_csv('../data/EDA_Preprocessing_test.csv')

### First Multiple Linear Regression Model

- Running Train Test Split with my selected features

In [4]:
features = ['overall_qual', 'total_sf','neighborhood', 'exter_qual', 'bsmt_qual', 'kitchen_qual', 'gr_liv_area',
            'garage_cars', 'garage_finish','fireplace_qu', 'full_bath', 'foundation', 'garage_type','mas_vnr_area']
X = clean_train[features]
y =  clean_train[['saleprice']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state = 69)

- Fitting the Linear Regression Model
- Producing Cross Value Score

In [5]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_cv = cross_val_score(lr, X_train, y_train, cv=10).mean()
print('Linear Regression Cross Value Score {}' .format(lr_cv))

Linear Regression Cross Value Score 0.8609544499575532


- This is a relativley good baseline score for being the first and simplest model.

In [6]:
lr_cv = cross_val_score(lr, X_test, y_test, cv=10).mean()
print('Linear Regression Cross Value Score {}' .format(lr_cv))

Linear Regression Cross Value Score 0.8716947879843181


- Looking at these first produced cross value scores this model is barely overfit. As the test cross value score is lower than the training. 

In [7]:
X = sm.add_constant(X)
model = sm.OLS(y , X).fit()
model.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.875
Model:,OLS,Adj. R-squared:,0.874
Method:,Least Squares,F-statistic:,1016.0
Date:,"Fri, 07 Dec 2018",Prob (F-statistic):,0.0
Time:,09:45:44,Log-Likelihood:,-23892.0
No. Observations:,2049,AIC:,47810.0
Df Residuals:,2034,BIC:,47900.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.156e+05,3966.477,-29.135,0.000,-1.23e+05,-1.08e+05
overall_qual,7108.2063,817.077,8.700,0.000,5505.812,8710.601
total_sf,32.1675,1.839,17.488,0.000,28.560,35.775
neighborhood,5.876e+04,5756.250,10.208,0.000,4.75e+04,7e+04
exter_qual,5.131e+04,7168.525,7.157,0.000,3.72e+04,6.54e+04
bsmt_qual,5.712e+04,5895.392,9.688,0.000,4.56e+04,6.87e+04
kitchen_qual,5.97e+04,5860.648,10.187,0.000,4.82e+04,7.12e+04
gr_liv_area,19.8876,2.713,7.330,0.000,14.567,25.208
garage_cars,6417.3347,1140.778,5.625,0.000,4180.120,8654.549

0,1,2,3
Omnibus:,434.925,Durbin-Watson:,2.026
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3843.862
Skew:,0.741,Prob(JB):,0.0
Kurtosis:,9.544,Cond. No.,39200.0


### Understanding of Key Metrics:
- Neighborhoods has the largest coef which, means as one unit of that variable increases that is how much the housing price will increase.
- There is one P value above 0.1(garage_finish), this would mean that this feature may not have any affect on the sale price.
- Interesting how full bathrooms and foundation coefficients are negative which would infer that these decrease sale price.  
    

### Saving Data

In [8]:
clean_train.to_csv('../data/Model_Benchmarks_train.csv', index=False)

In [9]:
clean_test.to_csv('../data/Model_Benchmarks_test.csv', index=False)