# Using Regression to predict housing prices in King County (R2 adj. 84%)

# Content

In [None]:
## to do

# Introduction

Hi all, welcome to this kernel. Until now, it includes everything I wanted to include, except interactions between variables. <br>
Comments are very welcome, I would love to hear your ideas about improvents !<br>
<br>
The goal of this kernel is to find a regression model that is able to predict housing prices for houses in King County, USA. Of course there are other types of models which can do that, but since I am not an expert, I choose regresion. <br>
Regression can help us to forecast a response using a set of predictors. We can forecast the house price for a house given the squared feet, location, and other variables. <br>
In this kernel we will use all the provided variables, transform them if neccesary, create a linear regression model, check the performance of this model, and check if the regression assuptions hold.

# Imports

First import all libraries we need, and the dataset with house data.

In [None]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import datasets, linear_model, metrics
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# import data

path = "/kaggle/input/housesalesprediction/kc_house_data.csv"

df_house = pd.read_csv(path)
df_house

# Explore data Part 1

Now we have all data we need, let's check what variables we have, what they look like, and what data types they are.

In [None]:
display(df_house.describe())
display(df_house.info())

<b>Missing vlaues</b><br>
In the .info() results we can see that there are no NaN values. <br>
In the .describe() results we can see if there are other values than NaN used as missing values, e.g. 0, -1, -99, blanks, etc.<br>
It looks like there are no missing values, so no need to transform missing values.<br>
<br>
<b>Qualitative variables</b><br>
Since regression models only take quantitative variables, we need to check which qualitative variables we have. In this dataset, this is only the date column. Since we won't use the date variable in the regression model anyway, there is no need to transform this variable.

# Create New Variables

<b>Change variables that make no sense</b><br>
The sqft_basement variable indicates how big the basement is. I believe that the actual size of the basement does not add much more to the price of a house, than simply having a basement or not. <br>
So instead, create a new variable 'has_basement' that indicates wether a house has a basement (1) or not (0).

In [None]:
df_house['has_basement'] = np.where(df_house['sqft_basement'] > 0, 1, 0)

df_house.head()

<b>Distance to city center</b><br>
In many cities, housing prices are higher in the city center, and lower outside the city. <br>
So let's calculate the distance of each house to the city center of Seattle, based on longitude and latitude, and put the results in a new variable 'distance _to_city'.

In [None]:
from math import radians, sin, cos, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    return 2 * 6371 * asin(sqrt(a))

latitude_city = 47.610515
longitude_city = -122.33465413

df_house['distance_to_city'] = df_house.apply(lambda row: haversine(latitude_city, longitude_city, row['lat'], row['long']), axis=1)

df_house

<b>Good/bad neighborhoods</b><br>
It often happens that a similar house will cost more in a 'good' neighborhood and less in a 'bad' neighborhood. <br>
It is difficult to define a good or bad neighborhood with the available data. Instead, we assume that the median house price is higher in good neighborhoods and lower in bad neighborhoods. <br>
First, calculate the top 30 zipcodes by median house price, and same for the bottom 30 zipcodes. <br>
Then create a new variable 'is_top_x_zipcode' which contains 1 for houses which zipcode is in the top 30, and if not, 0. <br>
Also create a new variable 'is_bottomx_zipcode' which contains 1 for houses which zipcode is in the bottom 30, and if not, 0.<br>
The number 30 is choosen by trial and error.

In [None]:
# check if prices are higher or lower er certain zipcodes

f, axe = plt.subplots(1, 1,figsize=(25,5))
sns.boxplot(x=df_house['zipcode'],y=df_house['price'], ax=axe)
sns.despine(left=True, bottom=True)
axe.yaxis.tick_left()
axe.set(xlabel='Zipcode', ylabel='Price')

In [None]:
# create new variables for top x and bottom x zipcodes by price
# trial and error to get a better x

med_price_zip = df_house.groupby(['zipcode']).agg({'price': 'median', 'id': "count"}).sort_values('price', ascending = False)

zipcode_topx = np.array([med_price_zip[c].nlargest(30).index.values for c in med_price_zip])[0]
zipcode_bottomx = np.array([med_price_zip[c].nsmallest(30).index.values for c in med_price_zip])[0]

print(zipcode_topx)
print(zipcode_bottomx)

df_house["is_topx_zipcode"] = [1 if x in list(zipcode_topx) else 0 for x in df_house["zipcode"]]
df_house["is_bottomx_zipcode"] = [1 if x in list(zipcode_bottomx) else 0 for x in df_house["zipcode"]]

df_house

# Explore data Part 2

Now we have our new variables, let's do some more data exploration:
- Univariate: histograms for all variables
- Univariate: distribution and normal proability plot for all variables
- Bivariate: joinplots for all x variables in relation to price
- Bivariate: correlation heatmap to vizualize all correlations between all variables

In [None]:
# histograms for all variables
# copied from: https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices

h = df_house.drop(['id', 'date'], axis = 1).hist(bins = 25, figsize = (16,16), xlabelsize = '10', ylabelsize = '10', xrot = -15)
sns.despine(left = True, bottom = True)
[x.title.set_size(12) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];

In [None]:
# graph distribution

for col in list(df_house.drop(['id', 'date'], axis = 1).columns):    
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    sns.distplot(df_house[col].dropna(), fit=stats.norm);
    plt.subplot(1,2,2)
    _=stats.probplot(df_house[col].dropna(), plot=plt)

plt.show()

In [None]:
# joinplots for all x variables in relation to x, to visualize the bivariate distribution

for col in list(df_house.drop(['id', 'date','price'], axis = 1).columns):
    sns.jointplot(x = col, y = "price", data = df_house, kind = 'reg', size = 5)

plt.show()

In [None]:
# create correlation matrix
corr_matrix = df_house.corr()


# set up mask to hide upper triangle

mask = np.zeros_like(corr_matrix, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True


# create seaborn heatmap

f, ax = plt.subplots(figsize = (16, 10)) 

heatmap = sns.heatmap(corr_matrix, 
                      mask = mask,
                      #square = True, # Makes each cell square-shaped
                      linewidths = .5, # set width of the lines that will divide each cell to .5
                      cmap = "coolwarm", # map data values to the coolwarm color space
                      cbar_kws = {'shrink': .4, # shrink the legend size and label tick marks at [-1, -.5, 0, 0.5, 1]
                                "ticks" : [-1, -.5, 0, 0.5, 1]},
                      vmin = -1, # Set min value for color bar
                      vmax = 1, # Set max value for color bar
                      annot = True, # Turn on annotations for the correlation values
                      fmt='.2f', # String formatting code to use when adding annotations
                      annot_kws = {"size": 12}) # Set annotations to size 12

# add title
plt.title('House Sales King County - Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')

# add the column names as labels
ax.set_xticklabels(corr_matrix.columns, rotation = 90) # Add column names to the x labels and rotate text to 90 degrees
ax.set_yticklabels(corr_matrix.columns, rotation = 0) # Add column names to the y labels and rotate text to 0 degrees

sns.set_style({'xtick.bottom': True}, {'ytick.left': True}) # Show tickmarks on bottom and left of heatmap

# Transform Variables

<b>Handle nonlinear relationships</b><br>
While many machine learning models inherently have the flexibility to detect and model non-linear relationships between a predictor variable and the target variable, linear regression, by default, will model the relationship as a linear relationship
You can, however, set up your regression dataset to give regression the flexibility to model non-linear relationships if they exist. <br>

The joinplot graphs that were made for each variable already show some imformation about what the relationship looks like. <br>
Additional, we can also calulate the skewness to see if there are variables skewed to the right (skew > 1) or left (skew < -1).

In [None]:
# check for skewed data

skew = df_house.skew(axis = 0, skipna = True) 

# show all variables which are skewed to the right (skew > 1)
print('Variables which are skewed to the right:')
print(skew[skew > 1].sort_values(ascending = False))
print()

# show all variables which are skewed to the left (skew < -1)
print('Variables which are skewed to the left:')
print(skew[skew < -1].sort_values())

<b>Reduce skewness</b><br>
The results show a number of variables which are skewed to the right, and none that are skewed to the left. <br>
To handle variables skewed to the right, take the log of the variable. This only works if there are no zero or negative values. Otherwise, take the cube root or square root. <br>
For the variables price, sqft_lot, and distane_to_city, a new variable is created with the log of the variable.

In [None]:
df_house['price_log'] = np.log(df_house['price'])
df_house['sqft_lot_log'] = np.log(df_house['sqft_lot'])
df_house['distance_to_city_log'] = np.log(df_house['distance_to_city'])

df_house

<b>Binning:</b><br>
Data binning is a preprocessing technique used to reduce the effects of minor observation errors.< Often we see this with a variable like age. <br>
We will do the same for the 'age' of the houses. <br>
Step 1: partition the age into bins. <br>
Step 2: create dummy variables for every bin.

In [None]:
# taken from https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices

# add the age of the buildings when the houses were sold as a new column
age = df_house['date'].astype(str).str[:4].astype(int) - df_house['yr_built']

# partition the age into bins
bins = [-2, 0, 5, 10, 25, 50, 75, 100, 100000]
labels = ['<1', '1-5', '6-10', '11-25', '26-50', '51-75', '76-100', '>100']
df_house['age_binned'] = pd.cut(age, bins = bins, labels = labels)

# histograms for the binned columns
plot = sns.countplot(df_house['age_binned'])
for p in plot.patches:
    height = p.get_height()
    plot.text(p.get_x() + p.get_width() / 2, height + 50, height, ha = "center")   

ax.set(xlabel='Age')
ax.yaxis.tick_left()

# transform the factor values to be able to use in the model
df_house = pd.get_dummies(df_house, columns=['age_binned'])

df_house

Now some important variables are transformed, including the y variable, let's do the joinplots and the correlation heatmap again.

In [None]:
# joinplots for all x variables in relation to x, to visualize the bivariate distribution

for col in list(df_house.drop(['id', 'date','price'], axis = 1).columns):
    sns.jointplot(x = col, y = "price_log", data = df_house, kind = 'reg', size = 5)

plt.show()

In [None]:
# create correlation matrix
corr_matrix = df_house.corr()
#display(corr_matrix)


# set up mask to hide upper triangle

mask = np.zeros_like(corr_matrix, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True


# create seaborn heatmap

f, ax = plt.subplots(figsize = (16, 10)) 

heatmap = sns.heatmap(corr_matrix, 
                      mask = mask,
                      #square = True, # Makes each cell square-shaped
                      linewidths = .5, # set width of the lines that will divide each cell to .5
                      cmap = "coolwarm", # map data values to the coolwarm color space
                      cbar_kws = {'shrink': .4, # shrink the legend size and label tick marks at [-1, -.5, 0, 0.5, 1]
                                "ticks" : [-1, -.5, 0, 0.5, 1]},
                      vmin = -1, # Set min value for color bar
                      vmax = 1, # Set max value for color bar
                      annot = True, # Turn on annotations for the correlation values
                      fmt='.2f', # String formatting code to use when adding annotations
                      annot_kws = {"size": 12}) # Set annotations to size 12

# add title
plt.title('House Sales King County - Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')

# add the column names as labels
ax.set_xticklabels(corr_matrix.columns, rotation = 90) # Add column names to the x labels and rotate text to 90 degrees
ax.set_yticklabels(corr_matrix.columns, rotation = 0) # Add column names to the y labels and rotate text to 0 degrees

sns.set_style({'xtick.bottom': True}, {'ytick.left': True}) # Show tickmarks on bottom and left of heatmap

# Create model and fit with test dataset

<b>Split dataset in train and test datasets</b>
<br>
The dataset will be split with random 80% in a training dataset to train the model, and the other random 20% in a test dataset. This split allows us to validate the model’s predictive power on a different set of data than the model was trained on.<br>

In [None]:
# split in train and test dataset, and x and y

from sklearn.model_selection import train_test_split

X = df_house.drop(['id', 'date', # variables not used from start
                   'price', 'price_log', # exclude dependent variables
                   'sqft_lot', 'distance_to_city', # exclude original variables after creating log transformatian
                   'sqft_basement','yr_built', 'zipcode', # exclude original variables after transformation
                   'sqft_above'], # remove to avoid multicoliniarity
                  axis = 1)
y = df_house['price_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

print(X_train.shape)
print(X_test.shape)

<b>Create model</b><br>
Now we can finally create a model ! Sklearn is used to create the model and for further steps. Statsmodels is also used to show the p-values.

In [None]:
# create linear model

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lrmodel1 = lr.fit(X_train, y_train)

In [None]:
# print model coefficients

coef_lr = pd.Series(lrmodel1.coef_, index = X_train.columns)
print('Intercept:', lrmodel1.intercept_)
print()
print('Coefficients:')
print(coef_lr.round(4))

In [None]:
# get p values for coefficients with statsmodel

import statsmodels.api as sm

#Fitting sm.OLS model
X_1 = sm.add_constant(X_train)
model = sm.OLS(y_train,X_1).fit()
print(model.summary())

<b>Interpreting model coefficients</b><br>


<i>Coef</i>: this is the weight assigned to that input variable. Positive indicates a positive relationship between the input variable and the target variable (controlling for all other input variables in the model). A higher absolute value indicates a larger impact on the target variable (although be sure to consider the potentially different scale of each of the input variables when comparing across input variables). <br>
<br>
<i>P>|t|</i>: this is the likelihood that there is no relationship between the input variable and the target variable; usually, variables are excluded if they have a p-value more than a certain threshold (such as 0.05).

<b>Interpreting coefficients with log-transformed Variables</b><br>
<br>
<i>Only the dependent/response variable is log-transformed.</i> <br>
Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. <br>
Example: the coefficient is 0.198. (exp(0.198) – 1) x 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.<br>
In python: (np.exp(0.198) - 1) x 100<br>
<br>
<i>Only independent/predictor variable(s) is log-transformed.</i> <br>
Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. <br>
Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). <br>
Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 x log(1.10) = 0.02.<br>
<br>
<i>Both dependent/response variable and independent/predictor variable(s) are log-transformed.</i> <br>
Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. <br>
Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract from 1, and multiply by 100. <br>
Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) x 100 = 3.7 percent.<br>

# Predict values on test dataset

Assessing a model: assesses predictive power of trained model on an independent dataset that the model was not trained on.<br>
This is the best indicator of how the model will actually perform when applied to new data points.

In [None]:
# predict y values based on model coefficients

pred_train = lrmodel1.predict(X_train)
pred_test = lr.predict(X_test)

plt.scatter(y_test,pred_test)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "r")
plt.xlabel('y actual')
plt.ylabel('y predicted')

In [None]:
# return predictions obtained for each element when it was in the test set

from sklearn.model_selection import cross_val_predict

y_cross = cross_val_predict(lrmodel1, X, y, cv = 5)

plt.scatter(y, y_cross)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "r")
plt.xlabel('y actual')
plt.ylabel('y predicted (cross)')

# Check regression assumptions

<b>While aspects of the regression model rely on these assumptions to be truly robust/accurate, a regression model can often be predictive/useful even when certain assumptions are not met!</b><br>
This model is used for prediction, and not determining significance of particular variables. For this reason, I regard the results of the assumptions check not as 'the model is wrong and unuseful', but as nothing more than suggestions for model improvement.

<b>Residuals/errors (model’s predicted value – actual value for each observation) are approximately normally distributed.</b> <br>
Problem:
    - If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares.
Check:
    - After model is created, you could plot distribution of residuals to see whether it looks normal
    - QQ (quantile quantile) / probability plot. If the data comes from a normal distribution, the plot would show fairly straight line.
    - You can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.
Fix:
    - If the errors are not normally distributed, non – linear transformation of the variables (response or predictors) can bring improvement in the model.

In [None]:
# create probability plot

residuals = y_train - pred_train.reshape(-1)

plt.figure(figsize=(7,7))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Normal Q-Q Plot")
print("If the residuals (blue dots) fall on the red line, the residuals are approximately normally distributed")
plt.show()
print()


# Kolmogorov-Smirnov test
# if the test is significant, this indicates that the model’s residuals are not normally distributed

kstest = stats.kstest(residuals, 'norm')

print("Kolmogorov-Smirnov:")
print(kstest)
if kstest[1] < 0.05:
    print("Evidence that the residuals are not normally distributed")
    print('Assumption not satisfied')
else:
    print("No evidence that the residuals are not normally distributed")
    print('Assumption satisfied')
print()


# Shapiro Wilk test
# if the test is significant, this indicates that the model’s residuals are not normally distributed

shapiro = stats.shapiro(residuals)

print("Shapiro Wilk:")
print(shapiro)
if shapiro[1] < 0.05:
    print("Evidence that the residuals are not normally distributed")
    print('Assumption not satisfied')
else:
    print("No evidence that the residuals are not normally distributed")
    print('Assumption satisfied')

The residuals are NOT normally distributed. <br>
The model predicts too low for low house prices, and too high for high house prices. <br>
Since we already did variable transformation to capture non linear relationships, and in the mid section the results look good, no further action is taken. <br>
In a future version I might take a look at what is happening in the tails.

<b>No autocorrelation: residuals are independent across observations</b><br>
Problem:
	- The presence of correlation in error terms drastically reduces model’s accuracy. 
	If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. 
	If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. 
	Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. 
Check:
	- Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. 
	- Also, you can see residual vs time plot and look for the seasonal or correlated pattern in residual values.
Fix:
    - The result of Durbin-Watson shows that either you fitted a linear function to your data which have nonlinear relationship or you didn't considered an important variable in your model. Try to transform your data in a way that you can guarantee a linear relationship and think if you omitted an important variable or not.

In [None]:
import statsmodels.stats.api as sms

dw = sms.durbin_watson(residuals)

print('Durbin-Watson: {:.3f}'.format(dw))
if dw < 1.5:
    print('Signs of positive autocorrelation')
    print('Assumption not satisfied')
elif dw > 2.5:
    print('Signs of negative autocorrelation')
    print('Assumption not satisfied')
else:
    print('Little to no autocorrelation')
    print('Assumption satisfied')

There appears to be no autocorrelation, this is good, no need for action

<b>Homoscedasticity: error terms must have constant variance</b><br>
The variance of residuals is not correlated with model’s predicted values.<br>
Problem:
	- The standard errors (which are used to conduct significance tests, and calculate the confidence intervals) will be biased
Check:
	- Plot residuals versus predicted values and check whether residuals are generally evenly distributed regardless of whether the predicted value is high or low.
	There should be no pattern. 
	If there exist any pattern (may be, a parabolic shape) in this plot, consider it as signs of non-linearity in the data. It means that the model doesn’t capture non-linear effects.
	If a funnel shape is evident in the plot, consider it as the signs of non constant variance i.e. heteroskedasticity.
	- This can be tested using a few different statistical tests, these include the Brown-Forsythe test, Levene’s test, Bruesch-Pagan test, or Cook-Weisberg test.
Fix:
	- To overcome the issue of non-linearity, you can do a non linear transformation of predictors such as log (X), √X or X² transform the dependent variable. 
	- To overcome heteroskedasticity, a possible way is to transform the response variable such as log(Y) or √Y. 
	- Also, you can use weighted least square method to tackle heteroskedasticity.
	- Fix outliers

In [None]:
# residual plot

sns.residplot(pred_train, y_train, lowess = True,
                                  line_kws = {'color': 'red', 'lw': 1, 'alpha': 1})
plt.xlabel("Predicted Y")
plt.ylabel("Residual")
plt.title('Residual plot')
plt.show()

If the red line on the graph deviates much from the green line, this is a sign of no constant variance between error terms. <br>
It is not an excact match, but the deviation is not extreme, so no forther action is taken.

<b>No multicolinearity: input variables in the final model are not highly correlated with one another</b><br>
Problem:
	- It makes it hard to determine which input variable is really driving the target variable
	- With presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters.
	- Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, you’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable. Since, even if you drop one correlated variable from the model, its estimated regression coefficients would change. That’s not good!
Check:
	- Plot correlation coefficient matrix for all of the input variables and look for cases where variables are highly correlated. Should be < 0.8
	- Check variance inflation factor (VIF) of each input variable and ensure that it is less than 5 or 10. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. 
Fix:
    - If there is high multicolinearity between 2 variables, you could recode the variables to have more unique information or exclude the input variable that is less correlated with the target variable from the model
    - Sometimes it is unavoidable, e.g. with dummy variables

In [None]:
# show all correlations above 0.8

cor = X_train.corr().round(2).abs() # create correlation df with rounded absolute values
cor = cor.unstack() # unstack to one row per correlation
cor = cor.reset_index(drop=False) # create new index instead of variable names as index
cor = cor[cor['level_0'] != cor['level_1']] # remove correlation for each variable with itsself
cor = cor[cor[0] >= 0.80] # show only correlations above this level
cor = cor.iloc[0:-1:2] # there are double rows for each correlation. take only one row. this is not fool proof !
cor.sort_values([0], ascending = False, inplace = True) # sort from high correlation to low
print(cor)

In [None]:
# calculate vif
# situations in which a high VIF is not a problem and can be safely ignored:
# 1. The variables with high VIFs are control variables, and the variables of interest do not have high VIFs.
# 2. The high VIFs are caused by the inclusion of powers or products of other variables. 
# 3. The variables with high VIFs are indicator (dummy) variables that represent a categorical variable with three+ categories.

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif["features"] = X_train.columns
vif["VIF Factor"] = vif["VIF Factor"].round(1)
vif.sort_values(['VIF Factor'], ascending = False, inplace = True)
vif["VIF Factor"] = vif["VIF Factor"].astype(str)
vif

Some variables show multicoliniarity. <br>
With dummy and binned variables this is unavoidable. <br>
With trial and error variables that relate too much to another variable are removed from the model. <br>
At the end, only removing the variable sqft_above resulted in an improvent of the model. 

# Access model performance

Because the goal of the regression is the use the model for predictions, we need to know how capable the model is in prediction. To determine the model performance, RMSE, R2, and R2 adjusted are used. <br>
All metrics are calculated for the train model, the test model, and cross validation.<br>
<br>
<b>RMSE</b><br>
The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit.<br>
<br>
<b>R Squared (R²) and Adjusted R Squared</b><br>
R Squared & Adjusted R Squared describe what percent of the variability in your dependent variable(s) is explained by your selected independent variable(s). <br>
An adjusted R² will consider the marginal improvement added by an additional term in your model. So it will increase if you add the useful terms and it will decrease if you add less useful predictors. However, R² increases with increasing terms even though the model is not actually improving.

In [None]:
# calculate MSQE, R2, and R2 adjusted, for train and test model

from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.model_selection import cross_val_score

print("Train model scores")

mse = np.sqrt(mean_squared_error(y_train, pred_train)) 
print('Root mean square error:', mse) 

r2 = r2_score(y_train, pred_train)
print('R2: {:.2%}'.format(r2))

n = X_train.shape[0]
p = X_train.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print('R2 Adjusted: {:.2%}'.format(r2_adj))
print()


print("Test model scores")

mse = np.sqrt(mean_squared_error(y_test,pred_test)) 
print('Root mean square error:', mse) 

r2 = r2_score(y_test, pred_test)
print('R2: {:.2%}'.format(r2))

n = X_test.shape[0]
p = X_test.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print('R2 Adjusted: {:.2%}'.format(r2_adj))
print()


print("Cross validation scores")

scores = cross_val_score(lrmodel1, X, y, cv = 5)
print("All scores:",(scores * 100).round(2))
print('Average cross validation score: {:.2%}'.format(np.mean(scores)))

# Feature Selection

Adding another input variable to the regression will generally always increase the model’s predictive power (increase the r-squared value) on the training dataset by at least a nominal amount<br>
However, adding a variable could actually result in lower predictive power on a holdout dataset due to the bias-variance tradeoff (this is called overfitting)<br>
Hence, it is important to be sure that the additional variable is statistically significant in the training dataset and check to make sure that the adjusted r-squared also increases<br>
There are several methods for feature selection, below RFE is used. Recursive Feature Elimination (RFE) recursively removes features, builds a model using the remaining attributes, and calculates model accuracy.

In [None]:
# feature selection with RFE

from sklearn.feature_selection import RFE

# choose optimal # of features

#no of features
columns_count = len(X_train.columns)
nof_list = np.arange(1, columns_count + 1)
high_score = 0

#Variable to store the optimum features
nof = 0
score_list = []

for n in range(len(nof_list)):
    #X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
    model = LinearRegression()
    rfe = RFE(model, nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train, y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe, y_train)
    score = model.score(X_test_rfe, y_test)
    score_list.append(score)
    if(score > high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
print()


# run RFE 

cols = list(X.columns)
model = LinearRegression()

#Initializing RFE model
rfe = RFE(model, nof)

#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)

#Fitting the data to model
model.fit(X_rfe,y)
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print("Selected features:", list(selected_features_rfe))

The results are similar to the results with all features. Since the p-values of all features are significant, this is no suprise.