# House Sales in King County, USA

The purpose of this notebook is to gain a deeper understanding of linear regression by applying it to the "House Sales in King County" dataset. I will aim to archieve the highest possible prediction score by checking the underlying assumptions of a linear regression model and taking appropiate actions if needed.

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

# 1.) Data Preparation

## 1.1.) Missing Values

In [None]:
df.isna().sum()

We don't hava any missing values in our dataset

## 1.2.) Categorical Variables

In [None]:
df.dtypes

In [None]:
df.drop(columns=['id', 'date'], inplace=True)

All our variables are numercial except for 'date'. Even though being numerical, variables like view, condition, etc. can be considered as categorical since they are not continuous. For those variables we could use an encoding method (e.g. dummy variables). However, I've decided to not consider those variables as categorical. Since I decided to not include 'date' in the regression, I will drop it along with 'id'.

# 2.) Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import preprocessing

In [None]:
X = df.drop(columns=['price'])
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [None]:
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
y_pred = model_lr.predict(X_test)
print("Training set score: {:.7f}".format(model_lr.score(X_train, y_train)))
print("Test set score: {:.7f}".format(model_lr.score(X_test, y_test)))
print("RMSE: {:.7f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))

With just a simple multiple regression we can already archive a r2-score of ~0.7. This is not optimal but not too bad for now.

# 3.) Check Model Adequacy

To improve our score we should take a look at the assumptions our model is making. We are using linear regression so we should check our data regarding:

- Linearity
- Outliers
- Homoscedasticity
- Normality
- Multicollinearity


## 3.1.) Linearity

### Actual vs. Predicted Plot

In [None]:
def calculate_residuals(model, features, label):
    predictions = model_lr.predict(features)
    df_result = pd.DataFrame({'Actual':label, 'Predicted':predictions})
    df_result['Residuals'] = abs(df_result['Actual']) - abs(df_result['Predicted'])
    return df_result

In [None]:
def linear_assumption(model, features, label):
    df_result = calculate_residuals(model, features, label)
    fig1, ax1 = plt.subplots(figsize=(12,8))
    ax1 = sns.regplot(x='Actual', y='Predicted', data=df_result, color='steelblue')
    line_coords = np.arange(df_result.min().min(), df_result.max().max())
    ax1 = plt.plot(line_coords, line_coords,  # X and y points
              color='indianred')

In [None]:
linear_assumption(model_lr, X_test, y_test)

### Residual Plot

In [None]:
df_result = calculate_residuals(model_lr, X_test, y_test)
fig2, ax2 = plt.subplots(figsize=(12,8))
ax2.scatter(x=df_result['Predicted'], y=df_result['Residuals'], color='steelblue')
plt.axhline(y=0, color='indianred')
ax2.set_ylabel('Residuals', fontsize=12)
ax2.set_xlabel('Predicted', fontsize=12)
plt.show()

We can check the linearity of our model by looking at the actual vs. predicted plot or the predicted vs. residuals plot. For linearity in case of the former, the data points should be symmetrically distributed around the diagonal line. This is not the case here as our predictions are biased especially with higher values. Similarly, for linearity in case of the latter, the data points should be symmetrically distributed around the horizontal line. As observed though, the residual variance increases with higher values. This indicates the violation of the underlying assumptions and are dealt with in a later section. 
(I found the functions on this blog: https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/)

## Outliers

Before we get back to the linearity assumption, we will focus on a different problem that is appearent in the plots: Outliers. Our model failed to predict certain values by a significant amount, thus it is worth it to have a look at the price distribution.

In [None]:
plt.style.use('ggplot')
fig3, ax3 = plt.subplots(figsize=(15,4))
ax3 = sns.boxplot(x=df['price'], color='steelblue')

In [None]:
df1 = df[~(df['price']>4000000)]
df1

We can see from the boxplot that there are just a few data points where the price exceeds 4,000,000. After examining the data points I reached the conclusion that these "outliers" are not due to a mistake (false entry, etc.) because the high prices seem plausible to some degree given the underlying attributes (sqft_living, grade, etc.). However, I will omit all the data where the price exceeds 4,000,000 as there are only 11 entries and the regression is affected (as shown in the residual and the actual vs. predicted plots). This implies that our model only applies to a certain price range (price < 4,000,000).

## Regression without outliers

In [None]:
X1 = df1.drop(columns=['price'])
y1 = df1['price']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=10)

In [None]:
model_lr1 = LinearRegression()
model_lr1.fit(X1_train, y1_train)
y1_pred = model_lr1.predict(X1_test)
print("Training set score: {:.7f}".format(model_lr1.score(X1_train, y1_train)))
print("Test set score: {:.7f}".format(model_lr1.score(X1_test, y1_test)))
print("RMSE: {:.7f}".format(np.sqrt(metrics.mean_squared_error(y1_test, y1_pred))))

In [None]:
linear_assumption(model_lr1, X1_test, y1_test)

In [None]:
df_result = calculate_residuals(model_lr1, X1_test, y1_test)
fig4, ax4 = plt.subplots(figsize=(12,8))
ax4.scatter(x=df_result['Predicted'], y=df_result['Residuals'], color='steelblue')
ax4.set_ylabel('Residuals', fontsize=12)
ax4.set_xlabel('Predicted', fontsize=12)
plt.axhline(y=0, color='indianred')
plt.show()

Omitting the "outliers" in the regression model led to a slight improvement of the score. We also got rid of the outliers from our actual vs. predicted and residual plots. However, the problem of non-linearity still remains. Let's dive deeper into that.

In [None]:
plt.style.use('ggplot')
sns.pairplot(df[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'view', 'grade']],  
             y_vars=['price'], x_vars=['bedrooms', 'bathrooms', 'sqft_living', 'view', 'grade'], 
             height=5, plot_kws={'color':'steelblue'}) 
plt.show()

In [None]:
plt.style.use('ggplot')
sns.pairplot(df[['price','sqft_above', 'sqft_basement', 'lat', 'sqft_living15']],  
             y_vars=['price'], x_vars=['sqft_above', 'sqft_basement', 'lat', 'sqft_living15'], height=5,
             plot_kws={'color':'steelblue'}) 
plt.show()

Based on the correlation plot I chose to plot the relationship of the variable price and the above seen variables. None of the independent variables show perfect linearity. The variables sqft_living, sqft_above and sqft_living15 show certain degrees of linearity. However, this is not optimal for our regression model. To deal with the non-linearity we will try to perform a polynomial regression.

## Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(2)
X_train_poly = poly.fit_transform(X1_train)
X_test_poly = poly.fit_transform(X1_test)

In [None]:
model_lr_poly = LinearRegression()
model_lr_poly.fit(X_train_poly, y1_train)
y_pred_poly = model_lr_poly.predict(X_test_poly)
print("Training set score: {:.7f}".format(model_lr_poly.score(X_train_poly, y1_train)))
print("Test set score: {:.7f}".format(model_lr_poly.score(X_test_poly, y1_test)))
print(np.sqrt(metrics.mean_squared_error(y1_test, y_pred_poly)))

In [None]:
fig10, ax10 = plt.subplots(figsize=(12,8))
ax10 = sns.regplot(x=y1_test, y=y_pred_poly, color='steelblue')
line_coords = np.arange(df_result.min().min(), df_result.max().max())
plt.plot(line_coords, line_coords, color='indianred')
ax10.set_ylabel('Predicted', fontsize=12)
ax10.set_xlabel('Actual', fontsize=12)
plt.show()

In [None]:
df_result = calculate_residuals(model_lr1, X1_test, y1_test)
fig11, ax11 = plt.subplots(figsize=(12,8))
ax11.scatter(x=y_pred_poly, y=y1_test-y_pred_poly, color='steelblue')
ax11.set_ylabel('Residuals', fontsize=12)
ax11.set_xlabel('Predicted', fontsize=12)
plt.axhline(y=0, color='indianred')
plt.show()

A polynomial regression (degree: 2), often considered a special kind of linear regression, significantly improves our r2 score! The distributions of the data points in our actual vs. predicted and residual plots looks better too. However, we can still observe what seems to be outliers and inaccurities. As mentioned earlier, especially in the residual plot we can still see increased variances with higher values. This is an indicator for heteroscedasticity.

## 3.2.) Homoscedasticity

Homoscedasticity describes constant variance in the residuals as can be observed in a residual plot. Ideally, the residuals should be distributed evenly around the horizontal line without any increasing or decreasing trend. This is not the case in our residual plot so we will try to log-transform the dependent variable in order to tackle heteroscedasticity.

In [None]:
X = df1.drop(columns=['price'])
price_trans = np.log1p(df1['price'])
X2_train, X2_test, y2_train, y2_test = train_test_split(X, price_trans, test_size=0.3, random_state=10)

In [None]:
poly = PolynomialFeatures(2)
X2_train_poly = poly.fit_transform(X2_train)
X2_test_poly = poly.fit_transform(X2_test)

In [None]:
model_lr_poly2 = LinearRegression()
model_lr_poly2.fit(X2_train_poly, y2_train)
y2_pred_poly = model_lr_poly2.predict(X2_test_poly)
print("Training set score: {:.7f}".format(model_lr_poly2.score(X2_train_poly, y2_train)))
print("Test set score: {:.7f}".format(model_lr_poly2.score(X2_test_poly, y2_test)))

In [None]:
fig13, ax13 = plt.subplots(figsize=(12,8))
ax13.scatter(x=y2_pred_poly, y=y2_test-y2_pred_poly, color='steelblue')
ax13.set_ylabel('Residuals', fontsize=12)
ax13.set_xlabel('Predicted', fontsize=12)
plt.axhline(y=0, color='indianred')
plt.show()

Transforming our dependent variable increased our r2 score slightly. More importantly, the residuals seem to be more evenly distributed around the horizontal line than before. However, we can still observe many outliers. 
(Any suggestions on how i should proceed with the outlier problem is welcomed)

## 3.3.) Normality

The normality assumption can easily be observed by plotting the residual histogram or the QQ-plot of the residuals.

In [None]:
fig15, ax15 = plt.subplots(figsize=(12,8))
sns.distplot(y2_test-y2_pred_poly, color='steelblue')
plt.show()

In [None]:
from scipy import stats
fig16, ax16 = plt.subplots(figsize=(8,5))
stats.probplot(y2_test-y2_pred_poly, plot=plt)
plt.show()

Both plots show that the normality assumption is met. We archieved this by log-transforming our dependent variable in the previos step.

## 3.4.) Multicollinearity

## VIF

One method to identify which variables are affected by multilinearity is the Variation Inflation Factor (VIF). A value of >10 indicates multicollinearity. Let's check the VIF's for our X_test from the very first regression.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# from statsmodels.tools.tools import add_constant

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X_test.values, i) for i in range(X_test.shape[1])]
vif["features"] = X_test.columns

In [None]:
vif['VIF'] = vif['VIF'].apply(lambda x: "{:.2f}".format(x))
vif

Apparently, most of our variables are affected by multicollinearity. This is a problem especially when interpreting the coefficients and the individual effects the independent variables have on the dependent variable. However, our goal here is primarily prediction precision, so we don't have to worry about collinearity too much.
(this blog amongst others provide an overview on when multicollinearity needs to be tackled and when not: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/).

# 4.) Conclusion

After checking the underlying assumptions of a linear regression model and taking the appropiate actions, we archieved an r2 score of ~0.82. There is definitely room for improvement and we should also consider different regression models aswell. Any suggestions on how I could improve or make changes in the model would be highly appreciated.