# King County House Sales Regression Analysis
## Data Modeling

* Student name: Spencer Hadel
* Student pace: Flex
* Scheduled project review date/time: 5/5/2022, 11:00am EST
* Instructor name: Claude Fried

#### Objective

In order to help a new real estate company in King County, we need to analyze past house sales data in the region and create a linear regression model which can help the company better understand what factors contribute to price of a given home. We will import over 20 thousand data points from recent sales in the King County area, and proceed to clean, preprocess, and model the information present in this dataset in order to inform the new company on how to appropriately assess the value of a home when helping a client buy or sell a home.

### Import Modules

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate, ShuffleSplit

from sklearn.metrics import mean_squared_error

KeyboardInterrupt: 

### Import Prepared Data from kc_preprocessing_exploring.ipynb

We have already preprocessed our data in the kc_kc_preprocessing_exploring notebook:

[Preprocessing Notebook](./kc_preprocessing_exploring.ipynb)

In [None]:
pre_df = pd.read_csv('./data/preprocessed.csv', index_col = 0)

untransformed_df = pd.read_csv('./data/untransformed.csv', index_col = 0)

In [None]:
df = untransformed_df

In [None]:
df.info()

In [None]:
subs = [(' ', '_'),('.','_'),("'",""),('™', ''), ('®',''),
        ('+','plus'), ('½','half'), ('-','_')
       ]
def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

df.columns = [col_formatting(col) for col in df.columns]

list(df.columns)

## Split, Train and Test Data

Now that we have a complete preprocessed dataset, we need to split the data into train and test datasets, as well as identify the feature we are testing for: price.

In [None]:
X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#check size of each
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X_1 = X_train
y = y_train

model_1 = sm.OLS(y, sm.add_constant(X_1)).fit()
model_1.summary()

Purely for exploration, next we train our model on the test data.

In [None]:
orig_model = LinearRegression()

# Fit the model on X_train_final and y_train
orig_model.fit(X_train, y_train)

orig_model.score(X_test, y_test)

### Remove Uninfluential Features

The first issue with our model is the number of features. This much potential noise is likely not helping our model properly train itself on the relevant data.

In order to reduce the number of features, we will first use scikit-learn's feature_selection submodule to select only the most important features.

In [None]:
# Importances are based on coefficient magnitude, so
# we need to scale the data to normalize the coefficients
X_train_for_RFECV = StandardScaler().fit_transform(X_1)

# Instantiate and fit the selector
selector = RFECV(LinearRegression(), cv=ShuffleSplit(n_splits=3, test_size=0.25, random_state=0))
selector.fit(X_train_for_RFECV, y_train)

selected_features = []

# Relevant Features:
for index, col in enumerate(X_1.columns):
    if selector.support_[index] == True:
        selected_features.append(col)
        #print(col)

print(selected_features)

Now we rerun the model with only the feature_selector's most important features.

In [None]:
X_2 = X_train[selected_features]

model_2 = sm.OLS(y, sm.add_constant(X_2)).fit()
model_2.summary()

This yielded approximately the same R squared score, which is good because it means the features we removed were in fact very inconsequential to the outcome of our model. It also removed any feature with a value above the threshold of 0.05, removing our need to do this manually.

### Investigate Multicollinearity

The Cond. No is above 30 (which indicates strong multicollinearity) the next step is to check our features for multicollinearity, and remove any features that may be impacting each other in a way that trains our model incorrectly.

We can start by investigating multicollinearity the same way as the preprocessing step.

In [None]:
corr = X_2.corr()
corr

In [None]:
sns.set(rc = {'figure.figsize':(15,15)})

In [None]:
sns.heatmap(corr, annot = True);

There are still a lot of features, making this hard to look at and understand at a galnce. So we will use statsmodels' variance_inflation_factor to look at this information more clearly.

In [None]:
vif = [variance_inflation_factor(X_2.values, i) for i in range(X_2.shape[1])]
vif_scores = list(zip(X_2, vif))
vif_scores

In [None]:
new_features = [x for x,vif in vif_scores if vif < 5]
new_features

Now that we have checked for uninfluential features as well as features potentially causing multicollinearity, we run the tests again.

In [None]:
X_3 = X_train[new_features]

model_3 = sm.OLS(y, sm.add_constant(X_3)).fit()
model_3.summary()

This has actually reduced our R-Squared value, which is the opposite of what we would hope for. Nonetheless, multicollinear values had to be removed to prevent our model from being improperly trained for our test dataset.

## Final Model Interpretation

In [None]:
X_train_final = X_train[new_features]
X_test_final = X_test[new_features]

In [None]:
final_model = LinearRegression()

# Fit the model on X_train_final and y_train
final_model.fit(X_train_final, y_train)

# Score the model on X_test_final and y_test
# (use the built-in .score method)
final_model.score(X_test_final, y_test)

The final model's score indicates that it is about 59% accurate.

Next we investigate if our model violates each of the assumptions of linear regression:

### Linearity

In [None]:
sns.set(rc = {'figure.figsize':(5,5)})

In [None]:
preds = final_model.predict(X_test_final)
fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())
ax.scatter(y_test, preds, alpha=0.5)
ax.set_xlabel("Actual Value")
ax.set_ylabel("Predicted Value")

This actually looks like it has a decently linear relationship, with no drastic outliers.

### Normality

In [None]:
residuals = (y_test - preds)
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

Our model also does not violate the Normality assumption.

### Multicollinearity
We already made sure that our model did not violate multicollinearity by removing features in the modeling phase. But we check again in the interest of good practice.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train_final.values, i) for i in range(X_train_final.shape[1])]

pd.Series(vif, index=X_train_final.columns, name="Variance Inflation Factor")

None of these values are above 5, so our model does not violate the assumption of Multicollinearity, as expected.

### Homoscedasticity

In [None]:
fig, ax = plt.subplots()

ax.scatter(preds, residuals, alpha=0.5)
ax.plot(preds, [0 for i in range(len(X_test_final))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");

Unfortunately, our model shows no Homoscedasticity at all. However, this could be caused by many different factors, and can be expected considering our model has only been trained with a 60% accuracy rate.

## Conclusions

This is not the strongest Linear Regression Model ever made. But it could certainly be used as a baseline predictor for assessing the value of homes in King County.

In [None]:
print(pd.Series(final_model.coef_, index=X_train_final.columns, name="Coefficients"))
print()
print("Intercept:", final_model.intercept_)

### Interpretation

# ########view results in actual values!###############

The above shows how our algorithm uses each feature to make determinations about the target price. 

Some of the most positively correlated features according to the model are sqft_living, waterfront, and view_excellent. The coefficients (0.66 for view, 0.63 for waterfront, and 0.50 for sqft_living), are referring to the influence each of these features has on the value of a house, based on the data that has been transformed and standardized for scaling purposes of the model.

This holds true to common assumptions of what would be of value in a house. 

However, there also is a negative correlation to things like grade_5_Fair, and bedrooms_6, while the other grades and bedrooms numbers are positively correlated. This could be e due to rrors in the way our model was trained. 

### Next Steps
The reality is that there is a very broad range of factors that can influence any individual house sale. That being said, we could also explore more features based on commonplace observations. For example, our data contains information on when (in the case of our model, whether) each house was renovated, but not what elements of the house were renovated or what was changed about them. 

Additionally, we could run another analysis of the data using features like the grade, bathrooms, bedrooms, floors, etc. as continuous variables rather than categorical ones. This could lead to less cases like the one in which our model subtracts more value from a house for a 5("Fair") rating than a 4("Low") Rating. 

Furthermore, features like rating could use more exploration, and perhaps be removed from the dataset completely in future analyses.

We could also opt for a ground up approach, analyzing models trained on featuresets based on commonlplace assumptions about house values, instead of a purely data driven approach, which is prone to different kinds of errors.

# ########################################################### 
## put in appropriate spot

### Multicollinearity
We check for multicollinearity between our predictive features by observing the pairwise correlation coefficients and visualizing them in a heatmap.

We will combine all our standardized continuous variables for this test.

In [None]:
test_df = normalized_cont.drop(['price'], axis=1)
test_df.head()

In [None]:
corr = test_df.corr()
corr

In [None]:
sns.heatmap(corr, center=0, annot=True, );

We can see that there is a strong correlation between square footage and the number of bedrooms or bathrooms in a house. This makes sense, as a larget house has more room for such amenities.