# King County, Washington home prices...

This kernel is the result of a class project where we were asked to apply a set of models covered in class to a dataset of our choosing. The class was loosely based on the Introduction to Statistical Learning with applications in R text found [here in hardcopy](https://amzn.to/2KgQJPY) or [here in soft copy form for free](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf). I wanted to reproduce some of those models in Python, because that's what I'm familiar with. I make no claim that these are the best models to use for this dataset--this is only meant as a demo.
### Skip to the AdaBoost model if you want to see the best performing model. There is a chart at the end that shows the error in terms of actual sale price. I think this is the best measure of performance for a ML model in this context, because it allows you to see the error across the entire market. This model performs more predictably the more expensive a property is.

# Table of Contents:
## --------- [Data Wrangling](#Data-Wrangling:) -----------------
## --------- Model Building: ------------------
## 0. [A Baseline Model](#0.-Baseline-Model:)
## 1. [Multiple Linear Regression](#1.-Multiple-Linear-Regression:)
## 2. [Best Subset Regression](#2.-Best-Subset-Regression:)
## 3. [Ridge Regression](#3.-Ridge-Regression:)
## 4. [Lasso Regression](#4.-Lasso-Regression:)
## 5. [SVM for Regression](#5.-Support-Vector-Machine-for-Regression:)
## 6. [K Nearest Neighbors](#6.-K-Nearest-Neighbors:)
## 7. [Classification and Regression Tree](#7.-CART:)
## 8. [Random Forest](#8.-Random-Forest:)
## 9. [AdaBoost](#9.-AdaBoost:)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

%matplotlib inline

In [None]:
df = pd.read_csv("../input/kc_house_data.csv")
df.head()

In [None]:
df.info()

In [None]:
## Look at useful stats for each variable:
print('There are',len(df.columns.tolist()),'columns, including the response variable "price".')
df.describe()

# Data Wrangling:
### Before we begin building models with these data, we should do several things to make the data more meaningful:

1. Convert **'date'** to python datetime objects.
1. Check for empty (NaN) values
2. Figure out which predictors should be categorical, and transform them to that data type (so Pandas will play nicer with them)
3. Transform the dependent variable, **`price`**, to a log scale, as is common with widely varying financial data (e.g. **`hitters`** from ISLR) 
4. Transform **`yr_built`** to a more apt **`yrs_old`** (at sale)
5. Plot the values of all variables against **`logPrice`** to see what anomalies we may have
6. Remove remaining meaningless or superfluous data (such as **`id`**)

### 1. Convert date column to datetime64 dtype:

In [None]:
df['date'] = df['date'].astype('datetime64') #pd.to_datetime(df['date'])
df['date'].describe()
# The desrcibe method still tells me that the dtype is "object", but if you index to a specific value such as (type(df.loc[0,'date'])), it says "pandas._libs.tslibs.timestamps.Timestamp"

### 2. Check for empty (NaN) values (there are none!)

In [None]:
df.isnull().sum()

### 3. Figure out which predictors should be categorical
Notice when we called `df.info()` above, all of the columns are read in from the csv as either `int64`, `float64`, or `object`.  Pandas has more datatypes that may be of use to us, particularly the categorical data type: http://pbpython.com/pandas_dtypes.html

In [None]:
# Which are categorical? Let's check all non-float values, screening for features with fewer than 50 unique values
## (50 is arbitrary, but I checked several other values, like 500, to make sure this result is reasonable)
predNames = ['waterfront','bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',  'view', 'condition', 'grade',
            'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'sqft_living15','sqft_lot15']

uniqueValsList = []
seriesNames = [] # Keep only the columns that have fewer than 50 unique values.
for each in predNames:
    if str(df[each].dtype) == "int64":
        uniqueVals = list(df[each].unique())
        if len(uniqueVals) < 50:
            uniqueValsList.append(pd.Series(data=uniqueVals,name=each))
            seriesNames.append(each)

listOfLists = []
for i in range(len(uniqueValsList)):
    thisList = []
    thisList.append(seriesNames[i])
    for each in uniqueValsList[i].tolist():
        thisList.append(each)
    listOfLists.append(thisList)

from IPython.display import HTML, display
import tabulate
display(HTML(tabulate.tabulate(listOfLists, tablefmt='html')))

### Based on this, it looks like `bedrooms`, `waterfront`, `view`, `condition`, and `grade` are the only candidates for categorical variables. Addressing each of them on their own:

- **`bedrooms`**: this should be kept as an integer, since order matters, and simple math applies
- **`waterfront`**: this is basically binary; it either is on the water or it isn't. I will translate this to a categorical variable
- **`view`**: this appears to contain more than binary information, maybe based on how many sides of the property have a view? There is no useful detail in the data documentation. I'll map this to a binary categorical variable--either it has a view or it doesn't.
- **`condition`**: Appears to be a 1-5 scale, so order matters. We will leave it as an `int`
- **`grade`**: There doesn't seem to be an obvious range to the scale here. Plotting will help.

The last two in particular warrant further examination. Should we not include both condition and grade, as they seem to contain similar information? I'll take the low hanging fruit first (converting **`waterfront`** and **`view`** to categorical variables), and then do some basic plotting to look more closely at **`condition`** and **`grade`**.

In [None]:
df['waterfront']= pd.Series(pd.Categorical(df['waterfront'], 
                                           categories=[0,1],
                                           ordered=True))
df['waterfront'].describe()

In [None]:
## Map all non-zero values in "view" to 1
y = pd.Series([0,1,1,1,1], index=[0,1,2,3,4])
df['view'] = df['view'].map(y)
df['view'].unique()

In [None]:
df['view']= pd.Series(pd.Categorical(df['view'], 
                                        categories=[0,1],
                                        ordered=True))
df['view'].describe()

### Now for some plotting:

In [None]:
plotColNames = ['condition','grade']

plt.figure(1,figsize = (5,6))
    
for i in range(len(plotColNames)):
    plt.subplot(100*len(plotColNames)+10+1+i)
    sns.distplot(df[plotColNames[i]],kde=0,label=plotColNames[i],color='blue')
    
plt.subplots_adjust(hspace = 0.8)
plt.show()

### My assessment:
- **`condition`**: Looks like no-one gives below a 3.0. Has less variation than `grade`, so I may only keep grade.
- **`grade`**: Yep, this looks like more information, so I'm going to drop `condition` and keep `grade`.

In [None]:
## Remove "condition"
df = df.drop(columns=['condition'])

### 4. Transform the dependent variable, **`price`**, to a log scale, as is common with widely varying financial data (e.g. **`hitters`** from ISLR)

In [None]:
## First, plot it as it is:
sns.distplot(df['price'],kde=1,color='darkblue',hist_kws={'alpha':0.8})
plt.title('Sale Price (USD)')
plt.show()

In [None]:
# Now transform it, but don't overwrite it:
logPrice = df['price'].apply(np.log)
sns.distplot(logPrice,kde=1,color='darkblue',hist_kws={'alpha':0.8})
plt.title('Log Transform of Sale Price')
plt.show()

### The log transform definitely looks better, but I'll keep them both in case I want to try different models with different version of the dependent variable.

In [None]:
df['logPrice'] = logPrice

### 5. Transform year-related data 
It could be useful to transform the `yr_built` to something that gives us a better sense of how old the house is. It may be that there will be no difference in the resulting models, but it seems to make more sense to me. 

In [None]:
df["yrs_old"] = pd.to_datetime(df['date']).dt.year - df['yr_built']

### 5.1 Because `yr_renovated` includes values of `0` for houses that have never been renovated, a similar transformation would not be useful. I think it will be best to look at this variable as binary--either it has or has not been renovated. I'll map all the non-zero numbers to 1, then make it categorical.

In [None]:
plt.figure()
#Let's make sure we ignor the '0' values.
renovations = df['yr_renovated'][df['yr_renovated']>100]
sns.distplot(renovations,kde=0,color='blue')
    
plt.subplots_adjust(hspace = 0.8)
print('There have been only:',len(renovations),'renovations.')
plt.show()

In [None]:
## Map all non-zero values in "yr_renovated" to 1
df['yr_renovated'].describe()

In [None]:
df['yr_renovated'][df['yr_renovated']>100] = 1
df['yr_renovated']= pd.Series(pd.Categorical(df['yr_renovated'], 
                                        categories=[0,1],
                                        ordered=True))
df['yr_renovated'].describe()

### 6. Plot the explanatory variables against price

Double click on a graph to enlarge them all.

In [None]:
colNames = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','grade','sqft_above',
            'sqft_basement','yr_renovated','sqft_living15','sqft_lot15','logPrice','yrs_old']

sns.pairplot(df,x_vars=colNames,y_vars="logPrice",size=10,aspect=1.0,kind = 'reg')
    
plt.show()

In [None]:
pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(20, 20), diagonal='kde');

### 7. Remove remaining meaningless or superfluous data (such as **`id`**)

In [None]:
# There two variables that we don't care about:
#  id
#  date (all are within about a year of each other, making it difficult to establish seasonality)
#  lat/long (this should be captured in zip code, and I'm not sure how to deal with it otherwise)
df = df.drop(columns=['id', 'date', 'lat','long','yr_built'])

### 8. Check for collinearity

In [None]:
#Source: https://stackoverflow.com/questions/39409866/correlation-heatmap?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
corr = df.corr()
corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '50px', 'font-size': '9pt'})\
    .set_precision(2)\

### Interestingly, `sqft_*15` variables seem to be sufficiently different from their counterparts that I will keep them.

## Before we do model-building, it will be useful to split the data into training/test data.

There is a lot of data, so I'm going to do an 80/20 split. I have no idea if this is optimal!

In [None]:
from sklearn.model_selection import train_test_split

# All the data:
y = pd.Series(df['logPrice'])
X = df.drop(columns=['logPrice','price'])

# How many observations are there?
numObs = df.shape[0]
numTrain = int(round(numObs*0.8,0))
# numTrain = 17290 in this case

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=numObs-numTrain, random_state=0)

In [None]:
# Store column names for later:
columnHeaders = X.columns.tolist()

print(X_train.shape,' ',X_test.shape)
print(y_train.shape,' ',y_test.shape)

### And set up a dataframe to store performance data:

In [None]:
perfDFColumns = ['Model Name','Test MSE','Test R^2']
perfDF = pd.DataFrame(columns=perfDFColumns)

# 0. Baseline Model:
## How does a naive average work? We'll use its performance metrics as a "floor" to compare everything else against.

In [None]:
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error

In [None]:
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train,y_train)
dummy_MSE = mean_squared_error(y_test,dummy.predict(X_test))
dummy_R2  = dummy.score(X_test,y_test)
print(dummy_MSE,dummy_R2)

In [None]:
newRow = [('Naive Mean Model',round(dummy_MSE,3),round(dummy_R2*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

# 1. Multiple Linear Regression:
## This will also be used as a baseline of sorts, because it is the most simple statistical learning model we have. If other ML models don't beat linear regression, then use linear regression!

In [None]:
from sklearn.linear_model import LinearRegression
linRegr = LinearRegression()
linRegr.fit(X_train, y_train)
r2 = linRegr.score(X_test, y_test)

linRegr_MSE = mean_squared_error(y_test, linRegr.predict(X_test))
print('Linear Regression Test MSE: {}'.format(round(linRegr_MSE,3)))
print('Linear Regression model Test R^2 is: {}%'.format(round(r2*100,2)))

In [None]:
newRow = [('Linear Regression',round(linRegr_MSE,3),round(r2*100,3))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

# 2. Best Subset Regression:

### Best Subset. Scikit-learn implements feature selection using a univariate statistical test (f_regression in this case) to choose the k-best of p predictors.
Best subset is a dimension-reducing methodology. Since there aren't that many predictors in this data set, I'm not expecting it to be that useful.

In [None]:
#SYNTAX: X_train, X_test, y_train, y_test
from sklearn.feature_selection import SelectKBest, f_regression
# X_train.shape[1] = 15; There are 15 possible predictors.
subSetter = SelectKBest(f_regression, k=5).fit(X_train, y_train)
X_bestSubset = subSetter.transform(X_train)
scores = subSetter.scores_
X_bestSubsetScoresDF = pd.DataFrame(data=scores,columns=['scores'],index=X_train.columns.tolist())
X_bestSubsetScoresDF.sort_values(by='scores',ascending=False,inplace=True)
X_bestSubsetScoresDF.astype(float).round(2)

In [None]:
linRegrSubsetPerfDF = pd.DataFrame(columns=perfDFColumns,index=range(1,16))
linRegrSubsetPerfDF['#Var'] = range(1,16)
linRegrSubsetPerfDF.set_index('#Var',inplace=True)

In [None]:
# Get the R^2 of a model with 1 through all 15 predictors, in order of their score:
linRegrSubsetPerfDF = pd.DataFrame(columns=['R-squared'],index=range(1,16))

for k in range(1,16):
    linRegr = LinearRegression().fit(X_train[X_bestSubsetScoresDF.index[0:k].values], y_train)
    #linRegr.fit(X_train[X_bestSubsetScoresDF.index[0:k].values], y_train)
    linRegrSubsetPerfDF.loc[k,'R-squared'] = linRegr.score(X_test[X_bestSubsetScoresDF.index[0:k].values], y_test)
    linRegrSubsetPerfDF.loc[k,'test MSE'] = mean_squared_error(y_test, linRegr.predict(X_test[X_bestSubsetScoresDF.index[0:k].values]))
    
# Example of stringing methods together in Python:
# Show the whole DataFrame, after rounding it to two decimal places, after transposing it.
linRegrSubsetPerfDF.astype(float).round(2).transpose()

In [None]:
## Creating a function because we'll graph MSE and R^2 for each candidate model:

def plotPerformance(x,mse,r2,title='Title Goes Here',xlabel='x label goes here'):
    # Informed by: https://matplotlib.org/gallery/api/two_scales.html
    fig, ax1 = plt.subplots()

    # plot R^2 with scale on left hand side of plot
    xaxis = x
    ax1.plot(x, r2,'b.-')
    ax1.set_xlabel(xlabel)
    ax1.set_ylabel('R-Squared', color='b')
    ax1.tick_params(axis='y',labelcolor='b')

    # plot test MSE with scale on right hand side of plot
    ax2 = ax1.twinx()
    ax2.plot(x, mse,'r.-')
    ax2.set_ylabel('test MSE', color='r')
    ax2.tick_params(labelcolor='r')

    plt.title(title)  
    plt.show()

In [None]:
plotPerformance(x=linRegrSubsetPerfDF.index.values, 
               mse=linRegrSubsetPerfDF['test MSE'],
               r2=linRegrSubsetPerfDF['R-squared'],
               title='Performance based on # features included',
               xlabel='# features')

In [None]:
newRow = [('Best Subset Regression',round(linRegrSubsetPerfDF.loc[14]['test MSE'],3),round(linRegrSubsetPerfDF.loc[14]['R-squared']*100,3))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

### Note that the highest accuracy is with 14 predictors. 15 predictors gives us the same test accuracy as the naive linear regression model, because they are the same! As expected, best subset, which is a dimensionality reduction tool, is not useful with this dataset.

# 3. Ridge Regression:

### Ridge and Lasso Regression: setting up the $\lambda$ and finding the best $\lambda$ by cross-validation.

In [None]:
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
# Generate a range of alpha values:
alphas = 10**np.linspace(10,-4,100)
ridge = Ridge(normalize = True)
coefs = []

for a in alphas:
    ridge.set_params(alpha = a)
    ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)
    
np.shape(coefs)

In [None]:
# This chart is similar to Figure 6.4 in ISLR (p 216)
ax = plt.gca()
ax.plot(alphas, coefs);
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('Standardized Coefficients');

In [None]:
# Use cross-validation to find the alpha value with the lowest error:
ridgecv = RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',normalize=True)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

In [None]:
# Find the test MSE associated with the best alpha:
ridge = Ridge(alpha = ridgecv.alpha_, normalize = True)
ridge.fit(X_train, y_train)
ridge_MSE = mean_squared_error(y_test, ridge.predict(X_test))
ridge_accuracy = ridge.score(X_test, y_test)
print('Ridge Regression MSE: {}'.format(round(ridge_MSE,3)))
print('Ridge Regression model accuracy is: {}%'.format(round(ridge_accuracy*100,2)))

In [None]:
ridgePerfDF = pd.DataFrame(columns=['R-squared','test MSE'])
ridgePerfDF['alpha'] = ridgecv.alphas
ridgePerfDF.set_index('alpha',inplace=True)

for a in ridgecv.alphas:
    ridge = Ridge(alpha = a, normalize = True)
    ridge.fit(X_train, y_train)
    testMSE = mean_squared_error(y_test, ridge.predict(X_test))
    testR2 = ridge.score(X_test, y_test)
    ridgePerfDF.loc[a,'test MSE']  = mean_squared_error(y_test, ridge.predict(X_test))
    ridgePerfDF.loc[a,'R-squared'] = ridge.score(X_test, y_test)

ridgePerfDF

In [None]:
# Plot performance based on alpha value. Does it match the autmated ridgeCV alpha?
# Note: the highest alpha values skew the graph, so we zoom in with indexing
plotPerformance(x=ridgePerfDF.index.values[60:101], 
               mse=ridgePerfDF['test MSE'].values[60:101],
               r2=ridgePerfDF['R-squared'].values[60:101],
               title='Performance based on alpha value',
               xlabel='alpha')

In [None]:
newRow = [('Ridge Regression',round(ridge_MSE,3),round(ridge_accuracy*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

In [None]:
ridgeCoefDF = pd.DataFrame(data=ridgecv.coef_,columns=['Coeff Val'],index=columnHeaders)
print('Best Alpha: ',ridgecv.alpha_)
ridgeCoefDF

### Scikit-Learn's Ridge Regression CV feature finds that an $\lambda$ = 0.000977, which is the smallest $\lambda$ I have tried. This checks out with the manual CV I conducted above. Not sure if it's appropriate to continue looking at lower $\lambda$ values. (I already lowered it once, where previously the lowest $\lambda$ was 0.01.
### It does not appear that ridge regression produces any zero coefficients, so it's not actually doing any variable selection. Furthermore, the Ridge Regression model doesn't even out perform the baseline Linear Regression. 
# 4. Lasso Regression:

In [None]:
## Thanks to: http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html

# Basic model: (no cross-validation)
lasso = Lasso(max_iter = 100000, normalize = True)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(X_train, y_train)
    coefs.append(lasso.coef_)
    
ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')

In [None]:
# Cross-validated model:
lassocv = LassoCV(alphas = None, cv = 10, max_iter = 100000, normalize = True)
lassocv.fit(X_train, y_train)

# Get MSE
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
lassoCV_MSE = mean_squared_error(y_test, lasso.predict(X_test))

# Get model accuracy
lassoCV_accuracy = lasso.score(X_test, y_test)

lassocvCoefDF = pd.DataFrame(data=lasso.coef_,columns=['Coeff Val'],index=columnHeaders)
print('lassoCV_MSE',lassoCV_MSE)
print('lassoCV_accuracy',lassoCV_accuracy)
lassocvCoefDF

In [None]:
lassoPerfDF = pd.DataFrame(columns=['R-squared','test MSE'])
lassoPerfDF['alpha'] = ridgecv.alphas
lassoPerfDF.set_index('alpha',inplace=True)
#lassoPerfDF

for a in ridgecv.alphas:
    lasso = Lasso(alpha = a, normalize = True)
    lasso.fit(X_train, y_train)
    testMSE = mean_squared_error(y_test, lasso.predict(X_test))
    testR2 = lasso.score(X_test, y_test)
    lassoPerfDF.loc[a,'test MSE']  = mean_squared_error(y_test, lasso.predict(X_test))
    lassoPerfDF.loc[a,'R-squared'] = lasso.score(X_test, y_test)

print("Note: Best performance is with lowest alphas")
lassoPerfDF.tail()

In [None]:
print(min(lassoPerfDF['test MSE']))
print(max(lassoPerfDF['R-squared']))
lassoPerfDF[lassoPerfDF['R-squared']>0.6]

In [None]:
lasso_MSE = lassoPerfDF.loc[0.0001,'test MSE']
lasso_R2 = lassoPerfDF.loc[0.0001,'R-squared']

In [None]:
# Plot performance based on alpha value. Does it match the autmated ridgeCV alpha?
# Note: the highest alpha values skew the graph, so we zoom in with indexing
plotPerformance(x=lassoPerfDF.index.values[85:101], 
               mse=lassoPerfDF['test MSE'].values[85:101],
               r2=lassoPerfDF['R-squared'].values[85:101],
               title='Lasso performance based on alpha value',
               xlabel='alpha')

In [None]:
newRow = [('Lasso Regression',round(lasso_MSE,3),round(lasso_R2*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

### Although a lot of the coefficients are close to zero, probably due to the log scale of the dependent variable, only one variable is left out. Strangely, the variable that's left out is **`sqft_above`**.

### Most importantly, the Lasso model does not improve upon linear regression with this data set.

# 5. Support Vector Machine for Regression:

### In class, we've used SVM's for binary classification (default or not, survive or not), but it is possible to use it for numerical regression of a continuous response variable. Scikit-Learn uses the `SVR` model object instead of the `SVC`, which is used for classification. It will be interesting to see how it performs against models that are typically better suited for Regression with continuous response variables.

In [None]:
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train,y_train)

In [None]:
# Get MSE
svr_MSE = mean_squared_error(y_test, svr.predict(X_test))
# Get model accuracy
svr_accuracy = svr.score(X_test, y_test)
print('Test MSE:',svr_MSE,' & Accuracy:',svr_accuracy)

In [None]:
## Grid search cross-validation isn't supported for continuous variables,
##  so I'll do  manually:

for c in [0.001, 0.01, 0.1, 1, 5, 10, 100]:
    svr = SVR(C=c,kernel='rbf') # radial basis function kernel is highly flexible
    svr.fit(X_train,y_train)
    svr_MSE = mean_squared_error(y_test, svr.predict(X_test))
    svr_accuracy = svr.score(X_test, y_test)
    print('For c=',c,'Test MSE:',svr_MSE,' & Accuracy:',svr_accuracy)

In [None]:
## Refit with best C:
svr = SVR(C=1)
svr.fit(X_train,y_train)
svr_MSE = mean_squared_error(y_test, svr.predict(X_test))
svr_accuracy = svr.score(X_test, y_test)
print('Test MSE:',svr_MSE,' & Accuracy:',svr_accuracy)

In [None]:
newRow = [('SVM Regression',round(svr_MSE,3),round(svr_accuracy*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

### Even the best SVM model does quite poorly with this data, barely achieving over 1% accuracy.

# 6. K Nearest Neighbors:
Testing for k = 1,2,3,...38,39,40, as in our lab.

We will use the Scikit-Learn scale feature, which standardizes all predictor variables to have a mean of zero and standard deviation of one. This way, we don't have problems with different scales across different variables. **This was already accomplished with the Lasso and Ridge Regression algorithms with the `normalize=True` parameter**

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import scale

In [None]:
knnDF = pd.DataFrame(columns=['k','test MSE','R-squared'],index=range(1,41))
knnDF['k'] = range(1,41)
knnDF['test MSE'] = range(1,41)
knnDF.set_index('k',inplace=True)

In [None]:
for k in range(1,41):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(scale(X_train), y_train) 
    knnDF.loc[k,'test MSE'] = mean_squared_error(y_test, knn.predict(scale(X_test)))
    knnDF.loc[k,'R-squared'] = knn.score(scale(X_test), y_test)

In [None]:
# Plot performance based on k value. 
plotPerformance(x=knnDF.index.values,#[85:101], 
               mse=knnDF['test MSE'].values,#[85:101],
               r2=knnDF['R-squared'].values,#[85:101],
               title='KNN Performance',
               xlabel='k neighbors')

In [None]:
newRow = [('KNN Regression',round(knnDF.loc[20,'test MSE'],3),round(knnDF.loc[20,'R-squared']*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

## The KNN model increases from k=1 and plateaus at about k = 20. Since referencing more neighbors at runtime will slow down processing of future predictions, I would use k=20 over k=40, even though they have about the same performance.

## So far, KNN is the best model.

# 7. CART:
Building a decision tree using the **`sklearn.tree.DecisionTreeRegressor()`** model, and plot it with **`graphviz`** (spoiler, I couldn't get graphviz to work)

In [None]:
from sklearn import tree

In [None]:
cart = tree.DecisionTreeRegressor()
cart = cart.fit(X_train,y_train)

cart_MSE = mean_squared_error(y_test, cart.predict(X_test))
cart_R2 = cart.score(X_test, y_test)
print(cart_MSE,cart_R2)

In [None]:
newRow = [('CART Regression',round(cart_MSE,3),round(cart_R2*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

# 8. Random Forest:

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
%%time
randFor = RandomForestRegressor(max_depth=500,random_state=2)
randFor.fit(X_train,y_train)
RF_MSE = mean_squared_error(y_test, randFor.predict(X_test))
RF_R2 = randFor.score(X_test,y_test)
print(RF_MSE,RF_R2)

In [None]:
newRow = [('Random Forest Regression',round(RF_MSE,3),round(RF_R2*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

# 9. AdaBoost:
## Using the CART and boosting it.

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
adaBoost = AdaBoostRegressor(cart,n_estimators=300,random_state=np.random.RandomState(1))
adaBoost.fit(X_train,y_train)
adaPredict = adaBoost.predict(X_test)
adaBoost_MSE = mean_squared_error(y_test,adaPredict)
adaBoost_R2  = adaBoost.score(X_test,y_test)
print(adaBoost_MSE,adaBoost_R2)

In [None]:
newRow = [('AdaBoost Regression',round(adaBoost_MSE,3),round(adaBoost_R2*100,2))]
perfDF = perfDF.append(pd.DataFrame(data=newRow,columns=perfDFColumns),ignore_index=True)
perfDF.drop_duplicates(inplace=True) # So we don't accidentally append multiple times.
perfDF

## But who cares if we can predict on a log scale... potential investors would want to know how accurate the model is in real terms.

So let's map the predicted and true (test) response values back onto the normal scale:

In [None]:
yNoLog=np.exp(y_test)
yNoLog.describe()

In [None]:
percError = ((yNoLog - np.exp(adaPredict)) / yNoLog) * 100
percError.describe()

In [None]:
fig, ax1 = plt.subplots()
sns.regplot(x=yNoLog,y=percError,fit_reg=False);

linex = np.linspace(0, max(yNoLog+100000), 5)
line1 = ax1.plot(linex, 0*linex, '--', linewidth=1,color='black')

fig.set_size_inches(10,6)
ax1.set_xlabel('Sale Price')
ax1.set_ylabel('Estimator % Error')
plt.title('% Error By Sale Price');

In the chart above, points above the line were **underestimated**, and points below the line were **overestimated** by the model. 

# Conclusion
The AdaBoost model seems to work the best with this test/training split. In order to improve my model further, I would spend more time on feature engineering, specifically with the lat/long values. I assume it's possible to deduce school ratings, import census data, etc... with this location information.

I did look at an outlier in the bottom left of the chart, `df.iloc[18332]`, which was predicted to be sold for \$390k when in fact it sold for \$130k, a 300% error. I tried to resolve the lat/long to an address, but it appears as though the data have been generalized so as to not allow that. I could place the neighborhood, but not the address. So I don't think it would be possible to gather adress information 

# Future Work
I think the most interesting measure of performance is the chart showing errors in terms of the actual sale price. I'm wondering if some models that didn't perform as well as AdaBoost, when judged by Test MSE and Test R-squared, may perform better in *certain segments* of the market. It could be that different models perform better in a noisy segment of the data, such as the lower end of the sale price spectrum, even though AdaBoost performs better overall. **I'd be interested to know how real estate analysts would typically split up a market**, and I would re-run these models, segmenting the test data to only those that apply. I may have to increase my percentage of data allocated to the test data so that my test set for each segment is sufficient.