# Introduction

We already know from previous analysis:

- Some features are categorical: ['waterfront', 'view','condition','grade']
- Some features don't have a clear relationship with "Price" ['yr_built', 'yr_renovated', 'long', 'sqft_lot', 'sqft_lot15']

In [96]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline


In [97]:
# load dataset from .csv, 
# split into X and y and 
# remove columns I don't want or need ['id','date', 'zipcode'] 

dataset = pd.read_csv('kc_house_data.csv')
y = dataset[['price']]
X = dataset.drop(['price', 'id','date','zipcode'],axis=1)


In [98]:
X.shape

(21613, 17)

# Function to run my model:


In [99]:
def run_model(X_input,y_input):
    x_train, x_test, y_train, y_test = train_test_split(X_input, y_input, test_size=0.2)
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    
    train_score = lr.score(x_train,y_train)
    train_preds = lr.predict(x_train)
    train_rmse = np.sqrt(mean_squared_error(y_train,train_preds))
    print("Train :: R^2 = " + "%.2f"%train_score + " :: RMSE = " + "%.2f"%train_rmse)
    
    test_score = lr.score(x_test,y_test)
    test_preds = lr.predict(x_test)
    test_rmse = np.sqrt(mean_squared_error(y_test,test_preds))
    print("Test  :: R^2 = " + "%.2f"%test_score  + " :: RMSE = " + "%.2f"%test_rmse)

I will run the model for different modifications of my dataset

- All data (after dropping ['price', 'id','date','zipcode'])
- Without categorical_features ['waterfront', 'view','condition','grade']
- Without low_linearity_features ['yr_built', 'yr_renovated', 'long', 'sqft_lot', 'sqft_lot15']
- Without categorical_features and low_linearity_features

In [100]:
#from previous studies, we know:
categorical_features = ['waterfront', 'view','condition','grade']
low_linearity_features = ['yr_built', 'yr_renovated', 'long', 'sqft_lot', 'sqft_lot15']

In [101]:
print("\nAll data")
run_model(X,y)

print("\nNo categorical_features")
run_model(X.drop(categorical_features, axis=1),y)

print("\nNo low_linearity_features")
run_model(X.drop(low_linearity_features, axis=1),y)

print("\nNo categorical_features + low_linearity_features")
run_model(X.drop((categorical_features + low_linearity_features), axis=1),y)


All data
Train :: R^2 = 0.70 :: RMSE = 200361.55
Test  :: R^2 = 0.68 :: RMSE = 211855.94

No categorical_features
Train :: R^2 = 0.63 :: RMSE = 220356.23
Test  :: R^2 = 0.61 :: RMSE = 243711.29

No low_linearity_features
Train :: R^2 = 0.67 :: RMSE = 209689.28
Test  :: R^2 = 0.65 :: RMSE = 221804.27

No categorical_features + low_linearity_features
Train :: R^2 = 0.59 :: RMSE = 237873.28
Test  :: R^2 = 0.56 :: RMSE = 234618.77


# Conclusions

- Removing low_linearity_features had no / very small negative impact, maybe they are really useless.
- But removing categorical_features has a big negative impact on the score. I will use "Dummy encoding" to better fit these features 

# Dummy encoding

These are the categorical features we need to encode:
- waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
- view - An index from 0 to 4 of how good the view of the property was
- condition - An index from 1 to 5 on the condition of the apartment,
- grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

Note: I'm using 'drop_first=True': if one categorical variable has n values, dummy encoding converts it into n-1 variables.

In [102]:
X_dummies =  pd.get_dummies(X, columns = categorical_features, prefix = categorical_features, drop_first=True)
X_dummies.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,yr_built,yr_renovated,lat,...,grade_4,grade_5,grade_6,grade_7,grade_8,grade_9,grade_10,grade_11,grade_12,grade_13
0,3,1.0,1180,5650,1.0,1180,0,1955,0,47.5112,...,0,0,0,1,0,0,0,0,0,0
1,3,2.25,2570,7242,2.0,2170,400,1951,1991,47.721,...,0,0,0,1,0,0,0,0,0,0
2,2,1.0,770,10000,1.0,770,0,1933,0,47.7379,...,0,0,1,0,0,0,0,0,0,0
3,4,3.0,1960,5000,1.0,1050,910,1965,0,47.5208,...,0,0,0,1,0,0,0,0,0,0
4,3,2.0,1680,8080,1.0,1680,0,1987,0,47.6168,...,0,0,0,0,1,0,0,0,0,0


In [103]:
X_dummies.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat',
       'long', 'sqft_living15', 'sqft_lot15', 'waterfront_1', 'view_1',
       'view_2', 'view_3', 'view_4', 'condition_2', 'condition_3',
       'condition_4', 'condition_5', 'grade_3', 'grade_4', 'grade_5',
       'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11',
       'grade_12', 'grade_13'],
      dtype='object')

In [104]:
print("\nAll data with categorical dummies")
run_model(X_dummies,y)


All data with categorical dummies
Train :: R^2 = 0.73 :: RMSE = 188033.65
Test  :: R^2 = 0.71 :: RMSE = 207635.77


In [105]:
lr = LinearRegression()
cross_validate(lr, X_dummies, y, cv=5, return_train_score=True)

{'fit_time': array([0.05485249, 0.0428853 , 0.06482744, 0.03989339, 0.03989387]),
 'score_time': array([0.00698161, 0.01296496, 0.00897503, 0.00897646, 0.00598311]),
 'test_score': array([0.71397152, 0.73161878, 0.72308369, 0.71181043, 0.71728635]),
 'train_score': array([0.72957632, 0.72488619, 0.72684743, 0.72895723, 0.7265156 ])}

# Conclusion
Model improved using categorical features as dummies!

# GridSearchCV, Pipeline and PolynomialFeatures

- I'll try to improve my model applying PolynomialFeatures transformation. 
- I'll use GridSearchCV to find the optimal polynomial order.

Note: This took 1 minute to run on my computer


In [106]:
pipe = Pipeline([('poly', PolynomialFeatures()),('lr', LinearRegression())])
params = {'poly__degree' : [1,2,3]}

In [107]:
grid = GridSearchCV(pipe, param_grid=params, cv=5, scoring='neg_root_mean_squared_error')

In [108]:
#remove low_linearity_features to improve speed
X_dummies = X_dummies.drop(low_linearity_features, axis=1)
grid.fit(X_dummies, y)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('poly',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=True,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('lr',
                                        LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False))],
                                verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'poly__degree': [1, 2, 3]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False,
        

In [109]:
grid.best_params_

{'poly__degree': 1}

In [110]:
grid.cv_results_

{'mean_fit_time': array([5.34570217e-02, 1.00342236e+00, 9.42506470e+01]),
 'std_fit_time': array([6.07014395e-03, 4.33382558e-02, 2.77550928e+01]),
 'mean_score_time': array([0.00558496, 0.05445466, 0.3707098 ]),
 'std_score_time': array([0.00048862, 0.00421232, 0.12954889]),
 'param_poly__degree': masked_array(data=[1, 2, 3],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'poly__degree': 1}, {'poly__degree': 2}, {'poly__degree': 3}],
 'split0_test_score': array([  -216307.27595099,  -3444563.38952802, -12921968.98179559]),
 'split1_test_score': array([  -201893.99365607,  -1635731.93880393, -10855103.8906067 ]),
 'split2_test_score': array([  -189147.29820089,   -172562.8344946 , -32314444.06725904]),
 'split3_test_score': array([ -199870.32849201,  -239660.31387549, -8279656.80102861]),
 'split4_test_score': array([  -196456.37761207, -11319070.98157596, -24250814.48264351]),
 'mean_test_score': array([  -200735.05478241,  

In [111]:
print("Test :: RMSE = " + "%.2f"%grid.cv_results_['mean_test_score'][0])

Test :: RMSE = -200735.05


# Conclusion

Sadly, PolynomialFeatures of orders 2 and 3 didn't improve my model.

# Final

#### Best model:

- LinearRegression
- Dummy categorical features

#### Notes

- PolynomialFeatures didn't improve model
- Using low_linearity_features has little improvement on the model. ['yr_built', 'yr_renovated', 'long', 'sqft_lot', 'sqft_lot15'] 