## Introduction

### Approach
- **Models**: Linear regression, Lasso and Ridge methods
- **Feature Selection**: Based on correlation with target feature and p-value<=0.05

### Definitions of the Variables
- **id** - Unique ID for each home sold
- **date** - Date of the home sale
- **price** - Price of each home sold
- **bedrooms** - Number of bedrooms
- **bathrooms** - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
- **sqft_living** - Square footage of the apartments interior living space
- **sqft_lot** - Square footage of the land space
- **floors** - Number of floors
- **waterfront** - A dummy variable for whether the apartment was overlooking the waterfront or not
- **view** - An index from 0 to 4 of how good the view of the property was
- **condition** - An index from 1 to 5 on the condition of the apartment,
- **grade** - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
- **sqft_above** - The square footage of the interior housing space that is above ground level
- **sqft_basement** - The square footage of the interior housing space that is below ground level
- **yr_built** - The year the house was initially built
- **yr_renovated** - The year of the houseâ€™s last renovation
- **zipcode** - What zipcode area the house is in
- **lat** - Lattitude
- **long** - Longitude
- **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
- **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors

*Thanks to user Nova19 for the column definitions; https://www.kaggle.com/harlfoxem/housesalesprediction/discussion/207885*

## Package & Data Imports

In [None]:
# Programming
import pandas as pd
import numpy as np
import warnings

# Modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from scipy import stats

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Importing Dataset
df = pd.read_csv('/kaggle/input/housesalesprediction/kc_house_data.csv').drop('id', axis=1)
pd.set_option('display.max_columns', None)

# Settings
pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")

## EDA DataFrame
- No missing values, no inputing step will be needed on the pipelines I'll create down the line
- Mix of categorical and continuous variables

In [None]:
print('# Observations: {}'.format(df.shape[0]))
print('# Variables: {}'.format(df.shape[1]))
print('')
print(df.info())
df.head()

## EDA Continuous Variables
- Continuous features selected based on ***Pearson Correlation Coefficients (>=0.3)*** and ***p-values (<0.05)***:   ['sqft_living', 'sqft_above', 'sqft_basement', 'sqft_living15', 'lat']
- p-values of selected features being 0.0 provide a high degree of confidence that the selected features are significantly correlated with price

In [None]:
# Creating a Cont Df
cont_var = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15', 'sqft_lot15', 'zipcode', 'lat', 'long', 'price']
df_cont = df[cont_var]

# Correlation Heatmap
sns.set(rc={'figure.figsize':(20,10)})
sns.heatmap(df_cont.corr(), vmin=-1, vmax=1, cmap="Spectral", annot=True)
plt.show()
plt.close()

# Pearson Corr Coe & p-Values
def pearsoncorr_pval(feature, target):
    '''Computes the Pearson Coefficient and p-value of a given pair of arrays'''
    '''Funtion requires: from scipy import stats'''
    pearson_coeff, p_value = stats.pearsonr(feature, target)
    return pearson_coeff, p_value
# Printing the results
print('')
sqft_liv = pearsoncorr_pval(df.sqft_living, df.price)
print('sqft_living | Pearson Coeff: {} |'.format(sqft_liv[0]), 'p-value: {}'.format(sqft_liv[1]))
sq_abo = pearsoncorr_pval(df.sqft_above, df.price)
print('sqft_above | Pearson Coeff: {} |'.format(sq_abo[0]), 'p-value: {}'.format(sq_abo[1]))
sq_base = pearsoncorr_pval(df.sqft_basement, df.price)
print('sqft_basement | Pearson Coeff: {} |'.format(sq_base[0]), 'p-value: {}'.format(sq_base[1]))
sq_liv15 = pearsoncorr_pval(df.sqft_living15, df.price)
print('sqft_living15 | Pearson Coeff: {} |'.format(sq_liv15[0]), 'p-value: {}'.format(sq_liv15[1]))
la = pearsoncorr_pval(df.lat, df.price)
print('lat | Pearson Coeff: {} |'.format(la[0]), 'p-value: {}'.format(la[1]))
print('')

# Final Continuous Variables list
cont_var_selected = ['sqft_living', 'sqft_above', 'sqft_living15', 'sqft_basement', 'lat']

## EDA Categorical Variables
- Categorical features selected:   ['bedrooms', 'bathrooms', 'waterfront', 'view', 'grade']

In [None]:
cat_var = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'yr_built', 'yr_renovated', 'price']
df_cat = df[cat_var]

# Pairplot
sns.set(rc={'figure.figsize':(20,10)})
sns.pairplot(df_cat)
plt.show()
plt.close()

# Boxplots
for features in cat_var:
    sns.set(rc={'figure.figsize':(20,10)})
    sns.boxplot(x=features, y='price', data=df_cat)
    plt.show()
    plt.close()
    
# Final categorical variables list
cat_var_selected = ['bedrooms', 'bathrooms', 'waterfront', 'view', 'grade']

## Model

#### Model/Hold-Out and Training/Test Splits
- Step 0: Select only the desired variables based on EDA
- Step 1: Arrange all features and labels as numpy arrays
- Step 2: Create *Model* and *Hold-Out* sets. Having a *Hold-Out* set (data never seen by the model) will ensure the best test on model performance on unseen data

In [None]:
# Features based on EDA
features_selected = cont_var_selected + cat_var_selected

# Features/Target in numpy arrays
features = df[features_selected].values
target = df.price.values
    # Model/Hold-out sets
model_features, holdout_features, model_target, holdout_target = \
    train_test_split(features, target, test_size=0.2, random_state=11)

#### Lasso Regression
- Step 0: Create a pipeline that scales the data, adds polynomial features (helps explaining non-linear relationships in a linear method) and runs the desired model
- Step 1: Use GridSearchCV to tune the hyperparamter alpha and apply cross-validation (5-fold)
- Step 2: Obtain scores for training (model) and test (holdout) data and plot the results

In [None]:
# Pipeline for Lasso
pipe_steps = [('Scaling', StandardScaler()), ('Polynomial Features', PolynomialFeatures(include_bias=False)), ('Lasso Linear Regression', Lasso())]
pipeline = Pipeline(pipe_steps)
# Tuning
    # Used print(pipeline.get_params().keys()) to get the name for the parameter to tune
    # Started testing the model with the following list of options to quickly check the ballpark of the optimal hyperparamter -> [0.0001, 0.001, 0.01, 0.1, 0, 1, 10, 100, 1000, 10000]
param_grid = {'Lasso Linear Regression__alpha': np.arange(1000,1500, step=10)}
lasso_grid = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='r2')
# Fitting
lasso_grid.fit(model_features, model_target)
alpha = lasso_grid.best_params_
r2 = lasso_grid.best_score_ 
r2_unseen = lasso_grid.score(holdout_features, holdout_target)
prediction = lasso_grid.predict(holdout_features)
# Priniting Results
print('LASSO PERFORMANCE')
print('Optimal Hyperparameter Alpha: {}'.format(alpha['Lasso Linear Regression__alpha']))
print('r2 Score on Training Data: {}'.format(r2.round(5)))
print('r2 Score on Hold-Out Data: {}'.format(r2_unseen.round(5)))
#Plotting Results
sns.set(rc={'figure.figsize':(10,5)})
_ = sns.kdeplot(holdout_target)
_ = sns.kdeplot(prediction)
_.legend(['target', 'prediction'])
plt.show()
plt.close()
# Deleting Variables
del pipe_steps, pipeline, param_grid, alpha, r2, r2_unseen, prediction

#### Ridge Regression
- Step 0: Create a pipeline that scales the data, adds polynomial features (helps explaining non-linear relationships in a linear method) and runs the desired model
- Step 1: Use GridSearchCV to tune the hyperparamter alpha and apply cross-validation (5-fold)
- Step 2: Obtain scores for training (model) and test (holdout) data and plot the results

In [None]:
# Pipeline for Ridge
pipe_steps = [('Scaling', StandardScaler()), ('Polynomial Features', PolynomialFeatures(include_bias=False)), ('Ridge Linear Regression', Ridge())]
pipeline = Pipeline(pipe_steps)
# Tuning
    # Used print(pipeline.get_params().keys()) to get the name for the parameter to tune
    # Started testing the model with the following list of options to quickly check the ballpark of the optimal hyperparamter -> [0.0001, 0.001, 0.01, 0.1, 0, 1, 10, 100, 1000, 10000]
param_grid = {'Ridge Linear Regression__alpha': np.arange(1000,1500, step=10)}
ridge_grid = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='r2')
# Fitting
ridge_grid.fit(model_features, model_target)
alpha = ridge_grid.best_params_
r2 = ridge_grid.best_score_
r2_unseen = ridge_grid.score(holdout_features, holdout_target)
prediction = ridge_grid.predict(holdout_features)
# Priniting Results
print('RIDGE PERFORMANCE')
print('Optimal Hyperparameter Alpha: {}'.format(alpha['Ridge Linear Regression__alpha']))
print('r2 Score on Training Data: {}'.format(r2.round(5)))
print('r2 Score on Hold-Out Data: {}'.format(r2_unseen.round(5)))
#Plotting Results
sns.set(rc={'figure.figsize':(10,5)})
_ = sns.kdeplot(holdout_target)
_ = sns.kdeplot(prediction)
_.legend(['target', 'prediction'])
plt.show()
plt.close()
# Deleting Variables
del pipe_steps, pipeline, param_grid, alpha, r2, r2_unseen, prediction