# Predicting House Prices
----------------------------------------------

In this notebook, I will be creating a model to predict house prices in the King County based on the dataset.

# Table of Contents

* [Setup](#1)
* [Exploratory Analysis](#2)
    - [Summary](#3)
    - [Finding categorical and continous variables](#4)
    - [Continous](#5)
        + [Statistical Significance](#6)
        + [Conclusion: Continous](#7)
    - [Categorical](#8)
        + [Conclusion: Categorical](#9)
    - [Exploratory Conclusion](#10)
* [Model Development and Evaluation](#11)
* [Setup](#12)
* [Multiple Linear Regression Model](#13)
    - [Building the model](#14)
    - [Teseting and evaluating the model](#15)
* [Polynomial Regression and Normalization](#16)
    - [Creating a pipeline](#17)
* [Ridge Regression](#18)
* [Cross-validation](#19)
* [Conclusion](#20)
* [Author](#21)

Without further ado, let's get started.

# Setup <a id='1'></a>

Importing the libraries and getting the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
%matplotlib inline

In [None]:
sns.set(color_codes=True)
pd.set_option('display.max_columns', None)

In [None]:
file_path = '../input/housesalesprediction/kc_house_data.csv'
df = pd.read_csv(file_path)

# Exploratory Analysis <a id='2'></a>
----------------------

In [None]:
df.head()

## Summary <a id='3'></a>

Now let's see the columns types in the dataframe. 

In [None]:
df.dtypes

Summary of the data

In [None]:
df.shape

In [None]:
df.describe()

## Finding categorical and continous variables <a id='4'></a>

We can see that the columns _floors, waterfront, view, condition,_ and _grade_ have few values and could be seen as categorical. Thus we should use a boxplot to explore the relationship between them.

As of the rest, a scatterplot would be a good idea.

In [None]:
# The columns
categorical = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade']
continous = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

## Continous <a id='5'></a>

Let's explore the relationship of continous variables and price.

We would be using two methods:
* The Pearson coefficient
* Scatter plot

In [None]:
# Pearson coefficient
continous_df = df[continous]
continous_df.corrwith(df['price'])

In [None]:
for column_name in continous:
    plt.figure(figsize=(10,8))
    sns.scatterplot(x=column_name, y='price', data=df)

We can see that the ones that might have correlation with price are _sqft_living, sqft_above, sqft_living15._ _lat,_ and _sqft_basement_ also have moderate correlation too.

### Statistical Significance <a id='6'></a>

Let's find whether they are statistically significant or not. We would be using p-value for that.

In [None]:
pearson_coeff, p_value = stats.pearsonr(df['sqft_living'], df['price'])
print("sqft_living -- Pearson's Coefficient is: ", pearson_coeff, " and the p-value is: ", p_value)

pearson_coeff, p_value = stats.pearsonr(df['sqft_living15'], df['price'])
print("sqft_living15 -- Pearson's Coefficient is: ", pearson_coeff, " and the p-value is: ", p_value)

pearson_coeff, p_value = stats.pearsonr(df['sqft_above'], df['price'])
print("sqft_above -- Pearson's Coefficient is: ", pearson_coeff, " and the p-value is: ", p_value)

pearson_coeff, p_value = stats.pearsonr(df['sqft_basement'], df['price'])
print("sqft_basement -- Pearson's Coefficient is: ", pearson_coeff, " and the p-value is: ", p_value)

pearson_coeff, p_value = stats.pearsonr(df['lat'], df['price'])
print("lat -- Pearson's Coefficient is: ", pearson_coeff, " and the p-value is: ", p_value)

The p-value is extremely low to the point that it seems like 0.

Now, we can confidently say that these variables have correlation with price.

### Conclusion: Continous <a id='7'></a>

The features that can predict price are _sqft_living, sqft_living15, sqft_above, sqft_basement and lat_

## Categorical <a id='8'></a>

Let's figure out the relationship between categorical variables and price. We will first plot them. 

In [None]:
# Categorical
for column_name in categorical:
    plt.figure(figsize=(10,8))
    sns.boxplot(x=column_name, y='price', data=df)

It seems that everything except condition would be a good predictor. 

### Conclusion: Categorical <a id='9'></a>

We concluded that the variables of interest are _bedrooms, bathrooms, waterfront, view,_ and _grade_

## Exploratory Conclusion <a id='10'></a>

The features that we found that we can use to predict the house price are:
* sqft_above
* sqft_living15
* sqft_living
* sqft_basement
* bedrooms
* bathrooms
* waterfront
* floors
* lat
* view 
* grade

# Model Development and Evaluation <a id='11'></a>
----------------------------------

Now, let's move onto creating a model.

I will be making a linear regression since we are trying to predict a continous variable that has some linear relationship with its 'dependent' variables. 

# Setup <a id='12'></a>

First, we need to get started by importing the libraries, setting some options and importing the data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
%matplotlib inline
sns.set(color_codes=True)
pd.set_option('display.max_columns', None)

# Multiple Linear Regression Model <a id='13'></a>

## Building the model <a id='14'></a>

We would first be building a MLR model. From the exploratory analysis, we know that the important features are:
* sqft_above
* sqft_living15
* sqft_living
* sqft_basement
* bedrooms
* bathrooms
* waterfront
* floors
* lat
* view 
* grade

The first step would be to create X and y. 

In [None]:
X = df[['sqft_above', 'sqft_living15', 'sqft_living', 'sqft_basement', 'bedrooms', 'bathrooms', 'waterfront', 'floors', 'lat', 'view' , 'grade']]
y = df[['price']]

Then, we need to split the data. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now, let's create the model.

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

## Testing and evaluating the model <a id='15'></a>

Let's test the model by getting prediction values and making a distribution plot.

In [None]:
yhat_test = lm.predict(X_test)
yhat_test_df = pd.DataFrame(yhat_test, columns=['predicted_price'])

In [None]:
plt.figure(figsize=(12,10))
ax = sns.kdeplot(y_test['price'])
ax = sns.kdeplot(yhat_test_df['predicted_price'], ax=ax)
ax.legend(['y', 'y_hat'], fontsize=13);

That looks a really good estimate. We should also get some numeric values. Let's get the R-squared value.

In [None]:
lm.score(X_test, y_test)

So, there's room for improvement. 

If we take a look at the graphs in the exploratory analysis, we can see them some features (say grade) have a non-linear relationship. So, we need to make a polynomial linear regression model to take those into account properly.

Furthermore, it would be helpfult to normalize the variables, so that few variables might not dominate the model

# Polynomial Regression and Normalization <a id='16'></a>

The best way to go at it would be to create a pipeline.

Let's import the modules first.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

## Creating a pipeline <a id='17'></a>

We will create a pipeline that will first normalize the values (using standard scalar), create polynomial features (with degree of 2) and then use linear model to predict the data. 

Let's create the pipeline

In [None]:
pipe_info = [('Normalize', StandardScaler()), ('Polynomial Features', PolynomialFeatures(include_bias=False)), ('Linear Model', LinearRegression())]
pipe = Pipeline(pipe_info)

Now let's use the pipeline

In [None]:
pipe.fit(X_train, y_train)

In [None]:
yhat_test_pipe = pipe.predict(X_test)
yhat_test_pipe_df = pd.DataFrame(yhat_test_pipe, columns=['predicted_price'])

Now let's plot it and see what we get.

In [None]:
plt.figure(figsize=(12,10))
ax = sns.kdeplot(y_test['price'])
ax = sns.kdeplot(yhat_test_pipe_df['predicted_price'], ax=ax)
ax.legend(['y', 'y_hat'], fontsize=13);

In [None]:
pipe.score(X_test, y_test)

A better score. But there's further room for improvement. 

Let's try Ridge Regression because the parameters could be correlated (such as having more bedrooms is likely to imply more bathrooms).

# Ridge Regression <a id='18'></a>

We would be creating a pipe to first normalize, polynomial features and then ridge regression. 

Mandatory Imports:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

In [None]:
pipe_info_ridge = [('Normalize', StandardScaler()), ('Polynomial Features', PolynomialFeatures(include_bias=False)), ('Regression', Ridge())]
ridge_pipe = Pipeline(pipe_info_ridge)

For ridge regression, we would be optimizing the hyper-parameter $\alpha$ by grid search. Let's get the parameters.

In [None]:
ridge_pipe.get_params().keys()

So, we would be using the `Regression__alpha` parameter then. Let's create the dict with parameters.

In [None]:
hyper_params_dict = {'Regression__alpha': [0.0001, 0.001, 0.01, 0.1, 0, 1, 10, 100, 1000, 10000]}

Time to use the grid search

In [None]:
grid = GridSearchCV(estimator=ridge_pipe, param_grid=hyper_params_dict, scoring='r2', n_jobs=-1, cv=4)

In [None]:
grid.fit(X_train, y_train)

Let's find the best estimator and the param

In [None]:
grid.best_params_

In [None]:
best_ridge = grid.best_estimator_
best_ridge

The distribution plot:

In [None]:
yhat_ridge = best_ridge.predict(X_test)
yhat_ridge_df = pd.DataFrame(yhat_ridge, columns=['predicted_price'])
plt.figure(figsize=(12,10))
ax = sns.kdeplot(y_test['price'])
ax = sns.kdeplot(yhat_test_pipe_df['predicted_price'], ax=ax)
ax.legend(['y', 'y_hat'], fontsize=13);

Finally, the R-squared score we got.

In [None]:
grid.best_score_

# Cross-validation <a id='19'></a>

Although we did take Cross-validation into account when doing the Grid Search for Regression, let's do a k-fold cross-validation with 5 folds regardless. We would getting the R-squared values.

Let's get started. 

Mandatory Imports.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cv_scores = cross_val_score(estimator=best_ridge, X=X, y=y, cv=5)
cv_scores

Finally, we need to get summary of the array.

In [None]:
pd.Series(cv_scores).describe()

That seems like a decent result. 

# Conclusion <a id='20'></a>

There we go. A linear regression model to predict houseprices.

The best model we achieved had a mean cv score of 0.73 and was a Ridge Regression model with $\alpha$ as 1000.  

# Author <a id='21'></a>
By Abhinav Garg