 House price prediction using regression

*********************************************************************

The steps followed to predict house prices are 

1. Data Ingestion 

2. Data Exploration 

3. Data Transformation 

4. Feature Selection

5. Test train split 

6. Prediction

In [None]:
import pandas as pd 
%matplotlib inline

In [None]:
# Data Ingestion using pandas
contents = pd.read_csv('../input/kc_house_data.csv')

**Data Exploration using Pandas**

In [None]:
# Data Exploration 
contents.head()

In [None]:
# Features 
len(contents.columns)

*Understanding continuous and categorical attributes*

In [None]:
contents.info()

In [None]:
contents.get_dtype_counts()

**Data Exploration**

Here we perform,

Univariate Analysis: analyzing individual features.

Bi Variate analysis: analyzing features together.

***Univariate Analysis***
--------------------------------------------------

For Continuous variables,we can find mean, median and IQR. 

Histograms, BoxPlot and Violin plots are used for visualization

In [None]:
# data for mean house price 
contents.describe()

In [None]:
# get the mean price for the house 
target = contents['price'].tolist()
mean_price = sum(target)/len(target)
print(mean_price)

In [None]:
# Data for the mean,high and low sales price 
meanrange = contents[(contents.price > 540000) & (contents.price <= 550000) ]
lowrange = contents[(contents.price > 70000) & (contents.price <= 75000) ]
highrange = contents[(contents.price > 7000000) & (contents.price <= 7700000 ) ]

In [None]:
low_price = min(target)
print(low_price)
high_price = max(target)
print(high_price)

In [None]:
print("Out of 21613 records")
print("The records in mean range", len(meanrange))
print("The records in high range", len(highrange))
print("The records in low range", len(lowrange))

In [None]:
len(contents)

In [None]:
low_price = min(target)
print(low_price)

In [None]:
#Bar Plots for 'Bedroom' feature in the given dataset
contents.bedrooms.value_counts().plot(kind = 'bar')

In [None]:
contents.boxplot(['lat'])

In [None]:
contents.boxplot(['long'])

In [None]:
contents.boxplot([ 'sqft_lot', 'sqft_living'])

From the above box plot visualizations, we understand that the outlier removal should be performed in the data cleaning process

------------------------------------------------------------------

**Violin Plots**

In [None]:
import seaborn as sns
sns.set(color_codes=True)

In [None]:
sns.violinplot(contents['yr_renovated'], color = 'cyan')

In [None]:
sns.violinplot(contents['yr_built'], color = 'cyan')

**Skewness and Kurtosis analysis**

In [None]:
from scipy import stats
stats.skew(contents.sqft_living, bias=False)

In [None]:
stats.skew(contents.sqft_lot15, bias = False)

In [None]:
stats.kurtosis(contents.sqft_living15, bias=False)

In [None]:
stats.kurtosis(contents.sqft_lot15, bias=False)

**BiVariate Analysis**
-----------------------------------------------------

Here we use scatterplots for our analysis

****Linear Correlation between features** **

In [None]:
lin_cor = contents.corr(method = 'pearson')['price']
lin_cor = lin_cor.sort_values(ascending=False)
print(lin_cor)

**Visualization of linear correlation**

In [None]:
import matplotlib.pyplot as plt
plt.scatter(target,contents.sqft_living)

In [None]:
plt.scatter(target,contents.sqft_lot15)

In [None]:
plt.scatter(target,contents.yr_renovated)

In [None]:
plt.scatter(target,contents.grade)

In [None]:
plt.scatter(target, contents.long)

In [None]:
plt.scatter(target, contents.zipcode)

 Data Cleaning and transformation 
-------------------------------------------------
This stage handles,

1. removal of missing values 

2. Data Feature transformation: Extract the year attribute and encode the year

3. Removal of the column id as it has no impact on the price

4. Zscore to remove outliers

In [None]:
contents.isnull().values.any()

In [None]:
# Convert date to year 
date_posted = pd.DatetimeIndex(contents['date']).year

In [None]:
conv_dates = [1 if values == 2014 else 0 for values in date_posted ]
contents['date'] = conv_dates

In [None]:
contents.date.value_counts().plot(kind = 'bar')

In [None]:
contents = contents.drop('id', axis = 1)

In [None]:
contents.describe()

**Removing outliers **

In [None]:
import numpy as np
from scipy import stats
contents= contents[(np.abs(stats.zscore(contents)) < 3).all(axis=1)]

In [None]:
contents.boxplot([ 'sqft_lot', 'sqft_living'])

In [None]:
contents.boxplot(['long'])

 Feature Selection
---------------------------------------------

For dimensionality reduction, 
we have used

1. PCA 
2. Stability Selection

In [None]:
predictors = contents.drop('price', axis = 1)
price = contents['price'].tolist()

**1. Using PCA**

In [None]:
#Standardize the data to input to PCA
from sklearn.preprocessing import scale
std_inputs = scale(predictors)
res_inputs = std_inputs.reshape((-1,19))
std_df = pd.DataFrame(data=std_inputs,columns= predictors.columns)

In [None]:
# 1. Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA()   
pca = PCA().fit_transform(std_inputs)

**Kaisers criterion**

In [None]:
a = list(np.std((pca), axis=0))
summary = pd.DataFrame([a])
summary = summary.transpose()
summary.columns = ['sdev']
summary.index = predictors.columns
kaiser = summary.sdev ** 2
print(kaiser)

**Scree Plot**

In [None]:
y = np.std(pca, axis=0)**2
x = np.arange(len(y)) + 1
plt.plot(x, y, "o-")
plt.show()

**2.Stability selection**

In [None]:
import time 
from sklearn.linear_model import RandomizedLasso
rlasso = RandomizedLasso(alpha=0.025)

In [None]:
%time rlasso.fit(predictors, price)

In [None]:
names = predictors.columns
print(sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True))

In [None]:
final_predictors = predictors.drop(['yr_renovated', 'waterfront'], axis = 1)

Train test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(final_predictors, price, test_size=0.33, random_state=42)

 Prediction
----------------------------------------
The regression algorithms used
1. Linear Regression
2. Gradient Boosting machine (GBM)

In [None]:
#Linear regression 
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train,y_train)

In [None]:
#r2 score 
regr.score(X_test,y_test)

In [None]:
#GBM model
from sklearn import ensemble
params = {'n_estimators': 200, 'max_depth': 5, 'min_samples_split': 2,
          'learning_rate': 0.1, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

In [None]:
clf.fit(X_train, y_train)

In [None]:
#r^2 score
clf.score(X_test,y_test)

 **Log loss**

In [None]:
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
         label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
         label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')

**Variable Importances**

In [None]:
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, final_predictors.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()