## Introduction

### **Problem Statement**

> Modelling house prices based on other factors

### **Success Metrics**

    1. Accuracy score of above 80%
    2. Least possible RMSE value
    3. Identifying the best Model

### **Understanding the context**

Hass Consulting Company is a real estate leader with over 25 years of experience. The company wishes to understand the factors that affect the price of a house and to further build a model to predict the price of a house given a set of the predictor variables.

### **Recording the Experimental Design**

* Read and explore the given dataset.
* Find and deal with outliers, anomalies, and missing data within the dataset.
* Perform univariate, bivariate and multivariate analysis recording your observations.
* Performing regression analysis. This will be done using the following regression techniques:

    1. Linear Regression
    2. Quantile Regression
    3. Ridge Regression
    4. Lasso Regression
    5. Elastic Net Regression
* Check for multicollinearity
* Create residual plots for your models, and assess heteroskedasticity using Barlett's test.
* Provide a recommendation based on your analysis.
* Challenge the solution by providing insights on how you can make improvements in model improvement.

### **Data Relevance**

The data contains much of the factors that are considered when purchasing a house. Worth noting is that each of these features comes at a cost. For example extra room implies extra cost and so on.

## Loading Libraries and Data

In [11]:
# Importing the Necessary Libraries
import pandas as pd


import matplotlib.gridspec as gridspec
from datetime import datetime
from scipy.stats import skew  # for some statistics
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from mlxtend.regressor import StackingCVRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.linear_model as linear_model
import matplotlib.style as style
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import missingno as msno

In [12]:
# Loading the Dataset
df = pd.read_csv('house_data.csv')
df.head()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [13]:
# Data Size
print('Rows: ', df.shape[0])
print('Columns: ', df.shape[1])

Rows:  21613
Columns:  20


In [14]:
# Confirming the datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   price          21613 non-null  float64
 2   bedrooms       21613 non-null  int64  
 3   bathrooms      21613 non-null  float64
 4   sqft_living    21613 non-null  int64  
 5   sqft_lot       21613 non-null  int64  
 6   floors         21613 non-null  float64
 7   waterfront     21613 non-null  int64  
 8   view           21613 non-null  int64  
 9   condition      21613 non-null  int64  
 10  grade          21613 non-null  int64  
 11  sqft_above     21613 non-null  int64  
 12  sqft_basement  21613 non-null  int64  
 13  yr_built       21613 non-null  int64  
 14  yr_renovated   21613 non-null  int64  
 15  zipcode        21613 non-null  int64  
 16  lat            21613 non-null  float64
 17  long           21613 non-null  float64
 18  sqft_l

* The **id** column is not useful in this analysis
* There are no missing values in this data
* **price** column has our depedent variable that we wish to determine factors that affect it and to further create a model that predicts it based on the remaining fetures
* In so much as the following columns are numeric **bedroom, bathrooms, floors, waterfront, year_built** and  **zip_code** represent nominal values

In [15]:
# Dropping the 'id' column
df = df.drop(['id'], axis=1)

In [16]:
# Previewing the Data
df.sample(5)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
3897,876650.0,3,3.25,2170,12508,1.5,0,0,5,9,1650,520,1928,1970,98040,47.5665,-122.229,2720,21070
13123,1180000.0,5,5.0,3960,94089,2.0,0,0,3,10,3960,0,1998,0,98038,47.38,-122.011,2240,64468
19927,360000.0,2,1.0,880,1165,2.0,0,0,3,8,880,0,2005,0,98122,47.6192,-122.297,1640,3825
19894,472000.0,3,2.5,1860,415126,2.0,0,0,3,7,1860,0,2006,0,98038,47.3974,-122.005,2070,54014
13871,640000.0,3,2.5,1690,1553,2.5,0,0,3,8,1690,0,2007,0,98199,47.6443,-122.385,1910,1553
