# Prediction of House Prices Using Advanced Regression Methods

<img src="https://miro.medium.com/max/1400/0*YMZOAO8QE4bZ4_Rk.jpg"  Width="800">

Prediction of a house price can be a quite difficult task to achieve, as there are numerous amount of features that may affect the price of a house. Even though some of the house characteristics may have a greater effect on the price of a house, ones that have a weaker effect still needs to be considered as the determination of the price is a crucial process, especially for real estate agents and customers. 

In this notebook a detailed exploratory analysis of the house price data is going to be made. After that, the raw data is going to be cleaned and an advanced regression model is going to be constructed using the provided train dataset. Lastly, the constructed model is going to be applied on the test dataset to predict the prices of given houses.

Any feedback for this kernel is appreciated and please feel free to <b>upvote</b> or leave a <b>comment</b> if you liked the work or if you want to criticize.

## 1. Importing Libraries and Datasets

In [None]:
#Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from scipy.stats import norm, skew
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin
from sklearn.base import TransformerMixin
from sklearn.metrics import mean_squared_error

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

In [None]:
#Importing the datasets (1 for training and 1 for predicting)
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

The data is split into two parts, one of them is training data and the other one is testing data. We are going to use the train data to train our models and the test data to predict the house prices.

## 2. Glimpse of the Dataset

In [None]:
#Train data overview (first five rows)
train.head()

In [None]:
#Test data overview (first five rows)
test.head()

The first difference that catches our eye is that the test data doesn't have a SalePrice column, as it is the dataset that we are going to apply the model that we will construct. Also, we can see that there are numerous amount of columns (doesn't even fit the cell), which we will inspect deeper in the upcoming section.

Now, we will check the size of the datasets before we inspect deeper.

In [None]:
print ("The shape of the train data is (row, column):"+ str(train.shape))
print ("The shape of the test data is (row, column):"+ str(test.shape))

The sizes of the datasets shows us that we have 1460 houses to train the model and 1459 houses to predict their prices. As we have mentioned, train dataset have one extra column which is the the SalePrice of the houses.

Next, we will check what these 81 features of houses are and whether they have missing values.

In [None]:
#More information on train data
train.info()

In [None]:
#More information on test data
test.info()

### Column Labels: (taken from Data fields of the competition page)
- **SalePrice**: The property's sale price in dollars. This is the target variable that you're trying to predict.
- **MSSubClass**: The building class
- **MSZoning**: The general zoning classification
- **LotFrontage**: Linear feet of street connected to property
- **LotArea**: Lot size in square feet
- **Street**: Type of road access
- **Alley**: Type of alley access
- **LotShape**: General shape of property
- **LandContour**: Flatness of the property
- **Utilities**: Type of utilities available
- **LotConfig**: Lot configuration
- **LandSlope**: Slope of property
- **Neighborhood**: Physical locations within Ames city limits
- **Condition1**: Proximity to main road or railroad
- **Condition2**: Proximity to main road or railroad (if a second is present)
- **BldgType**: Type of dwelling
- **HouseStyle**: Style of dwelling
- **OverallQual**: Overall material and finish quality
- **OverallCond**: Overall condition rating
- **YearBuilt**: Original construction date
- **YearRemodAdd**: Remodel date
- **RoofStyle**: Type of roof
- **RoofMatl**: Roof material
- **Exterior1st**: Exterior covering on house
- **Exterior2nd**: Exterior covering on house (if more than one material)
- **MasVnrType**: Masonry veneer type
- **MasVnrArea**: Masonry veneer area in square feet
- **ExterQual**: Exterior material quality
- **ExterCond**: Present condition of the material on the exterior
- **Foundation**: Type of foundation
- **BsmtQual**: Height of the basement
- **BsmtCond**: General condition of the basement
- **BsmtExposure**: Walkout or garden level basement walls
- **BsmtFinType1**: Quality of basement finished area
- **BsmtFinSF1**: Type 1 finished square feet
- **BsmtFinType2**: Quality of second finished area (if present)
- **BsmtFinSF2**: Type 2 finished square feet
- **BsmtUnfSF**: Unfinished square feet of basement area
- **TotalBsmtSF**: Total square feet of basement area
- **Heating**: Type of heating
- **HeatingQC**: Heating quality and condition
- **CentralAir**: Central air conditioning
- **Electrical**: Electrical system
- **1stFlrSF**: First Floor square feet
- **2ndFlrSF**: Second floor square feet
- **LowQualFinSF**: Low quality finished square feet (all floors)
- **GrLivArea**: Above grade (ground) living area square feet
- **BsmtFullBath**: Basement full bathrooms
- **BsmtHalfBath**: Basement half bathrooms
- **FullBath**: Full bathrooms above grade
- **HalfBath**: Half baths above grade
- **Bedroom**: Number of bedrooms above basement level
- **Kitchen**: Number of kitchens
- **KitchenQual**: Kitchen quality
- **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
- **Functional**: Home functionality rating
- **Fireplaces**: Number of fireplaces
- **FireplaceQu**: Fireplace quality
- **GarageType**: Garage location
- **GarageYrBlt**: Year garage was built
- **GarageFinish**: Interior finish of the garage
- **GarageCars**: Size of garage in car capacity
- **GarageArea**: Size of garage in square feet
- **GarageQual**: Garage quality
- **GarageCond**: Garage condition
- **PavedDrive**: Paved driveway
- **WoodDeckSF**: Wood deck area in square feet
- **OpenPorchSF**: Open porch area in square feet
- **EnclosedPorch**: Enclosed porch area in square feet
- **3SsnPorch**: Three season porch area in square feet
- **ScreenPorch**: Screen porch area in square feet
- **PoolArea**: Pool area in square feet
- **PoolQC**: Pool quality
- **Fence**: Fence quality
- **MiscFeature**: Miscellaneous feature not covered in other categories
- **MiscVal**: $Value of miscellaneous feature
- **MoSold**: Month Sold
- **YrSold**: Year Sold
- **SaleType**: Type of sale
- **SaleCondition**: Condition of sale


For more detailed explanation of the columns please see data_description.txt.

Some basics stats of the numerical values of train data.

In [None]:
train.describe().T

In [None]:
#Counts of different types
train.dtypes.value_counts()

Above informations tells us that there are 43 features of type object (probably categorical values), and 38 numerical values (35 integer types and 3 float types of features.)

## 3. Cleaning of Data

From the above information, we can see that we have unequal amount of data point in our datasets. Which implies that we probably have some missing values in our datasets. We have to replace or get rid of these missing values so that they dont result in any discrepancies in our calculations. Let's check how many NaN values are present in each data set.

In [None]:
#A function that calculates the percentage of missing data
def missing_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total','Percent'])

In [None]:
#Train data
missing_percentage(train)

In [None]:
#Test data
missing_percentage(test)

As we can see while the train dataset have 19 fatures that has minimum 1 missing value, test dataset has 33 features with minimum 1 missing values. While some of the features have a considerable amount of missing values, some of them have negligible amounts. Therefore we will use different methods to handle different features that has missing values.

Now we will combine test and train datasets to do all the cleaning at once. We will drop the SalePrice as there isn't a SalePrice column in the test dataset.

In [None]:
salesprice = train['SalePrice']
all_data = pd.concat((train, test)).reset_index(drop = True)
all_data.drop(['SalePrice'], axis = 1, inplace = True)

After reading the descripiton of the columns, we can figure out that some of the missing values are not meant to be missing values but rather a 'None' value for categorical features and '0' value for numerical features. Let's fix those columns first.

In [None]:
missing_value_0 = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BsmtFullBath', 'BsmtHalfBath', 'GarageYrBlt','GarageArea','GarageCars','MasVnrArea']

for i in missing_value_0:
    all_data[i] = all_data[i].fillna(0)
    
missing_value_none = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','MasVnrType']

for i in missing_value_none:
    all_data[i] = all_data[i].fillna('None')

In [None]:
missing_percentage(all_data)

After the fixing of 'None' and '0' values the number of features with missing values reduced to only 9 features. And there is only one column that has a considerable amount of missing values, which is LotFrontage. We are going to estimate LotFrontage according to LotArea and we are goingto use the modes of remaining values for the other missing values.

What we will do with the LotFrontage values is that we will compare the square root of the lot are to the existing lot frontage values to see the relationsip between them.

In [None]:
lot_frontage_data = all_data['LotFrontage']
sqr_lot_area_data = np.sqrt(all_data['LotArea'])

In [None]:
ax = sns.regplot(sqr_lot_area_data, lot_frontage_data)
ax.set_ylabel('LotFrontage')
ax.set_xlabel('LotAreaUnsq')
ax.set_title('Lot Area Squarerooted vs Lot Frontage')

The relationship between square root of lot area and the lot frontage looks like a pretty linear relationship. Therefore, we can estimate that the each lot has a shape of square and we can estimate that each lot frontage is equal to the square root of the lot area of each house.

In [None]:
all_data['LotFrontage'] = sqr_lot_area_data.round(2) 

For the remaining missing values we will use the most frequent values of the each column.

In [None]:
all_data['MSZoning'] = all_data.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
all_data['Utilities'] = all_data['Utilities'].fillna(all_data['Utilities'].mode()[0]) 
all_data['Functional'] = all_data['Functional'].fillna(all_data['Functional'].mode()[0]) 
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0]) 
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0]) 
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0]) 
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

In [None]:
missing_percentage(all_data)

There is no missing values left and the dataset is cleaned now.

Lastly, we will separate the train and test datasets.

In [None]:
train = all_data[:1460]
test = all_data[1460:]
train['SalePrice'] = salesprice

## 4. Correlation and Visualization of Data

Before we visualize some of the house features first we need to check the correlation between each other and between our target value salesprice.

Let's check the correlation between each feature.

In [None]:
train.corr()

As our dataset is huge we are not able to see the correlation between features clearly. A better wat to do that is creating a heatmap of all the features.

In [None]:
plt.figure(figsize=(30, 20))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(train.corr(), dtype=np.bool))
heatmap = sns.heatmap(train.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Heatmap of Train data', fontdict={'fontsize':18}, pad=16);

Let's inspect the features that have high correlations. As we have implied before, LotArea and LotFrontage has a considerable amount of correlation, 0.91. We have used the correlation between these features to estimate the missing LotFrontage values with considering the lot as a square and squarerooting the LotArea. Another highly correlated pair of features are GarageArea and GarageCars. As we could guess they are almost directly effected by each other therefore the high correlation is expected. Some of other highly correlated paris are TotRmsAbvGrd and GrLivArea, 1stFlrSF and TotalBsmtSF, and SalePrice and OverallQual. Let's visualize them!

In [None]:
#TotRmsAbvGrd and GrLivArea regression plot
ax = sns.regplot(train['TotRmsAbvGrd'],train['GrLivArea'])
ax.set_ylabel('GrLivArea')
ax.set_xlabel('TotRmsAbvGrd')
ax.set_title('TotRmsAbvGrd vs GrLivArea')

In [None]:
#GarageArea and GarageCars regression plot
ax = sns.regplot(train['GarageArea'],train['GarageCars'])
ax.set_ylabel('GarageCars')
ax.set_xlabel('GarageArea')
ax.set_title('GarageArea vs GarageCars')

In [None]:
#1stFlrSF and TotalBsmtSF regression plot
ax = sns.regplot(train['1stFlrSF'],train['TotalBsmtSF'])
ax.set_ylabel('TotalBsmtSF')
ax.set_xlabel('1stFlrSF')
ax.set_title('1stFlrSF vs TotalBsmtSF')

In [None]:
#OverallQual vs SalePrice regression plot
ax = sns.regplot(train['OverallQual'],train['SalePrice'])
ax.set_xlabel('OverallQual')
ax.set_ylabel('SalePrice')
ax.set_title('OverallQual vs SalePrice')

These features are also highly dependent on each other therefore the correlation is expected. We notice that in those highly correlated examples one of them have SalePrice, which is our target value. Now let's focus on our target value and try to find the features that effect the SalePrice the most.

In [None]:
#Correlation between SalePrice and other features
pd.DataFrame(train.corr()['SalePrice'].sort_values(ascending = False))

The features that have a bigger correlation between SalePrice will have a bigger effect on the model we will create. We want those data to be as clean as possible to get the best outcome from our model. So, let's visualize the features with most correlation (>0.6) and check if they have any discontinuity or outliers.

In [None]:
#OverallQual vs SalePrice regression plot
ax = sns.regplot(train['OverallQual'],train['SalePrice'])
ax.set_xlabel('OverallQual')
ax.set_ylabel('SalePrice')
ax.set_title('OverallQual vs SalePrice')

This is a categorical variable and therefore dropping outliers wouldn't effect our result too much. Also, threre aren't any crucial outliers.

In [None]:
#GrLivArea vs SalePrice regression plot
ax = sns.regplot(train['GrLivArea'],train['SalePrice'])
ax.set_xlabel('GrLivArea')
ax.set_ylabel('SalePrice')
ax.set_title('GrLivArea vs SalePrice')

This one has 2 obvious outliers at the right bottom corner. We will get rid of them soon.

In [None]:
#GarageCars vs SalePrice regression plot
ax = sns.regplot(train['GarageCars'],train['SalePrice'])
ax.set_xlabel('GarageCars')
ax.set_ylabel('SalePrice')
ax.set_title('GarageCars vs SalePrice')

In [None]:
#GarageArea vs SalePrice regression plot
ax = sns.regplot(train['GarageArea'],train['SalePrice'])
ax.set_xlabel('GarageArea')
ax.set_ylabel('SalePrice')
ax.set_title('GarageArea vs SalePrice')

This one seems to have 4 outliers on the right bottom corner.

In [None]:
#TotalBsmtSF vs SalePrice regression plot
ax = sns.regplot(train['TotalBsmtSF'],train['SalePrice'])
ax.set_xlabel('TotalBsmtSF')
ax.set_ylabel('SalePrice')
ax.set_title('TotalBsmtSF vs SalePrice')

There is also an obvious outlier in this one.

In [None]:
#1stFlrSF vs SalePrice regression plot
ax = sns.regplot(train['1stFlrSF'],train['SalePrice'])
ax.set_xlabel('1stFlrSF')
ax.set_ylabel('SalePrice')
ax.set_title('1stFlrSF vs SalePrice')

There is also an obvious outlier in this one. Let's get rid of them all together.

In [None]:
train = train[train.GrLivArea < 4500]
train = train[train.GarageArea < 1250]
train = train[train.TotalBsmtSF < 3500]
train = train[train['1stFlrSF'] < 3500]
train.reset_index(drop = True, inplace = True)

This way we got rid of some outliers that we don't want in our dataset.

## 5. Machine Learning Preprocessing

We have to do some preprocessing before we apply the machine learning methods. We need numbers instead of categorical values, as machine learning algorithms can not understand categorical values. Also we will have to split our train dataset for doing consistency checks and after that we will standardize the data.

Let's start with dropping unnecessary variables.

In [None]:
#Save old train dataframe in case we need it again 
old_train = train.copy()
old_test= test.copy()
salesprice = train['SalePrice']

#Dropping unnecessary features
train.drop(['Id'],axis=1, inplace=True)
test.drop(['Id'],axis=1, inplace=True)
train.drop(['SalePrice'],axis=1, inplace=True)

Next, to get rid of categorical values we will assign dummy values to categorical variables. But, to get the same number of columns for the train and test data we will merge these two datasets and get dummies for the whole dataset.

In [None]:
#Merge train and test data
all_data = pd.concat((train, test)).reset_index(drop = True)

In [None]:
#Assign dummy values
all_data_dummies = pd.get_dummies(all_data).reset_index(drop=True)
all_data_dummies.shape

In [None]:
#Split the all_data dataset
train = all_data_dummies[:1456]
test = all_data_dummies[1456:]

Next, we will split our train dataset for future validation and then standardize each dataset.

In [None]:
X = train
y = salesprice

#Splitting the train data 
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = .33, random_state=0)

Lastly, we will scale our datasets.

In [None]:
#X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
#X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)
#test_scaled = preprocessing.StandardScaler().fit(test).transform(test)

## 6. Modelling the Data

### Simple Approach

For a simple approach we will fit a linear regression model to the dataset, using the vairable with most correlation. We will try the most correlated data GrLivArea. We won't use OverallQual as it is basically a categorical value and its hard to create regression for it.

In [None]:
regr = linear_model.LinearRegression()
train_x = np.asanyarray(X_train['GrLivArea'])
train_y = np.asanyarray(y_train)
regr.fit(train_x.reshape(-1, 1), train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

In [None]:
#GrLivArea vs SalePrice regression plot
ax = sns.regplot(train_x,train_y)
ax.set_xlabel('GrLivArea')
ax.set_ylabel('SalePrice')
ax.set_title('GrLivArea vs SalePrice')

On plot the linear regression looks like a good approximation. Let's evaluate the model with comparing to the test values.

In [None]:
test_x = np.asanyarray(X_test['GrLivArea'])
test_y = np.asanyarray(y_test)
test_y_ = regr.predict(test_x.reshape(-1, 1))

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

An R2-score of 0.53 shows us that linear regression with GrLivArea is not a good approxiamtion method to calculate SalePrice values. Therefore we need a better approach to approximate our target value.

### Multiple Linear Regression

Let's use each and every feature in our train dataset to construct a linear model and to predict the saleprice out our houses.

In [None]:
regr.fit(X_train, y_train)
# The coefficients
print ('Coefficients: ', regr.coef_)

In [None]:
test_x = np.asanyarray(X_test)
test_y = np.asanyarray(y_test)
test_y_ = regr.predict(X_test)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

These scores shows us that these models perform very poorly therefore some improvements need to be done both to the datasets and the model we're going to use.

## Advanced Approach

First of all we are going to do some feature engineering to show how it effects the performance of our final model, copared to previous simple models.

### Feature Engineering

In [None]:
all_data

Let's create some new features with combining other features to add some context to our dataset.

In [None]:
all_data = all_data.drop([ 'Utilities','Street', 'PoolQC',], axis=1)

# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

all_data['Total_Bathrooms'] = (all_data['FullBath'] + (0.5 * all_data['HalfBath']) +
                               all_data['BsmtFullBath'] + (0.5 * all_data['BsmtHalfBath']))

all_data['Total_porch_sf'] = (all_data['OpenPorchSF'] + all_data['3SsnPorch'] +
                              all_data['EnclosedPorch'] + all_data['ScreenPorch'] +
                              all_data['WoodDeckSF'])

all_data['haspool'] = all_data['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
all_data['has2ndfloor'] = all_data['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
all_data['hasgarage'] = all_data['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
all_data['hasbsmt'] = all_data['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
all_data['hasfireplace'] = all_data['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

In [None]:
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Next, we will apply LabelEncoder to our categorical values.

In [None]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC',  'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))

Another improvement we can do is fixing the skewness of our numerical values.

In [None]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness

In [None]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

Creating our dummy variables just like before.

In [None]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)

In [None]:
train = all_data[:1456]
test = all_data[1456:]
y_train = np.log1p(salesprice)

### Constructing Our Model

Before we construct our advanced model let's define our validation fucntion first.

In [None]:
#Validation function
n_folds = 10

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

Now, we are going to construct 6 different models which are LASSO Regression, Kernel Ridge Regression, Elastic Net Regression, Gradient Boosting Regression, XGBoost and LightGBM. After that we are going to check their rmse scores and use the one with the best score. We will use RobustScaler() in our models because this dataset may be sensitive to outliers.

In [None]:
#LASSO Regression
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

#Kernel Ridge Regression
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

#Elastic Net Regression
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

#Gradient Boosting Regression
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

#XGBoost
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

#LightGBM
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

Let's check the score of each model.

In [None]:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

In [None]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [None]:
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

In [None]:
ENet.fit(train,y_train)
ENet_train_pred = ENet.predict(train)
ENet_pred = np.expm1(ENet.predict(test))
print(rmsle(y_train, ENet_train_pred))

In [None]:
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test))
print(rmsle(y_train, lgb_train_pred))

After checking the scores we can see that Elastic Net Regression is the best scoring model therefore we are going to use it to predout our price values.

In [None]:
y_hat = lgb_pred

## 7. Appending Predictions

In [None]:
sale_pred = []
for k in range(len(y_hat)):
    sale_pred.append(y_hat[k])

In [None]:
predictions_df = pd.DataFrame(
    {'Id': old_test['Id'].values.tolist(),
     'SalePrice': sale_pred
    })

In [None]:
predictions_df

In [None]:
predictions_df.to_csv('predictions.csv', index=False)

If you have come this far, Congratulations!!

### If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :) 