# **PREDICTING HOUSE PRICES USING MACHINE LEARNING**

![](https://ichef.bbci.co.uk/wwfeatures/live/976_549/images/live/p0/7d/9z/p07d9znv.jpg)

There are a number of factors which determine house prices, some are logical, based on economic theories and population density and some are based on more intangible factors, like the feel of a neighbourhood and expectations for future growth.

Some general key factors which effect property prices are as follows

1. Supply and demand
Put simply if demand for houses increases faster than supply, then house prices go up. For house prices to fall the demand needs to fall.

2. Interest rates
When interest rates rise, mortgage lenders generally increase the cost of variable mortgage payments. These higher interest rates in turn make home buying less attractive. Since the majority of Australian homeowners have variable mortgages, even a small change in interest rates can have a big impact on the affordability of buying a house.

3. Economic growth
As the economy grows and wages increase more people can afford buy a house, this inturn increases overall demand, which increases prices. See number 1.

4. Demographics
As levels of migration increase so does the population and more people means more demand for homes. Another factor is changes in demographics; for example rising divorce rates have increased the number of single people living alone and our old friend demand is an issue again.

5. Location, location, location
This is an obvious one. Homes that are closer to the beach, closer to the CBD or closer to transport tend to sell at a higher price. 
Australia is a vast and varied country but if you look at any map you'll see a high concentration of housing around the city centres .The majority of people want to live close to where they work, shop and go out to enjoy themselves and this naturally causes higher demand for property prices in these areas.

6. Room to move
The potential for growth is a key issue in determining the value of a property. This relates to the potential to add on a second storey, increase the number of bedrooms or add a room above a garage or in the garden. Increasing the floor area, will increase the value. This relates back to the value of Location and land size in determining house prices.

7. A second bathroom
If two identical properties were for sale in the same street, the one with the extra bathroom would sell for more. Simple. However, the value of the bathrooms relates directly to the number of rooms in the property. For example a second bathroom in a two-bedroom house would be less desirable than in a five-bedroom house.

8. Parking
We all know that parking is at a premium in our big cities so if a home has parking or even a garage this can significantly increase the value of a home.

9. Home improvements
Updating kitchens, replacing flooring, repainting walls and adding landscaping can add to the value of a home. However often homeowners spend too much and don't get the return on investment when they sell the house. Before making drastic improvements to your house, be sure to talk with your real estate agent so that you use your money wisely on your investment.

**So lets explore which factors will affect the sale price of our houses with the help of machine learning!**


**For my final prediction I have used Light GBM**

**Why Light GBM is gaining extreme popularity?**

**The size of data is increasing day by day and it is becoming difficult for traditional data science algorithms to give faster results. Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.**

# LETS GO!!!

**Importing important libraries**

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt

In [None]:
train=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test=pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")


**Saving IDs column**

In [None]:
train_ID = train['Id']
test_ID = test['Id']
y_train=y = train['SalePrice']

**Dropping ID from the dataset**

In [None]:
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

In [None]:
train.select_dtypes(include=['int64','float64'])

In [None]:
train.select_dtypes(include=['object'])

**Checking out the number of Categorical Data and Numerical data and adding them up to find out the total feature types**

In [None]:
categorical=len(train.select_dtypes(include=['object']).columns)
numbers=len(train.select_dtypes(include=['float64','int64']).columns)
print("Total number of Categorical Data is:",categorical)
print("Total number of Numerical Data is:",numbers)
print("Total Features are:",categorical+numbers)

**Checking out the shape of our training and testing dataset**

In [None]:
train.shape

In [None]:
test.shape

**Trying to understand the density value for Sale price **

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(train['SalePrice'],color='salmon')

**Since there is a lot of data present in this particular problem we shall use correltion matrix to find out the maximum correlation between features through heatmaps**

In [None]:
corrmat=train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

**Now we shall find the Top ten most correlated features to sale price **

In [None]:
k = 10 #number of variables for heatmap
c = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[c].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=c.values, xticklabels=c.values)
plt.show()

**Most Correlated features**

In [None]:
most_cor=pd.DataFrame(c)
most_cor

# Feature exploration

OverallQual: Rates the overall material and finish of the house

GrLivArea: Above grade (ground) living area square feet

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

TotalBsmtSF: Total square feet of basement area

1stFlrSF: First Floor square feet

FullBath: Full bathrooms above grade

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

YearBuilt: Original construction date

**Now we shall se that how they affect the Sale Price through easy yet informative visualisations**

In [None]:
sns.jointplot(x=train['OverallQual'], y=train['SalePrice'], kind='reg',color='skyblue',height=7)

10:Excellent Quality therefore Higher Sale Price


In [None]:

sns.jointplot(x=train['GrLivArea'], y=train['SalePrice'], kind='hex',color='violet',height=7)

In [None]:
train = train.drop(train[(train['GrLivArea']>4000) 
                         & (train['SalePrice']<300000)].index).reset_index(drop=True)

Beyond 4000 there is a deviation or outlier from the normal observation of data.

In [None]:

sns.jointplot(x=train['GrLivArea'], y=train['SalePrice'], kind='hex',color='violet',height=7)

In [None]:
sns.boxplot(x=train['GarageCars'], y=train['SalePrice'])

In [None]:
train = train.drop(train[(train['GarageCars']>3) 
                         & (train['SalePrice']<300000)].index).reset_index(drop=True)

Again we can notice some type of deviation

In [None]:
sns.boxplot(x=train['GarageCars'], y=train['SalePrice'])

In [None]:
sns.jointplot(x=train['GarageArea'], y=train['SalePrice'], kind='reg')

Seems fine.

In [None]:
sns.jointplot(x=train['GarageArea'], y=train['SalePrice'], kind='reg',color='coral',height=7)

Pretty much clustered in the range of 0-1000 GarageArea

In [None]:
sns.jointplot(x=train['1stFlrSF'], y=train['SalePrice'], kind='hex',color='gold',height=7)

Pretty Clean and clustered towards 1000 against 100000 for first floor surface and gets bleaker later.

In [None]:
sns.boxplot(x=train['TotRmsAbvGrd'], y=train['SalePrice'])

Looks good and well defined for different numbers of rooms,except the one with 11 rooms.

In [None]:
sns.jointplot(x=train['YearBuilt'], y=train['SalePrice'], kind='reg',color='green',height=7)

Recently built houses are of higher sale Price than the ones of older buily.

**Concatinating Test and Train for making Imputing and Cleaning of Data Easier**

In [None]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
total=pd.concat((train,test)).reset_index(drop=True)
total.drop(['SalePrice'], axis=1, inplace=True)
print("Combined dataset size is : ",total.shape)


**Lets find out the percentage of missing values according to which imputation and datacleaning can take place further**

In [None]:
totalnull=(total.isnull().sum())/len(total)*100
totalnull=totalnull.drop(totalnull[totalnull == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Values' :totalnull})
missing_data

**Missing Data Percentage Visualization for Clarity**

In [None]:
f, ax = plt.subplots(figsize=(13, 5))
plt.xticks(rotation='90')
sns.barplot(x=totalnull.index, y=totalnull)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

# **Imputation of data and deep clean to fill up missing values!!**

Assigning None to missing categorical values

In [None]:

total['PoolQC']=total['PoolQC'].fillna('None')
total['MiscFeature']=total['MiscFeature'].fillna('None')
total['Alley']=total['Alley'].fillna('None')
total['Fence']=total['Fence'].fillna('None')
total['FireplaceQu']=total['FireplaceQu'].fillna('None')


Since Neighbourhood and LotFrontage are highly correlated we will fill up lotFrontage's NAN using it

In [None]:
lot= total.groupby("Neighborhood")["LotFrontage"]
print(lot.median())

In [None]:
total.loc[total.LotFrontage.isnull(),'LotFrontage']=total.groupby("Neighborhood").LotFrontage.transform('median')

Filling other missing categorical and numerical data with 'None'(categorical) and 0(numerical).
For some features where we have selected categories we can use MODE to fill up the values

In [None]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    total[col] = total[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    total[col] = total[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    total[col] = total[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    total[col] = total[col].fillna('None')
total["MasVnrType"] = total["MasVnrType"].fillna("None")
total["MasVnrArea"] =total["MasVnrArea"].fillna(0)
total['MSZoning'] = total['MSZoning'].fillna(total['MSZoning'].mode()[0])
total["Functional"] = total["Functional"].fillna("Typ")
total['Electrical'] = total['Electrical'].fillna("SBrkr")
total['KitchenQual'] = total['KitchenQual'].fillna('TA')
total['Exterior1st'] = total['Exterior1st'].fillna(total['Exterior1st'].mode()[0])
total['Exterior2nd'] = total['Exterior2nd'].fillna(total['Exterior2nd'].mode()[0])
total['SaleType'] = total['SaleType'].fillna(total['SaleType'].mode()[0])
total['MSSubClass'] = total['MSSubClass'].fillna("None")

These values can offer more as categorical features than numerical data,therefore we will be converting them to string

In [None]:
total['MSSubClass'] = total['MSSubClass'].apply(str)

total['OverallCond'] = total['OverallCond'].astype(str)

total['YrSold'] = total['YrSold'].astype(str)
total['MoSold'] = total['MoSold'].astype(str)



# **Adding New Features**

In [None]:
total['TotalSF'] = total['TotalBsmtSF'] + total['1stFlrSF'] + total['2ndFlrSF']
total['Bathrooms']=total['BsmtHalfBath']+total['BsmtFullBath']+total['HalfBath']+total['FullBath']
total['TotalSqu'] = (total['BsmtFinSF1'] + total['BsmtFinSF2'] +total['1stFlrSF'] + total['2ndFlrSF'])
total['pool'] = total['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
total['2ndfloor'] = total['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
total['garage'] = total['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
total['Basement'] = total['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
total['Fireplace'] = total['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

                                 

**Dropping some Columns which dont seem that important or signifant or have a history of many missing data**

In [None]:

total.drop(['Condition1','Condition2','Exterior1st','Exterior2nd'], axis=1, inplace=True)    
total=total.drop(['Utilities','Street','PoolQC'],axis=1)

In [None]:
missing=total.isnull().sum()
missing

In [None]:
total.select_dtypes(include=['object']).columns

# Preprocessing(LabelEncoder)

**Preprocessing(Label encoder):Encode target labels with value between 0 and n_classes-1.It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.**

In [None]:
from sklearn.preprocessing import LabelEncoder
c= ('Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Electrical', 'ExterCond',
       'ExterQual', 'Fence', 'FireplaceQu', 'Foundation', 'Functional',
       'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating',
       'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope',
       'LotConfig', 'LotShape', 'MSSubClass', 'MSZoning', 'MasVnrType',
       'MiscFeature', 'MoSold', 'Neighborhood', 'OverallCond', 'PavedDrive',
        'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType',
        'YrSold')
for i in c:
    l=LabelEncoder()
    l.fit(list(total[i].values))
    total[i]=l.transform(list(total[i].values))
total.shape    

**Fixing Skewness:skew() function returns unbiased skew over requested axis Normalized by N-1. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.**

In [None]:
train["SalePrice"] = np.log1p(train["SalePrice"])
plt.figure(figsize=(10,5))
sns.distplot(train['SalePrice'],color='coral');

In [None]:
print("Skewness: %f" % train['SalePrice'].skew())

**Splitting data into train and test again**

In [None]:
train = total[:ntrain]
test = total[ntrain:]

In [None]:
train.head()

In [None]:
test.head()

# Modeling and Predicting

In [None]:
x_train=train.values
x_test=test.values
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, random_state=0) 

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

import lightgbm as lgbm


import warnings
warnings.filterwarnings(action='ignore')

In [None]:
kfold = KFold(n_splits=10, random_state = 77, shuffle = True)

**In GridSearchCV approach, machine learning model is evaluated for a range of hyperparameter values. This approach is called GridSearchCV, because it searches for best set of hyperparameters from a grid of hyperparameters values.**

**Light GBM is a gradient boosting framework that uses tree based learning algorithm.**

**Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.**

In [None]:
# LightGBM Grid Search
params = {
    'task' : 'train',
    'objective' : 'regression',
    'subsample' : 0.8,
    'max_depth' : 7
}

param_grid = {
    'learning_rate': [0.1],
    'feature_fraction' : [0.5, 0.8],
    'num_leaves':[31, 63, 127]
}

lgbm_model = lgbm.LGBMRegressor(**params, verbose=-1)

lgbm_grid  = GridSearchCV(lgbm_model, 
                          param_grid, 
                          cv=kfold, 
                          scoring='neg_mean_squared_error', 
                          return_train_score=True)

lgbm_grid.fit(x_train, y_train)

r2_score(lgbm_grid.predict(x_train), y_train)

lgbm_model.fit(x_train,y_train)

# **Submission**

In [None]:
ids = test_ID
predictions =lgbm_model.predict(test)
output = pd.DataFrame({ 'id' : ids, 'SalePrice': predictions })
output.to_csv('submission.csv', index=False)