# Predicting House Prices - Using Gradient Boosting and Random Forrest Classifiers

# **The steps that I have taken to for this task:**

# 1. EDA
# 2. Feature Egineering
# 3. Fitting the Models

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
        
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.mode.chained_assignment = None
import seaborn as sns
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**The first part of our EDA will be too look at the data we have, see the data types and see how many null values we have**

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape, test.shape

In [None]:
train.info()

In [None]:
test.info()

**Can see clearly that in both the training and test datsets have a very large amount of null values in the columns - PoolQC, Fence, MiscFeature. As over 50% of their values are missing. I am going to drop these straight away as filling these with values will not help us.**

**I am also going do drop the ID from both as we do not need them right now, however will require the test ID at the end for submission.**

In [None]:
train = train.drop(['Id','Alley','PoolQC','Fence','MiscFeature'],axis=1)
testID = test.Id
#print(testID)
test = test.drop(['Id','Alley','PoolQC','Fence','MiscFeature'],axis=1)

# Now Lets take a look at correlations!
**Looking at correlations will help us see what features are going to be most impoortant in our model.**

In [None]:
corr = train.corr()
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.heatmap(corr,linewidths=.5,annot=True)

**Now lets refine this graph so we can see it clearer and look at correlations with a strong correlation with sale price.**

In [None]:
corr = train.corr()
strong_corr = corr.index[abs(corr["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
sns.heatmap(train[strong_corr].corr(),annot=True,cmap="RdYlGn")

**As we can see by lookoing above, some variables themselves are highly correlated, such as GarageCars and Garage Area - this can be detrimnetal to our model if two features are correlated themselves.**

**Below I will plot some of these so we can get a better look into how they are correlated, some have been left out for the reasons stated above**

In [None]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
#Have not chosen them all, as you can see from above that some columns are highly correlatd with one another such as GarageCars and GarageArea, so will only use one of them.
sns.pairplot(train[cols], height = 2.5)

# Lets Investigate are target variable : Sale Price

In [None]:
sns.histplot(train.SalePrice,kde=True)

In [None]:
from scipy.stats import skew
skew(train.SalePrice)

**As we can see, sale price is not normally distributed, the models that we will be using here will work better with niormally distributed data, so in iorder to fix this we will use numpy to log transform this variable**

**Below you will also see me combine both testing and training data, I add a new column to help differentiate this**

In [None]:
#LOG OF SALE PRICE
#DROP SALE PRICE BUT SAVE IT 
#CONCATENATE TWO DATAFRAMES BUT ADD COLUMN TRAIN
train['train']  = 1
test['train']  = 0
all_data = pd.concat([train, test], axis=0,sort=False)

train.SalePrice = np.log1p(train.SalePrice)
#print(all_data.head())

In [None]:
#Lets have a look at sale price now
sns.histplot(train.SalePrice,kde=True)

**Now it looks more normally distributed!**

# We are now going to plot some graphs to understand more about our dataset

In [None]:
sns.barplot(x=all_data.OverallQual,y=all_data.SalePrice)

Can see very linear pattern between these two variables - this visualsation could have been predicted from the strong correlation value we were given earlier.

In [None]:
#does the neighbourhood effect price
plt.figure(figsize=(10,10))
plt.xticks(rotation = 45)
sns.barplot(x=all_data.Neighborhood,y=all_data.SalePrice)

In [None]:
#GrLiving area is the area above ground in square feet.
sns.scatterplot(x=all_data.GrLivArea,y=all_data.SalePrice)

In [None]:
#are modern houses going for more or less?
sns.scatterplot(x=all_data.YearBuilt,y=all_data.SalePrice)

In [None]:
#number of bathrooms imply bigger house?
sns.stripplot(x=all_data.FullBath,y=all_data.SalePrice)

**Now we split our data intio numerical and categorical data to understand and pre-process both independently, before combining to use in our model.**

In [None]:
#Dropping but saving the SalePrice for later Use as it is the dependent variable
salePrice = train.SalePrice
all_data = all_data.drop(['SalePrice'],axis=1)

cat = all_data.select_dtypes(include=['object'])
num = all_data.select_dtypes(exclude=['object'])
#print(all_data)
#print(cat.columns)

In [None]:
#Lets See how many they are
print("There are " + str(num.shape[1]) + " numerical values")
print("There are " + str(cat.shape[1]) + " categorical values")

**Now lets look at some of these categorical variables**

In [None]:
#for i in cat.columns:
#    plt.figure()
#    sns.countplot(x=cat[i])

#After carrying out the previous code it is clear that some of the variables have an extremely low variance - and should be dropped from the dataset, these columns are:

#'Heating','RoofMatl','Condition2','Street','Utilities'

cols = ['Heating','RoofMatl','Condition2','Street','Utilities']
for i in cols:
    plt.figure()
    sns.countplot(x=cat[i])

**We do not need the columns above due to their extremley high variance - these could affect our model as it may weigh certain features higher due to coincidences, as the majority of these features take on the same value.**

In [None]:
cat = cat.drop(['Heating','RoofMatl','Condition2','Street','Utilities'],axis=1)

**Now lets look at our numerical data**.

In [None]:
print(num.columns)
print(num.shape[1])

**Here we are going to use are list of strongly correlated features from earlier to plot and get and idea of some of our numerical variables**

In [None]:
print(len(strong_corr))

#plotting the top correlated values, only doing first 10 as the last one is sale price which has now been removed.
for i in range(10):
    plt.figure()
    sns.histplot(num[strong_corr[i]],kde=True)

**We see here that some of these are skewed, we will fix this in feature egineering.**

# First lets fix null values

In [None]:
#Lets look at Categorical First
#cat.info()
#Lets look at how many null values there are
for i in cat.columns:
    print(str(i) + ' has ' + str(cat[i].isnull().sum()) + ' null values')

**Looking at the data description provided allows us to know how to handle null values for certain features**

In [None]:
#From looking at the data description we can fill some of these values in already.

#If null then it has no fireplace
#MasVnrType and Functional have the following values if null from data description
cat['FireplaceQu'] = cat['FireplaceQu'].fillna('None')
cat['MasVnrType'] = cat['MasVnrType'].fillna('None')
cat['Functional'] = cat['Functional'].fillna('Typ')

#If null then no garage
for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
        cat[col] = cat[col].fillna('None')

#If null then no basement
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
        cat[col] = cat[col].fillna('None')


# Replace the missing values in each of the columns below with their mode
cat['Electrical'] = cat['Electrical'].fillna("SBrkr")
cat['KitchenQual'] = cat['KitchenQual'].fillna("TA")
cat['Exterior1st'] = cat['Exterior1st'].fillna(cat['Exterior1st'].mode()[0])
cat['Exterior2nd'] = cat['Exterior2nd'].fillna(cat['Exterior2nd'].mode()[0])
cat['SaleType'] = cat['SaleType'].fillna(cat['SaleType'].mode()[0])
cat['MSZoning'] = cat['MSZoning'].fillna(cat['MSZoning'].mode()[0])

**We can now see how many nulls are left**

In [None]:
#Now lets see how many null values remain.
for i in cat.columns:
    print(str(i) + ' has ' + str(cat[i].isnull().sum()) + ' null values')

**We now repeat this process for numerical data**

In [None]:
#Now for filling in numerical data
#num.info()
#Lets look at how many null values there are
for i in num.columns:
    print(str(i) + ' has ' + str(num[i].isnull().sum()) + ' null values')

In [None]:
#no garage then 0
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
        num[col] = num[col].fillna(0)

#plt.hist(num.LotFrontage)
#Plotting lot frontage to see whehter to fill with mean or median
#As it is skewed we are going to replace it with the median.
num['LotFrontage'] = num['LotFrontage'].fillna(num['LotFrontage'].median())

#fill the rests with 0
for col in ('BsmtHalfBath', 'BsmtFullBath','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1','MasVnrArea'):
    num[col] = num[col].fillna(0)


In [None]:
#Lets see how many null values remain
for i in num.columns:
    print(str(i) + ' has ' + str(num[i].isnull().sum()) + ' null values')

Now lets look into skewed data and fix that!

# Feature egineering time!

In [None]:
#total surface area = TotalbsmtSF + 1stFlrSF + 2ndFlrSf
#Total bnathrooms = fullbath + 0.5*halfbath + bsmtfullbath + 0.5*bsmthalfbath
#Overall = overallqual + overallCond
num['TotalSF'] = num['TotalBsmtSF'] + num['1stFlrSF'] + num['2ndFlrSF']
num['Total_Home_Quality'] = num['OverallQual'] + num['OverallCond']
num['Total_Bathrooms'] = (num['FullBath'] + (0.5 * num['HalfBath']) +
                               num['BsmtFullBath'] + (0.5 * num['BsmtHalfBath']))

num['haspool'] = num['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
num['has2ndfloor'] = num['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
num['hasgarage'] = num['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
num['hasbsmt'] = num['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
num['hasfireplace'] = num['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

**Now lets encoide our cateogrical data**

In [None]:
cat = pd.get_dummies(cat)
#print(cat)
print(cat.shape)

# Fixing skewed data

In [None]:
#Lets look at how skewed our data is on a box plot.
#original plot is too skewed to see so we will scale it down!
f, ax = plt.subplots(figsize=(8, 7))
ax.set_xscale("log")
ax = sns.boxplot(data=num , orient="h", palette="Set1")

In [None]:
#Lets look at the skew values
skew_features = num.apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skew_features.apply(lambda x: abs(x) > 0.5)
high_skew = high_skew[high_skew == True]
print(high_skew)

In [None]:
#dont want to transform our new feature egineered cols that fall under being highly skewed
transform_cols = [x for x in high_skew.index if x not in ['hasgarage','hasbsmt','haspool']]
#print(transform_cols)

for i in transform_cols:
    num[i] = np.log1p(num[i])

print(num.head())


# Putting back together and making Test and Train

In [None]:
#Lets put our numerical and categorical data back together
num.shape,cat.shape
data = pd.concat([num, cat], axis=1,sort=False)
data.shape,num.shape,cat.shape
data.shape
print(data.head())

In [None]:
#Split these into our test and train
train_Y = salePrice
train_X = data[data.train == 1]
test_X = data[data.train == 0]
#test_X.shape
train_X = train_X.drop(['train'],axis=1)
test_X = test_X.drop(['train'],axis=1)
#drop train column in both

# **MODEL**

**Here I use a gradient boosting regressor and a random forrest**
**Tuning the hyper paramters was something I have had trouble understanding 100% but after reading and watching content on it and playing around with them for a while it seems to be good.**
**However could be improved.**

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
#from xgboost import XGBRegressor

gbr = GradientBoostingRegressor(n_estimators=6000,
                                learning_rate=0.01,
                                max_depth=4,
                                max_features='sqrt',
                                min_samples_leaf=15,
                                min_samples_split=10,
                                loss='huber',
                                random_state=42)  

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=1200,
                          max_depth=15,
                          min_samples_split=5,
                          min_samples_leaf=5,
                          max_features=None,
                          oob_score=True,
                          random_state=42)

In [None]:
print(train_X)
gbr.fit(train_X, train_Y)
rf.fit(train_X, train_Y)

In [None]:
predictions = gbr.predict(test_X)
predictions = np.expm1(predictions) 
predictions = predictions.round(0)
print(predictions)

#predictions2 = rf.predict(test_X)
#predictions2 = np.expm1(predictions2)
#predictions = predictions.round(0)
#print(predictions2)

In [None]:
#Submission
submission = pd.read_csv("../input/house-prices-advanced-regression-techniques/sample_submission.csv")
submission.Id = testID
submission.SalePrice = predictions
submission.head()
submission.to_csv("submission.csv", index=False)