<a href="https://colab.research.google.com/github/younesabdolmalaky/home-price/blob/main/Home_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import required packages
 Pandas and numpy have been used for data pre-processing and we have done splitting and label encoding and data scaling using sklearn ready functions and also r2-score and MAE criteria using We have called ready sklearn functions and used the scipy library to normalize the variables and finally used the catboost algorithm to solve the problem.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import  MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p , inv_boxcox1p
from catboost import CatBoostRegressor

In [3]:
df = pd.read_csv('../datasets/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


# Feature Engineering 
## Imputing missing values¶
We impute them by proceeding sequentially through features with missing values

### PoolQC : 
data description says NA means "No Pool". That make sense, given the huge ratio of missing value (+99%) and majority of houses have no Pool at all in general.
### MiscFeature : 
data description says NA means "no misc feature"
### Alley :
data description says NA means "no alley access"
### Fence : 
data description says NA means "no fence"
### FireplaceQu :
data description says NA means "no fireplace"
### MasVnrArea and MasVnrType :
NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type.

In [4]:
df["PoolQC"] = df["PoolQC"].fillna(0)
df["MiscFeature"] = df["MiscFeature"].fillna(0)
df["Alley"] = df["Alley"].fillna(0)
df["Fence"] = df["Fence"].fillna(0)
df["FireplaceQu"] = df["FireplaceQu"].fillna(0)
df["MasVnrType"] = df["MasVnrType"].fillna(0)
df["MasVnrArea"] = df["MasVnrArea"].fillna(0)

### LotFrontage :
Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.

In [5]:
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

### GarageType, GarageFinish, GarageQual and GarageCond :
Replacing missing data with 0
### GarageYrBlt, GarageArea and GarageCars :
Replacing missing data with 0 (Since No garage = no cars in such garage.)
### BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath :
missing values are likely zero for having no basement
### BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : 
For all these categorical basement-related features, NaN means that there is no basement.

In [6]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    df[col] = df[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    df[col] = df[col].fillna(0)

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    df[col] = df[col].fillna(0)

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    df[col] = df[col].fillna('None')

### MSZoning (The general zoning classification) :
'RL' is by far the most common value. So we can fill in missing values with 'RL'
### Utilities :
For this categorical feature all records are "AllPub", except for one "NoSeWa" and 2 NA . Since the house with 'NoSewa' is in the training set, this feature won't help in predictive modelling. We can then safely remove it.
### Functional :
data description says NA means typical
### Electrical :
It has one NA value. Since this feature has mostly 'SBrkr', we can set that for the missing value.
### KitchenQual:
Only one NA value, and same as Electrical, we set 'TA' (which is the most frequent) for the missing value in KitchenQual.
### Exterior1st and Exterior2nd :
Again Both Exterior 1 & 2 have only one missing value. We will just substitute in the most common string
### SaleType :
Fill in again with most frequent which is "WD"
### MSSubClass : 
Na most likely means No building class. We can replace missing values with None


In [7]:
df['MSZoning'] = df['MSZoning'].fillna(df['MSZoning'].mode()[0])
df = df.drop(['Utilities'], axis=1)
df["Functional"] = df["Functional"].fillna("Typ")
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])
df['KitchenQual'] = df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])
df['Exterior1st'] = df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df['Exterior2nd'] = df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
df['SaleType'] = df['SaleType'].fillna(df['SaleType'].mode()[0])
df['MSSubClass'] = df['MSSubClass'].fillna(0)

Transforming some numerical variables that are really categorical

In [8]:
df['MSSubClass'] = df['MSSubClass'].apply(str)
df['OverallCond'] = df['OverallCond'].astype(str)
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df['MoSold'].astype(str)

Label Encoding some categorical variables that may contain information in their ordering set



In [9]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(df[c].values)) 
    df[c] = lbl.transform(list(df[c].values))

print('Shape df: {}'.format(df.shape))

Shape df: (1460, 80)


### TotalSF : 
Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house
### Skewed features : 
Check the skew of all numerical features

In [10]:
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
numeric_feats = df.dtypes[df.dtypes != "object"].index
skewed_feats = df[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)



Skew in numerical features: 



Unnamed: 0,Skew
MiscVal,24.45164
PoolQC,16.718999
PoolArea,14.813135
LotArea,12.195142
3SsnPorch,10.293752
LowQualFinSF,9.00208
LandSlope,4.808735
KitchenAbvGr,4.483784
Alley,4.284964
BsmtFinSF2,4.250888


### Box Cox Transformation of (highly) skewed features

We use the scipy function boxcox1p which computes the Box-Cox transformation of  1+x .

Note that setting  λ=0  is equivalent to log1p used above for the target variable.

See this page for more details on Box Cox Transformation as well as the scipy function's page

In [11]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    df[feat] = boxcox1p(df[feat], lam)

There are 61 skewed numerical features to Box Cox transform


### Getting dummy categorical features

In [12]:
df = pd.get_dummies(df)
print(df.shape)
df.head()

(1460, 224)


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,Street,Alley,LotShape,LandSlope,OverallQual,OverallCond,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0.730463,2.75025,5.831328,19.212182,0.730463,0.0,1.540963,0.0,2.440268,1.820334,...,0,0,0,1,0,0,0,0,1,0
1,1.194318,1.820334,6.221214,19.712205,0.730463,0.0,1.540963,0.0,2.259674,2.440268,...,0,0,0,1,0,0,0,0,1,0
2,1.540963,2.75025,5.91494,20.347241,0.730463,0.0,0.0,0.0,2.440268,1.820334,...,0,0,0,1,0,0,0,0,1,0
3,1.820334,2.885846,5.684507,19.691553,0.730463,0.0,0.0,0.0,2.440268,1.820334,...,0,0,0,1,1,0,0,0,0,0
4,2.055642,2.75025,6.314735,21.32516,0.730463,0.0,0.0,0.0,2.602594,1.820334,...,0,0,0,1,0,0,0,0,1,0


In [13]:
df.isnull().sum().sum()

0

In [14]:
df['Price'] = df['SalePrice']
df = df.drop("SalePrice", axis='columns')
df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,Street,Alley,LotShape,LandSlope,OverallQual,OverallCond,...,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,Price
0,0.730463,2.75025,5.831328,19.212182,0.730463,0.0,1.540963,0.0,2.440268,1.820334,...,0,0,1,0,0,0,0,1,0,35.190995
1,1.194318,1.820334,6.221214,19.712205,0.730463,0.0,1.540963,0.0,2.259674,2.440268,...,0,0,1,0,0,0,0,1,0,34.329249
2,1.540963,2.75025,5.91494,20.347241,0.730463,0.0,0.0,0.0,2.440268,1.820334,...,0,0,1,0,0,0,0,1,0,35.629466
3,1.820334,2.885846,5.684507,19.691553,0.730463,0.0,0.0,0.0,2.440268,1.820334,...,0,0,1,1,0,0,0,0,0,32.763482
4,2.055642,2.75025,6.314735,21.32516,0.730463,0.0,0.0,0.0,2.602594,1.820334,...,0,0,1,0,0,0,0,1,0,36.34636


### MinMaxScaler
Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data.

In [15]:
scaler = MinMaxScaler()
data = scaler.fit_transform(df)

Identify independent and dependent variables

In [16]:
y = data[: ,df.columns.get_loc("Price") ]
X = np.delete(data, df.columns.get_loc("Price") , axis=1)
y=y.astype('float')

Separation of training and test data

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### CatBoost :
CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model.

In [18]:
model = CatBoostRegressor(eval_metric='R2',iterations=2000 , task_type='GPU') 
model.fit(X_train , y_train , eval_set = (X_test , y_test))

Learning rate set to 0.051833
0:	learn: 0.0681209	test: 0.0585518	best: 0.0585518 (0)	total: 18.8ms	remaining: 37.7s
1:	total: 34.5ms	remaining: 34.5s
2:	total: 50.1ms	remaining: 33.4s
3:	total: 72.6ms	remaining: 36.2s
4:	total: 91.6ms	remaining: 36.6s


Default metric period is 5 because R2 is/are not implemented for GPU
Metric R2 is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


5:	learn: 0.3425496	test: 0.3211612	best: 0.3211612 (5)	total: 118ms	remaining: 39.2s
6:	total: 133ms	remaining: 37.9s
7:	total: 149ms	remaining: 37.2s
8:	total: 176ms	remaining: 39s
9:	total: 194ms	remaining: 38.5s
10:	learn: 0.5184029	test: 0.4909500	best: 0.4909500 (10)	total: 210ms	remaining: 38s
11:	total: 227ms	remaining: 37.7s
12:	total: 242ms	remaining: 36.9s
13:	total: 256ms	remaining: 36.3s
14:	total: 271ms	remaining: 35.9s
15:	learn: 0.6292222	test: 0.6015536	best: 0.6015536 (15)	total: 287ms	remaining: 35.6s
16:	total: 302ms	remaining: 35.2s
17:	total: 320ms	remaining: 35.2s
18:	total: 335ms	remaining: 34.9s
19:	total: 350ms	remaining: 34.7s
20:	learn: 0.7075071	test: 0.6784942	best: 0.6784942 (20)	total: 364ms	remaining: 34.3s
21:	total: 376ms	remaining: 33.8s
22:	total: 388ms	remaining: 33.3s
23:	total: 399ms	remaining: 32.9s
24:	total: 411ms	remaining: 32.4s
25:	learn: 0.7610857	test: 0.7307230	best: 0.7307230 (25)	total: 423ms	remaining: 32.1s
26:	total: 435ms	remaining

<catboost.core.CatBoostRegressor at 0x7f0248f2d850>

predict test data

In [19]:
y_pred = model.predict(X_test)

In [20]:
data_test_true = pd.DataFrame(X_test)
data_test_pred = pd.DataFrame(X_test)

In [21]:
data_test_true['Price'] = y_test
data_test_pred['Price'] = y_pred

inverse data from scale

In [22]:
inverserd_data_true = scaler.inverse_transform(data_test_true)
inverserd_data_pred = scaler.inverse_transform(data_test_pred)

In [23]:
y_test = inverserd_data_true[:,X.shape[1]]
y_pred = inverserd_data_pred[:,X.shape[1]]

inverse target from boxcox1p

In [24]:
y_test = inv_boxcox1p(y_test , lam)
y_pred = inv_boxcox1p(y_pred , lam)

In [25]:
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))

Mean Absolute Error: 15461.226413118018


In [26]:
r2_score(y_test ,y_pred)

0.906410120928585