# House Prices Prediction Using Random Forests Decision Tree

This notebook walks you through how to use Random Forest Decision Tree to make prediction on the house price with the dataset provided in the competition.
The algorithm is roughly like below:

Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start to work with tabular data, and will often outperform before you begin experimenting with neural networks.

# 1. Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import r2_score, mean_squared_error
from xgboost import XGBRegressor
import seaborn as sns
pd.set_option('display.max_rows',None)

# 2. Read Training data and testing data from file

In [2]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

In [3]:
print(f"df_train shape is {df_train.shape}")
print(f"df_test shape is {df_test.shape}")
print(df_train[:5])

df_train shape is (1460, 81)
df_test shape is (1459, 80)
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  S

# 3. Separate target and features from data

In [4]:
x_train=df_train.drop('SalePrice', axis=1,inplace=False)
y_train=df_train['SalePrice']
x_test = df_test.copy()
test_ids = x_test['Id']

# 4. Preprocessing the features

In [5]:
# Combine for preprocessing
all_data = pd.concat([x_train, x_test], axis=0)

# Remove unrelated feature "Id"
all_data.drop('Id',axis=1,inplace=True)

#Handle missing values
for col in all_data.columns:
    if all_data[col].dtype == 'object': #categorical
        all_data[col] = all_data[col].fillna('None')
    else: # numeric
        all_data[col] = all_data[col].fillna(all_data[col].median())

# One-hot encode categorical variables
all_data = pd.get_dummies(all_data)
x_train = all_data.iloc[:len(df_train), :]
x_test = all_data.iloc[len(df_train):, :]


## 5. Random Forest Model

In [6]:
rf_model = RandomForestRegressor(n_estimators=300,
                                 max_depth=12,
                                 n_jobs = -1,
                                 random_state=42,
                                 oob_score=True,
                                 bootstrap=True)

x_tr,x_val,y_tr,y_val = train_test_split(x_train, y_train, test_size=0.2,random_state=42)
model_rf = rf_model.fit(x_tr,y_tr)
print(f"Our model's score using cross validation set is {model_rf.score(x_val,y_val)*100} with using Random Forest")

Our model's score using cross validation set is 89.46324967259923 with using Random Forest


In [7]:

kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf_model.fit(x_train, y_train)
rf_train_preds = rf_model.predict(x_train)
rf_train_r2 = r2_score(y_train, rf_train_preds)
rf_train_rmse = np.sqrt(mean_squared_error(y_train, rf_train_preds))
rf_cv_scores = cross_val_score(rf_model, x_train, y_train, cv=kf, scoring='r2')
print(f"Random Forest CV R^2 scores {rf_cv_scores}")

#OOB score
print(f"Random Forest OOB score: {rf_model.oob_score_:.4f}")

Random Forest CV R^2 scores [0.8884113  0.89827578 0.63663484 0.87623559 0.89295746]
Random Forest OOB score: 0.8607


# Generate the predictions using the test data

In [10]:
predictions = rf_model.predict(x_test)
output = pd.DataFrame({'Id':test_ids,
                       'SalePrice':predictions})
output.to_csv("submission.csv", index=False)