Train a decision tree classifier to predict the house price based on other variables present in the below dataset. Use a 5 fold CV for scoring. Which variables do you think are categorical? How good is the prediction?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
housing = pd.read_excel('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/07_House_Price_Data.xlsx', sheet_name='Data')
housing.head()

Unnamed: 0,Home No,Nbhd,Offers,SqFt,Brick,Bedrooms,Bathrooms,Price
0,1,0,2,1790,0,2,2,114300
1,2,0,3,2030,0,4,2,114200
2,3,0,1,1740,0,3,2,114800
3,4,0,3,1980,0,3,2,94700
4,5,0,3,2130,0,3,3,119800


In [3]:
# Check info
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Home No    128 non-null    int64
 1   Nbhd       128 non-null    int64
 2   Offers     128 non-null    int64
 3   SqFt       128 non-null    int64
 4   Brick      128 non-null    int64
 5   Bedrooms   128 non-null    int64
 6   Bathrooms  128 non-null    int64
 7   Price      128 non-null    int64
dtypes: int64(8)
memory usage: 8.1 KB


In [4]:
# Lets check Nbhd
housing['Nbhd'].value_counts()

0    89
1    39
Name: Nbhd, dtype: int64

Looking at the values, Nbhd is a binary categorical which indicate whether any neighbour has occupied/bought the house

In [5]:
# Lets check Offers
housing['Offers'].value_counts()

3    46
2    36
1    23
4    19
5     3
6     1
Name: Offers, dtype: int64

Looking at the values, Offers can take specific values and are discrete. It could mean the offers that a buyer was made by the builder at the time of buying OR some category of offering defined by the builder. Like premium apartments etc.

In [6]:
# Lets check Brick
housing['Brick'].value_counts()

0    86
1    42
Name: Brick, dtype: int64

The values indicate binary categorical. Whether the house is under construction (Brick-0) or fully built house (Brick-1)

In [7]:
# Lets convert Nbhd, Offers and Brick into dummies
housing_onehot = pd.get_dummies(housing, columns=['Nbhd','Offers','Brick'])
housing_onehot.sample(5)

Unnamed: 0,Home No,SqFt,Bedrooms,Bathrooms,Price,Nbhd_0,Nbhd_1,Offers_1,Offers_2,Offers_3,Offers_4,Offers_5,Offers_6,Brick_0,Brick_1
22,23,1690,3,2,91700,1,0,0,0,1,0,0,0,1,0
91,92,2150,3,2,116500,1,0,0,1,0,0,0,0,1,0
10,11,2030,3,2,132500,1,0,0,0,1,0,0,0,0,1
33,34,2280,5,3,139600,1,0,0,0,0,1,0,0,0,1
67,68,2040,4,3,151900,1,0,0,0,1,0,0,0,1,0


In [8]:
# Lets drop Home No
housing_onehot.drop('Home No', axis=1, inplace=True)

In [9]:
# Split the data
X = housing_onehot.drop('Price', axis=1)
y =  housing_onehot['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [10]:
# Lets fit the decision tree
params = {'min_samples_split' : [5,10,15,20],
          'min_samples_leaf' : [10,15,20],
          'max_depth' : [5,10,15]}

# Create GridSearchCV object
clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

# Fit
clf_gs.fit(X_train, y_train)

# Print best params and best score
print(clf_gs.best_params_)
print(clf_gs.best_score_)

{'max_depth': 5, 'min_samples_leaf': 10, 'min_samples_split': 5}
0.42087384105873615


In [11]:
# Check score on Test
clf_gs.score(X_test, y_test)

0.5752963809407762

The prediction though better than training set is still poor with only r2 score of 57%