Tree Based Models - Q14 - 20/July
===================================
The infamous house price prediction problem. :

07_House_Price_Data.xlsx contains house price data along with few relevant variables. 
https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

Train a decision tree classifier to predict the house price based on other variables present in the dataset. Use a 5 fold CV for scoring.  Which variables do you think are categorical? How good is the prediction?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read and display the data file
df = pd.read_excel('/Users/riteshturlapaty/ai-ml-learning/AccelerateAI/7.DecisionTree/DailyQuiz/07_House_Price_Data.xlsx',sheet_name='Data')
df.head(5)

Unnamed: 0,Home No,Nbhd,Offers,SqFt,Brick,Bedrooms,Bathrooms,Price
0,1,0,2,1790,0,2,2,114300
1,2,0,3,2030,0,4,2,114200
2,3,0,1,1740,0,3,2,114800
3,4,0,3,1980,0,3,2,94700
4,5,0,3,2130,0,3,3,119800


In [3]:
# Get the datatypes of the data frame columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Home No    128 non-null    int64
 1   Nbhd       128 non-null    int64
 2   Offers     128 non-null    int64
 3   SqFt       128 non-null    int64
 4   Brick      128 non-null    int64
 5   Bedrooms   128 non-null    int64
 6   Bathrooms  128 non-null    int64
 7   Price      128 non-null    int64
dtypes: int64(8)
memory usage: 8.1 KB


In [4]:
# Describe the data frame
df.describe()

Unnamed: 0,Home No,Nbhd,Offers,SqFt,Brick,Bedrooms,Bathrooms,Price
count,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0
mean,64.5,0.304688,2.578125,2000.9375,0.328125,3.023438,2.445312,130427.34375
std,37.094474,0.462084,1.069324,211.572431,0.471376,0.725951,0.514492,26868.770371
min,1.0,0.0,1.0,1450.0,0.0,2.0,2.0,69100.0
25%,32.75,0.0,2.0,1880.0,0.0,3.0,2.0,111325.0
50%,64.5,0.0,3.0,2000.0,0.0,3.0,2.0,125950.0
75%,96.25,1.0,3.0,2140.0,1.0,3.0,3.0,148250.0
max,128.0,1.0,6.0,2590.0,1.0,5.0,4.0,211200.0


### Check for categorical variables in the dataframe

In [6]:
df['Nbhd'].value_counts()

0    89
1    39
Name: Nbhd, dtype: int64

As seen, Nbhd (Neighborhood) takes only 0 and 1 values hence can be considered as binary categorical variable

In [7]:
df['Offers'].value_counts()

3    46
2    36
1    23
4    19
5     3
6     1
Name: Offers, dtype: int64

Here offers takes specific value and hence can be categorized as categorical

In [8]:
df['Brick'].value_counts()

0    86
1    42
Name: Brick, dtype: int64

Here Brick (whether the construction is done using brick or not) takes only 0 and 1 values hence can be considered as binary categorical variable

In [10]:
# Create dummy values for the identified categorical variables
housing_data = pd.get_dummies(df, columns=['Nbhd','Offers','Brick'])
housing_data.sample(5)

Unnamed: 0,Home No,SqFt,Bedrooms,Bathrooms,Price,Nbhd_0,Nbhd_1,Offers_1,Offers_2,Offers_3,Offers_4,Offers_5,Offers_6,Brick_0,Brick_1
4,5,2130,3,3,119800,1,0,0,0,1,0,0,0,1,0
45,46,1810,3,2,103200,1,0,0,0,1,0,0,0,1,0
106,107,2130,3,2,108500,1,0,0,0,0,1,0,0,1,0
56,57,2190,3,2,140900,1,0,0,0,1,0,0,0,0,1
127,128,2250,3,3,124600,1,0,0,0,0,1,0,0,1,0


In [11]:
# Lets drop Home No
housing_data.drop('Home No', axis=1, inplace=True)

In [12]:
# Split the data
X = housing_data.drop('Price', axis=1)
y =  housing_data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [13]:
# Lets fit the decision tree
params = {'min_samples_split' : [5,10,15,20],
          'min_samples_leaf' : [10,15,20],
          'max_depth' : [5,10,15]}

# Create GridSearchCV object
clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

# Fit
clf_gs.fit(X_train, y_train)

# Print best params and best score
print(clf_gs.best_params_)
print(clf_gs.best_score_)

{'max_depth': 5, 'min_samples_leaf': 10, 'min_samples_split': 5}
0.42087384105873615


In [14]:
# Check score on Test
clf_gs.score(X_test, y_test)

0.5752963809407762

As observed, the score of 52% is not that high and better than training set score of 42%.