## Cleaning with data

Columns 'Alley', 'PoolQC', 'Fence', 'MiscFeature' were dropped out since they contained more than 80% null values. Rest of the columns, if of numeric type were filled with mean of the respective column else null values were replaced with mode.

# Experimentation

#### Tree termination conditions

Three conditions were used to decide upon when to terminate the tree:
    1. If a maximum predefined depth had been reached
    2. If the number of samples left were less than a prespecified threshold limit
    3. If after splitting along the best found value, one of the two split turns out to be empty

#### Categorical to numerical conversion

The values of the following attributes had a hierarchial relation among them, hence they were suitably converted to numerical type. However, the mean square error of the predictions improved only by a small margin.

### Metrics obtained

| Data type | Metric used for determining split | Mean absolute error | Mean squared error | R2 score |
|-----------|---------------|---------------------|--------------------|----------|
|converted categorical hierarchial to numerical | MSE | 23999.73 | 1265841364.03 | 0.76 |
|left categorical hierarchial intact | MSE | 24590.42 | 1324704484.92 | 0.74 |
|left categorical hierarchial intact | MAE | 24373.75 | 1367824704.77 | 0.74 |
|converted categorical hierarchial to numerical | MAE | 24954.17 | 1412145253.86 | 0.73 |
|converted categorical hierarchial to numerical, kept used column | MSE | 24704.59 | 1319736600.78 | 0.75 |
|converted categorical hierarchial to numerical, kept used column | MAE | 23747.55 | 1176833682.77 | 0.77 |
|Predicting always mean | - | 54656.09 | 5322462690.05 | -0.0071 |
|Predicting always median | - | 51893.63 | 5447828300.60 | -0.030 |
|Sklearn's Dtree regressor | MSE | 27144.21 | 1396748683.13 | 0.73 |
|Sklearn's Dtree regressor | MAE | 27959.46 | 1785757014.06 | 0.66 |
|Sklearn's Dtree regressor | Friedman-MSE | 26593.20 | 1356746873.59 | 0.74 |

In [3]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
import numpy as np
import math
import pandas as pd

In [2]:
import numpy as np
import math
import pandas as pd

class node:
    def __init__(self, l, r, attr, val):  ##converted categorical hierarchial to numerical and with MSE meteric
        self.left = l
        self.right = r
        self.attribute = attr
        self.value = val
        self.answer = 0

class DecisionTree1:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
        left_x.drop(columns=[best_attr[0]], inplace=True)
        right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
        convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
                   'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
                   'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
                   'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
                   'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
                   'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
                   'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
                   'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
                   'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'PavedDrive':{'Y':3, 'P':2, 'N':1}}
        for attribute in df.columns:
            if attribute in convert_to_num.keys():
                df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list

 
dtree_regressor = DecisionTree1()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1265841364.038237
23999.733693732094
0.7604752954801323


In [3]:
import numpy as np
import math
import pandas as pd

 #left categorical hierarchial intact,with MSE meteric

class DecisionTree2:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
        left_x.drop(columns=[best_attr[0]], inplace=True)
        right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
#         convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
#                    'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
#                    'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
#                    'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
#                    'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
#                    'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
#                    'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
#                    'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
#                    'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'PavedDrive':{'Y':3, 'P':2, 'N':1}}
#         for attribute in df.columns:
#             if attribute in convert_to_num.keys():
#                 df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list


dtree_regressor = DecisionTree2()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1324704484.9292908
24590.42097664111
0.749337113367353


In [4]:
import numpy as np
import math
import pandas as pd

 #left categorical hierarchial intact,with MAE meteric

class DecisionTree3:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
        left_x.drop(columns=[best_attr[0]], inplace=True)
        right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
#         convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
#                    'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
#                    'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
#                    'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
#                    'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
#                    'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
#                    'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
#                    'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
#                    'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'PavedDrive':{'Y':3, 'P':2, 'N':1}}
#         for attribute in df.columns:
#             if attribute in convert_to_num.keys():
#                 df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list


dtree_regressor = DecisionTree3()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1367824704.772412
24373.755359157414
0.7411778303717284


In [5]:
import numpy as np
import math
import pandas as pd

#converted categorical hierarchial to numerical, with MAE meteric

class DecisionTree4:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
        left_x.drop(columns=[best_attr[0]], inplace=True)
        right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
        convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
                   'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
                   'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
                   'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
                   'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
                   'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
                   'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
                   'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
                   'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'PavedDrive':{'Y':3, 'P':2, 'N':1}}
        for attribute in df.columns:
            if attribute in convert_to_num.keys():
                df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list


dtree_regressor = DecisionTree4()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1412145253.8675234
24954.179043743232
0.7327914189873678


In [6]:
import numpy as np
import math
import pandas as pd

#converted categorical hierarchial to numerical, with MSE meteric, not dropping used column

class DecisionTree5:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.square(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.square(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
#         left_x.drop(columns=[best_attr[0]], inplace=True)
#         right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
        convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
                   'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
                   'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
                   'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
                   'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
                   'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
                   'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
                   'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
                   'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'PavedDrive':{'Y':3, 'P':2, 'N':1}}
        for attribute in df.columns:
            if attribute in convert_to_num.keys():
                df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list


dtree_regressor = DecisionTree5()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1319736600.7859044
24704.595260616108
0.7502771450453647


In [7]:
import numpy as np
import math
import pandas as pd

#converted categorical hierarchial to numerical, with MAE metric, not dropping used column

class DecisionTree6:
    def __init__(self):
        self.root_node = None

    def build_tree(self, x_f, current_depth, maximum_depth=20, threshold_samples=20):
        # print("buildtree")
        col_list = list(x_f.columns)
        col_list.remove('SalePrice')
#         print("depth = "+str(current_depth))
        # print(" got "+str(len(x_f))+" rows", end = ' ')
        # print(" got "+str(len(col_list))+" columns")
        tree_node = node(None, None, None, None)
        if current_depth == maximum_depth or len(x_f) < threshold_samples:
            tree_node.answer = x_f['SalePrice'].mean()
            return tree_node
        best_attr = []
        ser = x_f.dtypes
        # print("series ",ser)
        for attr in col_list:
            # print("attribute ",attr, end=' ')
            if ser[attr] == object:
                # print("object")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                for split_val in attr_value_list:
                    less_frame = x_f[x_f[attr] == split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] != split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]                 
            else:
                # print("numerical")
                attr_values_set = np.unique(x_f[attr])
                attr_value_list = list(attr_values_set)
                attr_value_list.sort()
                split_list = [((attr_value_list[iv] + attr_value_list[iv+1])/2) for iv in range(len(attr_value_list)-1)]
                # print(split_list)
                for split_val in split_list:
                    less_frame = x_f[x_f[attr] <= split_val].copy()
                    left_error = 0
                    if len(less_frame):
                        less_array = less_frame['SalePrice'].to_numpy()
                        less_array = less_array.astype('float64')
                        less_array -= less_array.mean()
                        less_array = np.absolute(less_array)
                        left_error = less_array.sum()*(len(less_array)/len(x_f))
                    more_frame = x_f[x_f[attr] > split_val].copy()
                    right_error = 0
                    if len(more_frame):
                        more_array = more_frame['SalePrice'].to_numpy()
                        more_array = more_array.astype('float64')
                        more_array -= more_array.mean()
                        more_array = np.absolute(more_array)
                        right_error = more_array.sum()*(len(more_array)/len(x_f))
                    mean_sq_error = left_error + right_error
                    if len(best_attr):
                        if best_attr[2] > mean_sq_error:
                            best_attr = [attr, split_val, mean_sq_error]
                    else:
                        best_attr = [attr, split_val, mean_sq_error]    
    
        tree_node.value = best_attr[1]
        tree_node.attribute = best_attr[0]
        # print("splitting on "+tree_node.attribute+" at value "+str(tree_node.value))
        if isinstance(tree_node.value, str):
            left_x = x_f[x_f[tree_node.attribute] == tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] != tree_node.value].copy()
        else:
            left_x = x_f[x_f[tree_node.attribute] <= tree_node.value].copy()
            right_x = x_f[x_f[tree_node.attribute] > tree_node.value].copy()
        if(len(left_x)==0):
            tree_node.answer = right_x['SalePrice'].mean()
            return tree_node
        if(len(right_x)==0):
            tree_node.answer = left_x['SalePrice'].mean()
            return tree_node
#         left_x.drop(columns=[best_attr[0]], inplace=True)
#         right_x.drop(columns=[best_attr[0]], inplace=True)
        tree_node.left = self.build_tree(left_x, current_depth+1, maximum_depth, threshold_samples)
        tree_node.right = self.build_tree(right_x, current_depth+1, maximum_depth, threshold_samples)
        return tree_node

    def preprocessing(self, df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
        convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
                   'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
                   'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
                   'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
                   'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
                   'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
                   'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
                   'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
                   'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
                   'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
                   'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
                   'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
                   'PavedDrive':{'Y':3, 'P':2, 'N':1}}
        for attribute in df.columns:
            if attribute in convert_to_num.keys():
                df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

    def train(self, train_dataframe_path):
        train_df = pd.read_csv(train_dataframe_path, index_col="Id")
        train_df = self.preprocessing(train_df)
        self.root_node = self.build_tree(train_df, 1)

    def predict(self, test_dataframe_path):
        test_df = pd.read_csv(test_dataframe_path, index_col="Id")
        test_df = self.preprocessing(test_df)
        pred_list = []
        ser = test_df.dtypes
        for test_index in range(len(test_df)):
            current_node = self.root_node;
            while current_node.left != None and current_node.right != None:
                if ser[current_node.attribute] == object:
                    if test_df.iloc[test_index][current_node.attribute] == current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
                else:
                    if test_df.iloc[test_index][current_node.attribute] <= current_node.value:
                        current_node = current_node.left
                    else:
                        current_node = current_node.right
            pred_list.append(current_node.answer)
        return pred_list


dtree_regressor = DecisionTree6()
dtree_regressor.train('./Datasets/q3/train.csv')
predictions = dtree_regressor.predict('./Datasets/q3/test.csv')
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1176833682.7755482
23747.555894256842
0.7773174837354058


In [5]:
def preprocessing(df):
        df.fillna(df.mean(), inplace=True)
        df.fillna(value="others", inplace=True)
        df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
#         convert_to_num = {'LotShape':{'Reg':4, 'IR1':3, 'IR2':2, 'IR3':1}, 
#                    'LandContour':{'Lvl':4, 'Bnk':3, 'HLS':2, 'Low':1},
#                    'Utilities':{'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
#                    'LandSlope':{'Gtl':3, 'Mod':2, 'Sev':1},
#                    'ExterQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'ExterCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'BsmtQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0},
#                    'BsmtExposure':{'Gd':5, 'Av':4, 'Mn':3, 'No':2, 'others':1},
#                    'BsmtFinType1':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'BsmtFinType2':{'GLQ':7, 'ALQ':6, 'BLQ':5, 'Rec':4, 'LwQ':3, 'Unf':2, 'others':1},
#                    'HeatingQC':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
#                    'KitchenQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}, 
#                    'Functional':{'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}, 
#                    'FireplaceQu':{'Ex':6, 'Gd':5, 'TA':4, 'Masonry':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageFinish':{'Fin':4, 'RFn':3, 'Unf':2, 'others':1}, 
#                    'GarageQual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'GarageCond':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'others':0}, 
#                    'PavedDrive':{'Y':3, 'P':2, 'N':1}}
#         for attribute in df.columns:
#             if attribute in convert_to_num.keys():
#                 df[attribute] = df[attribute].map(convert_to_num[attribute])
        return df

In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor

# le = LabelEncoder()
train_df = pd.read_csv('./Datasets/q3/train.csv', index_col="Id")
train_df = preprocessing(train_df)
train_res = train_df['SalePrice']
train_df.drop(columns=['SalePrice'], inplace=True)
test_df = pd.read_csv('./Datasets/q3/test.csv', index_col="Id")
test_df = preprocessing(test_df)
ser = test_df.dtypes
for attr in test_df.columns:
    if ser[attr] == object:
        train_df[attr] = le.fit_transform(train_df[attr])
        test_df[attr] = le.fit_transform(test_df[attr])

# categorical_feature_mask = train_df.dtypes==object
# onehotencoder = OneHotEncoder() 
# train_data = onehotencoder.fit_transform(train_df).toarray()
# print(train_data.shape)
# test_data = onehotencoder.transform(test_df).toarray()
# print(test_data.shape)
clf = DecisionTreeRegressor(criterion='mse')
clf.fit(train_df, train_res)
predictions = clf.predict(test_df)
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1396748683.1326087
27144.215217391305
0.7357047848803377


In [46]:
clf = DecisionTreeRegressor(criterion='mae')
clf.fit(train_df, train_res)
predictions = clf.predict(test_df)
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1785757014.0630436
27959.467391304348
0.662095951918339


In [38]:
clf = DecisionTreeRegressor(criterion='friedman_mse')
clf.fit(train_df, train_res)
predictions = clf.predict(test_df)
test_labels = list()
with open("./Datasets/q3/test_labels.csv") as f:
  for line in f:
    test_labels.append(float(line.split(',')[1]))
print (mean_squared_error(test_labels, predictions))
print (mean_absolute_error(test_labels, predictions))
print (r2_score(test_labels, predictions))

1356746873.597826
26593.206521739132
0.7432739968537185


In [36]:
train_df = pd.read_csv('./Datasets/q3/train.csv', index_col="Id")
always_mean = [train_df['SalePrice'].mean()]*len(test_df)
always_median = [train_df['SalePrice'].median()]*len(test_df)
print (mean_squared_error(test_labels, always_mean))
print (mean_absolute_error(test_labels, always_mean))
print (r2_score(test_labels, always_mean))
print (mean_squared_error(test_labels, always_median))
print (mean_absolute_error(test_labels, always_median))
print (r2_score(test_labels, always_median))

5322462690.052036
54656.09460869566
-0.007125647313121597
5447828300.608696
51893.639130434785
-0.030847546184985086
