### CART (Classification and Regression Tree)
#### Regression
The mean and standard deviation of the dataset are given by,
$$\mu(X) = \frac{1}{N}\sum^N_{i=0}x_i$$
$$\sigma(X)=\sqrt{\frac{\sum_{i=0}^N (x_i-\mu)^2}{N}}$$

The weighted standard deviation of the column is given by,
$$\sigma(X|c) = \sum^M_{j=0}p(c=j)\sigma(X|c=j)$$

The reducted standard deviation of the column is given by,
$$\delta(c) = \sigma(X)-\sigma(X|c) $$

The feature with highest reducted standard deviation score will be the decision.

In [2]:
import requests
from io import StringIO
import pandas as pd
import numpy as np
import math

In [3]:
url = 'https://raw.githubusercontent.com/serengil/decision-trees-for-ml/master/dataset/golf3.txt'
r = requests.get(url)
s=str(r.content,'utf-8')
data = StringIO(s) 
df=pd.read_csv(data)
df

Unnamed: 0,Outlook,Temp.,Humidity,Wind,Decision
0,Sunny,Hot,High,Weak,25
1,Sunny,Hot,High,Strong,30
2,Overcast,Hot,High,Weak,46
3,Rain,Mild,High,Weak,45
4,Rain,Cool,Normal,Weak,52
5,Rain,Cool,Normal,Strong,23
6,Overcast,Cool,Normal,Strong,43
7,Sunny,Mild,High,Weak,35
8,Sunny,Cool,Normal,Weak,38
9,Rain,Mild,Normal,Weak,46


In [8]:
std = df["Decision"].std(ddof=0)
print(std)

9.321086474291743


In [12]:
classes = df["Temp."].unique()
wstd = 0
for i in range(len(classes)):
    subdf = df[df["Temp."]==classes[i]]
    wstd += subdf.shape[0]/df.shape[0]*subdf["Decision"].std(ddof=0)
print(wstd)
print(std-wstd)

8.841395899528422
0.4796905747633211


In [13]:
class Regression_Tree:
    def __init__(self, df, label_col="Decision"):
        self.df = df
        self.label_col = label_col
        self.shape = df.shape
        self.std = self.df[label_col].std(ddof=0)
    
    def rstd(self, col_name="Humidity"):
        unique_classes = self.df[col_name].unique()
        wstd = 0
        for j in range(len(unique_classes)):
            subdf = self.df[self.df[col_name]==unique_classes[j]]
            wstd += subdf.shape[0]/self.df.shape[0]*subdf[self.label_col].std(ddof=0)
        return self.std-wstd
        
    def rank_rstd(self):
        col_names = self.df.columns
        rstds = []
        for col_name in col_names:
            if col_name==self.label_col:
                continue
            rstds.append(self.rstd(col_name=col_name))
        rstds = np.array(rstds)
        ranks = np.argsort(rstds)[::-1]
        return ranks

In [14]:
dt = Regression_Tree(df)
ranks = np.array(dt.rank_rstd())
print(ranks, dt.df.columns[ranks])
print('Original Column Names: ',dt.df.columns)

[0 1 3 2] Index(['Outlook', 'Temp.', 'Wind', 'Humidity'], dtype='object')
Original Column Names:  Index(['Outlook', 'Temp.', 'Humidity', 'Wind', 'Decision'], dtype='object')
