## Decision Tree Classifier - Implemented from scratch
Using the gini-impurity cost function to split the data by feature and rank them based on importance, we created a Decision Tree classifier from scratch. We included 2 hyper-parameters which are max depth of the tree and the minimum samples count per leaf to tune the model. Below, we tested the model on 2 datasets which are the breast cancer dataset and the iris dataset for the task of classification

In [3]:
## Testing on breast cancer dataset which predcits if a sample to "Benign" or "Malgnant" case of cancer
import pandas as pd
data = pd.read_csv("bcan.csv")
X = data.drop(["diagnosis", "id"], axis=1).values
y = data["diagnosis"].values
from decisionTree import decisionTree
dt = decisionTree(max_depth=50, min_samples_leaf=1)
dt.train(X, y)
ypred = []
ypred = dt.predict(X)

In [4]:
from sklearn.metrics import accuracy_score
print("Accuracy on the breast cancer dataset is ", accuracy_score(ypred, y))

Accuracy on the breast cancer dataset is  0.9876977152899824


In [1]:
## Testing on iris dataset which classifies flower physical charecteristics to type of flower : Setosa, Verginica and Versicolor
import pandas as pd
data = pd.read_csv("iris.csv")
X = data.drop(["variety"], axis=1).values
y = data["variety"].values
from decisionTree import decisionTree
dt = decisionTree(type="classification", max_depth=100, min_samples_leaf=1)
dt.train(X, y)
ypred = []
ypred = dt.predict(X)

In [2]:
from sklearn.metrics import accuracy_score
print("Accuracy on iris dataset is ", accuracy_score(ypred, y))

Accuracy on iris dataset is  0.9866666666666667


### Decision Tree Regressor

Using variance as cost function to split the data by feature and rank them by importance, and then using the average of all the labels in the leaf node, we make a prediction for a given set of features after training through traversal of the binary tree created. 

In [3]:
import pandas as pd
reg_data = pd.read_csv("regression_housing.csv")[["MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "SalePrice"]]
X = reg_data.drop(["SalePrice"], axis=1).values
y = reg_data["SalePrice"].values
from decisionTree import decisionTree
dt = decisionTree(type = "regression")
dt.train(X, y)
ypred = [], []
ypred = dt.predict(X)

In [4]:
from sklearn.metrics import mean_squared_error
import math
print(math.sqrt(mean_squared_error(ypred, y)))

50070.42903592454


### Random Forest Classifier
#### Using bagging method, we created an ensemble of Decision Tree classifiers and used a voting mechnaism to decide what the class of a set of features will be.

In [1]:
import pandas as pd
data = pd.read_csv("iris.csv")
X = data.drop(["variety"], axis=1).values
y = data["variety"].values
from randomForest import randomForest
rf = randomForest(type = "classification", max_depth=100, min_samples_leaf=1)
rf.train(X, y)
ypred = rf.predict(X)

In [2]:
from sklearn.metrics import accuracy_score
print("Accuracy on iris dataset using a random forest model is ", accuracy_score(ypred, y))

Accuracy on iris dataset using a random forest model is  0.9666666666666667


### Random Forest Regressor
Using Decision Tree Regressor base regression model, an ensemble method using bagging is created which is nothing but the Random Forest Regressor. Instead of voting mechanism like in the case of Random Forest Classifier, we take the average of all the predictions by all the base estimators

In [1]:
import pandas as pd
reg_data = pd.read_csv("regression_housing.csv")[["MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "SalePrice"]]
X = reg_data.drop(["SalePrice"], axis=1).values
y = reg_data["SalePrice"].values
from randomForest import randomForest
rf = randomForest(type = "regression", max_depth=100, min_samples_leaf=1)
rf.train(X, y)
ypred = rf.predict(X)

In [2]:
from sklearn.metrics import mean_squared_error
import math
print(math.sqrt(mean_squared_error(ypred, y)))

49903.62236996378
