## Downloading source code from the github

In [None]:
!git clone https://github.com/shivaditya-meduri/ensembleLearning.git

Cloning into 'ensembleLearning'...
remote: Enumerating objects: 105, done.[K
remote: Counting objects: 100% (105/105), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 105 (delta 46), reused 82 (delta 32), pack-reused 0[K
Receiving objects: 100% (105/105), 182.91 KiB | 5.54 MiB/s, done.
Resolving deltas: 100% (46/46), done.


## Decision Tree Classifier - Implemented from scratch
Using the gini-impurity cost function to split the data by feature and rank them based on importance, we created a Decision Tree classifier from scratch. We included 2 hyper-parameters which are max depth of the tree and the minimum samples count per leaf to tune the model. Below, we tested the model on 2 datasets which are the breast cancer dataset and the iris dataset for the task of classification

#### Testing on Breast Cancer dataset

In [None]:
## Testing on breast cancer dataset which predcits if a sample to "Benign" or "Malgnant" case of cancer
import pandas as pd
from ensembleLearning.src.decisionTree import decisionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv("ensembleLearning/data/bcan.csv")
X = data.drop(["diagnosis", "id"], axis=1).values
y = data["diagnosis"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
dt = decisionTree(max_depth=50, min_samples_leaf=1)
dt.train(X_train, y_train)
ypred = dt.predict(X_test)
print("Accuracy on the breast cancer dataset is ", accuracy_score(ypred, y_test))

Accuracy on the breast cancer dataset is  0.9473684210526315


#### Testing on Iris dataset

In [None]:
## Testing on iris dataset which classifies flower physical charecteristics to type of flower : Setosa, Verginica and Versicolor
import pandas as pd
from ensembleLearning.src.decisionTree import decisionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv("ensembleLearning/data/iris.csv")
X = data.drop(["variety"], axis=1).values
y = data["variety"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
dt = decisionTree(type="classification", max_depth=100, min_samples_leaf=1)
dt.train(X_train, y_train)
ypred = dt.predict(X_test)
print("Accuracy on iris dataset is ", accuracy_score(ypred, y_test))

Accuracy on iris dataset is  1.0


### Decision Tree Regressor

Using variance as cost function to split the data by feature and rank them by importance, and then using the average of all the labels in the leaf node, we make a prediction for a given set of features after training through traversal of the binary tree created. 

In [None]:
import pandas as pd
from ensembleLearning.src.decisionTree import decisionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import math
reg_data = pd.read_csv("ensembleLearning/data/regression_housing.csv")[["MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "SalePrice"]]
X = reg_data.drop(["SalePrice"], axis=1).values
y = reg_data["SalePrice"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
dt = decisionTree(type = "regression")
dt.train(X_train, y_train)
ypred = dt.predict(X_test)
print("Root Mean Squared Error of the model is : ", math.sqrt(mean_squared_error(ypred, y_test)))

Root Mean Squared Error of the model is :  53117.38009733844


### Random Forest Classifier
Using bagging method, we created an ensemble of Decision Tree classifiers and used a voting mechnaism to decide what the class of a set of features will be.

In [None]:
import pandas as pd
from ensembleLearning.src.randomForest import randomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv("ensembleLearning/data/iris.csv")
X = data.drop(["variety"], axis=1).values
y = data["variety"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
rf = randomForest(type = "classification", n_trees=50, max_depth=100, min_samples_leaf=1)
rf.train(X_train, y_train)
ypred = rf.predict(X_test)
print("Accuracy on iris dataset using a random forest model is ", accuracy_score(ypred, y_test))

Accuracy on iris dataset using a random forest model is  1.0


### Random Forest Regressor
Using Decision Tree Regressor base regression model, an ensemble method using bagging is created which is nothing but the Random Forest Regressor. Instead of voting mechanism like in the case of Random Forest Classifier, we take the average of all the predictions by all the base estimators

In [None]:
import pandas as pd
from ensembleLearning.src.randomForest import randomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import math
reg_data = pd.read_csv("ensembleLearning/data/regression_housing.csv")[["MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "SalePrice"]]
X = reg_data.drop(["SalePrice"], axis=1).values
y = reg_data["SalePrice"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
rf = randomForest(type = "regression", n_trees=50, max_depth=100, min_samples_leaf=1)
rf.train(X_train, y_train)
ypred = rf.predict(X_test)
print("Root Mean Squared Error of the model is : ", math.sqrt(mean_squared_error(ypred, y_test)))

Root Mean Squared Error of the model is :  52611.77811114424
