# **Decision Trees**

The Wisconsin Breast Cancer Dataset(WBCD) can be found here(https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data)

This dataset describes the characteristics of the cell nuclei of various patients with and without breast cancer. The task is to classify a decision tree to predict if a patient has a benign or a malignant tumour based on these features.

Attribute Information:
```
#  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)
```

In [4]:
import numpy as np 
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split

In [5]:
import pandas as pd
headers = ["ID","CT","UCSize","UCShape","MA","SECSize","BN","BC","NN","Mitoses","Diagnosis"]
data = pd.read_csv('breast-cancer-wisconsin.data', na_values='?',    
         header=None, index_col=['ID'], names = headers) 
data = data.reset_index(drop=True)
data = data.fillna(0)
data.describe()

Unnamed: 0,CT,UCSize,UCShape,MA,SECSize,BN,BC,NN,Mitoses,Diagnosis
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.463519,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.640708,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


1. a) Implement a decision tree (you can use decision tree implementation from existing libraries).

In [9]:
X = data.values[:, :-1] 
Y = data.values[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3) 

1. b) Train a decision tree object of the above class on the WBC dataset using misclassification rate, entropy and Gini as the splitting metrics.

In [11]:
model_entropy = DecisionTreeClassifier(criterion = "entropy")
model_entropy.fit(X_train, y_train)
entropy_pred = model_entropy.predict(X_test)

model_gini = DecisionTreeClassifier(criterion = "gini")
model_gini.fit(X_train, y_train)
gini_pred = model_gini.predict(X_test)

1. c) Report the accuracies in each of the above splitting metrics and give the best result. 

In [17]:
gini_accuracy = accuracy_score(gini_pred,y_test)
entropy_accuracy = accuracy_score(entropy_pred,y_test)
print(f'Accuracies:\n  for splt at entropy: {entropy_accuracy*100}\n  for splt at gini: {gini_accuracy*100}')

Accuracies:
  for splt at entropy: 94.28571428571428
  for splt at gini: 94.28571428571428


1. d) Experiment with different approaches to decide when to terminate the tree (number of layers, purity measure, etc). Report and give explanations for all approaches. 

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [27]:
hyper_params = {
    'criterion': ['gini','entropy'],
    'max_depth': range(5,20),
    'min_samples_leaf': range(5,500,5),
    'random_state': range(0,100,5)
}

In [28]:
gridsearch = GridSearchCV(DecisionTreeClassifier(), hyper_params, cv=5)
gridsearch.fit(X_train,y_train)
print(f'Best parameters : {gridsearch.best_params_}\nScore:{gridsearch.best_score_}')

Best parameters : {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 10, 'random_state': 5}
Score:0.9407532084998949


In [42]:
model = DecisionTreeClassifier(criterion = "entropy",  random_state=5, max_depth=5 )
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy = accuracy_score(pred,y_test)
print(f'Accuracy    :   {accuracy*100} %')

Accuracy    :   94.76190476190476 %


In [32]:
randsearch = RandomizedSearchCV(DecisionTreeClassifier(), hyper_params, cv=5)
randsearch.fit(X_train,y_train)
print(f'Best parameters : {randsearch.best_params_}\nScore:{randsearch.best_score_}')

Best parameters : {'random_state': 15, 'min_samples_leaf': 5, 'max_depth': 13, 'criterion': 'entropy'}
Score:0.9386913528297918%


In [41]:
model = DecisionTreeClassifier(criterion = "entropy",  random_state=15, max_depth=13 )
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy = accuracy_score(pred,y_test)
print(f'Accuracy    :   {accuracy*100} %')

Accuracy    :   94.28571428571428 %


2. What is boosting, bagging and  stacking?
Which class does random forests belong to and why?

Answer:

3. Implement random forest algorithm using different decision trees . 

In [109]:
class RandomForest:
    def __init__(self,n=3,min_sample_split=2,max_depth=5,criterion="entropy" ):
        self.n = n
        self.min_sample_split = min_sample_split
        self.max_depth = max_depth
        self.criterion = criterion
        self.trees = [DecisionTreeClassifier(max_depth=self.max_depth, min_samples_split=self.min_sample_split, criterion=self.criterion) for _ in range(n)]

    def fit(self,X,y):
        for i in range(self.n):
            samples = np.random.choice(X.shape[0],X.shape[0],replace = True)
            X_sample = X[samples]
            y_sample = y[samples]
            self.trees[i].fit(X_sample,y_sample)
    
    def predict(self,X):
        pred = []
        for i in range(self.n):
            pred.append(self.trees[i].predict(X))
        pred = np.array(pred,dtype=int).T
        y = np.zeros(X.shape[0],dtype=np.int64)
        for i in range(X.shape[0]):
            y[i] = np.argmax(np.bincount(pred[i]))
            print(np.bincount(pred[i]))
        return y

4. Report the accuracies obtained after using the Random forest algorithm and compare it with the best accuracies obtained with the decision trees. 

In [110]:
model = RandomForest(max_depth=3,n=10,criterion='entropy')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

[  0   0 135   0  75]
[  0   0 128   0  82]
[  0   0 131   0  79]
[  0   0 127   0  83]
[  0   0 132   0  78]
[  0   0 137   0  73]
[  0   0 136   0  74]
[  0   0 135   0  75]
[  0   0 129   0  81]
[  0   0 120   0  90]


In [108]:
y_pred

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [63]:
y_test.shape

(210,)

5. Submit your solution as a separate pdf in the final zip file of your submission


Compute a decision tree with the goal to predict the food review based on its smell, taste and portion size.

(a) Compute the entropy of each rule in the first stage.

(b) Show the final decision tree. Clearly draw it.

Submit a handwritten response. Clearly show all the steps.

