# **Decision Trees**

The Wisconsin Breast Cancer Dataset(WBCD) can be found here(https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data)

This dataset describes the characteristics of the cell nuclei of various patients with and without breast cancer. The task is to classify a decision tree to predict if a patient has a benign or a malignant tumour based on these features.

Attribute Information:
```
#  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)
```

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
headers = ["ID","CT","UCSize","UCShape","MA","SECSize","BN","BC","NN","Mitoses","Diagnosis"]
data = pd.read_csv('breast-cancer-wisconsin.data', na_values='?',    
         header=None, index_col=['ID'], names = headers) 
data = data.reset_index(drop=True)
data = data.fillna(0)
data.describe()

Unnamed: 0,CT,UCSize,UCShape,MA,SECSize,BN,BC,NN,Mitoses,Diagnosis
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.463519,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.640708,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


1. a) Implement a decision tree (you can use decision tree implementation from existing libraries).

In [3]:
def gini(X_train, y_train, dpt):
    clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,max_depth=dpt, min_samples_leaf=5)
    clf_gini.fit(X_train, y_train)
    return clf_gini
      
def entropy(X_train, y_train, dpt):
    clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth = dpt, min_samples_leaf = 5)
    clf_entropy.fit(X_train, y_train)
    return clf_entropy

1. b) Train a decision tree object of the above class on the WBC dataset using misclassification rate, entropy and Gini as the splitting metrics.

In [4]:
X = data.values[:, 0:9]
Y = data.values[:, 9]
X_train, X_test, y_train, y_test = train_test_split( X, Y, train_size = 0.7, random_state = 100)
usedGini = gini(X_train, y_train, 3)
y_gini = usedGini.predict(X_test)
usedEntropy = entropy(X_train, y_train, 3)
y_entropy = usedEntropy.predict(X_test)
print("Accuracy using entropy: ", accuracy_score(y_test,y_entropy)*100)
print("Accuracy using gini: ", accuracy_score(y_test,y_gini)*100)

Accuracy using entropy:  91.9047619047619
Accuracy using gini:  93.33333333333333


1. c) Report the accuracies in each of the above splitting metrics and give the best result. 

Accuracy using entropy:  91.9047619047619
Accuracy using gini:  93.33333333333333
Using gini gave the best accuracy.

1. d) Experiment with different approaches to decide when to terminate the tree (number of layers, purity measure, etc). Report and give explanations for all approaches. 

In [7]:
layers = 10
for i in range(1, layers+1):
  print("depth = ", i)
  usedGini = gini(X_train, y_train, i)
  y_gini = usedGini.predict(X_test)
  usedEntropy = entropy(X_train, y_train, i)
  y_entropy = usedEntropy.predict(X_test)
  print("Accuracy using entropy: ", accuracy_score(y_test,y_entropy)*100)
  print("Accuracy using gini: ", accuracy_score(y_test,y_gini)*100)

depth =  1
Accuracy using entropy:  89.52380952380953
Accuracy using gini:  88.09523809523809
depth =  2
Accuracy using entropy:  88.57142857142857
Accuracy using gini:  92.85714285714286
depth =  3
Accuracy using entropy:  91.9047619047619
Accuracy using gini:  93.33333333333333
depth =  4
Accuracy using entropy:  93.33333333333333
Accuracy using gini:  93.33333333333333
depth =  5
Accuracy using entropy:  92.85714285714286
Accuracy using gini:  93.33333333333333
depth =  6
Accuracy using entropy:  93.80952380952381
Accuracy using gini:  94.28571428571428
depth =  7
Accuracy using entropy:  93.80952380952381
Accuracy using gini:  94.28571428571428
depth =  8
Accuracy using entropy:  93.80952380952381
Accuracy using gini:  94.28571428571428
depth =  9
Accuracy using entropy:  93.80952380952381
Accuracy using gini:  94.28571428571428
depth =  10
Accuracy using entropy:  93.80952380952381
Accuracy using gini:  94.28571428571428


Gini is used to avoid xor problems where tree height should be exponentially higher for getting purity.<br>
We are test depths to see which gives better accuracy which we got a 6 and then constant for the further increase.<br>
2. What is boosting, bagging and  stacking?
Which class does random forests belong to and why?

Answer:<br>
1)Boosting: Prediction made be a model is given as input to next layer model one by one(sequential).<br>
2)Bagging: Averaging or voting the predictions made by different models independently.<br>
3)Stacking: Each individual models prediction is stacked and used as input to the final estimator to predict.<br>

Random Forests belong to bagging because in random forests multiple decision tree models predict the outcome and then they will be voted(averaging) for final prediction.

3. Implement random forest algorithm using different decision trees . 

In [11]:
def trainModel(X_train, y_train, criterion, max_depth, min_samples_leaf, max_features):
    clf_gini = DecisionTreeClassifier(criterion = criterion, random_state = 100, max_depth=max_depth, min_samples_leaf=min_samples_leaf, max_features=max_features)
    clf_gini.fit(X_train, y_train)
    return clf_gini

numTrees = 5
max_depth=3
min_samples_leaf=5
max_features=3
criterion = 'gini' #use 'entropy' for entropy criterion

all_predictions = []
X_train, real_X_test, y_train, real_y_test = train_test_split( X, Y, test_size = 0.3)
for i in range(0, numTrees):
    X_train, X_test, y_train, y_test = train_test_split( X, Y, train_size = 0.7, random_state=100)
    modelRand = trainModel(X_train, y_train, criterion, max_depth, min_samples_leaf, max_features)
    y_gini = modelRand.predict(real_X_test)
    all_predictions.append(y_gini)


ap_transpose = l2 =[[row[i] for row in all_predictions] for i in range(len(all_predictions[0]))]
y_rand = [max(row) for row in ap_transpose]
print("Accuracy using random trees: ", accuracy_score(real_y_test,y_rand)*100)

Accuracy using random trees:  96.19047619047619


4. Report the accuracies obtained after using the Random forest algorithm and compare it with the best accuracies obtained with the decision trees. 

Best accuracy from decision tree is 94.28571428571428 at depth 6, using gini.<br>
By using random forests we are getting accuracy in a range of 93 to 96 which is performing a bit better than decision tree alone.


5. Submit your solution as a separate pdf in the final zip file of your submission


Compute a decision tree with the goal to predict the food review based on its smell, taste and portion size.

(a) Compute the entropy of each rule in the first stage.

(b) Show the final decision tree. Clearly draw it.

Submit a handwritten response. Clearly show all the steps.

