# Decision Trees

#### Instructions:
- Write modular code with relevant docstrings and comments for you to be able to use
functions you have implemented in future assignments.
- All theory questions and observations must be written in a markdown cell of your jupyter notebook.You can alsoadd necessary images in `imgs/` and then include it in markdown. Any other submission method for theoretical question won't be entertained.
- Start the assignment early, push your code regularly and enjoy learning!

### Question 1 Optimal DT from table
**[20 points]**\
We will use the dataset below to learn a decision tree which predicts if people pass machine
learning (Yes or No), based on their previous GPA (High, Medium, or Low) and whether or
not they studied. 

| GPA | Studied | Passed |
|:---:|:-------:|:------:|
|  L  |    F    |    F   |
|  L  |    T    |    T   |
|  M  |    F    |    F   |
|  M  |    T    |    T   |
|  H  |    F    |    T   |
|  H  |    T    |    T   |
    
 For this problem, you can write your answers using $log_2$
, but it may be helpful to note
that $log_2 3 ≈ 1.6$.

---
1. What is the entropy H(Passed)?
    <br>
    <br>
    $H(Passed) = \sum -p(x) \log(p(x))\\$
    $= -p_{not pass}.\log(p_{not pass}) -p_{pass}.\log(p_{pass})\\$
    $= -\frac{1}{3}.\log(\frac{1}{3}) - \frac{2}{3}.\log(\frac{2}{3})\\$
    $= \log(3) - \frac{2}{3}\\$
    $= 0.918$
    <br>
    <br>
2. What is the entropy H(Passed | GPA)?
    <br>
    <br>
    $H(passed \vert GPA) = \sum_{x \in GPA}p(x).H(passed \vert GPA=x)\\$
    $H(passed \vert GPA) = p(L).H(passed \vert GPA=L) + p(M).H(passed \vert GPA=M) + p(H).H(passed \vert GPA=H)\\$
    $H(passed \vert GPA) =  \frac{2}{6}.(-\frac{1}{2}.\log(\frac{1}{2}) - \frac{1}{2}.\log(\frac{1}{2})) + \frac{2}{6}.(-\frac{1}{2}.\log(\frac{1}{2}) -\frac{1}{2}.\log(\frac{1}{2})) +\frac{2}{6}.(-1.\log(1) - 0.\log(0)))\\$
    $H(passed \vert GPA) =  \frac{1}{3}.(\log(2)) + \frac{1}{3}.(\log(2)) +\frac{1}{3}.(0)\\$
    $H(passed \vert GPA) =  \frac{1}{3}.(1) + \frac{1}{3}.(1) +\frac{1}{3}.(0)\\$
    $H(passed \vert GPA) =  \frac{2}{3}\\$
    $= 0.67\\$
    <br>
    <br>
3. What is the entropy H(Passed | Studied)?
    <br>
    <br>
    $H(passed \vert studied) = \sum_{x \in studied}p(x).H(passed \vert studied=x)\\$
    $H(passed \vert studied) = p(True).H(passed \vert studied=True) + p(False).H(passed \vert studied=False)\\$
    $H(passed \vert studied) = \frac{3}{6}.H(passed \vert studied=True) + \frac{3}{6}.H(passed \vert studied=False)\\$
    $H(passed \vert studied) = \frac{1}{2}.( - 1.\log(1) - 0.\log{0}) + \frac{1}{2}.( - \frac{1}{3}.\log(\frac{1}{3}) - \frac{2}{3}.\log(\frac{2}{3}))\\$
    $H(passed \vert studied) = \frac{1}{2}.(0) + \frac{1}{2}.(0.918)\\$
    $H(passed \vert studied) = \frac{1}{2}.(\log(3) - \frac{2}{3})\\$
    $= 0.459 \\$
    <br>
    <br>

4. Draw the full decision tree that would be learned for this dataset. You do not need to show any calculations.
    <br>
    <br>
    ![decision  tree](./imgs/q1.1.png)
---


### Question 2 DT loss functions
**[10 points]**
1. Explain Gini impurity and Entropy. 
    <br>
    <br>
    Both of them are measures used in building decision trees. A good question, is the one which causes split, which causes the least amount of variance in all the parts. Both the Entropy and Gini Impurity are a measure to evaluate the amount of imuprity, or inhomogenity, in a set of item (part). The formulas of Entropy and Gini Impurity are as follows. 
    <br>
    $H_{entropy} = \sum_i -p_i.\log(p_i)\\$
    $H_{gini} = 1 - \sum_i (p_i)^2\\$
    Both of them are 0, only when the split contains instances (or rather samples) from a single class.
    <br> 
    <br>The range of Entropy - [0, 1] <br> The range of Gini Impurity - [0, 0.5]<br>
2. What are the min and max values for both Gini impurity and Entropy
    | Impurity | Min | Max
    |:--:|:--:|:--:|
    |Entropy| 0 | 1 |
    |Gini | 0 | 0.5 |
    
    <br>
3. Plot the Gini impurity and Entropy for $p\in[0,1]$.
    <br>
    <br>
    ![](./imgs/q2.1.png)
5. Multiply Gini impurity by a factor of 2 and overlay it over entropy.
    <br>
    <br>
    ![](./imgs/q2.2.png)

In [405]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

### Question 3 Training a Decision Tree  
**[40 points]**

You can download the spam dataset from the link given below. This dataset contains feature vectors and the lables of Spam/Non-Spam mails. 
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data

**NOTE: The last column in each row represents whether the mail is spam or non spam**\
Although not needed, incase you want to know what the individual columns in the feature vector means, you can read it in the documentation given below.
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.DOCUMENTATION

**Download the data and load it from the code given below**

In [406]:
#######################
# Your code goes here #
#######################
data = pd.read_csv('./spambase.data', header=None)
labels = np.array((data[len(data.columns)-1].values.tolist()))
del data[len(data.columns)-1]
data = np.array(data.values.tolist())
X, y = shuffle(data, labels, random_state=0)

You can try to normalize each column (feature) separately with wither one of the following ideas. **Do not normalize labels**.
- Shift-and-scale normalization: substract the minimum, then divide by new maximum. Now all values are between 0-1
- Zero mean, unit variance : substract the mean, divide by the appropriate value to get variance=1.

In [407]:
#######################
# Your code goes here #
#######################
def norm1(X):
    temp = X.copy()
    min = np.amin(temp, axis = 0)
    temp -= min
    max = np.max(temp, axis = 0)
    temp /= max
    return temp
def norm2(X):
    temp = X.copy()
    temp = (temp - np.mean(temp, axis = 0))/np.std(temp, axis = 0)
    return temp


norm1_x = norm1(X)
norm2_x = norm2(X)


1. Split your data into train 80% and test dataset 20% 
2. **[BONUS]** Visualize the data using PCA . You can reduce the dimension of the data if you want. Bonus marks if this increases your accuracy.

*NOTE: If you are applying PCA or any other type of dimensionality reduction, do it before splitting the dataset*

In [408]:
#######################
# Your code goes here #
#######################
test_precentage = 20
def split(X, y, test_precentage):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=test_precentage/100, random_state=0)
    return x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = split(norm1_x, y, test_precentage)

You need to perform a K fold validation on this and report the average training error over all the k validations. 
- For this , you need to split the training data into k splits.
- For each split, train a decision tree model and report the training , validation and test scores.
- Report the scores in a tabular form for each validation

In [409]:
def k_fold_validation(k, x, y):
    '''
    the function each of the k splits, val 
    and train sets data and labels
    '''
    k_folds = KFold(n_splits=k)
    splits = []
    for train_index, val_index in k_folds.split(x, y):
        splits += [{"x_train":x[train_index], "x_val":x[val_index], "y_train":y[train_index], "y_val":y[val_index]}]
    return splits

def k_fold_performance(splits):
    '''
    This function returns the models
    (decision trees) and their corresponding accuracies
    when the dataset is given in the format of the return
    object of the above function
    '''
    performance = dict()
    performance['Training Set Accuracy'] = []
    performance['Validation Set Accuracy'] = []
    performance['Testing Set Accuracy'] = []
    trees = []
    for i in range(len(splits)):
        split = splits[i]
        train_set = split['x_train']
        val_set = split['x_val']
        val_y = split['y_val']
        train_y = split['y_train']
        clf = DecisionTreeClassifier()
        clf.fit(train_set, train_y)
        trees += [clf]
        performance['Training Set Accuracy'] += [clf.score(train_set, train_y)*100]
        performance['Validation Set Accuracy'] += [clf.score(val_set, val_y)*100]
        performance['Testing Set Accuracy'] += [clf.score(x_test, y_test)*100]
    return performance, trees

# Initialize K and split the data

k = 15

# Splitting and evaluating

performance, trees = k_fold_performance(k_fold_validation(k, x_train, y_train))
print(f"Average Training Accuracy, over {k} validations: {sum(performance['Training Set Accuracy'])/len(performance['Training Set Accuracy'])}%")

# This contains the tree models, and their respective performances

df = pd.DataFrame(performance)
df.index.names = ['Validation']
df.columns.names = ['Metrics']
df

#Run the K fold Validation and report the scores

#######################
# Your code goes here #
#######################


Average Training Accuracy, over 15 validations: 99.92236043537565%


Metrics,Training Set Accuracy,Validation Set Accuracy,Testing Set Accuracy
Validation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,99.941759,90.243902,91.7481
1,99.941759,90.650407,91.639522
2,99.912638,92.682927,90.553746
3,99.912638,90.650407,90.662324
4,99.912638,90.650407,90.119435
5,99.912664,95.102041,90.770901
6,99.912664,93.061224,90.770901
7,99.912664,91.428571,90.662324
8,99.970888,89.387755,91.530945
9,99.912664,91.428571,92.073833


### Question 4 Random Forest Algorithm
**[30 points]**

1. What is boosting, bagging and  stacking?
Which class does random forests belong to and why? **[5 points]**

- Boosting
  <br>
  <br>
  - Boosting is an ensemble technique which is used to boost the performance of weak classfiers into a strong classifier. 
  - Procedure (Adaboost)
    - The models are sequentially learnt. The first model, is fit to the whole data. (All samples are equally likely)
    - Then the misclassified examples' weights are increased.
    - This gives the misclassified examples, a higher chance of occuring in the next training set. 
  <br>
  <br>
- Bagging
  <br>
  <br>
  - Bagging is an ensemble technique, for improving unstable classification models. Usually applied to decision trees, but also can be applied to naive-bayes, KNN, etc.
  - Procedure
    1. Multiple Versions of the training set are created by drawing N random samples, where N is the size of the original training set. 
    2. Each of these sets are used to train different models. 
    3. The outputs of the model (for testing set) are aggregated by majority vote in the case of classification and mean in the case of regression. 
  <br>
  <br>
- Stacking
  <br>
  <br>
  - In boosting and bagging, we use the same kind of models and exclusively those models are trained on the different versions and combined. 
  - Whereas, in stacking the individual models (reffered to as components), are different. (ex. ANNs, Decision Trees etc.)
  - Procedure
    - Lv - 0: **Base Learners**
      - The original dataset is fit to each of the component. (Reffered to as Sub-Models sometimes)
      - The outputs of each of these models are related to their respective components' training data.
    - Lv - 1: **Stacking Model Learner**
      - The outputs of the previous are compiled into a new dataset, and is fit to a new model. (Reffered to Aggregator Model sometimes)
      - During the testing phase, we feed the testing data into Lv-1 and then Lv-2.
      - The output of Lv-2 is going to give us the final prediction value for a training instance. 
  <br>
  <br>
- Random Forests belongs to the class of **BAGGING**.

2. Implement random forest algorithm using different decision trees. **[25 points]** 

In [410]:
def ensemble_components(components, x_train, y_train, n_prime):
    ensemble = []
    for _ in range(components):
        clf = DecisionTreeClassifier(max_features= 5)
        # Selecting features
        rand_idx = np.random.randint(0, x_train.shape[0], n_prime)
        train_labels = y_train[rand_idx]
        train_set = x_train[rand_idx]
        # bagging 
        clf.fit(train_set, train_labels)
        ensemble += [clf]
    return ensemble

def random_forest_algorithm(number_of_trees, x_train, y_train, n_prime): # Pass necessary params as per requirements
    '''
    This function intends to return the indivdual trees,
    which are going to be used for predicting the ensemble
    '''
    ensemble = ensemble_components(number_of_trees, x_train, y_train, n_prime)
    return ensemble


def test(ensemble, test_set, test_labels):
    preds = [i.predict(test_set) for i in ensemble]
    preds = np.array(preds)
    preds = np.mean(preds, axis = 0)
    for i in range(len(preds)):
        if(preds[i] >= 0.5):
            preds[i] = 1
        else:
            preds[i] = 0
    preds = preds.astype(int)
    score = accuracy_score(test_labels, preds)
    return preds, score

ensembles = random_forest_algorithm(100, x_train, y_train,( (len(x_train)*9)//10))
test_preds, score = test(ensembles, x_test, y_test)
print(f"Accuracy on the testing Set (Random Forests): {score*100}%")
print("Confusion Matrix")
pd.DataFrame(confusion_matrix(y_test, test_preds))




#######################
# Your code goes here #
#######################

Accuracy on the testing Set (Random Forests): 95.33116178067318%
Confusion Matrix


Unnamed: 0,0,1
0,530,19
1,24,348


### Bonus Section
- As per the documentation, it is provided that ...
  - The 39th Index is the Word Frequency of ```direct``` in the mail text.
  - The 49th Index is the Word Frequency of ```cs``` in the mail text.
  - The 51st Index is the Char Frequency of ```!``` in the mail text. 
- I hypothesise that they are redundant features, which are also seen in non-spam mails, along with the spam ones. 
- So, removing them may cause, a better classification. 

In [411]:
data = pd.read_csv('./spambase.data', header=None)
labels = np.array((data[len(data.columns)-1].values.tolist()))
del data[len(data.columns)-1]
data = np.array(data.values.tolist())
X, y = shuffle(data, labels, random_state=0)

X = np.delete(X, [39, 40, 51], axis = 1)

norm1_x = norm1(X)
norm2_x = norm2(X)


test_precentage = 20
def split(X, y, test_precentage):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=test_precentage/100, random_state=0)
    return x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = split(norm1_x, y, test_precentage)


temp_1 = []
for _ in range(50):
    ensembles_exp = random_forest_algorithm(100, x_train, y_train,( (len(x_train)*9)//10))
    test_preds, score_exp = test(ensembles_exp, x_test, y_test)
    print(f"Accuracy on the testing Set (Random Forests): {score_exp*100}%")
    temp_1 += [score_exp]
print(f"The Mean Accuracy over 50 trails: {np.mean(temp_1)}")

Accuracy on the testing Set (Random Forests): 95.76547231270358%
Accuracy on the testing Set (Random Forests): 95.87404994571118%
Accuracy on the testing Set (Random Forests): 96.09120521172639%
Accuracy on the testing Set (Random Forests): 95.87404994571118%
Accuracy on the testing Set (Random Forests): 96.19978284473399%
Accuracy on the testing Set (Random Forests): 95.76547231270358%
Accuracy on the testing Set (Random Forests): 95.76547231270358%
Accuracy on the testing Set (Random Forests): 95.87404994571118%
Accuracy on the testing Set (Random Forests): 95.87404994571118%
Accuracy on the testing Set (Random Forests): 95.33116178067318%
Accuracy on the testing Set (Random Forests): 95.43973941368078%
Accuracy on the testing Set (Random Forests): 95.43973941368078%
Accuracy on the testing Set (Random Forests): 95.65689467969598%
Accuracy on the testing Set (Random Forests): 96.09120521172639%
Accuracy on the testing Set (Random Forests): 95.76547231270358%
Accuracy on the testing S

In [412]:
np.max(temp_1)

0.9619978284473398

From the above experiment, we are able to produce, better accuracy in some cases. Consider the above trail where we got, 96% accuracy in one trail of random experiment, and the normal one produced 95.7% at the best among multiple trails.

Hence we have managed to achieve a ~0.3% bump in the accuracy.