#### Section - B

- In this project you will apply the AdaBoost boosting algorithm to implement an ensemble learning approach for solving a (binary) classification problem. 
- The (one- dimensional) training data set D is given in Table 4.12 on page 352 of the textbook. The base classifier is a simple, one-level decision tree (decision stump) (as explained on p. 303 of the textbook).
- Determine the number of boosting rounds and show the result of each round (the probability distribution pi’s at each round, the records chosen at each round, the model (tree) obtained at each round, the ε and the α at each round), as well as the result obtained on D with the final ensemble classifier. 
- Note that the textbook uses the notation w (weight) for what we called p (probability) in the derivation we did in the lectures. (The textbook has quite a few typos!) Also, do not forget the stopping condition we discussed.
- What is the result of running your ensemble classifier on the following test data? X = 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0


Submit source code and your output (the round-wise results and the result on the test data) following the styles of Figures 4.46, 4.49, and 4.50 of the textbook.

In [163]:
import pandas as pd
import numpy as np

#### Section - B : 1. Load datasets

In [774]:
training_dataset_df = pd.read_csv("1d_training_dataset.csv")
training_dataset_df = training_dataset_df.T
training_dataset_df.reset_index(inplace=True)
training_dataset_df.columns = ["x", "y"]
training_dataset_df = training_dataset_df[1:]
training_dataset_df.reset_index(inplace=True)
training_dataset_df.drop(columns=['index'],inplace=True)
training_dataset_df

Unnamed: 0,x,y
0,0.5,-
1,3.0,-
2,4.5,+
3,4.6,+
4,4.9,+
5,5.2,-
6,5.3,-
7,5.5,+
8,7.0,-
9,9.5,-


In [775]:
X_test = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
X_test

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

##### Replace class labels "y" --- "+" and "-" with 1 and 0

In [776]:
training_dataset_df['y'] = training_dataset_df['y'].map({'+': 1, "-": 0})
training_dataset_df

Unnamed: 0,x,y
0,0.5,0
1,3.0,0
2,4.5,1
3,4.6,1
4,4.9,1
5,5.2,0
6,5.3,0
7,5.5,1
8,7.0,0
9,9.5,0


In [777]:
training_dataset_df.dtypes

x    object
y     int64
dtype: object

In [778]:
training_dataset_df['x'] = training_dataset_df['x'].astype(float)
training_dataset_df

Unnamed: 0,x,y
0,0.5,0
1,3.0,0
2,4.5,1
3,4.6,1
4,4.9,1
5,5.2,0
6,5.3,0
7,5.5,1
8,7.0,0
9,9.5,0


In [779]:
training_dataset_df.dtypes

x    float64
y      int64
dtype: object

#### Section B - 2. Task

- apply the AdaBoost boosting algorithm
- Use the base classifier is a simple, one-level decision tree (decision stump) (as explained on p. 303 of the textbook).
- Determine the number of boosting rounds and show the result of each round (the probability distribution pi’s at each round, the records chosen at each round, the model (tree) obtained at each round, the ε and the α at each round), as well as the result obtained on D with the final ensemble classifier. 
- Note that the textbook uses the notation w (weight) for what we called p (probability) in the derivation we did in the lectures. (The textbook has quite a few typos!) Also, do not forget the stopping condition we discussed.
- What is the result of running your ensemble classifier on the following test data? X = 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0


Submit source code and your output (the round-wise results and the result on the test data) following the styles of Figures 4.46, 4.49, and 4.50 of the textbook.

**Algorithm reference from textbook and professor notes:**


    

In [780]:
from sklearn.tree import DecisionTreeClassifier

In [1160]:
# 20, 90
np.random.seed(seed=40)
def train_adaboost(XTRAIN, YTRAIN):
    
    SAMPLESIZE = len(XTRAIN)
    # setting initial probabilities
    samples_probabilities = np.repeat(1/SAMPLESIZE, SAMPLESIZE)
    
    selected_round_models = list()
    alphas = list()
    #yhats = np.empty((SAMPLESIZE, NROUNDS))
    
    curr_round = 1
    epsilion_threshold = 0.3
    epsilon_t_current = 1.1 # set initial error to maximum
    
    columns = ["Round"]
    for x_train in XTRAIN:
        columns.append(f"x={x_train[0]}")
    weights_df = pd.DataFrame(columns=columns)
        
    while epsilon_t_current > epsilion_threshold:
        #print("\n\nWorking on round: ", curr_round)

        # a) Create a training dataset D(t) (of the same size n) by sampling (with replacement)
        #from the distribution defined by p(t)
        bootstrap_sample = np.random.choice(np.arange(len(XTRAIN)),
                                               size=len(XTRAIN),
                                               replace=True,
                                           p=samples_probabilities)
        
        bootstrap_XTRAIN = XTRAIN[bootstrap_sample]
        bootstrap_YTRAIN = YTRAIN[bootstrap_sample]
        
        # b) Create model M(t) using A on D(t) (e.g., if A is a decision tree induction 
        # algorithm, M(t) is a decision tree)
        # max_depth =1 gives us decision stump
        
        base_classifier_t = DecisionTreeClassifier(max_depth=1)
        base_classifier_t.fit(bootstrap_XTRAIN, bootstrap_YTRAIN,sample_weight=samples_probabilities)
        
        yhat_t = base_classifier_t.predict(XTRAIN)
        # c) Calculate the error of M(t) on D (note: not on D(t))
        epsilon_t = 0
        for index in range(SAMPLESIZE):
            y_i = YTRAIN[index][0]
            yhat_t_i = yhat_t[index]
            if y_i!= yhat_t_i:
                # misclassification
                epsilon_t += samples_probabilities[index]
        epsilon_t_current = epsilon_t
        # d) Find model M(t)’s weight: α(t) = 1 ln 1−ε(t)
        alpha_t = 0.5 * np.log((1-epsilon_t)/epsilon_t)
        
        
        #(e) if ε(t) ≥ 0.5:
        #• Re-start the current iteration by setting t = t − 1 and re-initializing each p(t)
        #to 1/n for i = 1,2,...,n; • Go to Step (a)
        
        if epsilon_t > 0.5:
            print(f"Epsilon is greater than 0.5, hence resetting this round")
            # Do not consider this round completion
            # re-initializing each p(t) to 1/n for i = 1,2,...,n
            samples_probabilities = np.repeat(1/SAMPLESIZE, SAMPLESIZE)
        else:
            # consider this round completion
            print(f"\nBoosting Round {curr_round}")
            
            # update the probabilities of the records in D:
            # For i = 1,...,n and (xi,yi) ∈ D:
            print(f"Training records chosen during boosting round {curr_round}. First row is the original index values from the training dataset")
            print(training_dataset_df.loc[bootstrap_sample].T)
            
            
            weights_df.loc[curr_round-1, 'Round'] = curr_round
            for idx, prob in enumerate(samples_probabilities):
                weights_df.loc[curr_round-1, f"{columns[idx+1]}"] = round(prob,2)
                
            
            curr_round += 1
            
            
            new_sample_probabilities = []
            for index in range(SAMPLESIZE):
                y_i = YTRAIN[index][0]
                yhat_t_i = yhat_t[index]
                if y_i!= yhat_t_i:
                    # misclassification
                    new_probability_for_this_index = samples_probabilities[index]/(2*epsilon_t)
                else:
                    # correct classification
                    new_probability_for_this_index = samples_probabilities[index]/(2*(1-epsilon_t))
                new_sample_probabilities.append(new_probability_for_this_index)
            # update with new probabilities
            samples_probabilities = new_sample_probabilities
            # only append alphas and models here
            alphas.append(alpha_t)
            selected_round_models.append(base_classifier_t)
            
    print("\n**Weights (Probabilities) of training records**\n")
    print(weights_df)
        
    return selected_round_models, alphas
        
def predict_adaboost(base_models, alphas, XTEST):
    
    yhats = np.zeros(len(XTEST))
    
    for index, this_model in enumerate(base_models):
        yhat_this_model = this_model.predict(XTEST)
        yhats += yhat_this_model* alphas[index]
    print("Prediction yhats:")
    print(yhats)
    return np.sign(yhats)

In [1161]:
XTRAIN_numpy = training_dataset_df['x'].to_numpy()
YTRAIN_numpy = training_dataset_df['y'].to_numpy()

In [1162]:
XTRAIN_numpy = XTRAIN_numpy.reshape(-1,1)
XTRAIN_numpy.shape

(10, 1)

In [1163]:
XTRAIN_numpy

array([[0.5],
       [3. ],
       [4.5],
       [4.6],
       [4.9],
       [5.2],
       [5.3],
       [5.5],
       [7. ],
       [9.5]])

In [1164]:
YTRAIN_numpy = YTRAIN_numpy.reshape(-1,1)
YTRAIN_numpy.shape

(10, 1)

In [1165]:
YTRAIN_numpy

array([[0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0]])

#### ADABOOST TRAINING

In [1166]:
selected_round_models, alphas = train_adaboost(XTRAIN_numpy, YTRAIN_numpy)


Boosting Round 1
Training records chosen during boosting round 1. First row is the original index values from the training dataset
     4    0    7    2    4    3    5    6    7    6
x  4.9  0.5  5.5  4.5  4.9  4.6  5.2  5.3  5.5  5.3
y  1.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  0.0

Boosting Round 2
Training records chosen during boosting round 2. First row is the original index values from the training dataset
     4    7    3    1    1    2    9    8    7    3
x  4.9  5.5  4.6  3.0  3.0  4.5  9.5  7.0  5.5  4.6
y  1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  1.0
Epsilon is greater than 0.5, hence resetting this round

Boosting Round 3
Training records chosen during boosting round 3. First row is the original index values from the training dataset
     7    9    6    8    9    0    5    2    7    0
x  5.5  9.5  5.3  7.0  9.5  0.5  5.2  4.5  5.5  0.5
y  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0

Boosting Round 4
Training records chosen during boosting round 4. First row is 

#### ADABOOST PREDICTIONS

In [1167]:
y_predictions = predict_adaboost(selected_round_models, alphas, np.array(X_test).reshape(-1,1))

Prediction yhats:
[0.91495562 0.91495562 0.91495562 0.91495562 0.91495562 0.04765509
 0.         0.         0.         0.        ]


In [1168]:
y_predictions

array([1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

In [1169]:
# Replace predictions with actual class values
np.where(y_predictions > 0, '+' , '-' )

array(['+', '+', '+', '+', '+', '+', '-', '-', '-', '-'], dtype='<U1')