# 1. Load data sets
In this block we are going to load three datasets, and preprocess the data. The datasets are "Website Phising", "Breast Cancer Prediction (BCP)" and "Arrhythmia". In the preprocessing phase of this assignment we are going to fill in the missing values with the mean value of the feature. Pandas was used to load the datasets, and to do the preprocessing. 

In [57]:
#Importing pandas
import pandas as pd

#Importing the datasets
wp = pd.read_csv('website-phishing.csv')
bcp = pd.read_csv('bcp.csv')
ar = pd.read_csv('arrhythmia.csv')

#Checking for missing values
print("Missing values in W-P:", wp.isnull().sum().sum())
print("Missing values in BCP:", bcp.isnull().sum().sum())
print("Missing values in AR:", ar.isnull().sum().sum())

Missing values in W-P: 0
Missing values in BCP: 0
Missing values in AR: 0


Since there were no missing values in the data sets, there will be no further preprocessing of the data. I will now print the preview of the data sets to get an overview of the feature and samples.

In [58]:
print("Website Phising")
wp.head()

Website Phising


Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Class
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [59]:
print("Breast Cancer Prediciton")
bcp.head()

Breast Cancer Prediciton


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [60]:
print("Arrythmia")
ar.head()

Arrythmia


Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7


# 2. Implementation of classifiers


### Impurity
To compare the impurity before and after a split, I need a function for calculating entropy. Simply explained, entropy is a way of measuring disorder, and is used to measure the randomness in a set. The formula for entropy is: 
$$
E(S) = \sum_{i=1}^{c} - p_i \log_2(p_i)
$$

In [61]:
def get_entropy(y):
    p_i = y.value_counts()/y.shape[0]
    entropy = np.sum(-p_i*np.log2(p_i+1e-9))
    return entropy

### Information gain
To evaluate a split, the entropy will be calculated before and after the split. By comparing these entropy-values we can evaluate the information gained from the split, and find the best split. The formula for information gain based on entropy is: 
$$
InformationGain= Entropy(y) - \sum_{s} \frac{|s|}{|y|}Entropy(s)
$$

In [62]:
#y is the target variable, while mask is an instance that splits the data 
def get_inf_gain(y, left, right):
    #Splits the data based on mask
    left=sum(mask) #samples in left node
    right=mask.shape[0]-left #samples in right node
    
    #Calculate information gain based on entropy
    inf_gain=func(y)-left/(left+right)*func(y[mask])-right/(left+right)*func(y[-mask])
    return inf_gain

### Choose split with highest information gain
In this section there will be defined a function that lists all the possible splits. The splits will be evaluated and the split that provides the highest information gain will be chosen.

In [68]:
import itertools

#Finds all the possible combinations
def all_combos(pred_val):
    pred_val=pred_val.unique()
    
    combos=[]
    for i in range(0, len(pred.val)+1):
        for combo in itertools.combinations(pred_val, i):
            combo = list(combo)
            combos.append(combo)        
    return combos[1:-1] #Not including the combinations that has all the data in one node

#Takes inn prediction variable and target variable
#Returns: Best split, error, variable
def max_inf_gain_split(x, y):
    split_value=[]
    inf_gain=[]
    
    #Check that x is numerical
    numeric_variable = True if x.dtypes != 'O' else False
    
    if numeric_variable:
        options = x.sort_values().unique()[1:]
    else: 
        options = all_combos(x)
        
    #Calculate information gain for each value
    for val in options:
        mask = x < val if numeric_variable else x.isin(val)
        val_inf_gain = inf_gain(y, mask)
        inf_gain.append(val_inf_gain)
        split_value.append(val)
        
    #Check if there is a split that provides information gain
    if len(inf_gain) == 0:
        return (None, None, None)
    else:
        best_inf_gain = max(inf_gain)
        best_inf_gain_index = inf_gain.index(best_inf_gain)
        best_split = split_value[best_inf_gain_index]
        return(best_inf_gain,best_split,numeric_variable)
        
#Testing decision stump
X1 = wp.iloc[:, :-1]  # Extract all columns except the last one as features
y1 = wp.iloc[:, -1]   # Extract the last column as the target variable
max_inf_gain_split(X1,y1)


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## Decision stump from ChatGPT

In [64]:
import numpy as np


class DecisionStump:
    def __init__(self):
        self.feature_index = None
        self.threshold = None
        self.prediction = None

    def fit(self, X, y):
        # Find the best split
        best_error = float('inf')
        for feature_index in range(X.shape[1]):
            thresholds = set(X.iloc[:, feature_index])
            for threshold in thresholds:
                y_left = y[X.iloc[:, feature_index] <= threshold]
                y_right = y[X.iloc[:, feature_index] > threshold]
                error = len(y_left[y_left != y_left.iloc[0]]) + len(y_right[y_right != y_right.iloc[0]])
                if error < best_error:
                    best_error = error
                    self.feature_index = feature_index
                    self.threshold = threshold
                    self.prediction = max(set(y), key=list(y).count)


## Unpruned tree from ChatGPT

In [65]:
class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

    def fit(self, X, y):
        self.root = self._grow_tree(X, y)

    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        n_labels = len(np.unique(y))

        # Stopping criteria
        if (self.max_depth is not None and depth >= self.max_depth) or n_labels == 1 or n_samples < 2:
            return np.bincount(y).argmax()

        # Find the best split
        best_gini = float('inf')
        best_feature = None
        best_threshold = None
        for feature_index in range(n_features):
            thresholds = set(X[:, feature_index])
            for threshold in thresholds:
                y_left = y[X[:, feature_index] <= threshold]
                y_right = y[X[:, feature_index] > threshold]
                gini = (len(y_left) / n_samples) * self._gini_impurity(y_left) + \
                       (len(y_right) / n_samples) * self._gini_impurity(y_right)
                if gini < best_gini:
                    best_gini = gini
                    best_feature = feature_index
                    best_threshold = threshold

        # Create sub-trees
        left_indices = X[:, best_feature] <= best_threshold
        left_tree = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
        right_tree = self._grow_tree(X[~left_indices], y[~left_indices], depth + 1)

        return best_feature, best_threshold, left_tree, right_tree

    def _gini_impurity(self, y):
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        return 1 - np.sum(probabilities ** 2)

    def predict(self, X):
        return np.array([self._predict(sample, self.root) for sample in X])

    def _predict(self, sample, node):
        if isinstance(node, int):
            return node
        feature_index, threshold, left_tree, right_tree = node
        if sample[feature_index] <= threshold:
            return self._predict(sample, left_tree)
        else:
            return self._predict(sample, right_tree)


## Pruned decision tree from ChatGPT

In [66]:
class PrunedDecisionTree(DecisionTree):
    def __init__(self, max_depth=None, validation_data=None):
        super().__init__(max_depth=max_depth)
        self.validation_data = validation_data

    def fit(self, X, y):
        self.root = self._grow_tree(X, y)
        if self.validation_data is not None:
            self._prune_tree(self.validation_data)

    def _prune_tree(self, validation_data):
        # Recursive function to prune the tree
        def prune_node(node, X_val, y_val):
            if node is None:
                return None
                
            # Recursively prune children
            node.left = prune_node(node.left, X_val, y_val)
            node.right = prune_node(node.right, X_val, y_val)

            # Prune node if it is a leaf
            if node.left is None and node.right is None:
                # Store current node's children
                left_child = node.left
                right_child = node.right
                
                # Make current node a leaf
                node.left = None
                node.right = None
                
                # Evaluate accuracy on validation set after pruning
                accuracy_after = accuracy_score(y_val, self.predict(X_val))
                
                # Restore children
                node.left = left_child
                node.right = right_child
                
                # If pruning improves accuracy, return None to prune the node
                if accuracy_after >= accuracy_before:
                    return None

            return node

        # Evaluate accuracy on validation set before pruning
        X_val, y_val = validation_data
        accuracy_before = accuracy_score(y_val, self.predict(X_val))
        
        # Start pruning from the root
        self.root = prune_node(self.root, X_val, y_val)


In [67]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the datasets into features (X) and target variable (y)
# Assuming 'class' is the last column in your datasets
X1 = wp.iloc[:, :-1]  # Extract all columns except the last one as features
y1 = wp.iloc[:, -1]   # Extract the last column as the target variable
X2 = bcp.iloc[:, :-1]  # Extract all columns except the last one as features
y2 = bcp.iloc[:, -1]   # Extract the last column as the target variable
X3 = ar.iloc[:, :-1]  # Extract all columns except the last one as features
y3 = ar.iloc[:, -1]   # Extract the last column as the target variable


# Split each dataset into training and testing sets
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42)


# Initialize decision tree models
decision_stump1 = DecisionStump()
decision_stump2 = DecisionStump()
decision_stump3 = DecisionStump()
decision_tree1 = DecisionTree(max_depth=None)
decision_tree2 = DecisionTree(max_depth=None)
decision_tree3 = DecisionTree(max_depth=None)
pruned_decision_tree1 = PrunedDecisionTree(max_depth=None, validation_data=(X1_test, y1_test))
pruned_decision_tree2 = PrunedDecisionTree(max_depth=None, validation_data=(X2_test, y2_test))
pruned_decision_tree3 = PrunedDecisionTree(max_depth=None, validation_data=(X3_test, y3_test))
# Initialize more models if needed

# Train the decision tree models
#decision_stump1.fit(X1_train, y1_train)
#decision_stump2.fit(X2_train, y2_train)
#decision_stump3.fit(X3_train, y3_train)
decision_tree1.fit(X1_train, y1_train)
decision_tree2.fit(X2_train, y2_train)
decision_tree3.fit(X3_train, y3_train)
pruned_decision_tree1.fit(X1_train, y1_train)
pruned_decision_tree2.fit(X2_train, y2_train)
pruned_decision_tree3.fit(X3_train, y3_train)


# Evaluate the decision tree models
#decision_stump1_accuracy = accuracy_score(y1_test, decision_stump1.predict(X1_test))
#decision_stump2_accuracy = accuracy_score(y2_test, decision_stump2.predict(X2_test))
#decision_stump3_accuracy = accuracy_score(y3_test, decision_stump3.predict(X3_test))
decision_tree1_accuracy = accuracy_score(y1_test, decision_tree1.predict(X1_test))
decision_tree2_accuracy = accuracy_score(y2_test, decision_tree2.predict(X2_test))
decision_tree3_accuracy = accuracy_score(y3_test, decision_tree3.predict(X3_test))
pruned_decision_tree1_accuracy = accuracy_score(y1_test, pruned_decision_tree1.predict(X1_test))
pruned_decision_tree2_accuracy = accuracy_score(y2_test, pruned_decision_tree2.predict(X2_test))
pruned_decision_tree3_accuracy = accuracy_score(y3_test, pruned_decision_tree3.predict(X3_test))


#Print or analyze the results
print("Decision Stump 1 Accuracy:", decision_stump1_accuracy)
print("Decision Stump 2 Accuracy:", decision_stump2_accuracy)
print("Decision Stump 3 Accuracy:", decision_stump3_accuracy)
print("Decision Tree 1 Accuracy:", decision_tree1_accuracy)
print("Decision Tree 2 Accuracy:", decision_tree2_accuracy)
print("Decision Tree 3 Accuracy:", decision_tree2_accuracy)
print("Pruned Decision Tree 1 Accuracy:", pruned_decision_tree1_accuracy)
print("Pruned Decision Tree 2 Accuracy:", pruned_decision_tree2_accuracy)
print("Pruned Decision Tree 3 Accuracy:", pruned_decision_tree3_accuracy)


InvalidIndexError: (slice(None, None, None), 0)