Datasets Description

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

Datasets are available on http://archive.ics.uci.edu/ml/datasets.html For this homework assignment, you need to download the datasets “glass” and “Tic-Tac-Toe Endgame” from the above link. The “glass” dataset is categorical and the “Tic-Tac-Toe” dataset is continuous.

**Question 1**

Design a C4.5 decision tree classifier to classify each dataset mentioned above. Report the accuracy based on the 10-times-10-fold cross validation approach (20% of training set as the validation set for every experiment). Report the mean accuracy and the variance of the accuracy for each experiment.

In [61]:
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
import random
from pprint import pprint

In [62]:
# from sklearn.model_selection import train_test_split # Import train_test_split function
# from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation


# data=np.array(df)
# X = data[:,0:9]
# y = data[:,10]


# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 70% training and 30% test

# clf = DecisionTreeClassifier()
# clf.fit(X_train,y_train)
# y_pred = clf.predict(X_test)
# print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [4]:
def train_test_split(df, test_size):
    
    if isinstance(test_size, float):
        test_size = round(test_size * len(df))

    indices = df.index.tolist()
    test_indices = random.sample(population=indices, k=test_size)

    test_df = df.loc[test_indices]
    train_df = df.drop(test_indices)
    
    return train_df, test_df

In [5]:
def check_purity(data):
    
    label_column = data[:, -1]
    unique_classes = np.unique(label_column)

    if len(unique_classes) == 1:
        return True
    else:
        return False

In [6]:
def classify_data(data):
    
    label_column = data[:, -1]
    unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)

    index = counts_unique_classes.argmax()
    classification = unique_classes[index]
    
    return classification

In [7]:
def get_potential_splits(data):
    
    potential_splits = {}
    _, n_columns = data.shape
    for column_index in range(n_columns - 1):          # excluding the last column which is the label
        values = data[:, column_index]
        unique_values = np.unique(values)
        
        potential_splits[column_index] = unique_values
    
    return potential_splits

In [8]:
def split_data(data, split_column, split_value):
    
    split_column_values = data[:, split_column]

    type_of_feature = FEATURE_TYPES[split_column]
    if type_of_feature == "continuous":
        data_below = data[split_column_values <= split_value]
        data_above = data[split_column_values >  split_value]
    
    # feature is categorical   
    else:
        data_below = data[split_column_values == split_value]
        data_above = data[split_column_values != split_value]
    
    return data_below, data_above

In [9]:
def calculate_entropy(data):
    
    label_column = data[:, -1]
    _, counts = np.unique(label_column, return_counts=True)

    probabilities = counts / counts.sum()
    entropy = sum(probabilities * -np.log2(probabilities))
     
    return entropy

In [10]:
def calculate_overall_entropy(data_below, data_above):
    
    n = len(data_below) + len(data_above)
    p_data_below = len(data_below) / n
    p_data_above = len(data_above) / n

    overall_entropy =  (p_data_below * calculate_entropy(data_below) 
                      + p_data_above * calculate_entropy(data_above))
    
    return overall_entropy

In [11]:
def determine_best_split(data, potential_splits):
    
    overall_entropy = 9999
    for column_index in potential_splits:
        for value in potential_splits[column_index]:
            data_below, data_above = split_data(data, split_column=column_index, split_value=value)
            current_overall_entropy = calculate_overall_entropy(data_below, data_above)

            if current_overall_entropy <= overall_entropy:
                overall_entropy = current_overall_entropy
                best_split_column = column_index
                best_split_value = value
    
    return best_split_column, best_split_value

In [14]:
def determine_type_of_feature(df):
    
    feature_types = []
    n_unique_values_treshold = 15
    for feature in df.columns:
        if feature != "label":
            unique_values = df[feature].unique()
            example_value = unique_values[0]

            if (isinstance(example_value, str)) or (len(unique_values) <= n_unique_values_treshold):
                feature_types.append("categorical")
            else:
                feature_types.append("continuous")
    
    return feature_types

In [38]:
def decision_tree_algorithm(df, counter=0, min_samples=2):
    
    # data preparations
    if counter == 0:
        global COLUMN_HEADERS, FEATURE_TYPES
        COLUMN_HEADERS = df.columns
        FEATURE_TYPES = determine_type_of_feature(df)
        data = df.values
    else:
        data = df           
    
    
    # base cases
    if (check_purity(data)) or (len(data) < min_samples):
        classification = classify_data(data)
        return classification    
    # recursive part
    else:    
        counter += 1

        # helper functions 
        potential_splits = get_potential_splits(data)
        split_column, split_value = determine_best_split(data, potential_splits)
        data_below, data_above = split_data(data, split_column, split_value)
        
        # check for empty data
        if len(data_below) == 0 or len(data_above) == 0:
            classification = classify_data(data)
            return classification
        
        # determine question
        feature_name = COLUMN_HEADERS[split_column]
        type_of_feature = FEATURE_TYPES[split_column]
        if type_of_feature == "continuous":
            question = "{} <= {}".format(feature_name, split_value)
            
        # feature is categorical
        else:
            question = "{} = {}".format(feature_name, split_value)
        
        # instantiate sub-tree
        sub_tree = {question: []}
        
        # find answers (recursion)
        yes_answer = decision_tree_algorithm(data_below, counter, min_samples)
        no_answer = decision_tree_algorithm(data_above, counter, min_samples)
        
        # If the answers are the same, then there is no point in asking the qestion.
        # This could happen when the data is classified even though it is not pure
        # yet (min_samples or max_depth base case).
        if yes_answer == no_answer:
            sub_tree = yes_answer
        else:
            sub_tree[question].append(yes_answer)
            sub_tree[question].append(no_answer)
        
        return sub_tree

In [16]:
def classify_example(example, tree):
    question = list(tree.keys())[0]
    feature_name, comparison_operator, value = question.split(" ")

    # ask question
    if comparison_operator == "<=":
        if example[feature_name] <= float(value):
            answer = tree[question][0]
        else:
            answer = tree[question][1]
    
    # feature is categorical
    else:
        if str(example[feature_name]) == value:
            answer = tree[question][0]
        else:
            answer = tree[question][1]

    # base case
    if not isinstance(answer, dict):
        return answer
    
    # recursive part
    else:
        residual_tree = answer
        return classify_example(example, residual_tree)

In [17]:
def calculate_accuracy(df, tree):

    df["classification"] = df.apply(classify_example, args=(tree,), axis=1)
    df["classification_correct"] = df["classification"] == df.columns[9]
    
    accuracy = df["classification_correct"].mean()
    
    return accuracy

In [34]:
def train_test_k_fold_split(df, fold):

    indices = df.index.tolist()
    
    low = int((fold/10)*len(indices))
    high = int(((fold+1)/10)*len(indices))
    test_indices=indices[low:high]
    
    test_df = df.loc[test_indices]
    train_df = df.drop(test_indices)
    
    return train_df, test_df

## Glass Dataset

In [None]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data')
df=df.drop(df.columns[0], axis=1)

# df.head()

my_accuracies = []
sk_accuracies = []

for i in range(10):
    train_df, test_df = train_test_k_fold_split(df, i)
    tree = decision_tree_algorithm(train_df)
   
    d=np.array(train_df)
    X_train = d[:,0:]
    y_train = d[:,9]
    clf = DecisionTreeClassifier()
    clf.fit(X_train,y_train)
    
    correct = 0
    correct_sk = 0
    for j in range(len(test_df)):
        example = test_df.iloc[j]
        if classify_example(example, tree)==example[9]: correct+=1
        if clf.predict([example])==example[9]: correct_sk+=1
    
    
    my_accuracies.append(correct/len(test_df))
    sk_accuracies.append(correct_sk/len(test_df))

# print(my_accuracies)
print(np.mean(my_accuracies))
print(np.var(my_accuracies))
# print(sk_accuracies)

## Tic-Tac-Toe Dataset

In [64]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data')

my_accuracies = []
sk_accuracies = []

for i in range(10):
    train_df, test_df = train_test_k_fold_split(df, i)
    tree = decision_tree_algorithm(train_df)
    
    
    correct = 0
    correct_sk = 0
    for j in range(len(test_df)):
        example = test_df.iloc[j]
        if classify_example(example, tree)==example[9]: correct+=1
    
    my_accuracies.append(correct/len(test_df))

print(np.mean(my_accuracies))
print(np.var(my_accuracies))


0.5952960526315789
0.059650727747960916


**Question 2**

There are two possible sources for class label noise:

a) Contradictory examples. The same sample appears more than once and is labeled with a different classification.

b) Misclassified examples. A sample is labeled with the wrong class. This type of error is common in situations where different classes of data have similar symptoms.

To evaluate the impact of class label noise, you should execute your experiments on both datasets, while various levels of noise are added. Then utilize the designed C4.5 learning algorithm from Question 1 to learn from the noisy datasets and evaluate the impact of class label noise (both Contradictory examples & Misclassified examples).

● Note: when creating the noisy datasets, select L% of training data randomly and change them. (Try 10-times-10-fold cross validation to calculate the accuracy/error for each experiment.)

a) Plot one figure for each dataset that shows the noise free classification accuracy along with the classification accuracy for the following noise levels: 5%, 10%, and 15%. Plot the two types of noise on one figure.

b) How do you explain the effect of noise on the C4.5 method?

## Glass Dataset

In [None]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data')
df=df.drop(df.columns[0], axis=1)

my_accuracies = []

for i in range(10):
    train_df, test_df = train_test_k_fold_split(df, i)
    train_df = shuffle_d
    tree = decision_tree_algorithm(train_df)
    
    correct = 0
    for j in range(len(test_df)):
        example = test_df.iloc[j]
        if classify_example(example, tree)==example[9]: correct+=1
    
    my_accuracies.append(correct/len(test_df))
    
# print(my_accuracies)
print(np.mean(my_accuracies))
print(np.var(my_accuracies))