# Stage 2: Image Feature Analysis

**Rui Filipe Martins Monteiro (R20170796) | MSc in Data Science and Advanced Analytics**

<br>

This notebook loads the intermediate datasets and performs classification with M4GP. Significance tests are performed on the output to determine statistical significance.

[ellyn](https://cavalab.org/ellyn/), a genetic programming system for regression, is one of the most important Python libraries. It is needed to apply M4GP.

<br>

This code is heavily inspired by: Jonathan Janke (https://github.com/novajon/classy-conv-features)

Code changed and improved by: Rui Monteiro

In [14]:
# Imports
import glob
import numpy as np
import pandas as pd
import copy

from ellyn import ellyn
from sklearn.metrics import (
    accuracy_score, 
    confusion_matrix, 
    precision_score, 
    recall_score, 
    f1_score, 
    roc_auc_score, 
    cohen_kappa_score
)
from sklearn.model_selection import GridSearchCV, train_test_split, KFold

# import keras
# from keras.utils import to_categorical
# from keras.models import Sequential
# from keras.layers import Dense, Activation, Dropout

# from scipy import stats
# import time
# import datetime
# import os
# from scipy.stats import ttest_ind_from_stats
# import matplotlib.pyplot as plt
# import pickle
# import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Check current path
import os

first = os.getcwd()
first

'/home/ruifi/thesis'

In [3]:
# Define the path to the intermediate datasets
path = first + '/outputs/data/cifar10_filtered/intermediate' # For CIFAR-10

In [4]:
# Define the name of the subfolder in path to use
predefined_folder_name = ""

In [5]:
# If no predefined folder name is set, the latest data is used (folder with latest timestamp)
if predefined_folder_name == "":
    folder_list = np.sort(glob.glob(path + "/*"))
    curr_path = folder_list[-1] + "/"
    
else:
    curr_path = path + predefined_folder_name + "/"

In [6]:
# The current path will then consist of the path and the latest timestamp folder
curr_path

'/home/ruifi/thesis/outputs/data/cifar10_filtered/intermediate/20220224_1845/'

In [7]:
# Name of the dataset
dataset_name = curr_path.split("/")[-4].split("_")[0]

In [8]:
dataset_name

'cifar10'

In [9]:
# The model names are all subfolders that are in the folder curr_path
model_names = [fn.split("/")[-1] for fn in glob.glob(curr_path + "*")]

In [10]:
model_names

['CNN_Input_1', 'CNN_Input_2']

## 1. Loading the datasets

The datasets are pre split between train and test data.

In [15]:
train_data = []
test_data = []

for fn in model_names:
    train_data_parts = np.load(glob.glob(curr_path + str(fn) + "/train*")[0], allow_pickle=True)
    for tt in glob.glob(curr_path + str(fn) + "/train*")[1:]:
        train_data_parts = np.vstack((train_data_parts,np.load(tt)))
    train_data.append(copy.copy(train_data_parts))
    
    test_data_parts = np.load(glob.glob(curr_path + str(fn) + "/test*")[0], allow_pickle=True)
    for tt in glob.glob(curr_path + str(fn) + "/test*")[1:]:
        test_data_parts = np.vstack((test_data_parts, np.load(tt)))
    test_data.append(copy.copy(test_data_parts))

In [54]:
# We can check the dataset dimensions
for ind, name in enumerate(model_names):
    print(name + " dimension on input data: " + str(train_data[ind][0][0].shape))

CNN_Input_1 dimension on input data: (10,)
CNN_Input_2 dimension on input data: (10,)


In [55]:
def iterate_over_datasets(datasets, function):
    """Iterate over datasets and apply function on each one

    Parameters
    ----------
    datasets : list
        list of datasets that function should be applied on
    function : function
        function that should be applied on each dataset
    """
    ret_data = []
    
    for data in datasets:
        n_data = function(data)
        ret_data.append(n_data)
        
    return ret_data

In [56]:
def split_input_target (data):
    """Save the target data separately from the input data

    Parameters
    ----------
    data : list
        list containing input and output data together
    """
    target = []
    inp = []
    
    for d in data:
        target.append(d[1][0])
        inp.append(d[0])
        
    return [np.array(inp), target]

In [57]:
# Iterate over train and test data to split the input data from the target for each
train_data = iterate_over_datasets(train_data, split_input_target)
test_data = iterate_over_datasets(test_data, split_input_target)

In [66]:
def reindex_classes(data):
    """Reindexing classes to be continuous and start at 0

    Parameters
    ----------
    data : list
        dataset to reindex
    """
    for ind, ds in enumerate(data):
        ds[1] = [x - min (ds[1]) for x in ds[1]]
        data[ind] = ds
        
    return data

In [67]:
# Reindex datasets to make their classes continuous

In [69]:
def get_top_n_accuracy(input_target, input_data, n=5):
    """Get top n accuracy from input data (predictions)

    Parameters
    ----------
    input_target : list
        array of correct targets
    input_data : list
        array of predictions
    n : int
        n value for top-n-accuracy
    """
    count = 0
    inp_data = input_data.copy()
    
    for ind, pred in enumerate(inp_data):
        max_classes = []
        
        for i in range (n):
            max_classes.append(np.argmax(pred))
            pred[np.argmax(pred)]=-1
            
        if input_target[ind] in max_classes: 
            count += 1
            
    return count/len(input_target)

In [68]:
def evaluate(model, data):
    """Method to evaluate a model based on several metrics:
        - top-1-accuracy ("accuracy")
        - top-2-accuracy
        - top-5-accuracy
        - top-10-accuracy
        - top-20-accuracy
        - confusion matrix
        - precision
        - recall
        - f1 score
        - cohens kappa

    Parameters
    ----------
    model : keras/sklearn model (needs predict function)
        dataset to reindex

    data : list
        dataset to evaluate
    """
    y_pred = model.predict(data[0])
    
    try: y_proba = model.predict_proba(data[0])
    except: y_proba = model.predict(data[0])
    
    try: predictions = [round(value) for value in y_pred]
    except: predictions = [np.argmax(value) for value in y_pred]
    
    # Evaluate predictions
    d = {}
    d["accuracy"] = accuracy_score(data[1], predictions)
    d["top-2-accuracy"] = get_top_n_accuracy(data[1], y_proba, n=2)
    d["top-5-accuracy"] = get_top_n_accuracy(data[1], y_proba, n=5)
    d["top-10-accuracy"] = get_top_n_accuracy(data[1], y_proba, n=10)
    d["top-20-accuracy"] = get_top_n_accuracy(data[1], y_proba, n=20)
    d["confusion matrix"] = confusion_matrix(data[1], predictions)
    d["precision"] = precision_score(data[1], predictions, average='macro')
    d["recall"] = recall_score(data[1], predictions, average='macro')
    d["f1-score"] = f1_score(data[1], predictions, average='macro')
    # d["roc-auc"] = roc_auc_score(data[1], predictions, )
    d["cohen's kappa"] = cohen_kappa_score(data[1], predictions)
    
    return d

In [70]:
def get_measures(measure, data_type="data", measure_name="Accuracy"):
    """Get string representation of measures

    Parameters
    ----------
    measure : float
        value to print
    data_type : string
        usually test or training data
    measure_name : string
        measure that is applied, e.g. 'Accuracy'
    """
    return "%s in %s: %.2f" % (measure_name, data_type, measure)

In [71]:
def print_measures(evaluation, t):
    """Print measures on screen

    Parameters
    ----------
    evaluation : dict
        key value pairs with accuracy measure : value
    t : string
        Data type, e.g. "Test" or "Train"
    """
    for key in evaluation.keys():
        if key != "confusion matrix":
            print(get_measures(evaluation[key], t, key))
            
        else:
            print(key)
            print(evaluation[key])

In [86]:
def write_to_file(train_eval, test_eval, model_stats, model_key, model_name, output_file_name):
    """Save the results to a CSV file

    Parameters
    ----------
    train_eval : dict
        key value pairs of "evaluation metric name : evaluation metric value" on train data
    test_eval : dict
        key value pairs of "evaluation metric name : evaluation metric value" on test data
    model_stats : dict
        key value pair of "train"/"test" : model statistics (e.g., mean, std. dev, ...)
    model_key : string
        name of the model that was applied
    model_name : string
        name of the model that was used to create the intermediate dataset, e.g., VGG-16
    output_file_name : string
        name of the file to write to
    """
    file_stamp = curr_path.split("/")[-2]
    train_size = str(len(train_data[0][0]))
    test_size = str(len(test_data[0][0]))
    dataset_dim = str(np.max(train_data[0][1] + test_data[0][1]) + 1)
    
    with open(output_file_name + '_train.csv', 'a') as f:
        print (output_file_name + '_train.csv')
        f.write("\n" + model_name + ',' + file_stamp + ',' + train_size + ',' + dataset_dim + ',' + model_key + ",")
        f.write(','.join([str(train_eval[x]) for x in train_eval.keys() if x!="confusion matrix"]) + ",")
        f.write(','.join([str(x) for x in model_stats["train"].values()]))
    
    with open(output_file_name + '_test.csv', 'a') as f:
        f.write("\n" + model_name + ',' + file_stamp + ',' + test_size + ',' + dataset_dim + ',' + model_key + ",")
        f.write(','.join([str(test_eval[x]) for x in test_eval.keys() if x!="confusion matrix"]) + ",")    
        f.write(','.join([str(x) for x in model_stats["test"].values()]))

In [87]:
def create_data_files(ind,prefix):
    """Create initial files to write to

    Parameters
    ----------
    ind : int
        index of dataset
    prefix : string
        prefix to give to file name
    """
    path = "/home/ruifi/thesis/outputs/benchmark_results" + dataset_name + "/"
    output_file_name = path + prefix + "_" + str(ind)
    
    if not os.path.exists(path):
        os.makedirs(path)
        
    with open(output_file_name + '_train.csv', 'w') as f:
        f.write('Dataset,Dataset Stamp,Dataset Size,# Dataset Classes,Model Architecture' 
                + ',' + "accuracy" + ',' + "top-2-accuracy" + ',' + "top-5-accuracy" + ',' 
                + "top-10-accuracy" + ',' + "top-20-accuracy" + ',' + "precision" + ',' 
                + "recall" + ',' + "f1-score" + ',' + "cohen's kappa" + ',' 
                + "Mean Accuracy over cross validation" + ',' + "mean standard deviation over cross validation" 
                + ',' + "number observations over cross validation")
        
    with open(output_file_name + '_test.csv', 'w') as f:
        f.write('Dataset,Dataset Stamp,Dataset Size,# Dataset Classes,Model Architecture' 
                + ',' + "accuracy" + ',' + "top-2-accuracy" + ',' + "top-5-accuracy" + ',' 
                + "top-10-accuracy" + ',' + "top-20-accuracy" + ',' + "precision" 
                + ',' + "recall" + ',' + "f1-score" + ',' + "cohen's kappa" + ',' 
                + "Mean Accuracy over cross validation" + ',' + "mean standard deviation over cross validation" 
                + ',' + "number observations over cross validation")
        
    return output_file_name

## 2. Classification Model Benchmark

### M4GP hyperparameters

Check all parameters here: <br>
https://github.com/cavalab/ellyn/blob/03ebdf07f0bfdca30da1a9a0a0da2989e9b1153c/src/ellyn.py#L42

In [88]:
e = ellyn(
    #g=200, 
    #popsize=50,
    verbosity=1,
    class_m4gp=True,
    op_list=['n','v','+','-','*','/', 'sin', 'cos', 'exp', 'log', 'sqrt'],
    #selection='lexicase',
    #fit_type='F1'
)

In [None]:
# START CHECKING ON "Create other classification models"