# KNN Bootstrapping

__Aim__: We wish to find the best method to compute distances by looking at the resulting F1 score from KNN done on that distance matrix.

__Datasets__: We wish to find the best method for each dataset;
1. Athenian pots,
1. Shells,
1. Swedish leaves.

__Distance Methods__: We will look at 5 main methods, with some of them holding different variations. In total, we will look at 51 methods (distance matrices).
1. SRVF (Path-Straightening)
1. Eigenshape, with PCs holding the following percentages of the variance;
    - 75%
    - 80%
    - 85%
    - 90%
    - 92.5%
    - 95%
    - 98%
    - 99% 
    - 99.5%
    - 99.9%
1. LDDMM
    - Slow
    - Moderate
    - Quick
1. Currents; 
    - Here we will use 36 different variations of the Currents algorithm.
1. L2.

__Bootstrapping Method__: We have the same training set size for each of the species in each of the datasets. We have 20 per specie in the Pots and Leaves datasets and we have 12 in the Shells datasets. We go through each specie and select 20 (or 12 resp.) shapes from that specie randomly. We do the same for each specie. Then we'll run the KNN algorithm using each of the 51 distance matrices, for K values 1-10, and we'll output the best F1 score from the various KNN results, per distance matrix. We then repeat this process by selecting a new randomly generated test and training set. We do this 100 times.

__Final Output__: In the end, we will have 100 different "best F1" scores for each 51 methods, for each of the three datasets. We will finalise our test by computing the mean, standard deviation, and confidence intervals over the 100 variations, for each of the 51 methods, for each dataset. 

### Imports

In [1]:
from statistics import mode
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from copy import deepcopy
import os
from tqdm import notebook as tqdm
import re
import seaborn as sns; sns.set()
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
import random
from collections import Counter

### Data

In [2]:
# Index Data
table = pd.read_csv("Final_Vase_Index.csv")

# Unique genus/species
specs = list(np.unique(list(table["Specie"])))

In [129]:
pth = "C:\\Users\\arian\\Documents\\GitHub\\Pots\\Code\\DistMats\\Pots"
files = []
for file in os.listdir(pth):
    files.append(file)
    
ntot = len(files)

### KNN Bootstrapping

In [114]:
# This function is used for computing the mode, when there are more than one cadidates for the mode.
# It ouputs all the options for the mode values.

def multi_mode(lst):
    p = Counter(lst).most_common(1)[0][1]
    modes = [val[0] for val in Counter(lst).most_common() if val[1] == p]
    return modes

In [130]:
nboots = 100

all_scores = np.zeros((ntot,nboots))
top_neighs = np.zeros((ntot,nboots))

k_neighbours = [3,4,5,6,7,8,9,10,11,12]


for nb in tqdm.tqdm(range(0,nboots)):
    
    # 1) Create training and test sets:
    train_names = []
    test_names = []

    for n,sp in enumerate(specs):
        inds = list(table[table["Specie"]==sp]["Index"])
        randinds = random.sample(inds, 20)
        testnames = list(table["Name"][list(np.setdiff1d(inds,randinds))])
        trainnames = list(table["Name"][randinds])
        train_names.extend(trainnames)
        test_names.extend(testnames)
        
    # 2) Save training and test sets:
    if nb == 0:
        pd.DataFrame(train_names).to_csv('training_vases.csv',index=False)
        pd.DataFrame(test_names).to_csv('testing_vases.csv',index=False)
    else:
        df = pd.read_csv('training_vases.csv')
        df.insert(len(df.T), len(df.T), train_names, True) 
        df.to_csv('training_vases.csv',index=False)
        df = pd.read_csv('testing_vases.csv')
        df.insert(len(df.T), len(df.T), test_names, True) 
        df.to_csv('testing_vases.csv',index=False)
        
    # 3) (Iterate through) Get distance matrices:
    
    pth = "C:\\Users\\arian\\Documents\\GitHub\\Pots\\Code\\DistMats\\Pots"
    n_f = 0
    for file in os.listdir(pth):
        all_dists = pd.read_csv(pth+"\\"+file)
        all_dists = all_dists.set_index('Name')
        
    # 4) KNN using K = 3,...,12.
        
        scores_sp = []

        for neigh in k_neighbours:

            # This will be updated during the Knn algorithm to contain the list of all the pots and their predicted sample.
            # To start with, it only contains the details from the training sample.
            specie_details = {}
            for shell in train_names:
                specie_details.update({shell:{"Specie":list(table[table["Name"]==shell]["Specie"])[0]}})

            for test in test_names:
                dists = []
                for train in train_names:
                    d = all_dists[test][train]
                    dists.append(d)
                toprnk = np.argsort(dists)[:neigh]
                top_classes = []
                for ind in toprnk:
                    top_classes.append(specie_details[train_names[ind]]["Specie"])
                try:
                    shapeclass = mode(top_classes)
                except:
                    # If there are multiple choices for the mode, we pick the one with the smallest distance.
                    modes = multi_mode(top_classes)
                    md_k = [n for n,val in enumerate(top_classes) if val in modes][0]
                    shapeclass = top_classes[md_k]

                specie_details.update({test:{"Specie":shapeclass}})

            act_specie = []
            pred_specie = []

            for testshape in test_names:
                pred_sp = int(specie_details[testshape]["Specie"])
                act_sp = int(list(table[table["Name"]==testshape]["Specie"])[0])

                act_specie.append(act_sp)
                pred_specie.append(pred_sp)

            f1 = f1_score(act_specie, pred_specie, average='weighted')
            scores_sp.append(f1)  

    # 5) Save top F1 score
        all_scores[n_f,nb] = max(scores_sp)
        top_neighs[n_f,nb] = k_neighbours[np.argmax(scores_sp)]
        n_f = n_f+1
        
    

dfscores = pd.DataFrame(all_scores)
dfscores.insert(0, "Method", files, True) 
dfscores.to_csv('All_Scores_Vases.csv',index=False)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


