# KNN

- **Categorical features**: conc1_type, exposure_type, control_type, media_type, application_freq_unit, class, tax_order, family, genus, species

- **Non Categorical features**: obs_duration_mean, conc1_mean, atom_number, alone_atom_number, bonds_number, doubleBond, tripleBond, ring_number, Mol, MorganDensity, LogP, oh_count

It turns out that *obs_duration_mean* have to be considered as a categorical feature in order to maxime the metrics.

In [1]:
from helper_knn import *

X_try, X_train, X_test, y_train, y_test, len_X_train = load_data_knn('data/lc_db_processed.csv',
                                                                     encoding = 'binary', seed = 42)

# Best combination
categorical = ['class', 'tax_order', 'family', 'genus', "species", 'control_type', 'media_type',
               'application_freq_unit',"exposure_type", "conc1_type", 'obs_duration_mean']

non_categorical = ['ring_number', 'tripleBond', 'doubleBond', 'alone_atom_number', 'oh_count',
                   'atom_number', 'bonds_number', 'Mol', 'MorganDensity', 'LogP']

## BINARY -- K = 1
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.0069519279617756054, alpha_2 = 1, alpha_3 = 0

In [2]:
c = [0,0]
ham = np.logspace(-3, -1, 20) 

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [1])

Mon Sep 14 11:08:06 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:1, leaf:10,
                                                        acc:  0.8976980539183801, st.error:  0.0026286215760152913,
                                                        rmse: 0.31974227013083456, st.error:  0.004087382206260621
New best params found! alpha:0.001, k:1, leaf:40,
                                                        acc:  0.8984766658275387, st.error:  0.0013155183512758776,
                                                        rmse: 0.3186007056220668, st.error:  0.0020569728718740753
New best params found! alpha:0.001, k:1, leaf:80,
                                                        acc:  0.8997000774068564, st.error:  0.001976147553967034,
                                                        rmse: 0.3166398848354725, st.error:  0.003126736498423271
Adding Hamming 1 (Categorical)... alpha =

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.0069519279617756054, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.0069519279617756054, alpha_2 = 1, alpha_3 = 0.0069519279617756054

In [2]:
c = [0,1]
al_ham = 0.0069519279617756054
pub = np.logspace(-3, -1, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Mon Sep 14 12:29:06 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:1, leaf:10,
                                                        acc:  0.8977536249017346, st.error:  0.002539752081748882,
                                                        rmse: 0.3196624298760987, st.error:  0.003946708247816731
New best params found! alpha:0.001, k:1, leaf:40,
                                                        acc:  0.8985878387013575, st.error:  0.0013189399299914412,
                                                        rmse: 0.31842614944471437, st.error:  0.0020584368131552197
New best params found! alpha:0.001, k:1, leaf:80,
                                                        acc:  0.8998112657342301, st.error:  0.0019271779321472723,
                                                        rmse: 0.3164673314241388, st.error:  0.0030480488453350613
Adding Hammin

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.0069519279617756054, alpha_2 = 1, alpha_3 = 0.0069519279617756054

END: alpha_1 = 0.009473684210526315, alpha_2 = 1, alpha_3 = 0.0069519279617756054

In [2]:
c = [1,0]
al_pub = 0.0069519279617756054

ham = np.linspace(0.005,0.01,20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [1])

Mon Sep 14 16:51:44 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.005
Start CV...
New best params found! alpha:0.005, k:1, leaf:10,
                                                        acc:  0.8245314057187116, st.error:  0.005742672577797003,
                                                        rmse: 0.4186730767961923, st.error:  0.006735151210219632
New best params found! alpha:0.005, k:1, leaf:30,
                                                        acc:  0.827031914527006, st.error:  0.0024800733461741212,
                                                        rmse: 0.4158523496638812, st.error:  0.0029541814442851045
Adding Hamming 1 (Categorical)... alpha = 0.005263157894736842
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.005526315789473685
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.005789473684210527
Start CV...
New best params found! alpha:0.005789473684210527, k:1, leaf:30,
             

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 0.009473684210526315, alpha_2 = 1, alpha_3 = 0.0069519279617756054

END: alpha_1 = 0.009473684210526315, alpha_2 = 1, alpha_3 = 0.007105263157894737

In [2]:
c = [0,1]
al_ham = 0.009473684210526315
pub = np.linspace(0.005,0.01,20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Mon Sep 14 21:28:13 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.005
Start CV...
New best params found! alpha:0.005, k:1, leaf:10,
                                                        acc:  0.8977536249017346, st.error:  0.0025060535258592765,
                                                        rmse: 0.3196648161329612, st.error:  0.003898090075453123
New best params found! alpha:0.005, k:1, leaf:40,
                                                        acc:  0.898532267718003, st.error:  0.0014565415219307638,
                                                        rmse: 0.3185074989524941, st.error:  0.002275158951814106
New best params found! alpha:0.005, k:1, leaf:80,
                                                        acc:  0.899588858172373, st.error:  0.0019329533882988246,
                                                        rmse: 0.3168182050385376, st.error:  0.0030564188107250296
Adding Hamming 3

## Final model -- BINARY -- K = 1

In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 0.009473684210526315
pub = 0.007105263157894737
k = 1
leaf = 40

cv_binary_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Tue Sep 22 00:23:49 2020
Adding pubchem2d Tue Sep 22 00:26:50 2020
End distance matrix... Tue Sep 22 00:40:05 2020
Accuracy: 	 0.9105265658811724, se: 0.001596093283967252
    RMSE: 		 0.29907364869243597, se: 0.0026639621159533205
    Sensitivity: 	 0.9305930270523051, se: 0.0025662688577057476
    Precision: 	 0.9257884455755866, se: 0.0008814897322470156
    Specificity: 	 0.877670405685049, se: 0.0016028283693755635


## BINARY -- K = 3
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.008858667904100823, alpha_2 = 1, alpha_3 = 0

In [2]:
c = [0,0]
ham = np.logspace(-3, -1, 20) 

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [3])

Tue Oct  6 15:48:57 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:3, leaf:10,
                                                        acc:  0.8865784021426044, st.error:  0.001471987870124243,
                                                        rmse: 0.33675277506777046, st.error:  0.002188969023942021
New best params found! alpha:0.001, k:3, leaf:20,
                                                        acc:  0.887690161787902, st.error:  0.00263278495208526,
                                                        rmse: 0.335035588524517, st.error:  0.0039048890550307407
New best params found! alpha:0.001, k:3, leaf:80,
                                                        acc:  0.8887471076740343, st.error:  0.0023078036127372813,
                                                        rmse: 0.3334734907055424, st.error:  0.003475748934621125
Adding Hamming 1 (Categorical)... alpha = 0.0

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.008858667904100823, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.008858667904100823, alpha_2 = 1, alpha_3 = 0.46415888336127786

In [2]:
c = [0,1]
al_ham = 0.008858667904100823
pub = np.logspace(-3, -1, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Tue Oct  6 17:22:18 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:3, leaf:10,
                                                        acc:  0.887356735887775, st.error:  0.001693560172563214,
                                                        rmse: 0.3355857685126894, st.error:  0.0025227011303663215
New best params found! alpha:0.001, k:3, leaf:20,
                                                        acc:  0.8884130172710474, st.error:  0.0028595070449058215,
                                                        rmse: 0.3339387082647631, st.error:  0.004240337587208438
New best params found! alpha:0.001, k:3, leaf:40,
                                                        acc:  0.8884132336208157, st.error:  0.0009112003453521156,
                                                        rmse: 0.33403486224369083, st.error:  0.001367222094492443
New best params

In [2]:
c = [0,1]
al_ham = 0.008858667904100823
pub = np.logspace(-1, 0, 10)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Wed Oct  7 01:42:35 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.1
Start CV...
New best params found! alpha:0.1, k:3, leaf:10,
                                                        acc:  0.8873567513413299, st.error:  0.0018834664237698242,
                                                        rmse: 0.3355768450901871, st.error:  0.0028031097903327197
New best params found! alpha:0.1, k:3, leaf:20,
                                                        acc:  0.8885241746913113, st.error:  0.0024828809727088235,
                                                        rmse: 0.3337983135707881, st.error:  0.0036915838737065667
New best params found! alpha:0.1, k:3, leaf:80,
                                                        acc:  0.8893030647644578, st.error:  0.0020303205888939084,
                                                        rmse: 0.3326553078897373, st.error:  0.0030570152243005245
Adding Hamming 3 (Pu

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.008858667904100823 alpha_2 = 1, alpha_3 = 0.46415888336127786

END: alpha_1 = 0.039331167860186894 alpha_2 = 1, alpha_3 = 0.46415888336127786

In [8]:
c = [1,0]
al_pub = 0.46415888336127786

ham = np.logspace(-2.1,-1,20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [3])

Wed Oct  7 03:15:41 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.007943282347242814
Start CV...
New best params found! alpha:0.007943282347242814, k:3, leaf:10,
                                                        acc:  0.8397084346893363, st.error:  0.0024519901755913277,
                                                        rmse: 0.40031798425955667, st.error:  0.003044535639013152
New best params found! alpha:0.007943282347242814, k:3, leaf:20,
                                                        acc:  0.8397638047764774, st.error:  0.003775281248056422,
                                                        rmse: 0.40018359131489, st.error:  0.004724628711326613
New best params found! alpha:0.007943282347242814, k:3, leaf:30,
                                                        acc:  0.8418766920676749, st.error:  0.0017307674019033284,
                                                        rmse: 0.39762330230603

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 0.039331167860186894 alpha_2 = 1, alpha_3 = 0.46415888336127786

END: alpha_1 = 0.039331167860186894 alpha_2 = 1, alpha_3 = 1.0

In [2]:
c = [0,1]
al_ham = 0.039331167860186894
pub = [i for i in np.logspace(-0.5,0,10)] + [2, 5, 10, 100, 1000, 10000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Wed Oct  7 08:26:54 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.31622776601683794
Start CV...
New best params found! alpha:0.31622776601683794, k:3, leaf:10,
                                                        acc:  0.8852439104107972, st.error:  0.002462560687709792,
                                                        rmse: 0.3386803882723388, st.error:  0.0035945858210874268
New best params found! alpha:0.31622776601683794, k:3, leaf:20,
                                                        acc:  0.8883574617412477, st.error:  0.0024829382978401406,
                                                        rmse: 0.33404849530623315, st.error:  0.003679029844730267
New best params found! alpha:0.31622776601683794, k:3, leaf:40,
                                                        acc:  0.8887469685920404, st.error:  0.0011434658687675026,
                                                        rmse: 0.33352857

## Final model -- BINARY -- K = 3

In [3]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 0.039331167860186894
pub = 1.0
k = 3
leaf = 80

cv_binary_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Wed Oct  7 11:52:02 2020
Adding pubchem2d Wed Oct  7 11:54:39 2020
End distance matrix... Wed Oct  7 12:10:16 2020
Accuracy: 	 0.9061686077707287, se: 0.0015416791459042548
    RMSE: 		 0.3062780121023428, se: 0.0025085619340753074
    Sensitivity: 	 0.9297288797672791, se: 0.0016272575609664918
    Precision: 	 0.9200600256244315, se: 0.0018468057727389193
    Specificity: 	 0.8675772077090196, se: 0.002544648179924816


## BINARY -- K = 5
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.004281332398719396, alpha_2 = 1, alpha_3 = 0

In [2]:
c = [0,0]
ham = np.logspace(-3, -1, 20) 

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [5])

Sun Sep 20 20:36:42 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:5, leaf:10,
                                                        acc:  0.8761812967788764, st.error:  0.002418811517636534,
                                                        rmse: 0.3518115974103475, st.error:  0.0034388642275861754
New best params found! alpha:0.001, k:5, leaf:20,
                                                        acc:  0.880184540170975, st.error:  0.002459028612155808,
                                                        rmse: 0.34607134161526065, st.error:  0.0035385852271274066
Adding Hamming 1 (Categorical)... alpha = 0.0012742749857031334
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.001623776739188721
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.00206913808111479
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.0026366508987303583
Start CV...
Adding Hamming 1 (Cat

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.004281332398719396, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.004281332398719396, alpha_2 = 1, alpha_3 = 0.0379269019073225

In [2]:
c = [0,1]
al_ham = 0.004281332398719396
pub = np.logspace(-3, -1, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Sun Sep 20 22:23:14 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:5, leaf:10,
                                                        acc:  0.8766817292466099, st.error:  0.001168452234182648,
                                                        rmse: 0.3511512650906884, st.error:  0.0016628122713671133
New best params found! alpha:0.001, k:5, leaf:20,
                                                        acc:  0.8801288301056266, st.error:  0.0026295035397560017,
                                                        rmse: 0.34614127610917766, st.error:  0.0037877060301538105
Adding Hamming 3 (Pubchem2d)... alpha = 0.0012742749857031334
Start CV...
New best params found! alpha:0.0012742749857031334, k:5, leaf:40,
                                                        acc:  0.8804630132299429, st.error:  0.0021034171628869567,
                                     

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.004281332398719396, alpha_2 = 1, alpha_3 = 0.0379269019073225

END: alpha_1 = 0.014399033208816327 alpha_2 = 1, alpha_3 = 0.0379269019073225

In [2]:
c = [1,0]
al_pub = 0.0379269019073225

ham = np.logspace(-2.8,-1.8,25)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [5])

Mon Sep 21 05:45:16 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.001584893192461114
Start CV...
New best params found! alpha:0.001584893192461114, k:5, leaf:10,
                                                        acc:  0.8432674347392515, st.error:  0.002852271713141206,
                                                        rmse: 0.395829532552304, st.error:  0.0035897917827189354
Adding Hamming 1 (Categorical)... alpha = 0.0017444826989992542
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.001920141938638803
Start CV...
New best params found! alpha:0.001920141938638803, k:5, leaf:30,
                                                        acc:  0.8442678824287364, st.error:  0.0014235023780417933,
                                                        rmse: 0.39461252835898064, st.error:  0.001807625057665591
Adding Hamming 1 (Categorical)... alpha = 0.0021134890398366475
Start CV...
Adding Hamming 1 (Categorical)

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 0.014399033208816327, alpha_2 = 1, alpha_3 = 0.0379269019073225

END: alpha_1 = 0.014399033208816327, alpha_2 = 1, alpha_3 = ------------

In [2]:
c = [0,1]
al_ham = 0.014399033208816327
pub = [i for i in np.logspace(-1.6, 0, 10)] + [1.5, 2, 10, 100, 1000, 10000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Thu Oct  8 18:20:50 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.025118864315095794
Start CV...
New best params found! alpha:0.025118864315095794, k:5, leaf:10,
                                                        acc:  0.8766260346348164, st.error:  0.001960253171965585,
                                                        rmse: 0.35120252648395944, st.error:  0.0027726682305826912
New best params found! alpha:0.025118864315095794, k:5, leaf:20,
                                                        acc:  0.8797953887519446, st.error:  0.0018286467562533342,
                                                        rmse: 0.34666477285905994, st.error:  0.002652664068051391
New best params found! alpha:0.025118864315095794, k:5, leaf:90,
                                                        acc:  0.8804070250006065, st.error:  0.002108237217088025,
                                                        rmse: 0.3457

In [2]:
c = [0,1]
al_ham = 0.014399033208816327
pub = [10000, 15000, 20000, 50000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Thu Oct  8 22:47:07 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 10000
Start CV...
New best params found! alpha:10000, k:5, leaf:10,
                                                        acc:  0.8769600786771387, st.error:  0.0025048288125693085,
                                                        rmse: 0.35069609225411275, st.error:  0.0036115163213070265
New best params found! alpha:10000, k:5, leaf:20,
                                                        acc:  0.877904970836824, st.error:  0.0015625853321730662,
                                                        rmse: 0.3493925767344551, st.error:  0.002228030858963717
New best params found! alpha:10000, k:5, leaf:30,
                                                        acc:  0.8782941377094092, st.error:  0.0027283227168917188,
                                                        rmse: 0.3487764869553651, st.error:  0.003899501175305083
New best param

In [3]:
c = [0,1]
al_ham = 0.014399033208816327
pub = [50000, 1000000, 10000000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Thu Oct  8 23:24:31 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 50000
Start CV...
New best params found! alpha:50000, k:5, leaf:10,
                                                        acc:  0.8769600786771387, st.error:  0.0025048288125693085,
                                                        rmse: 0.35069609225411275, st.error:  0.0036115163213070265
New best params found! alpha:50000, k:5, leaf:20,
                                                        acc:  0.877904970836824, st.error:  0.0015625853321730662,
                                                        rmse: 0.3493925767344551, st.error:  0.002228030858963717
New best params found! alpha:50000, k:5, leaf:30,
                                                        acc:  0.8782941377094092, st.error:  0.0027283227168917188,
                                                        rmse: 0.3487764869553651, st.error:  0.003899501175305083
New best param

## Final model -- BINARY -- K = 5

In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 0.014399033208816327
pub = 10000000
k = 5
leaf = 60

cv_binary_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Fri Oct  9 00:05:03 2020
Adding pubchem2d Fri Oct  9 00:07:29 2020
End distance matrix... Fri Oct  9 00:20:12 2020
Accuracy: 	 0.8959994908352071, se: 0.001252572763477651
    RMSE: 		 0.32246776985430203, se: 0.0019394954231536168
    Sensitivity: 	 0.9253063556786623, se: 0.002118546423226795
    Precision: 	 0.908949372669675, se: 0.0028363743015283986
    Specificity: 	 0.8480611908545651, se: 0.004270880015664068


# KNN -- MULTICLASS

In [1]:
from helper_knn import *

X_try, X_train, X_test, y_train, y_test, len_X_train = load_data_knn('data/lc_db_processed.csv',
                                                                     encoding = 'multiclass', seed = 42)

# Best combination
categorical = ['class', 'tax_order', 'family', 'genus', "species", 'control_type', 'media_type',
               'application_freq_unit',"exposure_type", "conc1_type", 'obs_duration_mean']

non_categorical = ['ring_number', 'tripleBond', 'doubleBond', 'alone_atom_number', 'oh_count',
                   'atom_number', 'bonds_number', 'Mol', 'MorganDensity', 'LogP']

## MULTICLASS -- K = 1
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.023357214690901212, alpha_2 = 1, alpha_3 = 0

In [5]:
c = [0,0]
ham = np.logspace(-3, -1, 20) 

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [1])

Tue Sep 22 22:41:52 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:1, leaf:10,
                                                        acc:  0.7255086885294288, st.error:  0.003222136760733884,
                                                        rmse: 0.7183435526109299, st.error:  0.007361166388361987
New best params found! alpha:0.001, k:1, leaf:20,
                                                        acc:  0.7255637340919174, st.error:  0.003546667786245106,
                                                        rmse: 0.7304215945475012, st.error:  0.005548520794540555
New best params found! alpha:0.001, k:1, leaf:80,
                                                        acc:  0.7267881501523026, st.error:  0.0033562206883952293,
                                                        rmse: 0.712964330967776, st.error:  0.005733241624523589
Adding Hamming 1 (Categorical)... alpha = 0.0

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.023357214690901212, alpha_2 = 1, alpha_3 = 0

END1: alpha_1 = 0.023357214690901212, alpha_2 = 1, alpha_3 = 0.1

END2: alpha_1 = 0.023357214690901212, alpha_2 = 1, alpha_3 = 0.4832930238571752

In [2]:
c = [0,1]
al_ham = 0.023357214690901212
pub = np.logspace(-3, -1, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Wed Sep 23 00:19:31 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:1, leaf:10,
                                                        acc:  0.7256198304961379, st.error:  0.0029769940217011497,
                                                        rmse: 0.7185721631305062, st.error:  0.007447037493213675
New best params found! alpha:0.001, k:1, leaf:80,
                                                        acc:  0.7263988596512782, st.error:  0.002992036837334821,
                                                        rmse: 0.7144185271959385, st.error:  0.0053160126795970515
Adding Hamming 3 (Pubchem2d)... alpha = 0.0012742749857031334
Start CV...
New best params found! alpha:0.0012742749857031334, k:1, leaf:10,
                                                        acc:  0.7265092289402431, st.error:  0.0037610557038928872,
                                       

This code chunck has to be re-runned since it point out as optimal value the last value of the interval I set. 

In [2]:
c = [0,1]
al_ham = 0.023357214690901212
pub = np.logspace(-1, 0, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Wed Sep 23 18:47:16 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.1
Start CV...
New best params found! alpha:0.1, k:1, leaf:10,
                                                        acc:  0.7261202475103163, st.error:  0.0030563025397978205,
                                                        rmse: 0.7166110159073625, st.error:  0.007088816416174572
New best params found! alpha:0.1, k:1, leaf:80,
                                                        acc:  0.7271772397571133, st.error:  0.0028711383187119827,
                                                        rmse: 0.7123628431239972, st.error:  0.004992303580564422
Adding Hamming 3 (Pubchem2d)... alpha = 0.11288378916846889
Start CV...
New best params found! alpha:0.11288378916846889, k:1, leaf:10,
                                                        acc:  0.7278992143876304, st.error:  0.0035340244286078725,
                                                 

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.023357214690901212, alpha_2 = 1, alpha_3 = 0.4832930238571752

END1: alpha_1 = 1.0 alpha_2 = 1, alpha_3 = 0.4832930238571752

END2: alpha_1 = 2.7825594022071245 alpha_2 = 1, alpha_3 = 0.4832930238571752

In [2]:
c = [1,0]
al_pub = 0.4832930238571752

ham = np.logspace(-2, 0, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [1])

Wed Sep 23 21:58:04 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.01
Start CV...
New best params found! alpha:0.01, k:1, leaf:10,
                                                        acc:  0.5677197800711883, st.error:  0.004939900535676926,
                                                        rmse: 0.9412787952512911, st.error:  0.005629515686018252
New best params found! alpha:0.01, k:1, leaf:30,
                                                        acc:  0.5720558930354382, st.error:  0.0034541145096843974,
                                                        rmse: 0.9381895946210109, st.error:  0.009681609567517947
New best params found! alpha:0.01, k:1, leaf:70,
                                                        acc:  0.5722778833513135, st.error:  0.004358228071446643,
                                                        rmse: 0.9318040691222567, st.error:  0.007922290370031204
Adding Hamming 1 (Categorica

In [2]:
c = [1,0]
al_pub = 0.4832930238571752

ham = np.logspace(0, 0.5, 10)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [1])

Thu Sep 24 05:47:44 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 1.0
Start CV...
New best params found! alpha:1.0, k:1, leaf:10,
                                                        acc:  0.5684981756305785, st.error:  0.004777699496653579,
                                                        rmse: 0.9419881118866616, st.error:  0.004197384066548152
New best params found! alpha:1.0, k:1, leaf:30,
                                                        acc:  0.572500615437823, st.error:  0.004151861528614078,
                                                        rmse: 0.9365807775990893, st.error:  0.01054366643876818
Adding Hamming 1 (Categorical)... alpha = 1.1364636663857248
Start CV...
Adding Hamming 1 (Categorical)... alpha = 1.2915496650148839
Start CV...
Adding Hamming 1 (Categorical)... alpha = 1.4677992676220695
Start CV...
New best params found! alpha:1.4677992676220695, k:1, leaf:70,
                              

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 2.7825594022071245, alpha_2 = 1, alpha_3 = 0.4832930238571752

END: alpha_1 = 2.7825594022071245, alpha_2 = 1, alpha_3 = 1000

In [2]:
c = [0,1]
al_ham = 2.7825594022071245
pub = np.logspace(-0.3, 1, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Thu Sep 24 10:42:25 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.5011872336272722
Start CV...
New best params found! alpha:0.5011872336272722, k:1, leaf:10,
                                                        acc:  0.5185697951306776, st.error:  0.0012606025896634678,
                                                        rmse: 1.134684458293093, st.error:  0.009055758374491591
New best params found! alpha:0.5011872336272722, k:1, leaf:20,
                                                        acc:  0.5198486077042462, st.error:  0.0035048587031914543,
                                                        rmse: 1.128338364279855, st.error:  0.003999009126531808
New best params found! alpha:0.5011872336272722, k:1, leaf:30,
                                                        acc:  0.5207386861047661, st.error:  0.0024258905741088616,
                                                        rmse: 1.129454408357924

Start CV...
New best params found! alpha:1.767536622987673, k:1, leaf:10,
                                                        acc:  0.5667740070588748, st.error:  0.0024814909386639138,
                                                        rmse: 1.0662405790407656, st.error:  0.009337438643036945
New best params found! alpha:1.767536622987673, k:1, leaf:20,
                                                        acc:  0.5688306897709674, st.error:  0.0029063209284844785,
                                                        rmse: 1.048098498192307, st.error:  0.008540549678760436
New best params found! alpha:1.767536622987673, k:1, leaf:40,
                                                        acc:  0.5702763698301485, st.error:  0.0035864981171653708,
                                                        rmse: 1.0427903908696359, st.error:  0.007319886720854343
Adding Hamming 3 (Pubchem2d)... alpha = 2.0691380811147897
Start CV...
New best params found! alpha:2.06913808111

Adding Hamming 3 (Pubchem2d)... alpha = 7.297227644686393
Start CV...
New best params found! alpha:7.297227644686393, k:1, leaf:20,
                                                        acc:  0.6318250509233267, st.error:  0.003144500029685537,
                                                        rmse: 0.9334032957167342, st.error:  0.008981436605725911
New best params found! alpha:7.297227644686393, k:1, leaf:30,
                                                        acc:  0.6328247722957322, st.error:  0.004429609538339608,
                                                        rmse: 0.9301073902476468, st.error:  0.00779031401464186
New best params found! alpha:7.297227644686393, k:1, leaf:40,
                                                        acc:  0.6331036316935721, st.error:  0.0024971361411862568,
                                                        rmse: 0.9383216508263216, st.error:  0.008518692385873845
New best params found! alpha:7.297227644686393, k:1, leaf

In [2]:
c = [0,1]
al_ham = 2.7825594022071245
pub = [10,20,50,100]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Thu Sep 24 14:06:04 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 10
Start CV...
New best params found! alpha:10, k:1, leaf:10,
                                                        acc:  0.649060107065319, st.error:  0.0039927701073128275,
                                                        rmse: 0.8970446052192381, st.error:  0.0073117770855754185
New best params found! alpha:10, k:1, leaf:70,
                                                        acc:  0.6510060186960198, st.error:  0.004674900141749234,
                                                        rmse: 0.8892442255135233, st.error:  0.007892436048516993
Adding Hamming 3 (Pubchem2d)... alpha = 20
Start CV...
New best params found! alpha:20, k:1, leaf:10,
                                                        acc:  0.6760811268361334, st.error:  0.004958342546751918,
                                                        rmse: 0.8471052725774729, st.err

In [2]:
c = [0,1]
al_ham = 2.7825594022071245
pub = [100,200,500,1000,10000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [1])

Mon Sep 28 10:52:21 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 100
Start CV...
New best params found! alpha:100, k:1, leaf:10,
                                                        acc:  0.7155011363771583, st.error:  0.004299724377315707,
                                                        rmse: 0.7488069810593526, st.error:  0.006479600703800599
New best params found! alpha:100, k:1, leaf:70,
                                                        acc:  0.7164458121870751, st.error:  0.003889738274597874,
                                                        rmse: 0.7470222994556968, st.error:  0.006509207008171904
Adding Hamming 3 (Pubchem2d)... alpha = 200
Start CV...
New best params found! alpha:200, k:1, leaf:10,
                                                        acc:  0.7252861882462425, st.error:  0.00456556607363394,
                                                        rmse: 0.7321440495009776, st.

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 2.7825594022071245, alpha_2 = 1, alpha_3 = 1000

END: alpha_1 = 2.7825594022071245, alpha_2 = 1, alpha_3 = 1000

In [2]:
c = [1,0]
al_pub = 1000

ham = [3, 5, 10, 100, 1000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [1])

Mon Sep 28 12:53:59 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 3
Start CV...
New best params found! alpha:3, k:1, leaf:10,
                                                        acc:  0.5694989169376061, st.error:  0.005365420350894578,
                                                        rmse: 0.9437098797254124, st.error:  0.0050949059820642835
New best params found! alpha:3, k:1, leaf:30,
                                                        acc:  0.5725560782462934, st.error:  0.0043481487470328015,
                                                        rmse: 0.9380019223084393, st.error:  0.011840056503234513
New best params found! alpha:3, k:1, leaf:70,
                                                        acc:  0.5731674672380773, st.error:  0.004254024042303233,
                                                        rmse: 0.936920913571455, st.error:  0.007937711609441634
Adding Hamming 1 (Categorical)... alpha 

## Final model -- MULTICLASS -- K = 1

In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 2.7825594022071245
pub = 100
k = 1
leaf = 30

cv_multiclass_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Thu Sep 24 15:37:29 2020
Adding pubchem2d Thu Sep 24 15:40:59 2020
End distance matrix... Thu Sep 24 15:53:58 2020
Accuracy: 	 0.7397005500575238, se: 0.0018383759458390201
RMSE: 		 0.6947909402973472, se: 0.00219610172761468


In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 2.7825594022071245
pub = 1000
k = 1
leaf = 30

cv_multiclass_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Thu Oct  8 17:31:16 2020
Adding pubchem2d Thu Oct  8 17:33:28 2020
End distance matrix... Thu Oct  8 17:45:41 2020
Accuracy: 	 0.7559785276743899, se: 0.0028283453879600776
RMSE: 		 0.6547477665180885, se: 0.005520359369533257
W. Recall: 	 0.7559785276743899, se:0.0028283453879600776
W. Precision: 	 0.7564643020670653, se: 0.0030486937987545958


## MULTICLASS -- K = 3
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.00206913808111479, alpha_2 = 1, alpha_3 = 0

In [2]:
c = [0,0]
ham = np.logspace(-3, -1, 20) 

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [3])

Wed Oct  7 12:46:21 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:3, leaf:10,
                                                        acc:  0.68998124402044, st.error:  0.0015979496350606635,
                                                        rmse: 0.8011173418651548, st.error:  0.0072205164247127135
New best params found! alpha:0.001, k:3, leaf:20,
                                                        acc:  0.69420633864642, st.error:  0.003615355802171918,
                                                        rmse: 0.7970066060536131, st.error:  0.0036302345163051492
Adding Hamming 1 (Categorical)... alpha = 0.0012742749857031334
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.001623776739188721
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.00206913808111479
Start CV...
New best params found! alpha:0.00206913808111479, k:3, leaf:20,
                                    

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.00206913808111479, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.00206913808111479, alpha_2 = 1, alpha_3 = 10

In [2]:
c = [0,1]
al_ham = 0.00206913808111479
pub = np.logspace(-3, 0, 20)

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Wed Oct  7 16:00:50 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:3, leaf:10,
                                                        acc:  0.6903704418001351, st.error:  0.0016663242776706667,
                                                        rmse: 0.8010746051527603, st.error:  0.007437915153043073
New best params found! alpha:0.001, k:3, leaf:20,
                                                        acc:  0.6945399345356508, st.error:  0.0038184457234268854,
                                                        rmse: 0.7965828865900018, st.error:  0.0038968597484611922
Adding Hamming 3 (Pubchem2d)... alpha = 0.0014384498882876629
Start CV...
Adding Hamming 3 (Pubchem2d)... alpha = 0.00206913808111479
Start CV...
Adding Hamming 3 (Pubchem2d)... alpha = 0.002976351441631319
Start CV...
New best params found! alpha:0.002976351441631319, k:3, leaf:20,
          

In [2]:
c = [0,1]
al_ham = 0.00206913808111479
pub = [i for i in np.logspace(0,0.3,5)] + [5, 10, 100, 1000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Wed Oct  7 19:08:08 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 1.0
Start CV...
New best params found! alpha:1.0, k:3, leaf:10,
                                                        acc:  0.6919271556511408, st.error:  0.0021087179731637826,
                                                        rmse: 0.7955391543258468, st.error:  0.005211288449478407
New best params found! alpha:1.0, k:3, leaf:20,
                                                        acc:  0.6972640562830833, st.error:  0.004141454665682889,
                                                        rmse: 0.789355812390519, st.error:  0.004159057944825247
Adding Hamming 3 (Pubchem2d)... alpha = 1.1885022274370185
Start CV...
Adding Hamming 3 (Pubchem2d)... alpha = 1.4125375446227544
Start CV...
New best params found! alpha:1.4125375446227544, k:3, leaf:90,
                                                        acc:  0.6973203226764073, st.error:  0.001

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.00206913808111479, alpha_2 = 1, alpha_3 = 10

END: alpha_1 = 1, alpha_2 = 1, alpha_3 = 10

In [2]:
c = [1,0]
al_pub = 10

ham = [i for i in np.logspace(-2.7,-2,10)] + [0.05, 0.1, 0.5, 1, 1.5, 2]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [3])

Wed Oct  7 20:33:39 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.001995262314968879
Start CV...
New best params found! alpha:0.001995262314968879, k:3, leaf:10,
                                                        acc:  0.5846215648486023, st.error:  0.002058225798475452,
                                                        rmse: 0.9455513238579115, st.error:  0.004300381695815527
New best params found! alpha:0.001995262314968879, k:3, leaf:20,
                                                        acc:  0.5901798994684441, st.error:  0.0071514246815722075,
                                                        rmse: 0.9338511759157491, st.error:  0.009760347369253266
New best params found! alpha:0.001995262314968879, k:3, leaf:30,
                                                        acc:  0.5927951045610704, st.error:  0.004157211535198321,
                                                        rmse: 0.94330063421629

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 1, alpha_2 = 1, alpha_3 = 10

END: alpha_1 = 1, alpha_2 = 1, alpha_3 = 100

In [8]:
c = [0,1]
al_ham = 1
pub = [i for i in np.logspace(1,1.4,8)] + [50, 100, 1000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [3])

Wed Oct  7 23:06:30 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 10.0
Start CV...
New best params found! alpha:10.0, k:3, leaf:10,
                                                        acc:  0.6759152174709238, st.error:  0.0034387538070697046,
                                                        rmse: 0.8446929153566343, st.error:  0.003471013289705158
New best params found! alpha:10.0, k:3, leaf:20,
                                                        acc:  0.6775267450810948, st.error:  0.0031063743209786094,
                                                        rmse: 0.8435730280796532, st.error:  0.003505432628349781
New best params found! alpha:10.0, k:3, leaf:40,
                                                        acc:  0.6779716220190286, st.error:  0.0010743030894290457,
                                                        rmse: 0.8393297366754549, st.error:  0.005698400827825148
New best params fou

## Final model -- MULTICLASS -- K = 3

In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 1
pub = 100
k = 3
leaf = 70

cv_multiclass_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Thu Oct  8 10:20:02 2020
Adding pubchem2d Thu Oct  8 10:23:27 2020
End distance matrix... Thu Oct  8 10:36:15 2020
Accuracy: 	 0.7348951998038259, se: 0.0015951052564620268
RMSE: 		 0.7194612001611612, se: 0.006722519007522795
W. Recall: 0.7348951998038259, se:0.0015951052564620268
W. Precision: 0.7361102745239585, se: 0.0017262402569629813


## MULTICLASS -- K = 5
### Finding the best alpha_1 for the problem

START: alpha_1 = 0, alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.11288378916846889, alpha_2 = 1, alpha_3 = 0

In [2]:
c = [0,0]
ham = [i for i in np.logspace(-3, -2, 5)] + [i for i in np.logspace(-2, 0.5, 20) ]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, choice = c, ks = [5])

Thu Oct  8 00:48:41 2020
START...
Computing Euclidean ...
Adding Hamming 1 (Categorical)... alpha = 0.001
Start CV...
New best params found! alpha:0.001, k:5, leaf:10,
                                                        acc:  0.6646273692038158, st.error:  0.0032536255150229683,
                                                        rmse: 0.8212029138071324, st.error:  0.007196875193561847
New best params found! alpha:0.001, k:5, leaf:20,
                                                        acc:  0.6665737135340534, st.error:  0.004978374400579129,
                                                        rmse: 0.8108664017188154, st.error:  0.006596650614280178
New best params found! alpha:0.001, k:5, leaf:70,
                                                        acc:  0.6672414925476005, st.error:  0.00307232633614994,
                                                        rmse: 0.8207551397825654, st.error:  0.005170232705348263
Adding Hamming 1 (Categorical)... alpha = 0.0

### Finding the best alpha_3, fixing best_alpha_1
START: alpha_1 = 0.11288378916846889 , alpha_2 = 1, alpha_3 = 0

END: alpha_1 = 0.11288378916846889 , alpha_2 = 1, alpha_3 =  0.08858667904100823

In [3]:
c = [0,1]
al_ham = 0.11288378916846889
pub = [i for i in np.logspace(-2, 0, 20) ] + [1.5, 2, 10, 100, 1000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Thu Oct  8 06:33:18 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.01
Start CV...
New best params found! alpha:0.01, k:5, leaf:10,
                                                        acc:  0.6668521556859115, st.error:  0.0009265824063318593,
                                                        rmse: 0.8306759684394383, st.error:  0.004157103390011953
New best params found! alpha:0.01, k:5, leaf:20,
                                                        acc:  0.6689644557420233, st.error:  0.004827741434402463,
                                                        rmse: 0.8190929084809557, st.error:  0.005216369343412883
New best params found! alpha:0.01, k:5, leaf:100,
                                                        acc:  0.67091025919784, st.error:  0.0030450429447324364,
                                                        rmse: 0.8201660875450827, st.error:  0.0020699903863519277
Adding Hamming 3 (Pu

### Finding again the best alpha_1, fixing best_alpha_3

START: alpha_1 = 0.11288378916846889 , alpha_2 = 1, alpha_3 = 0.08858667904100823

END: alpha_1 = 0.774263682681127 , alpha_2 = 1, alpha_3 = 0.08858667904100823

In [7]:
c = [1,0]
al_pub = 10

ham = [i for i in np.logspace(-1, 0, 10)] + [1.1, 1.5, 2, 10, 100]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_ham = ham, a_pub = al_pub, choice = c, ks = [5])

Thu Oct  8 11:08:26 2020
START...
Computing Euclidean and Pubchem2d Matrix...
Adding Hamming 1 (Categorical)... alpha = 0.1
Start CV...
New best params found! alpha:0.1, k:5, leaf:10,
                                                        acc:  0.5859558093235315, st.error:  0.004031385251592011,
                                                        rmse: 0.9214804228981427, st.error:  0.007096176122142234
New best params found! alpha:0.1, k:5, leaf:60,
                                                        acc:  0.5892357799865029, st.error:  0.005711865142076941,
                                                        rmse: 0.9099137207211532, st.error:  0.008805354782220519
Adding Hamming 1 (Categorical)... alpha = 0.1291549665014884
Start CV...
Adding Hamming 1 (Categorical)... alpha = 0.16681005372000587
Start CV...
New best params found! alpha:0.16681005372000587, k:5, leaf:80,
                                                        acc:  0.5914595156268665, st.error:  0.0059

### Finding again the best alpha_3, fixing best_alpha_1

START: alpha_1 = 0.774263682681127 , alpha_2 = 1, alpha_3 = 0.08858667904100823

END: alpha_1 = 0.774263682681127 , alpha_2 = 1, alpha_3 = ---------------

In [2]:
c = [0,1]
al_ham = 0.774263682681127
pub = [i for i in np.logspace(-1.1,0,15)] + [1.5, 2, 10, 100, 1000, 10000]

best_acc, best_alpha, best_k, best_leaf = cv_params_new(X_train, y_train, categorical, non_categorical,
                                                    sequence_pub = pub, a_ham = al_ham, choice = c, ks = [5])

Thu Oct  8 13:43:28 2020
START...
Computing Basic Matrix: Hamming 1 and Euclidean 2...
Adding Hamming 3 (Pubchem2d)... alpha = 0.07943282347242814
Start CV...
New best params found! alpha:0.07943282347242814, k:5, leaf:10,
                                                        acc:  0.607528508331707, st.error:  0.0021537710058528287,
                                                        rmse: 0.9752715640899847, st.error:  0.0069680814251431
New best params found! alpha:0.07943282347242814, k:5, leaf:60,
                                                        acc:  0.60797214898525, st.error:  0.0034689652323369,
                                                        rmse: 0.9673461168830514, st.error:  0.007902821981398815
New best params found! alpha:0.07943282347242814, k:5, leaf:90,
                                                        acc:  0.6079722262530244, st.error:  0.0044230020059025646,
                                                        rmse: 0.9643704054871804,

Start CV...
New best params found! alpha:10, k:5, leaf:10,
                                                        acc:  0.6661846857434621, st.error:  0.003064454001194414,
                                                        rmse: 0.8114504943147214, st.error:  0.009921746042263103
New best params found! alpha:10, k:5, leaf:20,
                                                        acc:  0.6674079582407859, st.error:  0.0016260798875357854,
                                                        rmse: 0.8168949415290927, st.error:  0.004720413215430576
New best params found! alpha:10, k:5, leaf:60,
                                                        acc:  0.670299411080477, st.error:  0.0017198016342288072,
                                                        rmse: 0.812455810051992, st.error:  0.004854500801423748
New best params found! alpha:10, k:5, leaf:80,
                                                        acc:  0.6716893501671997, st.error:  0.001652658000193140

## Final model -- MULTICLASS -- K = 5

In [2]:
y = np.append(y_train,y_test)

del X_train, X_test, y_train, y_test

ham = 0.774263682681127
pub = 100
k = 5
leaf = 90

cv_multiclass_knn(X_try, y, ham, pub, k, leaf)

Basic Matrix... Thu Oct  8 16:56:01 2020
Adding pubchem2d Thu Oct  8 16:59:29 2020
End distance matrix... Thu Oct  8 17:12:49 2020
Accuracy: 	 0.7131417999669112, se: 0.0027060852688013727
RMSE: 		 0.7372753493731985, se: 0.006932540723191115
W. Recall: 	 0.7131417999669112, se:0.0027060852688013727
W. Precision: 	 0.7145671758599701, se: 0.0028212544594714978
