## Evaluate Non-Linear Models

Here I will test the following Models:

+ KNN
+ Kernel ridge regression
+ Support Vector Machines
+ Gaussian Processes

I will use the standar preprocessing pipeline.

In [1]:
import numpy as np
import pandas as pd

# sklearn imports
from sklearn.neighbors import KNeighborsRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
# my module imports
from optimalcodon.projects.rnastability.dataprocessing import get_data, general_preprocesing_pipeline
from optimalcodon.projects.rnastability import modelevaluation


In [2]:
(train_x, train_y), (test_x, test_y) = get_data("../19-04-30-EDA/results_data/")
print("{} points for training and {} for testing with {} features".format(
    train_x.shape[0], test_x.shape[0], test_x.shape[1]))

67817 points for training and 7534 for testing with 6 features


*** 

## Data Pre-processing

In [3]:
preprocessing = general_preprocesing_pipeline(train_x)

preprocessing.fit(train_x)
train_x_transformed = preprocessing.transform(train_x)

train_x_transformed.shape

(67817, 80)

***

## KNN

In [4]:
knn_reg = KNeighborsRegressor(weights='distance') # THIS DISTANCE GIVES THE BEST RESULTS

knn_grid = dict(
    n_neighbors = np.arange(5, 10)
)

knn_search = modelevaluation.gridsearch(knn_reg, knn_grid, train_x_transformed, train_y, cores=15)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=15)]: Using backend LokyBackend with 15 concurrent workers.
[Parallel(n_jobs=15)]: Done   2 out of  15 | elapsed: 19.2min remaining: 124.7min
[Parallel(n_jobs=15)]: Done   4 out of  15 | elapsed: 19.5min remaining: 53.5min
[Parallel(n_jobs=15)]: Done   6 out of  15 | elapsed: 20.0min remaining: 30.0min
[Parallel(n_jobs=15)]: Done   8 out of  15 | elapsed: 20.1min remaining: 17.6min
[Parallel(n_jobs=15)]: Done  10 out of  15 | elapsed: 20.2min remaining: 10.1min
[Parallel(n_jobs=15)]: Done  12 out of  15 | elapsed: 20.6min remaining:  5.1min
[Parallel(n_jobs=15)]: Done  15 out of  15 | elapsed: 20.9min finished


Best Score R2 =  0.23028382142185025
Best Parameters:  {'n_neighbors': 7}


***

## Kernel Ridge

In [5]:
kerRidge_reg = KernelRidge(kernel='rbf', gamma=0.1)
kerRidge_grid = {"alpha": [0.9, 0.5, 0.2, 0.1],"gamma": np.array([1e-03, 1.e-02, 1.e-01, 1.e+0])}

kerRidge_search = modelevaluation.gridsearch(kerRidge_reg, kerRidge_grid, train_x_transformed, train_y, cores=10)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed: 36.2min
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed: 68.7min
[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed: 100.4min
[Parallel(n_jobs=10)]: Done  34 out of  48 | elapsed: 150.7min remaining: 62.1min
[Parallel(n_jobs=10)]: Done  39 out of  48 | elapsed: 165.3min remaining: 38.2min
[Parallel(n_jobs=10)]: Done  44 out of  48 | elapsed: 196.9min remaining: 17.9min
[Parallel(n_jobs=10)]: Done  48 out of  48 | elapsed: 210.3min finished


Best Score R2 =  0.2886146749972631
Best Parameters:  {'alpha': 0.2, 'gamma': 0.01}


***
## SVM linear kernel

In [7]:
svr_reg = SVR()
svr_grid =  {'C': [10, 100, 1000],
             'gamma': [0.01, 0.001],
             'kernel': ['rbf']},


svr_search = modelevaluation.gridsearch(
    svr_reg, svr_grid,
    train_x_transformed,
    train_y, n_splits=2,
    cores=20)

Fitting 2 folds for each of 6 candidates, totalling 12 fits


[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   1 tasks      | elapsed: 10.7min
[Parallel(n_jobs=20)]: Done   3 out of  12 | elapsed: 61.9min remaining: 185.7min
[Parallel(n_jobs=20)]: Done   5 out of  12 | elapsed: 74.8min remaining: 104.8min
[Parallel(n_jobs=20)]: Done   7 out of  12 | elapsed: 224.0min remaining: 160.0min
[Parallel(n_jobs=20)]: Done   9 out of  12 | elapsed: 309.1min remaining: 103.0min
[Parallel(n_jobs=20)]: Done  12 out of  12 | elapsed: 897.5min finished


Best Score R2 =  0.22691931181966343
Best Parameters:  {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}


In [9]:
mymodels = {
    'knn': knn_search.best_estimator_,
    'kernel_ridge': kerRidge_search.best_estimator_,
    'svm': svr_search.best_estimator_
}
modelevaluation.eval_models(mymodels, preprocessing, test_x, test_y).to_csv("results_data/val_non-linearmodels.csv")

generating predictions for model: knn
generating predictions for model: kernel_ridge
generating predictions for model: svm


In [13]:
# perform cross validation
modelevaluation.crossvalidation(
    mymodels,
    train_x_transformed,
    train_y,
    n_splits = 5
).to_csv('results_data/cv_non-linearmodels.csv', index=False)

cv for model: knn
cv for model: kernel_ridge
cv for model: svm
