# Evaluate Linear Models

Evaluate Linear Models

Here, I evaluate the following linear models:

+ LinearRegression
+ PLSRegression
+ Lasso
+ Enet

## NOTE:

For linear models I will add polynomial features of 2nd degree.

In [1]:
import numpy as np
import pandas as pd

# sklearn import
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.cross_decomposition import PLSRegression

# my module imports
from optimalcodon.projects.rnastability.dataprocessing import get_data, general_preprocesing_pipeline
from optimalcodon.projects.rnastability import modelevaluation

In [2]:
(train_x, train_y), (test_x, test_y) = get_data("../19-04-30-EDA/results_data/")

In [3]:
print("{} points for training and {} for testing with {} features".format(
    train_x.shape[0], test_x.shape[0], test_x.shape[1]))

67817 points for training and 7534 for testing with 6 features


***

## Data Pre-processing

In [4]:
# pre-process Pipeline

preprocessing = Pipeline([
    ('general', general_preprocesing_pipeline(train_x)), # see the code for general_preprocesing_pipeline
    ('polyfeaturs', PolynomialFeatures(degree=2)),
    ('zerovar', VarianceThreshold(threshold=0.0)),
    ('scaling', StandardScaler()) # I scale again not all polynomial features may be with scaled
])


preprocessing.fit(train_x)
train_x_transformed = preprocessing.transform(train_x)

In [5]:
train_x_transformed.shape

(67817, 3320)

***
## Linear Regression

In [6]:
lm_reg = Pipeline([
    ('lm', LinearRegression())
])

lm_grid = dict()

lm_search = modelevaluation.gridsearch(lm_reg, lm_grid, train_x_transformed, train_y)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=32)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done   3 out of   3 | elapsed:  2.3min remaining:    0.0s
[Parallel(n_jobs=32)]: Done   3 out of   3 | elapsed:  2.3min finished


Best Score R2 =  0.1706582260853268
Best Parameters:  {}


***
## PLS regression

In [7]:
pls_reg = Pipeline([
    ('pls', PLSRegression())
])

pls_grid = dict(
    pls__n_components = np.arange(6, 15, 1)
)

pls_search = modelevaluation.gridsearch(pls_reg, pls_grid, train_x_transformed, train_y)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=32)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done   3 out of  27 | elapsed:  1.2min remaining:  9.6min
[Parallel(n_jobs=32)]: Done   6 out of  27 | elapsed:  1.3min remaining:  4.4min
[Parallel(n_jobs=32)]: Done   9 out of  27 | elapsed:  1.3min remaining:  2.7min
[Parallel(n_jobs=32)]: Done  12 out of  27 | elapsed:  1.4min remaining:  1.8min
[Parallel(n_jobs=32)]: Done  15 out of  27 | elapsed:  1.5min remaining:  1.2min
[Parallel(n_jobs=32)]: Done  18 out of  27 | elapsed:  1.5min remaining:   46.2s
[Parallel(n_jobs=32)]: Done  21 out of  27 | elapsed:  1.6min remaining:   27.3s
[Parallel(n_jobs=32)]: Done  24 out of  27 | elapsed:  1.7min remaining:   12.4s
[Parallel(n_jobs=32)]: Done  27 out of  27 | elapsed:  1.8min remaining:    0.0s
[Parallel(n_jobs=32)]: Done  27 out of  27 | elapsed:  1.8min finished


Best Score R2 =  0.19312240722861246
Best Parameters:  {'pls__n_components': 9}


***

## Lasso

In [8]:
lasso = Lasso()
alphas = np.logspace(-4, -0.5, 10)
lasso_grid = [{'alpha': alphas}]
lasso_search = modelevaluation.gridsearch(lasso, lasso_grid, train_x_transformed, train_y)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=32)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done   3 out of  30 | elapsed:   45.7s remaining:  6.9min
[Parallel(n_jobs=32)]: Done   7 out of  30 | elapsed:   51.6s remaining:  2.8min
[Parallel(n_jobs=32)]: Done  11 out of  30 | elapsed:   53.5s remaining:  1.5min
[Parallel(n_jobs=32)]: Done  15 out of  30 | elapsed:  1.1min remaining:  1.1min
[Parallel(n_jobs=32)]: Done  19 out of  30 | elapsed:  6.4min remaining:  3.7min
[Parallel(n_jobs=32)]: Done  23 out of  30 | elapsed:  7.2min remaining:  2.2min
[Parallel(n_jobs=32)]: Done  27 out of  30 | elapsed:  8.1min remaining:   53.7s
[Parallel(n_jobs=32)]: Done  30 out of  30 | elapsed:  8.8min finished


Best Score R2 =  0.20392686350784336
Best Parameters:  {'alpha': 0.003593813663804626}


***

## Elastic Net

In [9]:
enet = ElasticNet()
alphas = np.logspace(-4, -0.5, 10)
enet_grid = [{'alpha': alphas, 'l1_ratio' : np.linspace(0, 1, 5)}]
enet_search = modelevaluation.gridsearch(enet, enet_grid, train_x_transformed, train_y)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=32)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done   8 tasks      | elapsed: 10.0min
[Parallel(n_jobs=32)]: Done  21 tasks      | elapsed: 10.6min
[Parallel(n_jobs=32)]: Done  34 tasks      | elapsed: 19.0min
[Parallel(n_jobs=32)]: Done  49 tasks      | elapsed: 20.3min
[Parallel(n_jobs=32)]: Done  64 tasks      | elapsed: 22.1min
[Parallel(n_jobs=32)]: Done  81 tasks      | elapsed: 24.2min
[Parallel(n_jobs=32)]: Done 103 out of 150 | elapsed: 25.4min remaining: 11.6min
[Parallel(n_jobs=32)]: Done 119 out of 150 | elapsed: 26.0min remaining:  6.8min
[Parallel(n_jobs=32)]: Done 135 out of 150 | elapsed: 28.6min remaining:  3.2min
[Parallel(n_jobs=32)]: Done 150 out of 150 | elapsed: 35.2min finished


Best Score R2 =  0.2055916255334355
Best Parameters:  {'alpha': 0.008799225435691074, 'l1_ratio': 0.25}


## Validation Data Test

In [11]:
mymodels = {
    'linear_reg': lm_search.best_estimator_,
    'PLS': pls_search.best_estimator_,
    'lasso': lasso_search.best_estimator_,
    'enet': enet_search.best_estimator_
}
modelevaluation.eval_models(mymodels, preprocessing, test_x, test_y).to_csv("results_data/val_linearmodels.csv")

generating predictions for model: linear_reg
generating predictions for model: PLS
generating predictions for model: lasso
generating predictions for model: enet


## 10-FOLD CV

Cross validate the best scoring model to have a profile.

In [25]:
def crossvalidation(trained_models, n_splits=10):
    """
    evaluate trained models in test data using 10Fold CV
    Args:
        trained_models (dict): Maps a model name/id (str) to an estimator object
    Returns: pd.DataFrame with R2 score for each fold
    """
    cross_val = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    r2score = make_scorer(r2_score)
    results = []
    for mdlname, estimator in trained_models.items():
        print('cv for model: {}'.format(mdlname))
        scores = cross_val_score(estimator, train_x_transformed, train_y, cv=cross_val, n_jobs=10, scoring=r2score)
        # put results in pandas data frame
        res = pd.DataFrame({'r2_score': scores, 'kfold' : range(1, n_splits + 1), 'mdlname': mdlname})
        results.append(res)
    return pd.concat(results)


In [29]:
results = modelevaluation.crossvalidation(mymodels)

cv for model: linear_reg
cv for model: PLS
cv for model: lasso
cv for model: enet


In [32]:
results.to_csv('results_data/cv_linearmodels.csv', index=False)