Reading in the data

In [1]:
import git
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error
# Add more imports in this block later. There will need to be several "from sklearn.whatever import something" lines

In [2]:
repo = git.Repo('.', search_parent_directories = True)
root = repo.working_tree_dir

# The sample id and the log-transformed gene expression values.
half_data_1 = pd.read_csv(root + '\\data\\RKNGHStress.csv')
half_data_1 = half_data_1.loc[:, half_data_1.columns.str.startswith(('Sample', 'Log'))]
half_data_1 = half_data_1.rename(columns = {'Sample' : 'sample', 'Log16S' : 'bact', 'Logcbblr' : 'cbblr', 'Log18S' : 'fungi', 'Logphoa' : 'phoa', 'Logurec' : 'urec'})

# The hyperspectral measurements for each sample
half_data_2 = pd.read_csv(root + '\\data\\RKNGHStressPCAPSR.csv')
half_data_2 = half_data_2.rename(columns = {'Unnamed: 0' : 'sample'})

data = half_data_1.join(half_data_2.set_index('sample'), on = 'sample')

TEMP: testing manual construction of models with specific hyperparameters

In [3]:
X = data.drop(['sample', 'bact', 'cbblr', 'fungi', 'phoa', 'urec'], axis = 1)
bact = data[['bact']]

In [4]:
X_train, X_test, bact_train, bact_test = train_test_split(X.to_numpy(), bact.to_numpy(), train_size = 0.8, random_state = 0)
bact_train = scale(bact_train.ravel())
bact_test = scale(bact_test.ravel())

# Note: do NOT scale X and y before splitting, since that is a data leak. Instead, use the pipeline to scale both Xs and the y training, and manually scale the y testing for custom scoring like RMSE.

# For the sake of robustness, maybe should repeat this a few times, with different random states (still manually set for sake of reproducibility) e.g., 0, 1, ... , 4
cv_0 = KFold(n_splits = 5, shuffle = True, random_state = 0)

# n_jobs will need to be adjusted later when running on SCINet (high performance computing clusters). -1 uses all available cores, which might cause a bit of thrashing, but good enough for now.
pipeline = make_pipeline(StandardScaler(), ElasticNetCV(alphas = [0.001389495], l1_ratio = 0.4285714, cv = cv_0, selection = 'random', max_iter = 10000, n_jobs = -1))
pipeline.fit(X_train, bact_train)
pipeline.score(X_test, bact_test)

0.4792609494855833

This looks weird. None of the folds are converging (true for 10 folds, 5 folds, or even just 1 fold with basic ElasticNet) but changing selection to 'random' and max_iter to 10k allowed convergence. 

The score is pretty low. But what metric is the score? If it's not RMSE it's not in the ballpark of the R version. (UPDATE: It's R^2, so it's a bunch of pretty bad scores, actually.)

It looks like ElasticNetCV, and ElasticNet for that matter, don't allow changing the scoring or tuning metric, or at least it's not obvious how. Can the models be evaluated on RMSE just by calling the RMSE function on the preds and targets? Also, does it even make sense to change the tuning metric for this algorithm? (Idk, you probably COULD, but minimizing the sum of squared residuals is good enough.) And what tuning metric does the tidymodels implementation use?

This all might be caused by using the hyperparameter optima found in the R code, but that had a differet train/test split. So it's not optimal here, but it's still probably pretty good, especially since in the analysis of part 1's hyperparameters, there wasn't much variation among the elastic net models' penalties (all tending to be very close to 0) or mixtures (a bit more spread but similar).

What happens if some of its own tuning were allowed?

In [5]:
# print('Preds on X_test')
# print(pipeline.predict(X_test))
# print()
# print('Scaled bact_test')
# print(scale(bact_test))
# print()

# print('MSE')
# mse = mean_squared_error(scale(bact_test), pipeline.predict(X_test))
# print(mse)

print('RMSE')
rmse = root_mean_squared_error(scale(bact_test), pipeline.predict(X_test))
print(rmse)
print()

# print('MSE, no scaling bact_test')
# mse1 = mean_squared_error(bact_test, pipeline.predict(X_test))
# print(mse1)
# print()

# print('RMSE, no scaling bact_test')
# rmse1 = root_mean_squared_error(bact_test, pipeline.predict(X_test))
# print(rmse1)
# print()

RMSE
0.721622512477553



Uh oh, RMSE of 0.83 is really bad. It doesn't even come close to the R models. What happened?

Looking at the results, it looks like the scaling isn't happening for some reason, or at least not for the predictions. They're all in the 9 range instead of 0 range. Recalculating RMSE after removing scaling on bact_test gave more reasonable results.

But the problem is, we want to be able to compare models among different targets using a common scale, so normalization has to be done with respect to other targets (but NOT the entire dataset for each column since that's data leakage). This should be done before training. But how can this be implemented? (UPDATE: Fixed by manually scaling bact_train and bact_test, separately. But still doesn't solve the issue of getting much higher RMSE than expected.)

Decided to go back and change the cross validation size to 5 instead of 10 in light of the relatively small dataset here.

In [6]:
# Testing how RMSE changes (hopefully improves!) with hyperparameter tuning

mix_space = np.linspace(0, 1, 8)
reg_space = np.logspace(-5, 5, 8)

elastcv = ElasticNetCV(alphas = reg_space, l1_ratio = mix_space, cv = cv_0, selection = 'random', max_iter = 10000, n_jobs = -1, random_state = 0, positive = True)
estimators = [('scaler', StandardScaler()), ('elastic_net', elastcv)]
pipeline_hp = Pipeline(estimators, memory = root + '\\cache')
pipeline_hp.fit(X_train, bact_train)
pipeline_hp.score(X_test, bact_test)

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '('scaler', StandardScaler())' (type <class 'tuple'>) doesn't

In [None]:
print('RMSE')
rmse_hp = root_mean_squared_error(scale(bact_test), pipeline_hp.predict(X_test))
print(rmse_hp)
print()

print(pipeline_hp['elasticnetcv'].get_params())

This sucks. The results are even worse than before. I must have done something wrong here, but I haven't figured out what yet. Also I'm trying to figure out what the hyperparameters were that it ended up on. Maybe it's better to just do ElasticNet and the parameter search separately in the pipeline.