Reading in the data

In [27]:
import git
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error as mse
# Add more imports in this block later. There will need to be several "from sklearn.whatever import something" lines

In [2]:
repo = git.Repo('.', search_parent_directories = True)
root = repo.working_tree_dir

# The sample id and the log-transformed gene expression values.
half_data_1 = pd.read_csv(root + '\\data\\RKNGHStress.csv')
half_data_1 = half_data_1.loc[:, half_data_1.columns.str.startswith(('Sample', 'Log'))]
half_data_1 = half_data_1.rename(columns = {'Sample' : 'sample', 'Log16S' : 'bact', 'Logcbblr' : 'cbblr', 'Log18S' : 'fungi', 'Logphoa' : 'phoa', 'Logurec' : 'urec'})

# The hyperspectral measurements for each sample
half_data_2 = pd.read_csv(root + '\\data\\RKNGHStressPCAPSR.csv')
half_data_2 = half_data_2.rename(columns = {'Unnamed: 0' : 'sample'})

data = half_data_1.join(half_data_2.set_index('sample'), on = 'sample')

TEMP: testing manual construction of models with specific hyperparameters

In [3]:
X = data.drop(['sample', 'bact', 'cbblr', 'fungi', 'phoa', 'urec'], axis = 1)
bact = data[['bact']]

In [12]:
X_train, X_test, bact_train, bact_test = train_test_split(X.to_numpy(), bact.to_numpy(), train_size = 0.8)
bact_train = bact_train.ravel()
bact_test = bact_test.ravel()

# Note: do NOT scale X and y before splitting, since that is a data leak. Instead, use the pipeline to scale both Xs and the y training, and manually scale the y testing for custom scoring like RMSE.

pipeline = make_pipeline(StandardScaler(), ElasticNetCV(alphas = [0.001389495], l1_ratio = 0.4285714, cv = 10, selection = 'random', max_iter = 10000))
pipeline.fit(X_train, bact_train)
pipeline.score(X_test, bact_test)

-471.04533586184095

This looks weird. None of the folds are converging (true for 10 folds, 5 folds, or even just 1 fold with basic ElasticNet) but changing selection to 'random' and max_iter to 10k allowed convergence. 

The score is pretty low. But what metric is the score? If it's not RMSE it's not in the ballpark of the R version. (UPDATE: It's R^2, so it's a bunch of pretty bad scores, actually.)

It looks like ElasticNetCV, and ElasticNet for that matter, don't allow changing the scoring or tuning metric, or at least it's not obvious how. Can the models be evaluated on RMSE just by calling the RMSE function on the preds and targets? Also, does it even make sense to change the tuning metric for this algorithm? (Idk, you probably COULD, but minimizing the sum of squared residuals is good enough.) And what tuning metric does the tidymodels implementation use?

This all might be caused by using the hyperparameter optima found in the R code, but that had a differet train/test split. So it's not optimal here, but it's still probably pretty good, especially since in the analysis of part 1's hyperparameters, there wasn't much variation among the elastic net models' penalties (all tending to be very close to 0) or mixtures (a bit more spread but similar).

What happens if some of its own tuning were allowed?

In [30]:
print(pipeline.predict(X_test))
print()
print(scale(bact_test))
print()
# RMSE since squared = False
rmse = mse(scale(bact_test), pipeline.predict(X_test), squared = False)
print(rmse)

[-0.30167882 -0.32435681 -0.16994891  0.39670121  0.25491271 -0.1838957
 -0.35130211  0.18132449  0.11240183  0.21245123 -0.28431435 -0.10937956
  0.02726354  0.34610739 -0.08573751 -0.12437335 -0.44832629  0.43490646
 -0.2331579  -0.03646126  0.29497726  0.08395952  0.14403236 -0.32314219
 -0.13073932  0.3601685   0.3271546   0.05173241 -0.08478912 -0.18042365
  0.28732325 -0.55372755  0.26427954  0.10760681 -0.12250387  0.53005803
 -0.27401899  0.58678425  0.42258126 -0.32610505  0.26361617 -0.03611554
  0.47091699 -0.32362347  0.23242506 -0.20872366 -0.35041646  0.09218657
 -0.26156427  0.36010758 -0.03356407  0.44726391  0.53851241  0.05622336
 -0.27942402  0.15987401 -0.2200759   0.30743585 -0.26980634  0.53244131
 -0.15480568  0.2872557  -0.17786029 -0.00411791 -0.28753649 -0.4676467
 -0.17377226 -0.37650008  0.08047002  0.37637611 -0.21343528  0.29685338
 -0.34337752 -0.16002197 -0.23365526 -0.50311291  0.11201752  0.28377126
  0.19557695 -0.10578414]

[ 0.22532478 -1.15687247  

Uh oh, RMSE of 0.83 is really bad. It doesn't even come close to the R models. What happened?

Now try to get the distance filter up and running. This is subtly different than the correlation filter, which is probably too heavy-handed on this data. But the other aspect to consider is that the distance filter can't be applied during preprocessing (before building models) like the correlation filter.