Reading in the data

In [1]:
import git
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error
# Add more imports in this block later. There will need to be several "from sklearn.whatever import something" lines

In [2]:
repo = git.Repo('.', search_parent_directories = True)
root = repo.working_tree_dir

# The sample id and the log-transformed gene expression values.
half_data_1 = pd.read_csv(root + '\\data\\RKNGHStress.csv')
half_data_1 = half_data_1.loc[:, half_data_1.columns.str.startswith(('Sample', 'Log'))]
half_data_1 = half_data_1.rename(columns = {'Sample' : 'sample', 'Log16S' : 'bact', 'Logcbblr' : 'cbblr', 'Log18S' : 'fungi', 'Logphoa' : 'phoa', 'Logurec' : 'urec'})

# The hyperspectral measurements for each sample
half_data_2 = pd.read_csv(root + '\\data\\RKNGHStressPCAPSR.csv')
half_data_2 = half_data_2.rename(columns = {'Unnamed: 0' : 'sample'})

data = half_data_1.join(half_data_2.set_index('sample'), on = 'sample')

TEMP: testing manual construction of models with specific hyperparameters

In [3]:
X = data.drop(['sample', 'bact', 'cbblr', 'fungi', 'phoa', 'urec'], axis = 1)
bact = data[['bact']]

In [10]:
X_train, X_test, bact_train, bact_test = train_test_split(X.to_numpy(), bact.to_numpy(), train_size = 0.8, random_state = 0)
bact_train = scale(bact_train.ravel())
bact_test = scale(bact_test.ravel())

# Note: do NOT scale X and y before splitting, since that is a data leak. Instead, use the pipeline to scale both Xs and the y training, and manually scale the y testing for custom scoring like RMSE.

pipeline = make_pipeline(StandardScaler(), ElasticNetCV(alphas = [0.001389495], l1_ratio = 0.4285714, cv = 10, selection = 'random', max_iter = 10000))
pipeline.fit(X_train, bact_train)
pipeline.score(X_test, bact_test)

0.4792530347735706

This looks weird. None of the folds are converging (true for 10 folds, 5 folds, or even just 1 fold with basic ElasticNet) but changing selection to 'random' and max_iter to 10k allowed convergence. 

The score is pretty low. But what metric is the score? If it's not RMSE it's not in the ballpark of the R version. (UPDATE: It's R^2, so it's a bunch of pretty bad scores, actually.)

It looks like ElasticNetCV, and ElasticNet for that matter, don't allow changing the scoring or tuning metric, or at least it's not obvious how. Can the models be evaluated on RMSE just by calling the RMSE function on the preds and targets? Also, does it even make sense to change the tuning metric for this algorithm? (Idk, you probably COULD, but minimizing the sum of squared residuals is good enough.) And what tuning metric does the tidymodels implementation use?

This all might be caused by using the hyperparameter optima found in the R code, but that had a differet train/test split. So it's not optimal here, but it's still probably pretty good, especially since in the analysis of part 1's hyperparameters, there wasn't much variation among the elastic net models' penalties (all tending to be very close to 0) or mixtures (a bit more spread but similar).

What happens if some of its own tuning were allowed?

In [11]:
print('Preds on X_test')
print(pipeline.predict(X_test))
print()
print('Scaled bact_test')
print(scale(bact_test))
print()

# print('MSE')
# mse = mean_squared_error(scale(bact_test), pipeline.predict(X_test))
# print(mse)

print('RMSE')
rmse = root_mean_squared_error(scale(bact_test), pipeline.predict(X_test))
print(rmse)
print()

# print('MSE, no scaling bact_test')
# mse1 = mean_squared_error(bact_test, pipeline.predict(X_test))
# print(mse1)
# print()

print('RMSE, no scaling bact_test')
rmse1 = root_mean_squared_error(bact_test, pipeline.predict(X_test))
print(rmse1)
print()

Preds on X_test
[ 0.69569189 -0.87535893 -0.25776621  0.10517769 -0.59273765  0.56006148
 -0.71385707 -0.36592773  0.38387765 -0.78304772  0.72022199  0.53128687
  0.44129921 -0.25786282  0.58056445  1.1809667  -0.26491401  0.28681619
 -0.21919771 -0.02986261  0.70315128 -0.90777818  0.84088154 -0.88865849
 -0.07234446 -0.36263674  1.23544489  1.36681226 -0.54173563  0.7090774
  0.06410912 -0.50085058 -0.01977494 -0.24642616  0.48992319  0.64504155
  0.31157215 -0.43514807  0.58307725 -0.89992968 -0.11740088 -0.80549581
  0.1547279   0.63949601 -0.21470968  1.06626484 -0.22108812 -0.40188409
  0.67753521 -0.68300371 -0.78264341  0.13862848 -1.15747476  0.49203234
 -0.76123458 -0.91494976  0.99742168  0.71968777  1.04475761  0.81154076
  0.30913917 -0.90586621  0.90451355 -0.48645285  0.33667892 -0.71946542
  0.51476301  0.77696631  0.23896917  0.43912973  0.00535966 -0.77178784
  1.13765297  0.95179409  0.45859737  0.81837057 -0.30785301 -0.80467148
 -0.70438419 -0.94757898]

Scaled ba

Uh oh, RMSE of 0.83 is really bad. It doesn't even come close to the R models. What happened?

Looking at the results, it looks like the scaling isn't happening for some reason, or at least not for the predictions. They're all in the 9 range instead of 0 range. Recalculating RMSE after removing scaling on bact_test gave more reasonable results.

But the problem is, we want to be able to compare models among different targets using a common scale, so normalization has to be done with respect to other targets (but NOT the entire dataset for each column since that's data leakage). This should be done before training. But how can this be implemented? (UPDATE: Fixed by manually scaling bact_train and bact_test, separately. But still doesn't solve the issue of getting much higher RMSE than expected.)

Now try to get the distance filter up and running. This is subtly different than the correlation filter, which is probably too heavy-handed on this data. But the other aspect to consider is that the distance filter can't be applied during preprocessing (before building models) like the correlation filter.