Reading in the data

In [164]:
import git
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.experimental import enable_halving_search_cv # Needed for HalvingGridSearchCV, which is experimental
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error
# Add more imports in this block later. There will need to be several "from sklearn.whatever import something" lines

In [2]:
repo = git.Repo('.', search_parent_directories = True)
root = repo.working_tree_dir

# The sample id and the log-transformed gene expression values.
half_data_1 = pd.read_csv(root + '\\data\\RKNGHStress.csv')
half_data_1 = half_data_1.loc[:, half_data_1.columns.str.startswith(('Sample', 'Log'))]
half_data_1 = half_data_1.rename(columns = {'Sample' : 'sample', 'Log16S' : 'bact', 'Logcbblr' : 'cbblr', 'Log18S' : 'fungi', 'Logphoa' : 'phoa', 'Logurec' : 'urec'})

# The hyperspectral measurements for each sample
half_data_2 = pd.read_csv(root + '\\data\\RKNGHStressPCAPSR.csv')
half_data_2 = half_data_2.rename(columns = {'Unnamed: 0' : 'sample'})

data = half_data_1.join(half_data_2.set_index('sample'), on = 'sample')

TEMP: testing manual construction of models with specific hyperparameters

In [15]:
X = data.drop(['sample', 'bact', 'cbblr', 'fungi', 'phoa', 'urec'], axis = 1)
# NOTE: when doing phoa, there are a couple of samples (3 and 30) that have no data recorded, so we'll need to remove NAs there. But those observations still have data for the other genes.
bact = data[['bact']]

In [73]:
# Note: do NOT scale X and y before splitting, since that is a data leak. Instead, use the pipeline to scale both Xs and the y training, and manually scale the y testing for custom scoring like RMSE.
X_train, X_test, bact_train, bact_test = train_test_split(X.to_numpy(), bact.to_numpy(), train_size = 0.8, random_state = 0)
bact_train_noscale = bact_train # used for debugging below
bact_test_noscale = bact_test
bact_train = scale(bact_train.ravel())
bact_test = scale(bact_test.ravel())

# For the sake of robustness, maybe should repeat this a few times, with different random states (still manually set for sake of reproducibility) e.g., 0, 1, ... , 4
cv_0 = KFold(n_splits = 5, shuffle = True, random_state = 0)

# n_jobs will need to be adjusted later when running on SCINet (high performance computing clusters). -1 uses all available cores, which might cause a bit of thrashing, but good enough for now.
pipeline = make_pipeline(StandardScaler(), ElasticNetCV(alphas = [0.001389495], l1_ratio = 0.4285714, cv = cv_0, selection = 'random', max_iter = 10000, n_jobs = -1))
pipeline.fit(X_train, bact_train)
pipeline.score(X_test, bact_test)

0.4482138651544799

This looks weird. None of the folds are converging (true for 10 folds, 5 folds, or even just 1 fold with basic ElasticNet) but changing selection to 'random' and max_iter to 10k allowed convergence. 

The score is pretty low. But what metric is the score? If it's not RMSE it's not in the ballpark of the R version. (UPDATE: It's R^2, so it's a bunch of pretty bad scores, actually.)

It looks like ElasticNetCV, and ElasticNet for that matter, don't allow changing the scoring or tuning metric, or at least it's not obvious how. Can the models be evaluated on RMSE just by calling the RMSE function on the preds and targets? Also, does it even make sense to change the tuning metric for this algorithm? (Idk, you probably COULD, but minimizing the sum of squared residuals is good enough.) And what tuning metric does the tidymodels implementation use?

This all might be caused by using the hyperparameter optima found in the R code, but that had a differet train/test split. So it's not optimal here, but it's still probably pretty good, especially since in the analysis of part 1's hyperparameters, there wasn't much variation among the elastic net models' penalties (all tending to be very close to 0) or mixtures (a bit more spread but similar).

What happens if some of its own tuning were allowed?

In [5]:
# print('Preds on X_test')
# print(pipeline.predict(X_test))
# print()
# print('Scaled bact_test')
# print(scale(bact_test))
# print()

# print('MSE')
# mse = mean_squared_error(scale(bact_test), pipeline.predict(X_test))
# print(mse)

print('RMSE')
rmse = root_mean_squared_error(scale(bact_test), pipeline.predict(X_test))
print(rmse)
print()

# print('MSE, no scaling bact_test')
# mse1 = mean_squared_error(bact_test, pipeline.predict(X_test))
# print(mse1)
# print()

# print('RMSE, no scaling bact_test')
# rmse1 = root_mean_squared_error(bact_test, pipeline.predict(X_test))
# print(rmse1)
# print()

RMSE
0.7428249859632096



Uh oh, RMSE of 0.83 is really bad. It doesn't even come close to the R models. What happened?

Looking at the results, it looks like the scaling isn't happening for some reason, or at least not for the predictions. They're all in the 9 range instead of 0 range. Recalculating RMSE after removing scaling on bact_test gave more reasonable results.

But the problem is, we want to be able to compare models among different targets using a common scale, so normalization has to be done with respect to other targets (but NOT the entire dataset for each column since that's data leakage). This should be done before training. But how can this be implemented? (UPDATE: Fixed by manually scaling bact_train and bact_test, separately. But still doesn't solve the issue of getting much higher RMSE than expected.)

Decided to go back and change the cross validation size to 5 instead of 10 in light of the relatively small dataset here.

In [None]:
# Testing how RMSE changes (hopefully improves!) with hyperparameter tuning

# mix_space = np.linspace(0, 1, 8)
# reg_space = np.logspace(-5, 5, 8)

# elastcv = ElasticNetCV(alphas = reg_space, l1_ratio = mix_space, cv = cv_0, selection = 'random', max_iter = 10000, n_jobs = -1, random_state = 0, positive = True)
# estimators_cv = [('scaler', StandardScaler()), ('elastic_net', elastcv)]
# pipeline_hp = Pipeline(estimators_cv, memory = root + '\\cache')
# pipeline_hp.fit(X_train, bact_train)
# pipeline_hp.score(X_test, bact_test)

In [None]:
# print('RMSE')
# # bact_test has already been scaled above
# rmse_hp = root_mean_squared_error(bact_test, pipeline_hp.predict(X_test))
# print(rmse_hp)
# print()

# print(pipeline_hp['elastic_net'].get_params())
# print()


This sucks. The results are even worse than before. I must have done something wrong here, but I haven't figured out what yet. Also I'm trying to figure out what the hyperparameters were that it ended up on. Maybe it's better to just do ElasticNet and the parameter search separately in the pipeline.

In [6]:
# Trying to get a sense of the spread of the data
print('X_train, mean:', np.mean(scale(X_train), axis=0))
print('X_train, std:', np.std(scale(X_train), axis=0))
print('X_test, mean:', np.mean(scale(X_test), axis=0))
print('X_test, std:', np.std(scale(X_test), axis=0))

X_train, mean: [ 2.64987194e-16  6.98253475e-18 -1.82942410e-16 ...  4.69994414e-15
  5.29555435e-15  1.12698111e-15]
X_train, std: [1. 1. 1. ... 1. 1. 1.]
X_test, mean: [-8.32667268e-18  0.00000000e+00  3.99680289e-16 ... -3.17627868e-15
 -7.89299182e-16 -4.98863104e-16]
X_test, std: [1. 1. 1. ... 1. 1. 1.]


In [7]:
print('bact_train, mean:', np.mean(bact_train))
print('bact_train, std:', np.std(bact_train))
print('bact_test, mean:', np.mean(bact_test))
print('bact_test, std:', np.std(bact_test))

bact_train, mean: -4.2034859171342405e-16
bact_train, std: 0.9999999999999998
bact_test, mean: -1.2934098236883074e-15
bact_test, std: 1.0


In [8]:
print(type([1,2,3]))
print(type(list((1,2,3))))
print(type(list(np.arange(3))))

<class 'list'>
<class 'list'>
<class 'list'>


In [44]:
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("elastic_net", ElasticNet(warm_start=True, positive=True, random_state=0, selection="random"))
    ],
    memory = root+'\\cache'
)

REGULARIZATION = np.logspace(-5, 5, 8) # If this doesn't work, may have to enclose the RHS in list()
MIXTURE = np.linspace(0, 1, 8)
PARAM_GRID = [
    {
        "elastic_net__alpha": REGULARIZATION,
        "elastic_net__l1_ratio": MIXTURE
    }
]

hgrid = HalvingGridSearchCV(estimator=pipe, param_grid=PARAM_GRID, factor=2, n_jobs=-1, cv=5, verbose=2, error_score='raise')
hgrid.fit(X_train, bact_train)

n_iterations: 5
n_required_iterations: 7
n_possible_iterations: 5
min_resources_: 10
max_resources_: 318
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 64
n_resources: 10
Fitting 5 folds for each of 64 candidates, totalling 320 fits


 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan]


----------
iter: 1
n_candidates: 32
n_resources: 20
Fitting 5 folds for each of 32 candidates, totalling 160 fits


         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan -5.1357652  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.29109964 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.39845712 -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.390

----------
iter: 2
n_candidates: 16
n_resources: 40
Fitting 5 folds for each of 16 candidates, totalling 80 fits


         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan -5.1357652  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.29109964 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.39845712 -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.390

----------
iter: 3
n_candidates: 8
n_resources: 80
Fitting 5 folds for each of 8 candidates, totalling 40 fits


         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan         nan         nan
         nan         nan         nan         nan -5.1357652  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.29109964 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.39845712 -5.3901937  -5.3901937  -5.3901937
 -5.3901937  -5.3901937  -5.3901937  -5.3901937  -5.390

----------
iter: 4
n_candidates: 4
n_resources: 160
Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [47]:
print(hgrid.score(X_test, bact_test))
print('RMSE:', root_mean_squared_error(bact_test, hgrid.predict(X_test)))

0.0
RMSE: 1.0


Still need to do a postmortem here, but at least the halving search converged quickly...?

I think the issue might be up top when reading in the data. Maybe I made a mistake reading it in, or scaling it, or something I haven't thought of yet. If the issue is with scaling it and the pipeline, maybe try manually scaling and manually using an estimator to confirm.

In [29]:
# Checklist:
# data doesn't have any obvious problems that I can see, except for the missing phoa data for two samples
# bact looks okay
# X looks okay
# X_train and bact_train have corresponding shapes
# Same for _test
# No obvious problems with X_train, bact_train, X_test, or bact_test

hgrid.cv_results_

{'iter': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4]),
 'n_resources': array([ 10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,
         10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,
         10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,
         10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,
         10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  10,  20,
         20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,
         20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,  20,
         20,  20,  20,  20,  20,  40,  40,  

Now the halving grid search is giving a bunch of nan values during training, and clean R2=0.0 and RMSE=1.0 vals. How did this get even worse??? I only manually deleted the two outlier lines from the csv. Next step: try switching to Grid search. If that doesn't work, try manually scaling and then manually fitting elastic net. Also check if I gave it bad hyperparameter values.

In [45]:
grid = GridSearchCV(estimator=pipe, param_grid=PARAM_GRID, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5, verbose=2, error_score='raise')
grid.fit(X_train, bact_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


In [55]:
print('best estimator:', grid.best_estimator_)
print()
print('best RMSE:', grid.best_score_*-1)
print()
print('best params:', grid.best_params_)
print()
print('preds on train:', grid.predict(X_train))
print()
print('preds on test:', grid.predict(X_test))
print()
best_estimator = grid.best_estimator_
print('model coeffs:', best_estimator['elastic_net'].coef_)

best estimator: Pipeline(memory='C:\\Users\\joshua.waldbieser\\OneDrive - '
                'USDA\\root_knot_nematode_greenhouse\\paper_2\\DirtSpectra\\cache',
         steps=[('scaler', StandardScaler()),
                ('elastic_net',
                 ElasticNet(alpha=np.float64(0.19306977288832497),
                            l1_ratio=np.float64(0.5714285714285714),
                            positive=True, random_state=0, selection='random',
                            warm_start=True))])

best RMSE: 1.000145737688182

best params: {'elastic_net__alpha': np.float64(0.19306977288832497), 'elastic_net__l1_ratio': np.float64(0.5714285714285714)}

preds on train: [-4.20348592e-16 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16
 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16
 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16
 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16
 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16 -4.20348592e-16


I think I found the reason all of this is so screwed up. It's predicting (basically) 0 for all input features, which IS the mean of the input data for each column, but it's still a blind guess. All the model coefficients are zero, which is why it's predicting zero every time. Maybe something went wrong with the scaling, then.

UPDATE: I just double checked with the previous R code, and I didn't normalize the target variables there. (But should I have? It would make cross-target RMSE directly comparable...)

In [60]:
# Testing to see if scaling is doing what I think it's doing
# toy = np.arange(12).reshape((4,3))
# print(toy)
# print()
# print(scale(toy))
# print()
# print(np.mean(scale(toy), axis=0))
# print()
# print(np.std(scale(toy), axis=0))

# Yup

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

[[-1.34164079 -1.34164079 -1.34164079]
 [-0.4472136  -0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079  1.34164079]]

[0. 0. 0.]

[1. 1. 1.]


In [75]:
# Testing if maybe there's something going on within the hyperparameter tuning, and that was overfit
test_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("elastic_net", ElasticNet(warm_start=True, positive=True, random_state=0, selection="random"))
    ],
    memory = root+'\\cache'
)

test_pipe.fit(X_train, bact_train)
test_pipe['elastic_net'].coef_

array([0., 0., 0., ..., 0., 0., 0.])

Yeah, this has the same issue. So the problem probably isn't in the cross validation, but in the preprocessing (scaling or train/test splits)

In [78]:
# Testing if the issue is with scaling bact_train and bact_test
en = ElasticNet(warm_start=True, positive=True, random_state=0, selection='random')
en.fit(X_train, bact_train_noscale)
print(en.score(X_train, bact_train_noscale))
print(en.score(X_test, bact_test_noscale))
print(en.coef_)

0.0
-0.05253706730489749
[0. 0. 0. ... 0. 0. 0.]


Doesn't appear that the issue is with scaling y, although I didn't really expect it to be.

In [79]:
X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)
en1 = ElasticNet(warm_start=True, positive=True, random_state=0, selection='random')
en1.fit(X_train_scaled, bact_train)
print(en1.score(X_train_scaled, bact_train))
print(en1.score(X_test_scaled, bact_test))
print(en1.coef_)

0.0
0.0
[0. 0. 0. ... 0. 0. 0.]


Similarly, manually scaling X doesn't fix the issue.

In [84]:
en2 = ElasticNet(warm_start=True, positive=True, random_state=0, selection='random')
en2.fit(X_train, bact_train)
print(en2.score(X_train, bact_train))
print(en2.score(X_test, bact_test))
print(en2.coef_)

0.0
0.0
[0. 0. 0. ... 0. 0. 0.]


And not scaling X doesn't either. Now test train_test_split to see if something's going on there.

In [100]:
repo = git.Repo('.', search_parent_directories = True)
root = repo.working_tree_dir

# The sample id and the log-transformed gene expression values.
half_data_1 = pd.read_csv(root + '\\data\\RKNGHStress.csv')
half_data_1 = half_data_1.loc[:, half_data_1.columns.str.startswith(('Sample', 'Log'))]
half_data_1 = half_data_1.rename(columns = {'Sample' : 'sample', 'Log16S' : 'bact', 'Logcbblr' : 'cbblr', 'Log18S' : 'fungi', 'Logphoa' : 'phoa', 'Logurec' : 'urec'})

# The hyperspectral measurements for each sample
half_data_2 = pd.read_csv(root + '\\data\\RKNGHStressPCAPSR.csv')
half_data_2 = half_data_2.rename(columns = {'Unnamed: 0' : 'sample'})

data = half_data_1.join(half_data_2.set_index('sample'), on = 'sample')

# Just went through this again, and didn't see any problems

In [101]:
X = data.drop(['sample', 'bact', 'cbblr', 'fungi', 'phoa', 'urec'], axis = 1)
# NOTE: when doing phoa, there are a couple of samples (3 and 30) that have no data recorded, so we'll need to remove NAs there. But those observations still have data for the other genes.
bact = data[['bact']]

In [115]:
# print(data.head(n=15))
# print('-----------------------------------------')
# print(X.head())
# print('-----------------------------------------')
# print(bact.head(n=15))

Nothing surprising here, it all looks like it should, from what I can tell...

In [170]:
X_train, X_test, bact_train, bact_test = train_test_split(X.to_numpy(), bact.to_numpy(), train_size = 0.8, random_state = 0)
# Testing if maybe there's something going on within the hyperparameter tuning, and that was overfit
test_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("lin_reg", LinearRegression(positive=True))
    ],
    memory = root+'\\cache'
)

test_pipe = test_pipe.fit(X_train, bact_train)
print(test_pipe['lin_reg'].coef_)
print(test_pipe.score(X_test, bact_test))

[[0. 0. 0. ... 0. 0. 0.]]
-0.06419526757866234


Just played around with different random_state vals for the split, and coeffs are still 0. Also tried doing just a basic linear regression, and same problem.

In [166]:
X_train, X_test, bact_train, bact_test = train_test_split(X, bact, train_size = 0.8, random_state = 0)
test_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("elastic_net", ElasticNet(warm_start=True, positive=True, random_state=0, selection='random'))
    ],
    memory = root+'\\cache',
    verbose=True
)

REGULARIZATION = np.logspace(-5, 5, 8) # If this doesn't work, may have to enclose the RHS in list()
MIXTURE = np.linspace(0, 1, 8)
PARAM_GRID = [
    {
        "elastic_net__alpha": REGULARIZATION,
        "elastic_net__l1_ratio": MIXTURE
    }
]

grid = GridSearchCV(estimator=test_pipe, param_grid=PARAM_GRID, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=cv_0, verbose=2, error_score='raise')
grid.fit(X_train, bact_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Pipeline] ....... (step 2 of 2) Processing elastic_net, total=   0.0s


In [163]:
print(grid.score(X_train, bact_train))
print(grid.score(X_test, bact_test))
print(grid)
print(grid.best_estimator_.named_steps['elastic_net'].coef_)
print()
print(grid.best_params_)

-0.44126421540811983
-0.46389124883573146
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             error_score='raise',
             estimator=Pipeline(memory='C:\\Users\\joshua.waldbieser\\OneDrive '
                                       '- '
                                       'USDA\\root_knot_nematode_greenhouse\\paper_2\\DirtSpectra\\cache',
                                steps=[('scaler', StandardScaler()),
                                       ('elastic_net',
                                        ElasticNet(positive=True,
                                                   random_state=0,
                                                   selection='random',
                                                   warm_start=True))],
                                verbose=True),
             n_jobs=-1,
             param_grid=[{'elastic_net__alpha': array([1.00000000e-05, 2.68269580e-04, 7.19685673e-03, 1.93069773e-01,
       5.17947468e+00, 1.38949549e+02,

No idea why it's all zeros. Does it work for random forests?

In [173]:
rf = RandomForestRegressor()
rf.fit(X_train, bact_train.ravel())

In [178]:
print(rf.score(X_train, bact_train))
print(rf.score(X_test, bact_test))

0.8552177307187465
0.020627383638791574


Well it overfit pretty badly on this, but I didn't put any effort into tuning hyperparameters so that's reasonable enough. At least it didn't give an abysmal score on the test set too... what's going on???