There is an endemic problem today in Bioinformatics.  With Deep Learning$^{\text{TM}}$ becoming the next Hot New Thing, the Bioinformatics community has worked quickly to catch up.  However, I think some important lessons of the Machine Learning community have not filtered through to the Bioinformatics community.  Specifically the big issue in how to choose a test set.

# What purpose does a test set serve?

The fundamental purpose of a test set is a fair and honest evaluation of the performance of the trained model.  In other words, the test set is used to answer the question: if we train the model and use int on future data, how well can we expect the model to perform.  The honesty part is absolutely critical, because otherwise we're going to overpromise the performance.  This is why the machine learning community hammers on the concept of a proper test set.  And it also guides the choice of the test set. I'll illustrate this with some examples.

## Examples

Suppose we want to build a model to predict stock prices.  If we train the model today, then we would be able to predict prices for tomorrow.  In particular, no information from tomorrow will be seen by the model during training.   In particular, if stocks A & B are correlated, say they're both in the same industry and will be similarly affected by underlying economic conditions, then we cannot use the fact that if stock A is up then so will stock B (except in a Granger sense, meaning we can use the current rise in A to predict a likely rise of B in the future).  Or if there is some economic event that causes a general shift in stock prices.  The latter would be apparent if you saw 80% of future prices, but not if you saw none of them.  Therefore the proper split to simulate this behavior is to take a time-based split.  If we don't do this, then we would fool ourselves in the performance of the algorithm and it's possible that we might deploy an algorithm that under-performs and loses us money.

Or consider the following example from [r/MachineLearning](https://www.reddit.com/r/MachineLearning/comments/c4ylga/d_misuse_of_deep_learning_in_nature_journals/): a Nature paper proposed a deep neural network to predict the location of aftershocks.  If we train a model today and an earthquake happens tomorrow, then what information is available to predict the location of the subsequent aftershocks?  We can use past earthquakes and aftershocks, as well as information we get from the initial earthquake.  What we don't get is aftershocks from the current earthquake.  Therefore, a proper split would be either by time (as above) or by earthquakes grouped with their aftershocks.  As the post above shows, doing the latter type of split results in a simple regularized regression having better test set performance, which indicates that the deep neural network is over-fitting.  

Now consider examples from bioinformatics.  Suppose we want to build a model to predict which guide RNAs are going to be effective.  If we want to apply the model to help design guides in a new experiment, then we would very typically not have access to a previous experiment in the same cell type and target phenotype (that we are trying to select for).  If we did, then we can just use that experiment to select which guides to use.  In particular, if we use the same experiment to predict and evaluate the model then there will be several confounders such as batch effects which will make us over-confident in  our predictions.  [One paper](https://www.nature.com/articles/nbt.4061) clearly showed this with an out-of-sample test set (hidden in the supplementary) where simple regularized regression showed better performance than their proposed deep learning model.  

Now consider the problem of predicting gene expression from other modalities, such as the promoter genetic sequence plus open chromatin of the particular cell type.  If we want to deploy the model, then we would take the model, the genetic sequence, and open chromatin data to predict the gene expression of a sample for which we have no gene expression data on, like a new patient sample.  The key here is that there a lot of biological (cell type to cell type or person to person) variation and batch effects present.  These effects hugely impact the variation, but they won't be available to the model in production.  Therefore, if the model is able "see" those batch effects (say using a simple train-test split), we will overestimate the accuracy of the model.  


# Example: guide RNA design

To clearly illustrate how this issue arises we'll use the third example above.  Let's say we want to build a model to improve on-target effects for CRISPRko (CRISPR knockout) guides.  To train the model we'll use the [Toronto Knockout Library](http://tko.ccbr.utoronto.ca/) dataset, a collection of CRISPRko experiments on 5 different cell lines for gene essentiallity.  To remove the bias of biological effect and the bias of using the training data to select positive hit genes, we'll subset the training data to previously known essential genes (from http://www.ncbi.nlm.nih.gov/pubmed/24987113).  

First what we'll have to do is process the counts to convert it to log fold change.  We'll do this using all guides.

## Preprocessing


In [25]:
import pandas as pd

log2fc_df = pd.read_csv("../CombinedLog2FoldChanges.txt", sep = '\t')
print(log2fc_df.shape)
log2fc_df.head()

(456600, 5)


Unnamed: 0,seq,gene,log2FoldChange,lib,essential
0,CACCTTCGAGCTGCTGCGCG,A1BG,-0.198332,DLD1,0
1,AAGAGCGCCTCGGTCCCAGC,A1BG,-0.631673,DLD1,0
2,TGGACTTCCAGCTACGGCGC,A1BG,-1.315708,DLD1,0
3,CACTGGCGCCATCGAGAGCC,A1BG,0.989644,DLD1,0
4,GCTCGGGCTTGTCCACAGGA,A1BG,0.021679,DLD1,0


In [26]:
log2fc_df['gene'].value_counts()

chr10         6900
LacZ           480
luciferase     120
EGFP           110
RAP1GAP         30
              ... 
CEP57L1          5
CALCB            5
KRTAP5-2         5
SRSF3            5
STARD6           5
Name: gene, Length: 17236, dtype: int64

In [27]:
log2fc_df['gene'].value_counts().value_counts()

30      12037
25       1642
20       1087
15        887
5         801
10        778
480         1
6900        1
120         1
110         1
Name: gene, dtype: int64

Note that `chr10`, `LacZ`, and `luciferase` are control guides.  Well, that's what they were designed for, but in fact some of these guides have a non-null effect.    For most of the genes there are 30 guides per gene.  

In [29]:
log2fc_df.loc[log2fc_df['log2FoldChange'].isna(), 'log2FoldChange'] = 0
# subset to essential genes and negative controls
essential_data = log2fc_df[(log2fc_df['essential'] == 1) | (log2fc_df['gene'] == 'chr10')].copy()
#essential_data.loc[essential_data['log2FoldChange'].isna(), 'log2FoldChange'] = 0
essential_data = essential_data.reset_index(drop = True)
#print(essential_data.shape)

(14440, 5)


In [30]:
# breakdown between libraries
log2fc_df['lib'].value_counts()

DLD1        91320
RPE1        91320
HeLa        91320
HCT116_1    91320
GBM         91320
Name: lib, dtype: int64

In [31]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

def seq_to_one_hot(seq):
    # transform sequence into an array
    seq_array = np.array(list(seq))
    
    #integer encode the sequence
    label_encoder = LabelEncoder()
    integer_encoded_seq = label_encoder.fit_transform(seq_array)
    #one hot the sequence
    onehot_encoder = OneHotEncoder(sparse=False)
    #reshape because that's what OneHotEncoder likes
    integer_encoded_seq = integer_encoded_seq.reshape(len(integer_encoded_seq), 1)
    onehot_encoded_seq = onehot_encoder.fit_transform(integer_encoded_seq)
    return onehot_encoded_seq.flatten()

log2fc_df['sklearn_1hot'] = log2fc_df['seq'].map(lambda s: seq_to_one_hot(s))

In [32]:
log2fc_df.head()

Unnamed: 0,seq,gene,log2FoldChange,lib,essential,sklearn_1hot
0,CACCTTCGAGCTGCTGCGCG,A1BG,-0.198332,DLD1,0,"[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
1,AAGAGCGCCTCGGTCCCAGC,A1BG,-0.631673,DLD1,0,"[1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
2,TGGACTTCCAGCTACGGCGC,A1BG,-1.315708,DLD1,0,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
3,CACTGGCGCCATCGAGAGCC,A1BG,0.989644,DLD1,0,"[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
4,GCTCGGGCTTGTCCACAGGA,A1BG,0.021679,DLD1,0,"[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."


In [33]:
log2fc_df['sklearn_1hot'].map(lambda s: len(s)).value_counts()

80    448985
60      7575
40        40
Name: sklearn_1hot, dtype: int64

In [34]:
log2fc_df['seq'].map(lambda s: len(s)).value_counts()

20    456600
Name: seq, dtype: int64

So the problem is not variable sequence length, rather the LabelEncoder has difficulty with unseen data.  The solution on stackoverflow is to switch to pandas (https://stackoverflow.com/a/33761341).

In [35]:
seq_array = pd.DataFrame(np.array([list(x) for x in log2fc_df['seq']]))
seq_array.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,C,A,C,C,T,T,C,G,A,G,C,T,G,C,T,G,C,G,C,G
1,A,A,G,A,G,C,G,C,C,T,C,G,G,T,C,C,C,A,G,C
2,T,G,G,A,C,T,T,C,C,A,G,C,T,A,C,G,G,C,G,C
3,C,A,C,T,G,G,C,G,C,C,A,T,C,G,A,G,A,G,C,C
4,G,C,T,C,G,G,G,C,T,T,G,T,C,C,A,C,A,G,G,A


In [36]:
seq_array_1hot = pd.get_dummies(seq_array)
print(seq_array_1hot.shape)
seq_array_1hot.head()

(456600, 76)


Unnamed: 0,0_A,0_C,0_G,0_T,1_A,1_C,1_G,1_T,2_A,2_C,...,16_G,17_A,17_C,17_G,18_A,18_C,18_G,19_A,19_C,19_G
0,0,1,0,0,1,0,0,0,0,1,...,0,0,0,1,0,1,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
2,0,0,0,1,0,0,1,0,0,0,...,1,0,1,0,0,0,1,0,1,0
3,0,1,0,0,1,0,0,0,0,1,...,0,0,0,1,0,1,0,0,1,0
4,0,0,1,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0


In [37]:
seq_array_1hot.columns.values

array(['0_A', '0_C', '0_G', '0_T', '1_A', '1_C', '1_G', '1_T', '2_A',
       '2_C', '2_G', '2_T', '3_A', '3_C', '3_G', '3_T', '4_A', '4_C',
       '4_G', '4_T', '5_A', '5_C', '5_G', '5_T', '6_A', '6_C', '6_G',
       '6_T', '7_A', '7_C', '7_G', '7_T', '8_A', '8_C', '8_G', '8_T',
       '9_A', '9_C', '9_G', '9_T', '10_A', '10_C', '10_G', '10_T', '11_A',
       '11_C', '11_G', '11_T', '12_A', '12_C', '12_G', '12_T', '13_A',
       '13_C', '13_G', '13_T', '14_A', '14_C', '14_G', '14_T', '15_A',
       '15_C', '15_G', '15_T', '16_A', '16_C', '16_G', '17_A', '17_C',
       '17_G', '18_A', '18_C', '18_G', '19_A', '19_C', '19_G'],
      dtype=object)

In [38]:
seq_array[19].value_counts()

C    184450
A    139055
G    133095
Name: 19, dtype: int64

In [39]:
seq_array[18].value_counts()

G    161340
C    155085
A    140175
Name: 18, dtype: int64

OK, so the missing T's in the end of the guide appear to be missing from the data.  Let's re-index it to add those in.

In [40]:
new_cols = seq_array_1hot.columns.values.tolist() + ['16_T', '17_T', '18_T', '19_T']
print(new_cols)
seq_array_1hot = seq_array_1hot.reindex(columns =  new_cols, fill_value=0)
seq_array_1hot.head()

['0_A', '0_C', '0_G', '0_T', '1_A', '1_C', '1_G', '1_T', '2_A', '2_C', '2_G', '2_T', '3_A', '3_C', '3_G', '3_T', '4_A', '4_C', '4_G', '4_T', '5_A', '5_C', '5_G', '5_T', '6_A', '6_C', '6_G', '6_T', '7_A', '7_C', '7_G', '7_T', '8_A', '8_C', '8_G', '8_T', '9_A', '9_C', '9_G', '9_T', '10_A', '10_C', '10_G', '10_T', '11_A', '11_C', '11_G', '11_T', '12_A', '12_C', '12_G', '12_T', '13_A', '13_C', '13_G', '13_T', '14_A', '14_C', '14_G', '14_T', '15_A', '15_C', '15_G', '15_T', '16_A', '16_C', '16_G', '17_A', '17_C', '17_G', '18_A', '18_C', '18_G', '19_A', '19_C', '19_G', '16_T', '17_T', '18_T', '19_T']


Unnamed: 0,0_A,0_C,0_G,0_T,1_A,1_C,1_G,1_T,2_A,2_C,...,18_A,18_C,18_G,19_A,19_C,19_G,16_T,17_T,18_T,19_T
0,0,1,0,0,1,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0
1,1,0,0,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,0,0,0,1,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
3,0,1,0,0,1,0,0,0,0,1,...,0,1,0,0,1,0,0,0,0,0
4,0,0,1,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0


To control for variable gene effect sizes I'll include a gene indicator.  

In [42]:
gene_one_hot = pd.get_dummies(log2fc_df['gene'], prefix = 'gene')
print(gene_one_hot.shape)
gene_one_hot.head()

(456600, 17236)


Unnamed: 0,gene_A1BG,gene_A1CF,gene_A2M,gene_A2ML1,gene_A4GALT,gene_A4GNT,gene_AAAS,gene_AACS,gene_AADAC,gene_AADACL2,...,gene_ZWINT,gene_ZXDB,gene_ZXDC,gene_ZYG11A,gene_ZYG11B,gene_ZYX,gene_ZZEF1,gene_ZZZ3,gene_chr10,gene_luciferase
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
X = seq_array_1hot.merge(gene_one_hot, left_index = True, right_index = True)
X.shape

(456600, 17316)

In [44]:
y = log2fc_df['log2FoldChange']

In [45]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error,explained_variance_score

## Simple train-test split

First, let's look at a simple train-test split.  Since there's 5 libraries/data sources, I'll do a 20% test set size. Note that the training data is evenly split by library, so taking a standard Cv split results in a split by library.  I'll shuffle the data frame before computing the CV scores.

In [None]:
shuffled_df = X.copy()
shuffled_df['y'] = y.copy()
shuffled_df = shuffled_df.sample(frac=1, replace=False).reset_index(drop=True)
shuffled_y = shuffled_df['y']
shuffled_X = shuffled_df.drop(['y'], axis = 1)
rf_model = RandomForestRegressor()
cv_scores = cross_val_score(rf_model, shuffled_X, shuffled_y, cv=5)
cv_scores

In [None]:
cv_scores.mean()

## Split by library

Now let's take a look at what happens when you split by library. Note that since the libraries are in order, and there are an equal number of guides per library, we can do a standard 5-way cross validation.

In [None]:
rf_model = RandomForestRegressor()
lib_cv_scores = cross_val_score(rf_model, X, y, cv=5)
lib_cv_scores

In [None]:
lib_cv_scores.mean()

# Interpretation

Note that the average $R^{2}$ (the default score for `RandomForestRegressor`) is lower when split by library.  However, variance is higher, which results in the test set performance being high for some libraries. The order of the cell types are as follows:
- DLD1, [male colorectal cancer cell line](https://www.atcc.org/products/ccl-221);
- GBM, glioblastoma (don't know the exact cell line);
- HCT116, [male colorectal carcinoma cell line](https://imanislife.com/collections/cell-lines/hct116-cells/);
- HeLa, [female cervical cancer](https://en.wikipedia.org/wiki/HeLa);
- RPE1, [female immortalized retinal pigment epithelium](https://web.expasy.org/cellosaurus/CVCL_4388).

Note that the first two have the highest test set scores.  It seems reasonable that DLD1 and HCT116 would be highly predictive of each other, since they are similar cell types.  And it is reasonable that HeLa is very difficult to predict, since the karyotype of HeLa is completely haywire.  But what we're missing is the metadata, such as specific experimental design and who prepared the libraries.  In my experience, such details are crucial to evaluating the quality of a sequencing-based experiment.  When reserchers outside the organization use a publicly available ML tool, then the person preparing the experiment will (with exceedingly high probability) be new (to the ML tool).  