## Exercise 2: transcription factor binding prediction (with hyperparameter optimization)

In this exercise, we will train multiple models allowing us to get more robust estimate of the performance. We will also compare the models in a cross-validation setting to find the best one.

Data is the same as in the previous exercise: it contains a list of DNA sequences that are annotated with 1 if the sequence binds the transcription factor or 0 if not.

In [None]:
# run if on google colab to get all the files from github

!git clone https://github.com/uio-bmi/machine_learning_in_comp_bio_exercises.git
!mv ./machine_learning_in_comp_bio_exercises/{.,}* ./
!rm -r ./machine_learning_in_comp_bio_exercises

In [None]:
# import packages

from sklearn.model_selection import KFold

from scripts.util import load_data_exercise_2, load_test_dataset_ex2, encode_kmer, print_dataset_info, train_logistic_regression, assess_model, make_folder

%load_ext autoreload
%autoreload 2

In [None]:
dataset = load_data_exercise_2()
print_dataset_info(dataset)

## Task 1: Setting up cross-validation 

In [None]:
# get indices of examples in the dataset

indices = list(range(dataset.shape[0]))
print(indices[:10])

In [None]:
# make a folder to store temporary results

result_path = "./exercise_2_output/"
make_folder(result_path)

# initialize a list to store performances

performances = []

# k-fold cross-validation setup

k_fold = KFold(n_splits=2)
current_split = 1 # to know what is the currect split

for train_indices, test_indices in k_fold.split(indices):
    
    # split the data
    
    train_dataset = dataset.iloc[train_indices, :]
    test_dataset = dataset.iloc[test_indices, :]
    
    print_dataset_info(train_dataset, test_dataset)
    
    # TODO: how to encode the data?
    
    # encoded_train, train_labels, feature_names = ?
    # encoded_test, test_labels, _ = ?
    
    # TODO: how to train and asses a model?
    
    # logistic_regression = ?
    
    current_split += 1 # go to next split
    
print(performances)

# what is the expected performance?

## Task 2: Comparing different hyperparameters

What if we were interested in different subsequence length (different k)? How would we compare them? Modify the skeleton below to obtain performance estimates for two different values of k and compare them.


In [None]:
# make a folder to store temporary results

result_path = "./exercise_2_output_hp/"
make_folder(result_path)

# TODO: provide a list of k values to test for
k_values = []

# performance measures could be stored in a format: 
# {k_5: [accuracy_split1, accuracy_split2]}, k_6: [accuracy_split1, accuracy_split2] }
performances = {f"k_{k}": [] for k in k_values}

# logistic regression trained models will be stored here:
log_reg_models = []

# k-fold cross validation setup

k_fold = KFold(n_splits=2)

for train_indices, test_indices in k_fold.split(indices):
    
    # split the data
    
    train_dataset = dataset.iloc[train_indices, :]
    test_dataset = dataset.iloc[test_indices, :]
    
    print_dataset_info(train_dataset, test_dataset)
    
    for k in k_values:
    
        # TODO: how to encode the data?

        # encoded_train, train_labels, feature_names = ?
        # encoded_test, test_labels, _ = ?

        # TODO: how to train and test a model?

        # logistic_regression = ?
        
        # TODO: how to check the accuracy? (hint: sklearn has accuracy_score function -- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
        
        acc = 0
        
        performances[f"k_{k}"].append(acc)
    
print(performances)

# which k is better? (hint: which k has higher accuracy on average?)

### Optional: what if we wanted to compare different k values, but also compare different ML methods?

Try to set up CV with 2 different k values for k-mer frequency encoding and with 2 different ML methods, e.g., logistic regression and random forest with default hyperparameters. Alternatively, try different hyperparameter values for one model, e.g., try varying regularization strength (C) parameter of the logistic regression model.

## Task 3: How well does the selected best model perform on the new test dataset?

Assess the performance of the chosen best model on a new dataset and compare it with the performances obtained during cross validation. Is there a difference? 

In [None]:
test_dataset = load_test_dataset_ex2()
print_dataset_info(test_dataset)

In [None]:
# TODO: assess the performance of the model

# what are the steps here?

performance = None