# Exercise 1: transcription factor binding prediction

In this exercise, we will train and assess the performance of machine learning models that predict if a DNA sequence contains a transcription factor binding site.

Transcription factors (TFs) are proteins that bind to DNA and influence gene regulation. Predicting if they will bind or not could help us understand the biology better and allow us to preding binding for new DNA sequences which were not experimentally analyzed.

## Data

We will use a dataset for transcription factors USF1. The data consists of sequences and a label for each sequence (1 - TF will bind to the sequence, 0 - it will not bind).

The data are located under `data` folder in this repository.

## Import necessary libraries and functions

In [None]:
# run if on google colab to get all the files from github
#
# !git clone https://github.com/uio-bmi/machine_learning_in_comp_bio_exercises.git
# !mv ./machine_learning_in_comp_bio_exercises/{.,}* ./
# !rm -r ./machine_learning_in_comp_bio_exercises

In [None]:
from scripts.util import load_data_exercise_1, encode_kmer, print_dataset_info, train_logistic_regression, assess_model, make_folder, in_notebook

%load_ext autoreload
%autoreload 2

In [None]:
# define where to store the result

result_path = "./exercise_1_output_usf1/"
make_folder(result_path)

## Load the data and print basic information

In [None]:
dataset = load_data_exercise_1()

print_dataset_info(dataset)

In [None]:
# split the data to training and testing

train_dataset, test_dataset = dataset.head(400), dataset.tail(100)

print_dataset_info(train_dataset, test_dataset)

## Encode the data as k-mer frequencies

In [None]:
# define the parameters

k = 5

# encode the data

encoded_train, train_labels, feature_names = encode_kmer(k, train_dataset, learn_model=True, path=result_path)

encoded_test, test_labels, _ = encode_kmer(k, test_dataset, learn_model=False, path=result_path)


Look into the plots showing average k-mer frequencies in positive vs negative examples: are there any differences? 


## Train logistic regression

In this step, run the following code and try to answer the questions:
        
- How good is our classifier? 
- What can we see from the confusion matrix?

In [None]:

logistic_regression = train_logistic_regression(encoded_train, train_labels, C=1)

assess_model(logistic_regression, encoded_train, train_labels, encoded_test, test_labels, feature_names, result_path)


Look into the last plot showing the largest model coefficients. 
    
- What does this mean for our model? 
- What k-mers is it detecting? 
- Can we interpret this somehow?

### Question:

How long subsequence should be to discover the motifs? What is the length of k that is best? How to try with different k?