# Speed Machine Learning Programming Challenge

**Dataset**: a set of DNA sequences labeled with 0 and 1 indicating if a protein (transcription factor) will bind the sequence or not

**Aim**: build a classifier that will predict the label for new DNA sequences

**Time**: 15 minutes!

**Competition**: find the ML algorithm and hyperparameter settings that gives the highest accuracy on independent test data

## Importing the data

The data is located in data.fa file. For each sequence, there is a line starting with `>` with the sequence label (0 or 1, indicating if a protein (transcription factor) will bind the DNA sequence or not), and a new line with the DNA sequence itself:
```
> 1
AGCGAGGCAGGTGCGGTCACGTGACCCGGCGGCGCTGCGGGGCAGCGGCCATTTTGCGGGGCGGCC
> 0
AGCGAGGGCGCTCGGAGTGCGACGTTTTGGCACCAGGCGGGGCGCACGGCATTGCCAAGCGGCCGC
```

Function load_data will import this data into a pandas dataframe.

In [None]:
from util import load_data

train_dataset = load_data("data.fa")
test_dataset = load_data("test_data.fa")

train_dataset.head(10)

## Defining ML methods and encodings

You can use any suitable ML method from scikit-learn library. scikit-learn is already installed in this environment.

One encoding is already defined and can be used right away. It represents a letter in a sequence as a vector of length 4 (4 possible letters: A, C, G, T) with one 1 and 0s elsewhere to indicate which letter is present at the given position. To represent the full sequence, these vectors are just concatenated one after another. For example, for sequence `AG` this representation looks like this:

```
[1, 0, 0, 0, 0, 0, 1, 0]
```

The first four digits describe letter A, the second four - letter G.

In [None]:
from util import encode_onehot

# encodings - encodes the data right away and returns a dictionary with encoded_data, labels, and feature_names

one_hot_train_data = encode_onehot(train_dataset)
one_hot_test_data = encode_onehot(test_dataset)

print(one_hot_train_data['encoded_data'])

# Selecting optimal ML model and encoding

Find the best ML model and encoding combination! Use accuracy as a performance measure, e.g., by using `accuracy_score` function. It takes two parameters: `y_true` (true labels) and `y_pred` (predicted labels).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# ML methods - defining ML methods to be used

logistic_regression = LogisticRegression(penalty='l1', C=10, solver='saga', max_iter=500)
random_forest = RandomForestClassifier()

# an example of fitting and evaluating the model on test set

logistic_regression = LogisticRegression()
logistic_regression.fit(X=one_hot_train_data['encoded_data'], y=one_hot_train_data['labels'])

predictions = logistic_regression.predict(one_hot_test_data['encoded_data'])

score = accuracy_score(y_true=one_hot_test_data['labels'], y_pred=predictions)
print(f"Accuracy is {score}.")

### Defining a new encoding (optional)



Another encoding that can be used in this setting is based on the frequency of subsequences within a sequence.

For example, if the original sequence is `AGCGAG`, subsequences of length 3 (3-mers) are:
- `AGC`
- `GCG`
- `CGA`
- `GAG`.

The encoding would then represent each sequence by the frequency of each possible 3-mer. For sequence `AGCGAG` it would be:

```
[0., 0., 0., 0., 0., 0., 0., 0., 0, 0.25, 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0.25, 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0.25, 0., 0., 0., 0.25, 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0.]
```

The first number is then the frequency of 3-mer `AAA`, the second the frequency of `AAC`, third `AAG`, ..., and the last one is the frequency of `TTT`.

Here you can implement the function that will encode the dataset by representing each DNA sequence by the frequency of its subsequences.

The function should take a pandas DataFrame as input with columns `sequence` and `label` (as shown above). It should output a dictionary with the following keys: 

- `encoded_data` (matrix where each row represents one sequence and each column represents the frequency of one k-mer), 
- `labels` (an array of labels for each sequence), and 
- `feature_names` (a list of features names, e.g., `AAA`, `AAC`, `ACA`, etc.)

In [None]:
import pandas as pd
from util import ALPHABET

# all letters (nucleotides) that can be used in a sequence:
print(ALPHABET)

# define a new encoding as frequences of subsequences of length k:

def encode_kmer_frequencies(dataset: pd.DataFrame, k: int) -> dict:
    
    # add your code here and fill in the returned object
    
    return {
        'encoded_data': None,
        'labels': None,
        'feature_names': None
    }

In [None]:
from util import test_kmer_encoding

test_kmer_encoding(encode_kmer_frequencies_func=encode_kmer_frequencies)
    