# 11-mer
The goal of this notebook is to reproduce the 11-mer model.
While investigating the plotting notebook in the original repository it was found that the 11-mer model actually is the best markov model. In the config file for the best markov model in the results folder it can be seen, that it is a bidirectional markov model of order 5.

In this notebook no splitting of the data wil be performed, ie. the whole dataset will be used for training and testing.

```
CONFIG
├── datamodule
│   └── _target_: src.datamodules.motif_datamodule.MotifDataModule                                                                         
│       _recursive_: false                                                                                                                 
│       dataset:                                                                                                                           
│         _target_: src.datamodules.dna_datasets.CSVDataset                                                                                
│       data:                                                                                                                              
│         train_file: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half_lif
│         test_file: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half_life
│         seq_position: UTR3_seq                                                                                                           
│       transforms:                                                                                                                        
│         _target_: src.datamodules.sequence_encoders.SequenceDataEncoder                                                                  
│         seq_len: 300                                                                                                                     
│         total_len: 303                                                                                                                   
│         mask_rate: 0.1                                                                                                                   
│       test_transforms:                                                                                                                   
│         _target_: src.datamodules.sequence_encoders.RollingMasker                                                                        
│         mask_stride: 50                                                                                                                  
│         frame: 0                                                                                                                         
│       batched_dataset: true                                                                                                              
│       batch_size: 1                                                                                                                      
│       train_val_test_split:                                                                                                              
│       - 55000                                                                                                                            
│       - 5000                                                                                                                             
│       - 10000                                                                                                                            
│       num_workers: 16                                                                                                                    
│       pin_memory: true                                                                                                                   
│       persistent_workers: true                                                                                                           
│                                                                                                                                          
├── model
│   └── _target_: src.models.baseline.markov_model.MarkovModel                                                                             
│       halflife_df_path: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half
│       markov_matrix_path: /s/project/semi_supervised_multispecies/Downstream/NearestNeighbour/markov_bimatrix_all.npy                    
│       order: 5                                                                                                                           
│       bidirectional: true                                                                                                                
│                                                                                                                                          
├── callbacks
│   └── {}                                                                                                                                 
│                                                                                                                                          
├── trainer
│   └── _target_: pytorch_lightning.Trainer                                                                                                
│       gpus: 1                                                                                                                            
│       min_epochs: 1                                                                                                                      
│       max_epochs: 50                                                                                                                     
│       resume_from_checkpoint: null                                                                                                       
│                                                                                                                                          
├── original_work_dir
│   └── /data/nasif12/home_if12/gankin/motif-modeling                                                                                      
├── data_dir
│   └── /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/                           
├── print_config
│   └── True                                                                                                                               
├── ignore_warnings
│   └── True                                                                                                                               
├── seed
│   └── None                                                                                                                               
├── name
│   └── default                                                                                                                            
├── ckpt_path
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-10/12-29-46/motif-training/3vvsocva/checkpoints/epoch=49-s
├── base_ssm
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-07-29/15-54-24/motif-training/3dvk81nk/checkpoints/epoch=49-s
├── base_ssm_frame
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-07-30/19-35-37/motif-training/1iqkna36/checkpoints/epoch=49-s
├── spec_sacc_schizzo_out
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-10-15/14-51-12/motif-training/1yesuk16/checkpoints/epoch=49-s
├── spec_on_all
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-03/22-02-00/motif-training/20p1vu1v/checkpoints/epoch=49-s
└── spec_sacc_out
    └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-10/12-29-46/motif-training/3vvsocva/checkpoints/epoch=49-s
```

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

import sys, os
sys.path.insert(0, '../..')

import gc
import pysam
import pandas as pd
import re
import torch
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt
import numpy as np


import helpers.train_eval as train_eval    #train and evaluation
import helpers.misc as misc                #miscellaneous functions

import encoding_utils.sequence_encoders as sequence_encoders
import encoding_utils.sequence_utils as sequence_utils
from models.spec_dss import DSSResNet, DSSResNetEmb, SpecAdd
from models.baseline.markov_model import *

from Bio import SeqIO

# Data

In [3]:
# load the train data if it exists
file_path = 'train_df.pickle'
if os.path.exists(file_path):
    with open(file_path, 'rb') as f:
        train_df = pickle.load(f)
else:
    # load the fasta file and select the train data
    fasta_file = "../../../test/Homo_sapiens_3prime_UTR.fa"
    sequences = []
    for s in SeqIO.parse(fasta_file, "fasta"):
        sequences.append(str(s.seq).upper())
    # get the train fraction
    val_fraction = 0.1
    N_train = int(len(sequences)*(1-val_fraction))
    train_data = sequences[:N_train]
    # store it as a dataframe
    train_df = pd.DataFrame({'3-UTR':train_data})
    with open(file_path, 'wb') as f:
        pickle.dump(train_df, f)
train_df

Unnamed: 0,3-UTR
0,ATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATAT...
1,GGTTGCCGGGGGTAGGGGTGGGGCCACACAAATCTCCAGGAGCCAC...
2,GGCAGCCCATCTGGGGGGCCTGTAGGGGCTGCCGGGCTGGTGGCCA...
3,CCCACCTACCACCAGAGGCCTGCAGCCTCCCACATGCCTTAAGGGG...
4,TGGCCGCGGTGAGGTGGGTTCTCAGGACCACCCTCGCCAAGCTCCA...
...,...
16315,CCGTATGAAGATGTCCTGTTAAATTTACAACACTAACGATGTAGAC...
16316,ACACACCCCCGAAAAACACAAGACCGACCCAAAATCTAGAGGAAAG...
16317,AGAAGCTAAAAGGAAAGAAAATAAATCTATCAAAATTACCCTAAAC...
16318,CTTCACTTTTGGGCTCAAGGACTGTGTGAACCAACAAGGGGCCAGT...


In [4]:
# load the test data if it exists
file_path = 'test_df.pickle'
if os.path.exists(file_path):
    with open(file_path, 'rb') as f:
        test_df = pickle.load(f)
else:
    # load the fasta file and select the train data
    fasta_file = "../../../test/Homo_sapiens_3prime_UTR.fa"
    sequences = []
    for s in SeqIO.parse(fasta_file, "fasta"):
        sequences.append(str(s.seq).upper())
    # get the train fraction
    val_fraction = 0.1
    N_train = int(len(sequences)*(1-val_fraction))
    test_data = sequences[N_train:]
    # store it as a dataframe
    test_df = pd.DataFrame({'3-UTR':test_data})
    with open(file_path, 'wb') as f:
        pickle.dump(test_df, f)
test_df

Unnamed: 0,3-UTR
0,CCCCCAGAACCAGTGGGACAAACTGCCTCCTGGAGGTTTTTAGAAA...
1,TATTGAGCCCTCAGAGAGTCCACAGTCCCTCCTCTCAGTTCAGTCT...
2,TATTCATTCCAACTGCTGCCCCTCTGTCTGCCTGGCTGAGATGCAT...
3,AACGGTGCGTTTGGCCAAAAAGAATCTGCATTTAGCACAAAAAAAA...
4,TAGTTTCTAACTGTCGGACCCGTCTGTAAACCAAGGACTATGAATA...
...,...
1809,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
1810,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
1811,GCCTACTTCATCTCAGGACCCGCCCAAGAGTGGCCGCGGCTTTGGG...
1812,TTGTCAGTCTGTCTGCTCAGGACACAAGAACTAAGGGGCAACAAAT...


In [8]:
sequences = list(test_df["3-UTR"])
with open("../../data/exclude_motifs.pickle", "rb") as f:
    config = pickle.load(f)
print(config)


{'fixed_length': 5, 'exclude_random': [('A1CF', 'AATTA', 0), ('BOLL', 'TTTTT', 1), ('CELF1', 'TATGT', 2), ('CNOT4', 'ACACA', 3), ('DAZAP1', 'ATATA', 4), ('EIF4G2', 'GTTGC', 5), ('ESRP1', 'GGGGG', 6), ('FUBP3', 'TATAT', 7), ('HNRNPA0', 'TATAG', 8), ('HNRNPD', 'TATTA', 9), ('HNRNPDL', 'TAATT', 10), ('HNRNPK', 'GCCCA', 11), ('KHDRBS2', 'ATAAA', 12), ('KHSRP', 'TGTAT', 13), ('MBNL1', 'CGCTT', 14), ('MSI1', 'TAGTT', 15), ('NOVA1', 'TTCAT', 16), ('NUPL2', 'AAAAA', 17), ('PCBP1', 'GCCCC', 18), ('PCBP2', 'CCCCC', 19), ('PCBP4', 'ATCCC', 20), ('PRR3', 'ATAAG', 21), ('PTBP3', 'TTTCT', 22), ('RBFOX2', 'GCATG', 23), ('RBM22', 'ACCGG', 24), ('RBM24', 'GTGTG', 25), ('RBM4', 'GCGCG', 26), ('RBM41', 'TACTT', 27), ('RBM45', 'ACGCA', 28), ('RBM47', 'AATCA', 29), ('RBM6', 'CGTCC', 30), ('RC3H1', 'ATATT', 31), ('SF1', 'TAACA', 32), ('SFPQ', 'TGTAA', 33), ('SNRPA', 'TGCAC', 34), ('SRSF10', 'AGCAG', 35), ('SRSF11', 'AGGGG', 36), ('SRSF8', 'GCAGC', 37), ('SRSF9', 'AGGAG', 38), ('TARDBP', 'GTATG', 39), ('TRA2

# Model

In [7]:
# training here refers to calculating the 11mer frequencies
file_path = 'kmer_train.pickle'
if os.path.exists(file_path):
    with open(file_path, 'rb') as f:
        kmer_train = pickle.load(f)
else: 
    # get the frequency counts of all motifs till 11mer
    kmer_train = KmerCount(11,pseudocount=0.1)
    kmer_train.compute_counts(train_df['3-UTR'])
    kmer_train.kmer_counts_dict

    # save dictionary pickle file
    with open('kmer_train.pickle', 'wb') as f:
        pickle.dump(kmer_train, f)

100%|██████████| 16320/16320 [06:56<00:00, 39.15it/s] 


In [8]:
# initialize a bidirectional markov model of order 5
markov_model = MarkovModel(
    kmer_train,
    markov_matrix_path="markov_model.npy",
    order=5,
    bidirectional=True,
    test_df_path='test_df.pickle'
)

In [9]:
# calculate the markov matrix using the 11mer counts
markov_model.model.compile_from_counts()

  self.markov_matrix[order,:,:] = self.markov_matrix[order,:,:]/np.sum(self.markov_matrix[order,:,:],axis=1)[:,np.newaxis]


In [10]:
# generate the result files needed for plotting using the test data
markov_model.test()