In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from IPython.display import display
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.io as pio
pio.templates.default = 'plotly_white'
import logging
import logzero
logzero.loglevel(logging.INFO)

# Move to the data directory

In [2]:
dir_fname = "work/"

In [3]:
import os
os.chdir(dir_fname)

In [4]:
!(ls -l)

total 244512
drwxr-xr-x 2 yoshihiko_s users        38 Sep 11 17:45 datander
drwxr-xr-x 2 yoshihiko_s users      4096 Sep 11 17:59 datruf
-rw-r--r-- 1 yoshihiko_s users       250 Sep 11 17:34 DMEL.db
-rw-r--r-- 1 yoshihiko_s users  70975912 Sep 11 17:46 TAN.DMEL.las
drwxr-xr-x 2 yoshihiko_s users        40 Sep 11 18:33 tmp
-rw-r--r-- 1 yoshihiko_s users 179390870 Sep 11 18:00 tr_reads.pkl


# Input

* `tr_reads.pkl`: Reads with tandem repeats and units detected by datander and datruf

In [5]:
tr_reads_fname = "tr_reads.pkl"

# Output

* `centromere_reads.pkl`: Centromeric reads, which are almost entirely covered by tandem repeat units whose lengths are very frequent in the data

# How to run

In [6]:
from vca.filter_tr_reads import TRReadFilter

## Docstring

In [7]:
TRReadFilter?

```
Init signature:
TRReadFilter(
    tr_reads_fname: str,
    min_ulen: int = 50,
    max_ulen: int = 500,
    min_cover_rate: float = 0.8,
    band_width: int = 5,
    min_density: float = 0.005,
    deviation: float = 0.1,
) -> None
Docstring:     
Class for filtering List[TRRead] into centromeric TR reads by finding peaks of unit lengths,
which should be a sign of centromeric TR units.

The flow of the filtering is as follows:
  - Given: TR reads (= reads w/ TR(s) of any unit length & any copy number)
        => TR-contained reads (= reads contained within TR(s))
        => Centromeric reads (= reads contained within TR(s) and having units of peak lengths)

Before executing run(), It is recomended to adjust the parameters through looking at histograms
of unit length using hist_all_units() and hist_filtered_units() methods inside Jupyter Notebook.
```

## Interactive execution

In [7]:
f = TRReadFilter(tr_reads_fname)

It is recommended to first look at the histogram of the length of the units:

In [8]:
f.hist_all_units()

And adjust the parameters for `TRReadFilter` and then plot histograms of all units and filtered units based on read cover rate. Shades in the plot are the intervals from which units will be collected. Plot of KDE used for peak detection is also drawn:

In [8]:
f.hist_filtered_units()

[I 190911 19:00:52 filter_tr_reads:59] Peak unit lengths: 121 bp, 371 bp
[I 190911 19:00:52 filter_tr_reads:66] Peak intervals: 109.0-133.0 bp, 334.0-408.0 bp


(Here it is clear that units of ~250 bp exist only with small copy number enough to be spanned by a single read.)

Finally run the filtering:

In [9]:
f.run()

[I 190911 19:02:28 filter_tr_reads:59] Peak unit lengths: 121 bp, 371 bp
[I 190911 19:02:28 filter_tr_reads:66] Peak intervals: 109.0-133.0 bp, 334.0-408.0 bp
[I 190911 19:02:29 filter_tr_reads:41] 12071 TR reads -> 777 centromere reads


Now `centromere_reads.pkl` is generaetd.

In [10]:
!(ls -l)

total 256192
-rw-r--r-- 1 yoshihiko_s users  11959223 Sep 11 19:02 centromere_reads.pkl
drwxr-xr-x 2 yoshihiko_s users        38 Sep 11 17:45 datander
drwxr-xr-x 2 yoshihiko_s users      4096 Sep 11 17:59 datruf
-rw-r--r-- 1 yoshihiko_s users       250 Sep 11 17:34 DMEL.db
-rw-r--r-- 1 yoshihiko_s users  70975912 Sep 11 17:46 TAN.DMEL.las
drwxr-xr-x 2 yoshihiko_s users        40 Sep 11 18:33 tmp
-rw-r--r-- 1 yoshihiko_s users 179390870 Sep 11 18:00 tr_reads.pkl


# About the output data: `centromere_reads.pkl`

Actually `centromere_reads.pkl` is a subset of `tr_reads.pkl`, thus data stored in it are `List[vca.types.TRRead]`:

In [11]:
from BITS.util.io import load_pickle

In [12]:
type(load_pickle("centromere_reads.pkl"))

list

In [13]:
type(load_pickle("centromere_reads.pkl")[0])

vca.types.TRRead

However, all the reads in `centromere_reads.pkl` consist almost only of tandem repeats whose unit lengths belong to the peak lengths shown above.

In [14]:
load_pickle("centromere_reads.pkl")[0]

TRRead(seq='aaagagagagatgaatgtcatagctcatggggctcgtaagaaaatttacaatcaactgtgttcaaacaatgtaaattaaaatttttatgggcctatttggcaagttttgatgacccccctccttaacaaaaaatgttgaaattgataccaaaaaattaatttcgccaaatgccttggcaaaaagtaatagggatcgttcactggtaattagctgctgctcaaaacagttattcttaagcatcctatgtgacatttttagccaaaagttatatacgaaaatttggtttgtaaatatcaaacatgtttggcagaatctgttttttcacaaatttcggtcacaaatgaatcatttattttgccacaacataaaaaataaaattgtctaaaaaattggaatgtcatatgctcactgagctcgtaaataaaaatttccggcaaatcaaagactgtgtgtcaaaatggaaaattaaattttttggcgcatatttggcaaggtttcgatgaccccctcctacaaaaaaatgtgaaattgataccaaaaattaatttgcccaaaaatccttcaaaaagtaaataagggatcgttagcactggtaaattagcttgctcaaaaacagtttattcttacatcatgtgaccatttttagccaagttataacgaaaaatttggtttgtaaatatcaaacgttttggcagaatttgtttttcgcaaaatttcggtcatcaaaaaataatcatttattttgccacaacattaaaaaataattgtcctgaaatatggaatgtcataaccctcaactgagctcgtaataaaatttccaatcaagctgtgttcaaaaataagaaataaattttttgccaatattttgggccaaaaaattttgatgaccccccccttacaaaaaatcgaaaattgatccaaaaattaatttccagctaaatcgcttcaaaaagtaataggtgatcgttaagcactggtattaggctg