In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from IPython.display import display
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.io as pio
pio.templates.default = 'plotly_white'
import logging
import logzero
logzero.loglevel(logging.INFO)

# Move to the data directory

In [2]:
dir_fname = "result"

In [3]:
import os
os.chdir(dir_fname)

# Input

* `tr_reads.pkl`: Reads with tandem repeats detected by datander

In [10]:
tr_reads_fname = "tr_reads.pkl"

# Output

* `centromere_reads.pkl`: Centromeric reads, which are almost entirely covered by tandem repeat units whose lengths are very frequent in the data

# How to run

## Extract centromeric TR reads from TR reads

In [5]:
from vca.filter_tr_reads import TRReadFilter

In [6]:
TRReadFilter?

```
Init signature:
TRReadFilter(
    tr_reads_fname: str,
    min_ulen: int = 50,
    max_ulen: int = 500,
    min_cover_rate: float = 0.8,
    band_width: int = 5,
    min_density: float = 0.005,
    deviation: float = 0.1,
    show_plot: bool = False,
) -> None
Docstring:     
Class for filtering List[TRRead] into centromeric TR reads by finding peaks of unit lengths,
which would be components of centromere.

TR reads [reads w/ TR(s) of any unit length & any copy number]
=> TR-contained reads [reads contained in TR(s)]
=> Centromeric reads [reads contained in TR(s) with peak unit length]
```

In [13]:
f = TRReadFilter(tr_reads_fname)

In [None]:
f.run()

Now `centromere_reads.pkl` is generated.

# Figures

Providing `show_plot=True` to `TRReadFilter` will show the histograms of the length of:

* All units
* Units filtered by unit length and read cover rate

In [None]:
f = TRReadFilter(tr_reads_fname, show_plot=True)

In [None]:
peak_intvls = f.find_peaks()

In [None]:
# TODO: summarize below somehow

# Synchronize centromeric TR units

What we would like to do next is clustering of the cetromeric TR units; however, every TR unit detected so far has arbitrary start position, i.e., they are not "synchronized". Wrap-around alignment seems to be a solution for this, but the self-circularized sequence used in the wrap-around DP is **artificial**. Therefore, it is better to first synchronized the units and then align them with global alignment.

In [3]:
from BITS.plot.plotly import make_hist, show_plot

In [6]:
show_plot([make_hist([read.length for read in load_pickle("centromere_reads.pkl")], bin_size=200)])