どうやってリードをつなげるか？まずは似ているリードの部分集合に分割することを考える。それはどうやって行うか？

- 元の配列の類似度を使う
- `repr_units`の類似度を使う
- `smc.cluster_cons`で置き換えた配列の類似度を使う
- `count_variants()`のスペクトラムの類似度を使う(一定数の連続したユニットごとにスペクトラムを作る？)

元の配列で一致率を厳しめに (e.g., 1%) 取ってどのくらいのリードが当たるのかは一度試してみたい。
--> やった。`1.2. AVA centromere read overlap with daligner.ipynb`

CCS ならノイズが非常に小さいので variant が信用できる。なので `repr_units` と `count_variants()` スペクトラムで十分だろう。問題は、どういう window を取るか？

Split-merge クラスタリングはユニットモデル構築のためという意味合いが大きいので、single-read の時点で決め打ちは避けたい。

1本のリードに含まれる synchronized units に対する表現を考える。

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from IPython.display import display
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.io as pio
pio.templates.default = 'plotly_white'
import logging
import logzero
logzero.loglevel(logging.INFO)

In [2]:
dir_fname = 'work'
import os
os.chdir(dir_fname)

In [3]:
from BITS.util.io import load_pickle, save_pickle
import numpy as np
import pandas as pd
from BITS.plot.plotly import make_hist, make_scatter, make_layout, show_plot
from BITS.clustering.seq import ClusteringSeq
import consed
from BITS.seq.align import EdlibRunner
from collections import Counter, defaultdict
from logzero import logger
from dataclasses import dataclass
from typing import List
import random

In [4]:
sync_reads = load_pickle("centromere_reads_sync.pkl")

In [5]:
db_prefix = "DMEL_CSS"
db_fname = f"{db_prefix}.db"
las_fname = f"TAN.{db_prefix}.las"
from vca import ReadViewer
v = ReadViewer(db_fname, las_fname)

In [6]:
read = sync_reads[0]

In [8]:
v.show(read=read)

[I 190917 11:10:39 log:17] Starting distance matrix calculation 
[I 190917 11:10:39 log:19] Finished distance matrix calculation


## Functions for counting variants

In [8]:
# Variants and sequencing errors

class PairwiseAlignment:
    def __init__(self, a_seq, b_seq):
        er = EdlibRunner("global", revcomp=False, cyclic=False)
        self.fcigar = er.align(b_seq.lower(), a_seq.lower()).cigar.flatten().string   # NOTE: b vs a; be careful!
        self.source, self.target = '', ''
        s_pos, t_pos = 0, 0
        for c in self.fcigar:
            if c == '=' or c == 'X':
                self.source += a_seq[s_pos]
                self.target += b_seq[t_pos]
                s_pos += 1
                t_pos += 1
            elif c == 'I':
                self.source += '-'
                self.target += b_seq[t_pos]
                t_pos += 1
            else:
                self.source += a_seq[s_pos]
                self.target += '-'
                s_pos += 1
        
    def show(self, by_cigar=False):
        if by_cigar:   # standard alignment like BLAST
            print(self.source)
            print(self.fcigar)
            print(self.target)
        else:
            print(''.join([' ' if c == '=' else self.source[i] for i, c in enumerate(self.fcigar)]))
            print(''.join([self.source[i] if c == '=' else ' ' for i, c in enumerate(self.fcigar)]))
            print(''.join([' ' if c == '=' else self.target[i] for i, c in enumerate(self.fcigar)]))

def count_variants(cluster_cons_unit, cluster_units):
    """Given a set of unit sequences <units> in a cluster, calculate the composition of
    nucleotides including '-' (= distribution of each )
    for each position on <cluster_cons_unit> as a seed.
    from which <units> are generated, compute the variations (= nucleotides inconsistent between
    <units> and <cluster_cons_unit> and their relative frequency).
    Since a cluster should be homogeneous (i.e., mono-source), the relative frequencies are
    expected to be not much larger than sequencing error.
    """
    assert cluster_cons_unit != "", "Empty strings are not allowed"
    # TODO: how to decide "same variant?" especially for multiple variations on same position (but slightly different among units)?
    variants = Counter()
    for unit in cluster_units:
        assert unit != "", "Empty strings are not allowed"
        alignment = PairwiseAlignment(cluster_cons_unit, unit)   # alignment.fcigar(cluster_cons_unit) = unit
        tpos = 0
        var_index = 0   # positive values for continuous insertions
        for i, c in enumerate(alignment.fcigar):
            if c == '=':
                var_index = 0
            elif c == 'I':
                var_index += 1
            if c != '=':
                variants[(tpos, var_index, c, alignment.target[i])] += 1   # TODO: multiple D on the same pos are aggregated
            if c != 'I':
                tpos += 1
        assert tpos == len(cluster_cons_unit)
    return variants

def list_variations(template_unit, cluster_cons_unit):
    """Single-vs-single version of count_variants().
    That is, list up the differences between the (imaginary) template unit and the consensus unit
    of a cluster (which should be a real instance).
    The return value is [(position_on_template_unit, variant_type, base_on_cluster_cons_unit)].
    """
    assert template_unit != "" and cluster_cons_unit != "", "Empty strings are not allowed"
    return list(count_variants(template_unit, [cluster_cons_unit]).keys())

## Representative units & variant spectra

In [93]:
def characterize_sync_read(read, min_var_frac=0.1):
    """Convert a synchronized read into a set of representative units and variants within them."""
    # TODO: window size and slide   # TODO: min_var_count in a window rather than min_var_frac in a read?
    read_spectrum = set()

    # Call variants with relative frequency >= <min_var_frac>
    for repr_id, repr_unit in read.repr_units.items():
        units = [read.seq[unit.start:unit.end] for unit in read.units if unit.id == repr_id]
        var_counts = sorted(count_variants(repr_unit, units).items())
        var_counts = [(key, round(count / len(units), 3)) for key, count in var_counts]
        
        logger.info(f"#units = {len(units)}, min var freq = {round(len(units) * min_var_frac, 3)}")
        var_counts = tuple(filter(lambda x: x[1] >= min_var_frac, var_counts))
        read_spectrum.add((repr_unit, var_counts))

    return read_spectrum

In [94]:
characterize_sync_read(read)

[I 190918 00:47:11 <ipython-input-93-9472150ac6de>:11] #units = 38, min var freq = 3.8


{('atgacccccctccttacaaaaaatgcgaaaattgatccaaaaattaatttcctaaatccttcaaaaagtaatagggatcgttagcactggtaattagctgctcaaaacagttattgttacatctatgtgaccatttttagccaagttataacgaaaatttcgtttgtaaatatcaacatttttgcagagtctgtttttccaaatttcggtcatcaaataatcatttattttgccacaacataaaaaataattgtctgaatatggaatgtcatacctcactgagctcgtaataaaatttccaatcaaactgtgttcaaaaatggaaattaaattttttggccatattttgcaaattttg',
  (((6, 0, 'X', 'a'), 0.184),
   ((9, 0, 'X', 't'), 0.132),
   ((12, 0, 'D', '-'), 0.158),
   ((31, 0, 'X', 'g'), 0.316),
   ((34, 0, 'X', 'g'), 0.263),
   ((36, 0, 'X', 't'), 0.368),
   ((52, 1, 'I', 'c'), 0.474),
   ((71, 0, 'X', 'c'), 0.184),
   ((74, 0, 'X', 'a'), 0.368),
   ((95, 0, 'X', 't'), 0.184),
   ((102, 0, 'X', 't'), 0.395),
   ((110, 0, 'X', 'a'), 0.447),
   ((115, 0, 'X', 'c'), 0.158),
   ((115, 1, 'I', 'c'), 0.447),
   ((117, 0, 'D', '-'), 0.447),
   ((131, 0, 'X', 'a'), 0.395),
   ((135, 0, 'X', 'g'), 0.237),
   ((136, 0, 'X', 'c'), 0.237),
   ((160, 0, 'X', 'g'), 0.316),
   ((164, 0, 'D', '-'), 0.342),
   ((174

問題は、このスペクトラム同士を比較する際にも、

- representative units 同士の phase synchronization が必要
- overlap している領域によってスペクトラムが変わる

という点である。