# Verification of fragment completeness analysis by direct cluster analysis

Fragments may represent conformers can be *complete*, which we define as being seen in more than one PDB;
or they may be *singletons*, which we define as being seen in only one PDB.

Completeness can be determined at any precision (an RMSD distance in A). 

The most obvious way is to determine for each fragment the closest fitting fragment (from a different PDB),
and then to determine which percentage of those closest fitting fragments is within the precision distance.
This is however somewhat computationally intensive.

Here we show that fragment completeness at distance X can be determined directly from analysis of the clustering at X (for X=0.5A or X=1A). This analysis is trivial in terms of computation time. For ~99.8 % of the fragments, they can be designation as complete or singleton with certaintly. For the remaining ~0.2 %, the designation is putative, but still correct in the large majority of cases.

# Instructions

The current notebook analyses the verfication result files that were previously generated in the `./result/` subfolder.

To generate these  result files: 

- Run `./classify-fragments.sh`.

- Run of the `check-all` scripts. (`check-all.sh` is the fastest, but this requires that the closest fit analysis in `../output` has been completed first.)

In [22]:
txt0 = """The **{libname}** library consists of **{nfrag}** non-redundant fragments. 
Of these, **{trivial_pct:.1f}** % are trivially complete: they come from 0.2 A clusters originating
from multiple PDBs.

Here, we will focus on the remaining non-trivial fragments instead.
"""
txt1 = """
Of the non-trivial fragments, **{single_pct:.1f} %** are alone in their cluster.
Another **{homo_heart_pct:.2f} %** are the hearts of clusters that come from a single PDB.
Together, these two categories are the certain singletons. 
"""

txt2 = """
Of the non-trivial fragments, **{member_dif_ori_pct:.1f} %** come from a different PDB than the heart of their cluster.
Another **{hetero_heart_pct:.2f} %** are the hearts of clusters that come from multiple PDBs.
Together, these two categories are the certain non-singletons. 
"""

txt3 = """
All certain fragments together make up **{certain_pct:.2f} %** of the non-trivial fragments.
Verification showed that **{error_certain}** certain fragments were assigned in error.
"""

txt4 = """
The remaining **{putative_pct:.2f} %** of the fragments come from the same PDB as their cluster heart, 
but are not a cluster heart themselves. 
(Note that a non-heart fragment can belong to multiple clusters; 
to be precise, all those cluster hearts come from the same PDB as the fragment).
If all members of the cluster (or all clusters) come from that PDB too, the fragment is designated
as a putative singleton; otherwise, as a putative non-singleton. 

Among the putative fragments, **{error_putative} ({putative_relative_error_pct:.1f} %)** are assigned incorrectly. 
However, since putative fragments are so rare, these errors are only **{putative_absolute_error_pct:.3f} %** of all non-trivial fragments.

"""


def run(lib, libname):
    result = f"# {libname[0].upper() + libname[1:]} library analysis\n\n"
    assert lib in ("dinuc", "trinuc"), lib
    motifs = (
        ["AA", "AC", "CA", "CC"]
        if lib == "dinuc"
        else ["AAA", "AAC", "ACA", "ACC", "CAA", "CAC", "CCA", "CCC"]
    )
    precisions = ["0.5", "1.0"]

    classif = {}
    for motif in motifs:
        for precision in precisions:
            classifile = f"result/classify-fragments-{lib}-{motif}-{precision}.out"
            with open(classifile) as f:
                for l in f:
                    l = l.strip()
                    if not l:
                        continue
                    if l[0] == "#":
                        ll = l[1:].split()
                        field = int(ll[0])
                        value = int(ll[-1])
                        classif[motif, precision, field] = value
    nfrag = 0
    ntrivial = 0
    p = precisions[0]
    for motif in motifs:
        nfrag += classif[motif, p, 2]
        ntrivial += classif[motif, p, 3]
    trivial_pct = ntrivial / nfrag * 100
    tot = nfrag - ntrivial
    result += txt0.format(**locals())
    for precision in precisions:
        result += f"\n## {precision}A precision analysis\n"

        single = 0
        homo_heart = 0
        error_singletons = 0
        for motif in motifs:
            single += classif[motif, precision, 6]
            homo_heart += classif[motif, precision, 7]
            check_singleton_file = (
                f"result/check-{lib}-{motif}-{precision}.singletons.out"
            )
            with open(check_singleton_file) as f:
                error_singletons += len(f.readlines())
        single_pct = single / tot * 100
        homo_heart_pct = homo_heart / tot * 100
        result += txt1.format(**locals())

        hetero_heart = 0
        member_dif_ori = 0
        error_non_singletons = 0
        for motif in motifs:
            hetero_heart += classif[motif, precision, 8]
            member_dif_ori += classif[motif, precision, 10]
            check_non_singleton_file = (
                f"result/check-{lib}-{motif}-{precision}.close-pairs.out"
            )
            with open(check_non_singleton_file) as f:
                error_non_singletons += len(f.readlines())
        member_dif_ori_pct = member_dif_ori / tot * 100
        hetero_heart_pct = hetero_heart / tot * 100

        result += txt2.format(**locals())

        certain = single + member_dif_ori + homo_heart + hetero_heart
        certain_pct = certain / tot * 100
        error_certain = error_non_singletons + error_singletons
        result += txt3.format(**locals())

        putative = tot - certain
        putative_pct = putative / tot * 100

        error_putative = 0
        for motif in motifs:
            hetero_heart += classif[motif, precision, 8]
            member_dif_ori += classif[motif, precision, 10]
            putative_singleton_file = (
                f"result/check-{lib}-{motif}-{precision}.putative_singletons.out"
            )
            with open(putative_singleton_file) as f:
                error_putative += len(f.readlines())
            putative_non_singleton_file = (
                f"result/check-{lib}-{motif}-{precision}.putative_non_singletons.out"
            )
            with open(putative_non_singleton_file) as f:
                error_putative += len(f.readlines())

        putative_relative_error_pct = error_putative / putative * 100
        putative_absolute_error_pct = error_putative / tot * 100

        result += txt4.format(**locals())

    return result


In [20]:
from IPython.display import Markdown
result = run("dinuc", "dinucleotide")
Markdown(result)

# Dinucleotide library analysis

The **dinucleotide** library consists of **166267** non-redundant fragments. 
Of these, **19.0** % are trivially complete: they come from 0.2 A clusters originating
from multiple PDBs.

Here, we will focus on the remaining non-trivial fragments instead.

## 0.5A precision analysis

Of the non-trivial fragments, **21.8 %** are alone in their cluster.
Another **0.08 %** are the hearts of clusters that come from a single PDB.
Together, these two categories are the certain singletons. 

Of the non-trivial fragments, **67.1 %** come from a different PDB than the heart of their cluster.
Another **10.85 %** are the hearts of clusters that come from multiple PDBs.
Together, these two categories are the certain non-singletons. 

All certain fragments together make up **99.82 %** of the non-trivial fragments.
Verification showed that **0** certain fragments were assigned in error.

The remaining **0.18 %** of the fragments come from the same PDB as their cluster heart, 
but are not a cluster heart themselves. 
(Note that a non-heart fragment can belong to multiple clusters; 
to be precise, all those cluster hearts come from the same PDB as the fragment).
If all members of the cluster (or all clusters) come from that PDB too, the fragment is designated
as a putative singleton; otherwise, as a putative non-singleton. 

Among the putative fragments, **39 (15.9 %)** are assigned incorrectly. 
However, since putative fragments are so rare, these errors are only **0.029 %** of all non-trivial fragments.


## 1.0A precision analysis

Of the non-trivial fragments, **3.3 %** are alone in their cluster.
Another **0.02 %** are the hearts of clusters that come from a single PDB.
Together, these two categories are the certain singletons. 

Of the non-trivial fragments, **91.4 %** come from a different PDB than the heart of their cluster.
Another **5.16 %** are the hearts of clusters that come from multiple PDBs.
Together, these two categories are the certain non-singletons. 

All certain fragments together make up **99.90 %** of the non-trivial fragments.
Verification showed that **0** certain fragments were assigned in error.

The remaining **0.10 %** of the fragments come from the same PDB as their cluster heart, 
but are not a cluster heart themselves. 
(Note that a non-heart fragment can belong to multiple clusters; 
to be precise, all those cluster hearts come from the same PDB as the fragment).
If all members of the cluster (or all clusters) come from that PDB too, the fragment is designated
as a putative singleton; otherwise, as a putative non-singleton. 

Among the putative fragments, **19 (13.8 %)** are assigned incorrectly. 
However, since putative fragments are so rare, these errors are only **0.014 %** of all non-trivial fragments.



In [21]:
from IPython.display import Markdown
result = run("trinuc", "trinucleotide")
Markdown(result)

# Trinucleotide library analysis

The **trinucleotide** library consists of **237879** non-redundant fragments. 
Of these, **15.3** % are trivially complete: they come from 0.2 A clusters originating
from multiple PDBs.

Here, we will focus on the remaining non-trivial fragments instead.

## 0.5A precision analysis

Of the non-trivial fragments, **26.7 %** are alone in their cluster.
Another **0.08 %** are the hearts of clusters that come from a single PDB.
Together, these two categories are the certain singletons. 

Of the non-trivial fragments, **61.5 %** come from a different PDB than the heart of their cluster.
Another **11.44 %** are the hearts of clusters that come from multiple PDBs.
Together, these two categories are the certain non-singletons. 

All certain fragments together make up **99.79 %** of the non-trivial fragments.
Verification showed that **0** certain fragments were assigned in error.

The remaining **0.21 %** of the fragments come from the same PDB as their cluster heart, 
but are not a cluster heart themselves. 
(Note that a non-heart fragment can belong to multiple clusters; 
to be precise, all those cluster hearts come from the same PDB as the fragment).
If all members of the cluster (or all clusters) come from that PDB too, the fragment is designated
as a putative singleton; otherwise, as a putative non-singleton. 

Among the putative fragments, **73 (16.9 %)** are assigned incorrectly. 
However, since putative fragments are so rare, these errors are only **0.036 %** of all non-trivial fragments.


## 1.0A precision analysis

Of the non-trivial fragments, **7.9 %** are alone in their cluster.
Another **0.04 %** are the hearts of clusters that come from a single PDB.
Together, these two categories are the certain singletons. 

Of the non-trivial fragments, **86.2 %** come from a different PDB than the heart of their cluster.
Another **5.70 %** are the hearts of clusters that come from multiple PDBs.
Together, these two categories are the certain non-singletons. 

All certain fragments together make up **99.85 %** of the non-trivial fragments.
Verification showed that **0** certain fragments were assigned in error.

The remaining **0.15 %** of the fragments come from the same PDB as their cluster heart, 
but are not a cluster heart themselves. 
(Note that a non-heart fragment can belong to multiple clusters; 
to be precise, all those cluster hearts come from the same PDB as the fragment).
If all members of the cluster (or all clusters) come from that PDB too, the fragment is designated
as a putative singleton; otherwise, as a putative non-singleton. 

Among the putative fragments, **30 (9.6 %)** are assigned incorrectly. 
However, since putative fragments are so rare, these errors are only **0.015 %** of all non-trivial fragments.

