Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctg_hex_pos and ctg_hex_dist is buggy when the contig has indels #10

Closed
zyxue opened this issue Jul 14, 2018 · 4 comments
Closed

ctg_hex_pos and ctg_hex_dist is buggy when the contig has indels #10

zyxue opened this issue Jul 14, 2018 · 4 comments
Labels
bug Something isn't working high priority

Comments

@zyxue
Copy link
Owner

zyxue commented Jul 14, 2018

This is because the coordinate information is currently lost when the sequence is extracted from contig.

Currently, this search function is used to search for hexamer in the contig, it only takes into account the extracted sequence with cigar information missing.

def search(strand, clv, seq, window=50):

Maybe it's easier to define the searching window (e.g. 50bp) wst. to the reference, the actual sequence length wst. contig could be a few bp more or less than 50bp.

@zyxue zyxue added bug Something isn't working high priority labels Jul 14, 2018
@zyxue
Copy link
Owner Author

zyxue commented Jul 14, 2018

Add an example

index seqname strand clv aclv gene_name gene_id signed_dist_to_aclv evidence_type contig_ids max_contig_len max_contig_mapq any_contig_is_hardclipped num_suffix_contigs num_bridge_contigs num_link_contigs num_blank_contigs num_total_contigs num_suffix_reads max_suffix_contig_tail_len num_bridge_reads max_bridge_read_tail_len num_link_reads ctg_hex ctg_hex_id ctg_hex_pos ctg_hex_dist ref_hex ref_hex_id ref_hex_pos ref_hex_dist
37378 chr1 + 204489245 204501779 MDM4 ENSG00000198625 -12534 bridge A0.R100625 3334 40 True 0 1 0 0 1 0 0 6 4 0 ACTAAA 6 204489235 10 ACTAAA 6 204489238 7
37379 chr1 + 204489253 204501779 MDM4 ENSG00000198625 -12526 bridge A0.R100625 3334 40 True 0 1 0 0 1 0 0 35 6 0 AATACA 9 204489241 12 AATACA 9 204489244 9
46052 chr7 - 109555254 109599274 EIF3IP1 ENSG00000237064 -44020 bridge A0.S61128 68 1 True 0 1 0 0 1 0 0 1 6 0 TTTAAA 3 109555268 14 AATAAA 16 109555282 28

ctg_hex_dist and ref_hex_dist don't match because of 3bp insertion in A0.R100625

zyxue added a commit that referenced this issue Jul 15, 2018
A temporary compromise, ideally use 'ctg_hex_dist', but it's still buggy see #10
zyxue added a commit that referenced this issue Jul 15, 2018
…r + strand #10

The actually seq length wst. contig may be shorter or longer than window,
depending on the indels on the contig
zyxue added a commit that referenced this issue Jul 15, 2018
@zyxue
Copy link
Owner Author

zyxue commented Jul 15, 2018

Actually, ctg_hex_dist could be correct, it's just the ctg_hex_pos is not consistent with genomics coordinates and should be dropped.

@zyxue
Copy link
Owner Author

zyxue commented Jul 15, 2018

Insight: It's that ctg_hex_pos isn't interpretable wst. genome coordinate. But ctg_hex_dist is probably valid.

@zyxue
Copy link
Owner Author

zyxue commented Jul 16, 2018

seqname strand clv aclv gene_name gene_id signed_dist_to_aclv evidence_type contig_ids max_contig_len max_contig_mapq any_contig_is_hardclipped num_suffix_contigs num_bridge_contigs num_link_contigs num_blank_contigs num_total_contigs num_suffix_reads max_suffix_contig_tail_len num_bridge_reads max_bridge_read_tail_len num_link_reads ctg_hex ctg_hex_id ctg_hex_pos ctg_hex_dist ref_hex ref_hex_id ref_hex_pos ref_hex_dist
39789 chr1 + 204489245 204501779 MDM4 ENSG00000198625 -12534 bridge A0.R100625 3334 40 True 0 1 0 0 1 0 0 6 4 0 ACTAAA 6 204489235 10 ACTAAA 6 204489238 7
39790 chr1 + 204489253 204501779 MDM4 ENSG00000198625 -12526 bridge A0.R100625 3334 40 True 0 1 0 0 1 0 0 35 6 0 AATACA 9 204489241 12 AATACA 9 204489244 9
46413 chr7 - 109555254 109599274 EIF3IP1 ENSG00000237064 -44020 bridge A0.S61128 68 1 True 0 1 0 0 1 0 0 1 6 0 AATATA 10 109555301 47 AATAAA 16 109555282 28

Happy to see the PAS hexamers finally match between contig and reference in the third case.

Conclusion:
ctg_hex_pos is not interpretable in terms of reference genome, but decided to leave it here for reference as a sanity check/indication of insertion, so is the difference between ctg_hex_dist and ref_hex_dist.

ctg_hex_pos may become useful when the ref_hex is not found.

@zyxue zyxue closed this as completed Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

No branches or pull requests

1 participant