Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the indel table #27

Open
4 tasks done
thkuo opened this issue Aug 8, 2018 · 4 comments
Open
4 tasks done

the indel table #27

thkuo opened this issue Aug 8, 2018 · 4 comments
Labels
bug Something isn't working discussion Things potentially concerned or some relevant questions may be raised

Comments

@thkuo
Copy link
Owner

thkuo commented Aug 8, 2018

Problems to solve:

  • core_genes_50.txt
    implemented in makeGroupAln.py
  • currently not able to expand by genes
    solved with "dynamic" function of snakemake
  • vcf2indel.py
    keep Aaron's algorithm but make it communicate with snakemake more smoothly
  • generate_indel_features.py and roary_PA14_abricate.txt,
    Script rewritten and the intermediate file is no longer required, because the users need to provide one more file...

Strategy:

  • Understand the roles of the intermediate files
  • Rewrite the external scripts
  • Try to reduce the intermediate files
@thkuo thkuo added bug Something isn't working discussion Things potentially concerned or some relevant questions may be raised labels Aug 8, 2018
@thkuo
Copy link
Owner Author

thkuo commented Aug 10, 2018

[vcf2indel.py]
The current binary labels are ambiguous and can result in incorrect ML outcomes.
The 1s may mean insertions, deletions, missing values, or mixture of indels. For a certain indel feature, furthermore, is 1 1 1 0 0 equal to 0 0 0 1 1? Are two features that have same values redundant?
Key questions to answer:

  • Should the definition of indels be reference-based?
  • How to deal with the missing values (eg. the dots in ALT and GT fields)?

@thkuo
Copy link
Owner Author

thkuo commented Aug 11, 2018

In the four filters of vcf2indel.py,

case 1

not existing

if ref == ".": ...

case 2

majority-based assignment

elif alt == ".":...

case 3

mixture of majority-based and reference-based methods

elif len(sample_names.loc[samples == "."]) > sample_names.shape[0]/2:...

case 4:

reference-based

else:...

@thkuo
Copy link
Owner Author

thkuo commented Aug 14, 2018

Only update the connection with snakemake main script (for now)

@thkuo thkuo closed this as completed Aug 14, 2018
@thkuo thkuo reopened this Sep 4, 2018
@thkuo
Copy link
Owner Author

thkuo commented Sep 4, 2018

Even with the original methods, the indels were incorrectly detected in this case:

>CH2500
>F1659
>CH2502
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>CH2522
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>ESP088
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>MHH15083
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA

and the vcf outcome was:

##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=561>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	CH2500	CH2502	CH2522	ESP088	F1659	MHH15083
chrUn	0	.	ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA	ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA	.	.	DP=4	GT:DP	./.	0/0:1	1/1:1	0/0:1	./.	0/0:1

The gene was lost in two strains, but it wasn't detected by the old method. It could be listed in the gpa table though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion Things potentially concerned or some relevant questions may be raised
Projects
None yet
Development

No branches or pull requests

1 participant