Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Update sbt_gather output formats #175

Merged
merged 18 commits into from May 16, 2017
Merged

[MRG] Update sbt_gather output formats #175

merged 18 commits into from May 16, 2017

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Apr 17, 2017

Fixes #169 (align stdout and --output for sbt_gather) and #170 (label columns for sbt_gather).

Removes --csv and replaces it with --output, which now outputs CSV format.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2017

The new output format is:

./sourmash sbt_gather gcf GCF_000783305.1_ASM78330v1_genomic.fna.gz.sig -o xxx 
# running sourmash subcommand: sbt_gather
loaded query: data/GCF_000783305.1_ASM78330v... (k=31, DNA)
query signature has max_hash: 1844674407370955

(column 1: fraction of query / column 2: fraction of discovered genome)
found: 1.00 1.000 data/GCF_000783305.1_ASM78330v1_genomic.fna.gz

found 1 matches total;
the recovered matches hit 100.0% of the query

Final composition (sorted by fraction of original query):

f_orig_query f_found_genome
1.00 1.00 data/GCF_000783305.1_ASM78330v1_genomic.fna.gz

and then 'xxx' is a CSV file containing:

f_orig_query,f_found_genome,name
1.0,1.0,data/GCF_000783305.1_ASM78330v1_genomic.fna.gz

@taylorreiter does this look OK?

@taylorreiter
Copy link
Contributor

It's everything I ever hoped and dreamed it would be.

@luizirber
Copy link
Member

I don't know how to react to @taylorreiter optimism, this never happens on open source projects...

@codecov-io
Copy link

codecov-io commented Apr 17, 2017

Codecov Report

Merging #175 into master will increase coverage by <.01%.
The diff coverage is 80.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
+ Coverage   85.78%   85.78%   +<.01%     
==========================================
  Files          13       13              
  Lines        1934     1942       +8     
  Branches       52       52              
==========================================
+ Hits         1659     1666       +7     
- Misses        265      266       +1     
  Partials       10       10
Impacted Files Coverage Δ
sourmash_lib/commands.py 90.56% <80.43%> (-0.19%) ⬇️
sourmash_lib/signature_json.py 95.36% <0%> (+0.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e996f7a...dc230ce. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2017

Note to self: I think this just needs some tests, is all.

@ctb ctb changed the title Update sbt_gather output formats [WIP] Update sbt_gather output formats Apr 19, 2017
@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2017

Note that 'total' no longer reflects what is printed out, sigh.

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2017

@taylorreiter @brooksph @halexand how does this look?

# running sourmash subcommand: sbt_gather
loaded query: -... (k=21, DNA)

overlap    p_query p_genome
-------    ------- --------
9.4 Mbp    4.9%    100.0%      Burkholderia1_xenovorans_LB400_chromosom
6.6 Mbp    3.5%    100.0%      Rhodopirellula_baltica_SH_1_complete_gen
6.4 Mbp    3.3%    100.0%      Herpetosiphon_aurantiacus_ATCC_23779
6.1 Mbp    3.2%    100.0%      Nostoc_sp._PCC_7120_DNA
6.1 Mbp    3.2%    100.0%      Bacteroides_thetaiotaomicron_VPI-5482
5.5 Mbp    2.9%    100.0%      Salinispora_arenicola_CNS-205
5.3 Mbp    2.8%    100.0%      Methanosarcina_acetivorans_str._C2A
5.1 Mbp    2.7%     99.8%      Bordetella_bronchiseptica_strain_RB50
5.0 Mbp    2.6%    100.0%      Gemmatimonas_aurantiaca_T-27_DNA
4.8 Mbp    2.5%    100.0%      Chloroflexus_aurantiacus_J-10-fl
4.8 Mbp    2.5%     99.2%      Leptothrix_cholodnii_SP-6
4.8 Mbp    2.5%    100.0%      Shewanella_baltica_OS223, complete genom
4.5 Mbp    2.4%     98.5%      Bacteroides_vulgatus_ATCC_8482
5.0 Mbp    2.6%     89.9%      Salinispora_tropica_CNB-440
4.1 Mbp    2.1%    100.0%      Acidobacterium_capsulatum_ATCC_51196
4.0 Mbp    2.1%     99.5%      Ruegeria_pomeroyi_DSS-3
3.9 Mbp    2.1%     99.7%      Clostridium_thermocellum_ATCC_27405
3.9 Mbp    2.0%    100.0%      Geobacter_sulfurreducens_PCA
3.4 Mbp    1.8%    100.0%      Desulfovibrio_vulgaris_DP4
3.4 Mbp    1.8%     98.5%      Sulfitobacter_NAS-14.1_scf_1099451320477
3.2 Mbp    1.7%    100.0%      Deinococcus_radiodurans_R1_chromosome_1,
3.0 Mbp    1.6%    100.0%      Sulfolobus_tokodaii str. 7 DNA, complete
3.0 Mbp    1.6%     99.7%      Enterococcus_faecalis_V583
3.0 Mbp    1.6%     99.7%      Treponema_denticola_ATCC_35405
2.9 Mbp    1.5%    100.0%      Haloferax_volcanii_DS2
2.8 Mbp    1.5%    100.0%      Pelodictyon_phaeoclathratiforme_BU-1
2.7 Mbp    1.4%    100.0%      Caldisaccharolyticus_DSM_8903
2.7 Mbp    1.4%     98.5%      Chlorobiumphaeobacteroides_DSM_266
2.7 Mbp    1.4%    100.0%      Akkermansia_muciniphila_ATCC_BAA-835
2.7 Mbp    1.4%     98.5%      Desulfovibrio_piger_ATCC_29098
2.6 Mbp    1.4%     97.7%      Chlorobiumlimicola_DSM_245
2.5 Mbp    1.3%     99.6%      Nitrosomonas_europaea_ATCC_19718
2.5 Mbp    1.3%    100.0%      Archaeoglobus_fulgidus_DSM_4304
2.4 Mbp    1.2%     99.6%      Thermoanaerobacter_pseudethanolicus_ATCC
2.5 Mbp    1.3%     91.3%      Caldibescii_DSM_6725
2.3 Mbp    1.2%    100.0%      Pyrobaculum_arsenaticum_DSM_13514
2.2 Mbp    1.2%     98.7%      Porphyromonas_gingivalis_ATCC_33277_DNA
2.2 Mbp    1.1%     99.5%      Pyrobaculum_aerophilum_str._IM2
2.1 Mbp    1.1%    100.0%      Fusobacterium_nucleatum_subsp._nucleatum
2.1 Mbp    1.1%     98.6%      Chlorobiumtepidum_TLS
2.1 Mbp    1.1%    100.0%      Wolinella_succinogenes_DSM_1740
4.6 Mbp    2.4%     43.8%      Shewanella_baltica_OS185
2.0 Mbp    1.0%    100.0%      Persephonella_marina_EX-H1
1.9 Mbp    1.0%    100.0%      Pyrobaculum_calidifontis_JCM_11548
1.9 Mbp    1.0%    100.0%      Pyrococcus_furiosus_DSM_3638
1.9 Mbp    1.0%    100.0%      Thermotoga_petrophila_RKU-1
1.9 Mbp    1.0%    100.0%      Zymomonas_mobilis_subsp._mobilis_ZM4
1.8 Mbp    0.9%     98.9%      Pyrococcus_horikoshii_OT3_DNA
1.8 Mbp    0.9%    100.0%      Dictyoglomus_turgidum_DSM_6724
1.8 Mbp    0.9%    100.0%      Methanopyrus_kandleri_AV19
1.9 Mbp    1.0%     93.1%      Thermotoga_neapolitana_DSM_4359
1.8 Mbp    0.9%     98.9%      Chlorobiumphaeovibrioides_DSM_265
1.7 Mbp    0.9%     99.4%      Thermus_thermophilus_HB27
1.7 Mbp    0.9%     99.4%      SulfuriYO3AOP1
1.7 Mbp    0.9%    100.0%      Methanococcus_maripaludis_strain_S2,_com
1.6 Mbp    0.9%     98.8%      Hydrogenobaculum_sp._Y04AAS1
1.4 Mbp    0.7%    100.0%      Aciduliprofundum_boonei_T469
1.4 Mbp    0.7%    100.0%      Methanocaldococcus_jannaschii_DSM_2661
1.3 Mbp    0.7%    100.0%      Ignicoccus_hospitalis_KIN4/I
3.0 Mbp    1.6%     41.0%      Sulfitobacter_sp._EE-36_scf_109945131800
1.4 Mbp    0.7%     86.3%      Methanococcus_maripaludis_C5
1.6 Mbp    0.8%     66.9%      Sulfurihydrogenibium_yellowstonense_SS-5
1.8 Mbp    1.0%     33.9%      Thermotoga_sp._RQ2
0.6 Mbp    0.3%    100.0%      Nanoarchaeum_equitans_Kin4-M

found 64 matches total;
the recovered matches hit 100.0% of the query

@taylorreiter
Copy link
Contributor

taylorreiter commented Apr 24, 2017

Very readable.

The Mbp makes the interpretation of the percentage more intuitive.

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2017 via email

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2017

also note new argument --threshold-bp which is the search threshold in base pairs.

@ctb
Copy link
Contributor Author

ctb commented Apr 26, 2017

oddity in output:

# running sourmash subcommand: sbt_gather                                       
loaded query: ../data/acido.fa.gz... (k=31, DNA)                                
                                                                                
overlap    p_query p_match                                                      
-------    ------- --------                                                     
4.2 Mbp  100.0%    100.0%      1403                                             
                                                                                
found 1 matches total;                                                          
the recovered matches hit 10.2% of the query   

@ctb
Copy link
Contributor Author

ctb commented May 14, 2017

above oddity fixed in this branch.

@ctb ctb changed the title [WIP] Update sbt_gather output formats [MRG] Update sbt_gather output formats May 14, 2017
@ctb
Copy link
Contributor Author

ctb commented May 14, 2017

Ready for review @luizirber @betatim

@ctb
Copy link
Contributor Author

ctb commented May 14, 2017

Hmm, might be a good idea to sort by bp or something.

@@ -15,3 +15,11 @@

DEFAULT_SEED = get_minhash_default_seed()
MAX_HASH = get_minhash_max_hash()

def scaled_to_max_hash(scaled):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - removed!

@betatim
Copy link
Contributor

betatim commented May 16, 2017

👍 maybe a second look from @luizirber

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@luizirber luizirber merged commit 41e15fb into master May 16, 2017
@luizirber luizirber deleted the update/gather_out branch May 16, 2017 21:45
@ctb ctb mentioned this pull request May 18, 2017
@luizirber luizirber added this to Done in sourmash 2.0 May 19, 2017
@luizirber luizirber removed this from Done in sourmash 2.0 May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants