Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig describe and sig summarize differ on zip files with duplicate manifest entries #2774

Open
ctb opened this issue Sep 22, 2023 · 0 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Sep 22, 2023

per #2748 (comment),

if you do something simple like run sketch twice to the same output file zip:

sourmash sketch dna fastq/0.fa -o 0.zip
sourmash sketch dna fastq/0.fa -o 0.zip

you will get different results from summarize and describe: summarize shows 2,

% sourmash sig summarize 0.zip

== This is sourmash version 4.8.4.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '0.zip'
path filetype: ZipFileLinearIndex
location: /home/ctbrown/2023-ccbaumler-debug/0.zip
is database? yes
has manifest? yes
num signatures: 2
** examining manifest...
total hashes: 8336
summary of sketches:
   2 sketches with DNA, k=31, scaled=1000             8336 total hashes

but describe only shows 1:

% sourmash sig describe 0.zip

== This is sourmash version 4.8.4.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/ctbrown/2023-ccbaumler-debug/0.zip
signature: ** no name **
source file: fastq/0.fa
md5: 324074c7287ed934af4fd0a6a459aa30
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4168
sum hashes: 4168
signature license: CC0

loaded 1 signatures total, from 1 files

There is indeed only one sketch in the zip file,

unzip -v 0.zip 
Archive:  0.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   34804  Stored    34804   0% 2023-09-22 06:27 e5e9328b  signatures/324074c7287ed934af4fd0a6a459aa30.sig.gz
     386  Stored      386   0% 2023-09-22 06:27 b9dbdae2  SOURMASH-MANIFEST.csv
--------          -------  ---                            -------
   35190            35190   0%                            2 files

but there are two entries in the manifest:

% cat SOURMASH-MANIFEST.csv 
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/324074c7287ed934af4fd0a6a459aa30.sig.gz,324074c7287ed934af4fd0a6a459aa30,324074c7,31,DNA,0,1000,4168,False,,fastq/0.fa
signatures/324074c7287ed934af4fd0a6a459aa30.sig.gz,324074c7287ed934af4fd0a6a459aa30,324074c7,31,DNA,0,1000,4168,0,,fastq/0.fa

so this looks like another case of slightly pathological manifest misrepresentation, sigh.

It is weird to me that the zip file saving code is ADDING a manifest entry but NOT adding a second signature file too 😭

cc #2749 #1849 #1837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants