Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/explain size vs compressed size in FORMAT cols #48

Closed
jeromekelleher opened this issue Feb 29, 2024 · 2 comments · Fixed by #76
Closed

Investigate/explain size vs compressed size in FORMAT cols #48

jeromekelleher opened this issue Feb 29, 2024 · 2 comments · Fixed by #76

Comments

@jeromekelleher
Copy link
Contributor

1000 Genomes chr20 we get this:

name                      type       chunks  size       compressed      max_n  min_val    max_val                                                                                             
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM                     String        160  1.25 MB    138.06 KB           1  n/a        n/a                                                                                                 
POS                       Integer       160  606.64 KB  4.43 MB             1  6e+04      6.4e+07                                                                                             
QUAL                      Float         160  606.64 KB  11.33 MB            1  30         1.2e+08
ID                        String        160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FILTERS                   String        160  1.41 MB    739.29 KB           1  n/a        n/a                                                                                                 
REF                       String        160  1.32 MB    3.57 MB             1  n/a        n/a                                                                                                 
ALT                       String        160  1.91 MB    4.82 MB             6  n/a        n/a                                                                                                 
FORMAT/AB                 Float        1097  178.45 MB  584.71 MB           1  0.04       0.95                                                                                                
FORMAT/AD                 Integer      2496  466.49 MB  7.78 GB             7  0          5.8e+03                                                                                             
FORMAT/DP                 Integer      1224  204.99 MB  6.08 GB             1  0          5.8e+03                                                                                             
FORMAT/GQ                 Integer      1224  204.99 MB  5.25 GB             1  0          99                                                                                                  
FORMAT/GT                 Integer      1788  307.2 MB   417.48 MB           3  -1         6                                                                                                   
FORMAT/MIN_DP             Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/MQ0                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/PGT                String        849  46.13 MB   100.24 MB           1  n/a        n/a                                                                                                 
FORMAT/PID                String        849  59.74 MB   129.69 MB           1  n/a        n/a                                                                                                 
FORMAT/PL                 Integer      3992  807.39 MB  18.42 GB           28  0          1.7e+05                                                                                             
FORMAT/RGQ                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/SB                 Integer       160  16 KB      9.6 KB              0  n/a        n/a          

Note that cols like FORMAT/DP etc are reporting as much larger compressed than uncompressed. Investigate.

@jeromekelleher
Copy link
Contributor Author

Note in #49 changed the size column to the size of the raw values rather than the pickled values. Worth looking at this difference too.

@benjeffery
Copy link
Contributor

Something odd here as for a small subsample of the GeL vcfs, I get compressed << uncompressed, for the full set the opposite is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants