Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whatshap stats - Calculate all length statistics on split blocks #412

Merged
merged 3 commits into from Nov 11, 2022

Conversation

pontushojer
Copy link
Collaborator

fix #400

Fixes inflated stats for block lengths in whatshap stats if there are overlapping blocks, also see #384.

I tested on chromosome SUPER_3 (116801625 bp) in the VCF referenced in issue #400 where there were a lot of overlapping blocks.
Before

      Sum of lengths: 1036875048    bp
 Median block length:        493.00 bp
Average block length:      16018.71 bp
       Longest block:   75095876    bp
      Shortest block:          1    bp
          Block NG50:          0    bp

After

      Sum of lengths: 48207739    bp
 Median block length:      492.00 bp
Average block length:      744.53 bp
       Longest block:    15389    bp
      Shortest block:        1    bp
          Block NG50:        0    bp

As you can see there is a drastic reduction in Sum of lengths, Average block length and Longest block.

This PR also a removes an additional call to PhasingStats.get (now renamed PhasingStats.get_detailed_stats) when using the flag --tsv.

Copy link
Contributor

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think this is a nice improvement! Can you address the one comment I had? Then feel free to merge.

"""Return DetailedStats"""
block_sizes = sorted(len(block) for block in self.blocks)
n_singletons = sum(1 for size in block_sizes if size == 1)
block_sizes = [size for size in block_sizes if size > 1]
block_lengths = sorted(block.span() for block in self.blocks if len(block) > 1)
# Block length stats calculated from split interleaved blocks to avoid inflating values
block_lengths = sorted([block.span() for block in self.split_blocks if len(block) > 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this remain a generator expression (without the [ and ])?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorted returns a list so this is definitely redundant.

@marcelm
Copy link
Contributor

marcelm commented Nov 11, 2022

I forgot: This also needs a documentation update and an entry in the changelog. In the docs, I think it makes sense to move the sentence about "interleaved blocks are cut" to the paragraph "The numbers in the following columns describe these blocks." https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-stats

@pontushojer pontushojer merged commit f1609f6 into main Nov 11, 2022
@pontushojer pontushojer deleted the update-stats branch November 11, 2022 12:29
@marcelm
Copy link
Contributor

marcelm commented Nov 11, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

strange values of whatshap stats
2 participants