Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

masked alignment different from global tree in GISAID #21

Open
lpipes opened this issue Oct 7, 2021 · 5 comments
Open

masked alignment different from global tree in GISAID #21

lpipes opened this issue Oct 7, 2021 · 5 comments

Comments

@lpipes
Copy link

lpipes commented Oct 7, 2021

Hello, I tried to download the masked alignment from GISAID but it contains >3 million sequences while the global tree they uploaded is only for ~600K sequences. Do you know where I can download the MSA file for the most recent global tree? Thanks.

@roblanf
Copy link
Owner

roblanf commented Oct 7, 2021

Hi @lpipes, there are two parts to this answer. First, the most recent global tree contains almost 3M sequences, although that's still fewer than in the alignment. The reason for the discrepancy is that the alignment contains all sequences, but the tree is built only with those that are good enough to build a tree from.

The older trees only had 600K sequences, because that's all fasttree could handle. These were subsampled to include all of the most recent sequences, and something like 100K other sequences for context.

In both cases, the way to get an alignment that has only the sequences contained in the tree is to pull out of the alignment just the sequences you want. To do that, I'd:

  1. Make a text file of all the sequence names in the tree (one per line)
  2. Extract the corresponding sequences from the alignment using faSomeRecords

Hope that helps!

Rob

@lpipes
Copy link
Author

lpipes commented Oct 8, 2021

Hi Rob,

Thanks for your explanation. The tree I recently downloaded (dated 2021-09-26) only had ~600K sequences in it. But I just downloaded the most recent tree (dated 2021-10-05) which had ~3million. Using faSomeRecords makes sense but I am actually having a lot of trouble extracting the MSA from the tar file.

tar xf mmsa_2021-10-06.tar.xz
xz: (stdin): Unexpected end of input
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

I also encountered this error with the previous *tar.xz files that were posted. Any idea on what could be the problem?

-Lenore

@roblanf
Copy link
Owner

roblanf commented Oct 8, 2021 via email

@lpipes
Copy link
Author

lpipes commented Oct 8, 2021

Hmm seems like that doesn't work either ugh...
xz -d mmsa_2021-10-06.tar.xz
xz: mmsa_2021-10-06.tar.xz: Unexpected end of input

@lpipes
Copy link
Author

lpipes commented Oct 8, 2021

In fact, I've tried to extract every single MSA file that they have posted and all of them have an Unexpected EOF in archive. I sent them a message though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants