Skip to content

Conversation

@saurabhbelsare
Copy link
Contributor

This is the write_ms function discussed in #727 I have created a new ms.py file with the function, included the write_ms function header in trees.py, and created a test_ms.py in the tests directory. In line 118-125 of ms.py, I'm introducing a hard exit if the tree sequence contains anything incompatible with the ms format. Let me know if that is the optimal way to do it. Also, if there should be any other modifications in any of the functions. Thanks!

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice! You've dealt with a lot of things here. See comments, but my main comments are:

  1. maybe this should be a top-level method, since it can act on many tree seuqence (thus getting rid of num_replicates)?
  2. we need more testing (eg testing for equality of positions, and of haplotypes)
  3. right now it uses haplotypes, and so requires the alleles to be 0/1, but really we just need them to be biallelic; using genotypes or genotype_matrix instead would allow that.

Let me know what you think? Happy to help with something if you're not sure of the best way forward.

@@ -0,0 +1,188 @@
# MIT License
#
# Copyright (c) 2018-2019 Tskit Developers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-2020

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and omit the next line, as it wasn't present in 2016

#
# MIT License
#
# Copyright (c) 2019 Tskit Developers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020!


def verify_num_haplotypes(self, ts, mutation_rate, num_replicates):
if num_replicates == 1:
with tempfile.TemporaryDirectory() as temp_dir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using the file-like io.StringIO(), like for instance here?

quantities["num_sites"] = num_sites
quantities["num_positions"] = num_positions
quantities["num_haplotypes"] = num_haplotypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also test for equality of the positions themselves, no? And, haplotypes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have written here is almost an ms file parsing function: might as well go all the way, and just return the whole dict? Then you can test the various aspects of it, like below. This wouldn't be for distribution, just for testing, so no need to worry about making it fast or documented, just simple and obviiously correct.

# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
"""
Convert tree sequences to ms output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we're not converting the whole tree seuqence: just writing out the genotypes - how about "Write the genotypes in a tree sequence in ms format."

recombination_rate=0,
migration_rate=0,
num_loci=1,
num_replicates=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - looks like you don't need most of these options. Do you need any of them? Other than num_replicates, maybe?

variant.position / (tree_sequence.sequence_length)
for variant in tree_sequence.variants()
]
positions.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the positions are guaranteed to be sorted.

file=output,
)
print(file=output)
for h in tree_sequence.haplotypes():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - so, haplotypes returns the actual alleles, which are thus required to be 0/1. But the genotype_matrix gives the indexes into arrays of alleles, with 0 always the ancestral state. So, more generally, could we do something like this:

genotypes = tree_sequence.genotype_matrix()
for k in range(tree_sequence.num_samples):
  print("".join(genotypes[:, k]), file=output)

(but, what's the - for?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The "-" is for missing data @petrelharp )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I guessed... but does ms output missing data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wouldn't imagine it does. But, I'm not sure how strict we should be here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to including missing data! Just wanted to be sure what was happening.

>>> tree_sequences = msprime.simulate(<simulation arguments>, num_replicates=num_replicates)
>>> with open('output.ms', 'w') as ms_file:
>>> for tree_sequence in tree_sequences:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

>>> tree_sequences = msprime.simulate(<simulation arguments>, num_replicates=num_replicates)
>>> with open('output.ms', 'w') as ms_file:
>>> for tree_sequence in tree_sequences:
>>> tree_sequence.write_ms(ms_file, mutation_rate=mutation_rate, num_replicates=num_replicates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got a nice solution to this. But really, write_ms is not a property of a TreeSequence, it's a thing that you can do to one, or maybe many, tree sequences. What if instead of ts.write_ms you did tskit.write_ms(ts)? That way if ts is a TreeSequence, you'd write it out, and if ts is a generator (of tree sequences) then you have replicates, and write them out?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this down quite a bit @saurabhbelsare, and it would actually be better to not have a class here. My main question really is, "what is this for and who will use it"; once this is clearer, we can see better what the requirements are in terms of what kind of data we try to represent and what options we need.

same tree. Therefore, we must keep track of all breakpoints from the
simulation and write out a tree for each one.
"""
breakpoints = list(self._tree_sequence.breakpoints(True)) + [self._num_loci]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this code makes sense in the msprime implementation but not here. In msprime, we know there are extra breakpoints not present in the tree sequence and we use these to output extra copies of the trees appropriately (otherwise we don't have the same distribution of the number of trees as ms). We can delete all the stuff about breakpoints here.

simulation and write out a tree for each one.
"""
breakpoints = list(self._tree_sequence.breakpoints(True)) + [self._num_loci]
if self._num_loci == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be if ts.num_trees == 1

print(newick, file=output)
else:
j = 1
for tree in self._tree_sequence.trees():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be;

for tree in ts.trees():
    newick = tree.newick(precision=self._precision)
    print("[{}]".format(tree.span), newick, file=output)


def __write_header(self, output):
print(
"ms {} {} # This file is an ms-style output file generated from tskit. The two arguments written are sample size and number of replicates".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should put comments in the output - it's not part of ms's output, is it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are no comments in ms's output.


def write(self, output):

if os.path.getsize(output.fileno()) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a good idea I think - it means that you can't write to file-like objects.

file=output,
)
print(file=output)
for h in tree_sequence.haplotypes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The "-" is for missing data @petrelharp )

# Introducing an error to exit if the sequence is not compatible with the ms format #
#####################################################################################
else:
sys.exit("This tree sequence contains non-biallelic SNPs and is incompatible with the ms format!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An exception is more appropriate here, .e.g. raise ValueError("not compatible with ms output")

@saurabhbelsare
Copy link
Contributor Author

Thanks for all these suggestions! A lot of the points you've listed are coming from the fact that I used the existing write_vcf method from tksit as a template, and used existing code from msprime/cli.py to create the write function. I'll work on these points. Also, I'm not sure of the exact target application. Should I message on the github issue thread where this has been requested and tag and ask the people who have requested it? Thanks.

@yunusbb
Copy link

yunusbb commented Sep 18, 2020

Thanks @petrelharp for mentioning this. I wasn't aware of #854.
Many thanks @saurabhbelsare and @petrelharp @jeromekelleher for doing this! Me and my colleagues use ms-style output quite often. Kind of old-fashioned, but anyway.
Right now, I am using ms style output generated with discoal as a training and testing dataset for partialSHIC machine-learning (https://github.com/xanderxue/partialSHIC).

PartialSHIC takes a single ms-style output that contains multiple replicates in it, for example I use 1000 replicates.
Basically, ms' style file looks exactly as in the original Hudson's ms format, i.e. like this:
ms 4 2 -t 5.0
27473 36154 10290

//
segsites: 4
positions: 0.0110 0.0765 0.6557 0.7571
0010
0100
0000
1001

//
segsites: 5
positions: 0.0491 0.2443 0.2923 0.5984 0.8312
00001
00000
00010
11110

p.s. I will be away for hiking on weekends, but will be back soon and do testing.

@benjeffery benjeffery changed the base branch from master to main September 28, 2020 12:11
@yunusbb
Copy link

yunusbb commented Oct 3, 2020

Hi @saurabhbelsare

I did some quick testing. Maybe this is not the right way but here what I did.
I temporarily replaced my local version of trees.py copy/pasting your 'pull-request' version code and also added ms.py into my library:

/Users/bayazityunusbayev/Library/Python/3.6/lib/python/site-packages/tskit/trees.py
/Users/bayazityunusbayev/Library/Python/3.6/lib/python/site-packages/tskit/ms.py

and then, I loaded my previously simulated data stored in tree sequence

python3.6
import tskit
ts=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')

ts.num_samples
1799

# then tested ms output, like this:

with open('output.ms', 'w') as ms_file:
        ts.write_ms(ms_file, mutation_rate=int(1.25e-8))

# ms output had only some header lines:

cat output.ms 
ms 1799 1 # This file is an ms-style output file generated from tskit. The two arguments written are sample size and number of replicates
999 # Setting random seed to 999

//

Perhaps, I am doing something wrong here. Should learn what is git-hub all about, like what is pull-request and staff

@petrelharp
Copy link
Contributor

Hi, @yunusbb! What you did sounded like it should work, but it depends on what version of tskit you're copying the file over. Here's a quick summary of how to do the git thing:

git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
make

Now, everything you do from this directory only will use the local version of tskit, matching this pull request. If you want to make edits, I'd recommend something different, but this is a quick and easy way to test out what's going on.

@yunusbb
Copy link

yunusbb commented Oct 5, 2020

Hi, @petrelharp! Thank you for your help with this!

@saurabhbelsare
Copy link
Contributor Author

I've updated a new version of the write_ms function, that addresses the points raised above. The write_ms function is now a function of tskit, and not tree_sequence. Hence it no longer needs the two different calls depending on whether num_replicates is used or not. All the breakpoints related code which was relevant to msprime has now been removed. All the other minor points have been fixed. The new tests for positions and genotypes have been added. Let me know how it looks now.

@benjeffery
Copy link
Member

@saurabhbelsare Great stuff! Could you rebase and run pre-commit (see here and here). This will then make CI green. I'll do a proper review tomorrow. Thanks!

@saurabhbelsare
Copy link
Contributor Author

saurabhbelsare commented Oct 9, 2020

Hi @benjeffery, I've run all the pre-commit checks and performed the corresponding modifications, and pushed that version. However, when I tried to do the rebasing, following these instructions, when I run git fetch upstream, I get the following error:

fatal: 'upstream' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I'm not sure how to fix this. Sorry, I'm still not super on top of working with github.

@benjeffery
Copy link
Member

Sounds like you don't have this fork set as a remote. git remote add upstream git@github.com:tskit-dev/tskit.git should fix this.

@petrelharp
Copy link
Contributor

Or it might be called origin? Doing git remote -v shows you the list of remote repositories, and you should replace upstream with whatever git@github.com:tskit-dev/tskit.git is called (and if you don't have this repo set as a remote, you'll have to do what Ben said).

@saurabhbelsare
Copy link
Contributor Author

@petrelharp, git remote -v did not show me the tskit-dev repo, so I added it as per @benjeffery's instructions. However, when I run git fetch upstream now, I get a new error:

git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@grahamgower
Copy link
Member

Hi @saurabhbelsare. Try using the https url for the repository. E.g.

git remote set-url upstream https://github.com/tskit-dev/tskit.git

@saurabhbelsare
Copy link
Contributor Author

Hi @grahamgower, That worked, thanks! I got another error when I tried to run git rebase -i upstream/master, but fixed that by changing it to git rebase -i upstream/main since I remember that this nomenclature has changed. However, after I do the next step, where I edit the rebase instructions in the editor and change the pick arguments to squash, and then save the changes, I get the following error:

Auto-merging python/tskit/trees.py
CONFLICT (content): Merge conflict in python/tskit/trees.py
error: could not apply 089274d... Added the write_ms function to write out ms-style output from a tree sequence

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Could not apply 089274d... Added the write_ms function to write out ms-style output from a tree sequence

When I open trees.py, it is showing me the version of trees.py from the first commit, not the one from my latest commit, which is what I would have expected from the squashing. Hence I'm not sure how to resolve this. Sorry that I need step by step instructions for this, I haven't really done this before and I don't want to break anything.

@petrelharp
Copy link
Contributor

In general, what you need to do in this case is go through the files with conflicts and find the places like this:

>>>>
old stuff
====
new stuff
<<<<

and edit them to be the way you want. That'll be a bit annoying here, since you will probably have to do it four times (once for each of your commits). This has maybe got to be difficult because the main branch has moved a good bit since you started this. Want me to do the rebase this time?

@saurabhbelsare
Copy link
Contributor Author

Thanks @petrelharp, I've rebased and pushed it. Let me know if it looks right now.

@petrelharp
Copy link
Contributor

Something went wrong there, @saurabhbelsare - this is not rebased to main. Looking at git log, the most recent two commits are

commit 1b062a81e59113d86cfd0db43c5a698e65bef10f (HEAD -> write_ms)
Author: Saurabh Belsare <smbelsare@gmail.com>
Date:   Wed Sep 16 23:03:44 2020 +0000

    Added the write_ms function to write out ms-style output from a tree sequence

commit ecfc5ad176398d2a04034c15a9f87fe874489b89
Author: Jerome Kelleher <jk@well.ox.ac.uk>
Date:   Tue Apr 14 13:58:47 2020 +0100

Maybe you forgot to push? You'll have to do git push -f after a rebase, since you've messed with history.

@saurabhbelsare
Copy link
Contributor Author

I did a git push -f after I did the rebase. And I tried it again right now, and I get the message Everything up-to-date.

@benjeffery
Copy link
Member

Hi @saurabhbelsare, thanks for persevering with this! Your branch is still not rebased to main. You need to make sure main is up to date:

git checkout main
git pull upstream main

then rebase your work on top:

git checkout master (I think your local branch is called this, usually it is best to choose a unique name when you start but this should still work)
git rebase main
git push -f master

@yunusbb
Copy link

yunusbb commented Oct 10, 2020

Hi, @yunusbb! What you did sounded like it should work, but it depends on what version of tskit you're copying the file over. Here's a quick summary of how to do the git thing:

git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
make

Now, everything you do from this directory only will use the local version of tskit, matching this pull request. If you want to make edits, I'd recommend something different, but this is a quick and easy way to test out what's going on.

Hi @petrelharp,

I've tried to follow this recipe to do testing but was stuck at the compilation step with errors:
Anyway, I guess now I better wait until @saurabhbelsare will do the merging?

#just in case, this was my error during compilation (I've tried to read some gcc documentation and some other things but could not solve this issue)

make
python3 setup.py build_ext --inplace
running build_ext
building '_tskit' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/lib
creating build/temp.linux-x86_64-3.6/lib/tskit
creating build/temp.linux-x86_64-3.6/lib/subprojects
creating build/temp.linux-x86_64-3.6/lib/subprojects/kastore
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -UNDEBUG -Ilib -Ilib/subprojects/kastore -I/usr/include/python3.6m -I/ebc_data/bayazit1/Estonian_Genomes/SLIM_SIMS/TREES2MS/tskit/python/.eggs/numpy-1.19.2-py3.6-linux-x86_64.egg/numpy/core/include -c _tskitmodule.c -o build/temp.linux-x86_64-3.6/_tskitmodule.o -std=c99
_tskitmodule.c:34:21: fatal error: kastore.h: No such file or directory
#include "kastore.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1
make: *** [ext3] Error 1

@benjeffery
Copy link
Member

@yunusbb sounds like you need git submodule update --recursive --init

Docs at https://tskit.readthedocs.io/en/latest/development.html can help when building tskit.

@yunusbb
Copy link

yunusbb commented Oct 10, 2020

Thanks @benjeffery ! It worked now.

@yunusbb
Copy link

yunusbb commented Oct 12, 2020

Hi,
I've now tried this pull request #892 version and it worked.
I tested using one simulation repetition stored in a tree sequence format.

To have #892 locally, I followed steps provided by @petrelharp and @benjeffery, i.e.

git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
git submodule update --recursive --init

make

python3.8
import tskit
ts1=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')

with open('output_1.ms', 'w') as ms_file:
tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8))

my original tree sequence was generated using SLIM and then it was recapitated & mutated using msprime
Now will looking into how it works with multiple replicates

@petrelharp
Copy link
Contributor

@yunusbb - great! Let us know if the output works for your pipeline?

@benjeffery
Copy link
Member

@saurabhbelsare Would you like me to rebase this to main?

@saurabhbelsare
Copy link
Contributor Author

saurabhbelsare commented Oct 12, 2020

Hi @benjeffery, is it possible for you to quickly do the rebase? I'm still getting merge conflicts when I try to do it and can't figure out what's going wrong. Sorry about that.

Hi @yunusbb, were you able to generate ms-style output with replicates the write_ms function from this pull request now? The same function should work with and without replicates. Let me know if everything is working the way you need it to.

@yunusbb
Copy link

yunusbb commented Oct 13, 2020

Hi, @saurabhbelsare!

So I did some testing this morning.
'write_ms' worked with iterable object returned by msprime, but it did only once. Strange isn't it. I mean, when I repeated msrpime simulations by setting a smaller sample size (changed only one parameter) and tried 'write_ms' again, it gave an empty ms file.

here is the first successful try:

reps = msprime.simulate(
... sample_size=2000,
... Ne=10000,
... length=100000,
... mutation_rate=float(1.25e-8),
... recombination_rate=float(1.1e-8),
... random_seed=9889,
... num_replicates=3)

with open('output_1.ms', 'w') as ms_file:
... tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

wc -l test_msprime_3_sim_iters.ms
6014 test_msprime_3_sim_iters.ms

now the failed attempt, with one parameter changed (sample_size=100):

reps = msprime.simulate(
sample_size=100,
Ne=10000,
length=100000,
mutation_rate=float(1.25e-8),
recombination_rate=float(1.1e-8),
random_seed=9889,
num_replicates=3)

with open("test_msprime_3_sim_iters.ms", "w") as ms_file:
tskit.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

wc -l test_msprime_3_sim_iters.ms
0 test_msprime_3_sim_iters.ms

As far as I remember, I did not change anything between these attempts.
Anyway, the ms-formatted file from my first attempt looked fine, i.e. one rep with a header line + 2 reps without.
So this gonna work fine with msprime iterable objects (provided that this strange behaviour would be resolved).
It also worked fine with num_replicates=1 by adding header by default.
Here I wonder if you can add an option to do 'header=False' for num_replicates=1? This can be useful for trees imported into msprime.
For example, I generate tree sequences in SLIM and then recapitate and add mutations in msprime.
So I would import tree sequences into msprime and if I had the 'header=False' option I could sequentially write ms output
by adding a header for the first imported tree sequence and then omitting it for the rest.

@saurabhbelsare
Copy link
Contributor Author

saurabhbelsare commented Oct 13, 2020

Hi @yunusbb, I wasn't able to reproduce the problem you are seeing with changing the parameter. Here is my script:

import msprime
import tskit as ts 

reps = msprime.simulate(sample_size=2000, Ne=10000, length=100000, mutation_rate=float(1.25e-8), recombination_rate=float(1.1e-8), random_seed=9889, num_replicates=3)
with open('output_1.ms', 'w') as ms_file:
    ts.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

reps = msprime.simulate(sample_size=100, Ne=10000, length=100000, mutation_rate=float(1.25e-8), recombination_rate=float(1.1e-8), random_seed=9889, num_replicates=3)
with open("test_msprime_3_sim_iters.ms", "w") as ms_file:
    ts.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

Here are my outputs:
wc -l output_1.ms
6014 output_1.ms

wc -l test_msprime_3_sim_iters.ms
314 test_msprime_3_sim_iters.ms

Both the output files have data written out. Is there a mismatch in the tree_sequence object you are giving the write_ms function? In your first example, the output of msprime is to reps but you are giving write_ms the object ts1. The output file names are also different. Might that mismatch have something to do with the error?

I can add the parameter to manually turn off write_header for a single tree_sequence. Could you send me a short example where I can test it when I implement it? I'm not very familiar with SLiM and it'll be better if I can test it for the usecase you're looking for. Thanks!

@yunusbb
Copy link

yunusbb commented Oct 14, 2020

Hi @saurabhbelsare ,
Thank you for your help with this. Now everything works. I started everything from the beginning (quit & launch python, load modules, run msprime & output).
These two lines crept in by mistake, they are from my tests with SLIM tree sequences.

with open('output_1.ms', 'w') as ms_file:
... tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

Sorry for taking your time on this. Must have been able to spot this myself.

Anyway, I am sending you three SLIM generated tree sequence files.
They were post-processed using msprime (recapitation & mutate).

I have compressed them using right-click 'Compress' menu in MACOSX finder.

SLIM_generated_tree_sequences.zip

Here are my commands to load tree sequences and process with write_ms:

ts1=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')
with open('output_1.ms', 'w') as ms_file:
tskit.write_ms(ts1, ms_file, precision=6, mutation_rate=float(1.25e-8), num_replicates=1)

@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #854 into main will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #854      +/-   ##
==========================================
- Coverage   93.47%   93.43%   -0.05%     
==========================================
  Files          25       25              
  Lines       20029    20065      +36     
  Branches      796      808      +12     
==========================================
+ Hits        18723    18747      +24     
- Misses       1272     1281       +9     
- Partials       34       37       +3     
Impacted Files Coverage Δ
tskit/trees.py 97.27% <0.00%> (-0.77%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d823801...6e3a671. Read the comment docs.

@benjeffery
Copy link
Member

@saurabhbelsare I've rebased to master - there was only one small conflict. I'd suggest reading https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/resolving-a-merge-conflict-using-the-command-line for how to do this next time.

@benjeffery
Copy link
Member

@saurabhbelsare I forgot to say that to fetch the rebased changes, you should not git pull as this will merge the rebased changes here with your local un-rebased changes. You need to first git fetch then git reset --hard origin/master

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @saurabhbelsare! I think the main method needs some refactoring and simplifying, so I haven't reviewed the tests yet as they will change.

"""


class msWriter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this deserves a class it has no life-time or identity. I also don't think it deserves its own module as a single function, it be rolled into the write_ms function in trees.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the class entirely and moved all the functions to trees.py

tree_sequence,
print_trees=False,
precision=4,
mutation_rate=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutation_rate has no purpose here, only made sense in the msprime code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the mutation_rate argument

Comment on lines 63 to 65
"ms {} {}".format(
self._tree_sequence.sample_size, max(self._num_replicates, 1)
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right - my understanding is that the first line of an ms file is the command line used to generate it. Should we be putting " ".join(sys.argv) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are the first two arguments in the ms file format. Including the rest is a complicated procedure, since the tree_sequence could be generated from different software like msprime or SLiM, and it is not straightforward to recreate all the ms-style command line arguments from that. I discussed this with @petrelharp a while ago and we thought that only having these two basic arguments makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is: what do downstream users expect to see here? If no-one is parsing this line, then it would make sense to put something about provenance here, as @benjeffery suggests. But I do worry that people might be parsing it a bit (well, hopefully just looking for a line starting with ms X Y to skip), and they might be pulling the sample size and number of replicates out of it. So, I guess I vote for leaving it as-is?

newick = tree.newick(precision=self._precision)
print(f"[{tree.span}]", newick, file=output)

def __write_header(self, output):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this needs to be a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged with the main function.

),
file=output,
)
print("{}".format(999), file=output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the random seed line in the ms file? If it has to be numeric then using 0 would be better, at first I assumed 999 had a special meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced 999 with 0.

]
for position in positions:
print(
"{0:.{1}f}".format(position, self._precision),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use an f string here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced the printing with f strings.

if set(tmp_str).issubset({"0", "1", "-"}):
print(tmp_str, file=output)
#################################################
# Introducing an error to exit if the sequence #
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment isn't needed now we have the exception text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

Comment on lines 6402 to 6418
writer = ms.msWriter(
tree_seq,
print_trees=print_trees,
precision=precision,
mutation_rate=mutation_rate,
num_replicates=num_replicates,
write_header=True,
)
else:
writer = ms.msWriter(
tree_seq,
print_trees=print_trees,
precision=precision,
mutation_rate=mutation_rate,
num_replicates=num_replicates,
write_header=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can have one call to msWriter with write_header=(i==0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change.

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@saurabhbelsare
Copy link
Contributor Author

saurabhbelsare commented Oct 16, 2020

Hi @benjeffery, I've implemented all the fixes you've suggested. The tests have also been modified to work with the new structure. Thanks for all the suggestions! Let me know if everything looks good now.

Hi @yunusbb, The function now has a write_header argument. Your example runs as follows:


import tskit as ts

ts1=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')
ts2=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_2_Est_Dem_decap_subset_1800.trees')
ts3=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_3_Est_Dem_decap_subset_1800.trees')
with open('output_1.ms', 'w') as ms_file:
    ts.write_ms(ts1, ms_file, precision=6, num_replicates=1, write_header=True)
    ts.write_ms(ts2, ms_file, precision=6, num_replicates=1, write_header=False)
    ts.write_ms(ts3, ms_file, precision=6, num_replicates=1, write_header=False)

Let me know if this is what you are looking for, and if the output is as you expect.

@yunusbb
Copy link

yunusbb commented Oct 17, 2020

Hi @saurabhbelsare! Fantastic! That is exactly what is needed to process "non-msrpime" tree sequences. Thank you for taking the time to do this!

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heading in the right direction @saurabhbelsare! I still think this could be simpler as just one function though.

)


def print_ms_file_trees(tree_seq, precision, output):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function is only called in one place it doesn't have a life of it's own. Best to inline it into print_ms_file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined the function

print(newick, file=output)


def print_ms_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function too can be inlined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined the function

are sample size and number of replicates. The second line has a 0 as a substitute
for the random seed.
"""
if isinstance(tree_sequence, collections.Iterable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could do tree_sequence=[tree_sequence] if the argument wasn't an iterable. This would let you inline print_ms_file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done this modification.

Print out the trees in ms-format from the specified tree sequence.
"""
tree = next(tree_seq.trees())
newick = tree.newick(precision=precision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only printing one tree, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this was a mistake. I've fixed this now. Also comparing with the ms output, when there is no recombination, and therefore only one tree, ms does not write out the span, while when there are multiple trees, it does. The new print_trees part of the function does that. Also, I looked carefully at the ms manual, and the -T argument which prints trees suppresses the output of genotypes. I have modified the write_ms function to behave accordingly.

…function from print genotypes, and added iterator to print all trees
@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@saurabhbelsare
Copy link
Contributor Author

Hi @benjeffery, I've done all the latest modifications you suggested and (hopefully correctly) rebased and squashed the new commits. Let me know how things look now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @saurabhbelsare! I think there's a few things we can improve a little bit, but the basic functionality is here so I think we can merge and file some issues to track the rest.

@benjeffery
Copy link
Member

Thanks @saurabhbelsare!

@mergify mergify bot merged commit a5f9c30 into tskit-dev:main Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants