Added the write_ms function to write out ms-style output from a tree sequence #854

saurabhbelsare · 2020-09-16T23:17:15Z

This is the write_ms function discussed in #727 I have created a new ms.py file with the function, included the write_ms function header in trees.py, and created a test_ms.py in the tests directory. In line 118-125 of ms.py, I'm introducing a hard exit if the tree sequence contains anything incompatible with the ms format. Let me know if that is the optimal way to do it. Also, if there should be any other modifications in any of the functions. Thanks!

petrelharp

This looks very nice! You've dealt with a lot of things here. See comments, but my main comments are:

maybe this should be a top-level method, since it can act on many tree seuqence (thus getting rid of num_replicates)?
we need more testing (eg testing for equality of positions, and of haplotypes)
right now it uses haplotypes, and so requires the alleles to be 0/1, but really we just need them to be biallelic; using genotypes or genotype_matrix instead would allow that.

Let me know what you think? Happy to help with something if you're not sure of the best way forward.

petrelharp · 2020-09-17T03:49:42Z

python/tests/test_ms.py

@@ -0,0 +1,188 @@
+# MIT License
+#
+# Copyright (c) 2018-2019 Tskit Developers


... and omit the next line, as it wasn't present in 2016

petrelharp · 2020-09-17T03:53:14Z

python/tskit/ms.py

+#
+# MIT License
+#
+# Copyright (c) 2019 Tskit Developers


petrelharp · 2020-09-17T03:58:07Z

python/tests/test_ms.py

+
+    def verify_num_haplotypes(self, ts, mutation_rate, num_replicates):
+        if num_replicates == 1:
+            with tempfile.TemporaryDirectory() as temp_dir:


how about using the file-like io.StringIO(), like for instance here?

petrelharp · 2020-09-17T03:59:51Z

python/tests/test_ms.py

+    quantities["num_sites"] = num_sites
+    quantities["num_positions"] = num_positions
+    quantities["num_haplotypes"] = num_haplotypes
+


We could also test for equality of the positions themselves, no? And, haplotypes?

What you have written here is almost an ms file parsing function: might as well go all the way, and just return the whole dict? Then you can test the various aspects of it, like below. This wouldn't be for distribution, just for testing, so no need to worry about making it fast or documented, just simple and obviiously correct.

petrelharp · 2020-09-17T04:02:59Z

python/tskit/ms.py

+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+"""
+Convert tree sequences to ms output


I guess that we're not converting the whole tree seuqence: just writing out the genotypes - how about "Write the genotypes in a tree sequence in ms format."

petrelharp · 2020-09-17T04:18:33Z

python/tskit/ms.py

+        recombination_rate=0,
+        migration_rate=0,
+        num_loci=1,
+        num_replicates=1,


Hm - looks like you don't need most of these options. Do you need any of them? Other than num_replicates, maybe?

petrelharp · 2020-09-17T04:20:07Z

python/tskit/ms.py

+                    variant.position / (tree_sequence.sequence_length)
+                    for variant in tree_sequence.variants()
+                ]
+                positions.sort()


the positions are guaranteed to be sorted.

petrelharp · 2020-09-17T04:31:08Z

python/tskit/ms.py

+                        file=output,
+                    )
+                print(file=output)
+                for h in tree_sequence.haplotypes():


Hm - so, haplotypes returns the actual alleles, which are thus required to be 0/1. But the genotype_matrix gives the indexes into arrays of alleles, with 0 always the ancestral state. So, more generally, could we do something like this:

genotypes = tree_sequence.genotype_matrix() for k in range(tree_sequence.num_samples): print("".join(genotypes[:, k]), file=output)

(but, what's the - for?)

(The "-" is for missing data @petrelharp )

that's what I guessed... but does ms output missing data?

No, I wouldn't imagine it does. But, I'm not sure how strict we should be here?

I'm not opposed to including missing data! Just wanted to be sure what was happening.

petrelharp · 2020-09-17T04:32:06Z

python/tskit/trees.py

+
+            >>> tree_sequences = msprime.simulate(<simulation arguments>, num_replicates=num_replicates)
+            >>> with open('output.ms', 'w') as ms_file:
+            >>> for tree_sequence in tree_sequences:


indentation

petrelharp · 2020-09-17T04:35:01Z

python/tskit/trees.py

+            >>> tree_sequences = msprime.simulate(<simulation arguments>, num_replicates=num_replicates)
+            >>> with open('output.ms', 'w') as ms_file:
+            >>> for tree_sequence in tree_sequences:
+            >>>     tree_sequence.write_ms(ms_file, mutation_rate=mutation_rate, num_replicates=num_replicates)


You've got a nice solution to this. But really, write_ms is not a property of a TreeSequence, it's a thing that you can do to one, or maybe many, tree sequences. What if instead of ts.write_ms you did tskit.write_ms(ts)? That way if ts is a TreeSequence, you'd write it out, and if ts is a generator (of tree sequences) then you have replicates, and write them out?

jeromekelleher

I think we can simplify this down quite a bit @saurabhbelsare, and it would actually be better to not have a class here. My main question really is, "what is this for and who will use it"; once this is clearer, we can see better what the requirements are in terms of what kind of data we try to represent and what options we need.

jeromekelleher · 2020-09-17T07:41:56Z

python/tskit/ms.py

+        same tree. Therefore, we must keep track of all breakpoints from the
+        simulation and write out a tree for each one.
+        """
+        breakpoints = list(self._tree_sequence.breakpoints(True)) + [self._num_loci]


Yeah, this code makes sense in the msprime implementation but not here. In msprime, we know there are extra breakpoints not present in the tree sequence and we use these to output extra copies of the trees appropriately (otherwise we don't have the same distribution of the number of trees as ms). We can delete all the stuff about breakpoints here.

jeromekelleher · 2020-09-17T07:42:15Z

python/tskit/ms.py

+        simulation and write out a tree for each one.
+        """
+        breakpoints = list(self._tree_sequence.breakpoints(True)) + [self._num_loci]
+        if self._num_loci == 1:


This would be if ts.num_trees == 1

jeromekelleher · 2020-09-17T07:43:49Z

python/tskit/ms.py

+            print(newick, file=output)
+        else:
+            j = 1
+            for tree in self._tree_sequence.trees():


This can just be;

for tree in ts.trees(): newick = tree.newick(precision=self._precision) print("[{}]".format(tree.span), newick, file=output)

jeromekelleher · 2020-09-17T07:44:43Z

python/tskit/ms.py

+
+    def __write_header(self, output):
+        print(
+            "ms {} {} # This file is an ms-style output file generated from tskit. The two arguments written are sample size and number of replicates".format(


I'm not sure we should put comments in the output - it's not part of ms's output, is it?

No, there are no comments in ms's output.

jeromekelleher · 2020-09-17T07:46:10Z

python/tskit/ms.py

+
+    def write(self, output):
+
+        if os.path.getsize(output.fileno()) == 0:


This isn't a good idea I think - it means that you can't write to file-like objects.

jeromekelleher · 2020-09-17T07:47:57Z

python/tskit/ms.py

+                        file=output,
+                    )
+                print(file=output)
+                for h in tree_sequence.haplotypes():


(The "-" is for missing data @petrelharp )

jeromekelleher · 2020-09-17T07:49:42Z

python/tskit/ms.py

+                    # Introducing an error to exit if the sequence is not compatible with the ms format #
+                    #####################################################################################
+                    else:
+                        sys.exit("This tree sequence contains non-biallelic SNPs and is incompatible with the ms format!")


An exception is more appropriate here, .e.g. raise ValueError("not compatible with ms output")

saurabhbelsare · 2020-09-17T17:33:00Z

Thanks for all these suggestions! A lot of the points you've listed are coming from the fact that I used the existing write_vcf method from tksit as a template, and used existing code from msprime/cli.py to create the write function. I'll work on these points. Also, I'm not sure of the exact target application. Should I message on the github issue thread where this has been requested and tag and ask the people who have requested it? Thanks.

yunusbb · 2020-09-18T05:43:53Z

Thanks @petrelharp for mentioning this. I wasn't aware of #854.
Many thanks @saurabhbelsare and @petrelharp @jeromekelleher for doing this! Me and my colleagues use ms-style output quite often. Kind of old-fashioned, but anyway.
Right now, I am using ms style output generated with discoal as a training and testing dataset for partialSHIC machine-learning (https://github.com/xanderxue/partialSHIC).

PartialSHIC takes a single ms-style output that contains multiple replicates in it, for example I use 1000 replicates.
Basically, ms' style file looks exactly as in the original Hudson's ms format, i.e. like this:
ms 4 2 -t 5.0
27473 36154 10290

//
segsites: 4
positions: 0.0110 0.0765 0.6557 0.7571
0010
0100
0000
1001

//
segsites: 5
positions: 0.0491 0.2443 0.2923 0.5984 0.8312
00001
00000
00010
11110

p.s. I will be away for hiking on weekends, but will be back soon and do testing.

yunusbb · 2020-10-03T04:03:22Z

Hi @saurabhbelsare

I did some quick testing. Maybe this is not the right way but here what I did.
I temporarily replaced my local version of trees.py copy/pasting your 'pull-request' version code and also added ms.py into my library:

/Users/bayazityunusbayev/Library/Python/3.6/lib/python/site-packages/tskit/trees.py
/Users/bayazityunusbayev/Library/Python/3.6/lib/python/site-packages/tskit/ms.py

and then, I loaded my previously simulated data stored in tree sequence

python3.6
import tskit
ts=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')

ts.num_samples
1799

# then tested ms output, like this:

with open('output.ms', 'w') as ms_file:
        ts.write_ms(ms_file, mutation_rate=int(1.25e-8))

# ms output had only some header lines:

cat output.ms 
ms 1799 1 # This file is an ms-style output file generated from tskit. The two arguments written are sample size and number of replicates
999 # Setting random seed to 999

//

Perhaps, I am doing something wrong here. Should learn what is git-hub all about, like what is pull-request and staff

petrelharp · 2020-10-04T02:08:39Z

Hi, @yunusbb! What you did sounded like it should work, but it depends on what version of tskit you're copying the file over. Here's a quick summary of how to do the git thing:

git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
make

Now, everything you do from this directory only will use the local version of tskit, matching this pull request. If you want to make edits, I'd recommend something different, but this is a quick and easy way to test out what's going on.

yunusbb · 2020-10-05T11:35:28Z

Hi, @petrelharp! Thank you for your help with this!

saurabhbelsare · 2020-10-08T18:33:52Z

I've updated a new version of the write_ms function, that addresses the points raised above. The write_ms function is now a function of tskit, and not tree_sequence. Hence it no longer needs the two different calls depending on whether num_replicates is used or not. All the breakpoints related code which was relevant to msprime has now been removed. All the other minor points have been fixed. The new tests for positions and genotypes have been added. Let me know how it looks now.

benjeffery · 2020-10-08T23:25:53Z

@saurabhbelsare Great stuff! Could you rebase and run pre-commit (see here and here). This will then make CI green. I'll do a proper review tomorrow. Thanks!

saurabhbelsare · 2020-10-09T06:12:43Z

Hi @benjeffery, I've run all the pre-commit checks and performed the corresponding modifications, and pushed that version. However, when I tried to do the rebasing, following these instructions, when I run git fetch upstream, I get the following error:

fatal: 'upstream' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I'm not sure how to fix this. Sorry, I'm still not super on top of working with github.

benjeffery · 2020-10-09T06:38:50Z

Sounds like you don't have this fork set as a remote. git remote add upstream git@github.com:tskit-dev/tskit.git should fix this.

petrelharp · 2020-10-09T16:54:59Z

Or it might be called origin? Doing git remote -v shows you the list of remote repositories, and you should replace upstream with whatever git@github.com:tskit-dev/tskit.git is called (and if you don't have this repo set as a remote, you'll have to do what Ben said).

saurabhbelsare · 2020-10-09T17:28:57Z

@petrelharp, git remote -v did not show me the tskit-dev repo, so I added it as per @benjeffery's instructions. However, when I run git fetch upstream now, I get a new error:

git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

grahamgower · 2020-10-09T18:02:30Z

Hi @saurabhbelsare. Try using the https url for the repository. E.g.

git remote set-url upstream https://github.com/tskit-dev/tskit.git

saurabhbelsare · 2020-10-09T18:27:32Z

Hi @grahamgower, That worked, thanks! I got another error when I tried to run git rebase -i upstream/master, but fixed that by changing it to git rebase -i upstream/main since I remember that this nomenclature has changed. However, after I do the next step, where I edit the rebase instructions in the editor and change the pick arguments to squash, and then save the changes, I get the following error:

Auto-merging python/tskit/trees.py
CONFLICT (content): Merge conflict in python/tskit/trees.py
error: could not apply 089274d... Added the write_ms function to write out ms-style output from a tree sequence

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Could not apply 089274d... Added the write_ms function to write out ms-style output from a tree sequence

When I open trees.py, it is showing me the version of trees.py from the first commit, not the one from my latest commit, which is what I would have expected from the squashing. Hence I'm not sure how to resolve this. Sorry that I need step by step instructions for this, I haven't really done this before and I don't want to break anything.

petrelharp · 2020-10-09T18:35:11Z

In general, what you need to do in this case is go through the files with conflicts and find the places like this:

>>>>
old stuff
====
new stuff
<<<<

and edit them to be the way you want. That'll be a bit annoying here, since you will probably have to do it four times (once for each of your commits). This has maybe got to be difficult because the main branch has moved a good bit since you started this. Want me to do the rebase this time?

saurabhbelsare · 2020-10-09T20:55:30Z

Thanks @petrelharp, I've rebased and pushed it. Let me know if it looks right now.

petrelharp · 2020-10-10T03:04:00Z

Something went wrong there, @saurabhbelsare - this is not rebased to main. Looking at git log, the most recent two commits are

commit 1b062a81e59113d86cfd0db43c5a698e65bef10f (HEAD -> write_ms)
Author: Saurabh Belsare <smbelsare@gmail.com>
Date:   Wed Sep 16 23:03:44 2020 +0000

    Added the write_ms function to write out ms-style output from a tree sequence

commit ecfc5ad176398d2a04034c15a9f87fe874489b89
Author: Jerome Kelleher <jk@well.ox.ac.uk>
Date:   Tue Apr 14 13:58:47 2020 +0100

Maybe you forgot to push? You'll have to do git push -f after a rebase, since you've messed with history.

saurabhbelsare · 2020-10-10T04:44:48Z

I did a git push -f after I did the rebase. And I tried it again right now, and I get the message Everything up-to-date.

benjeffery · 2020-10-10T11:21:48Z

Hi @saurabhbelsare, thanks for persevering with this! Your branch is still not rebased to main. You need to make sure main is up to date:

git checkout main
git pull upstream main

then rebase your work on top:

git checkout master (I think your local branch is called this, usually it is best to choose a unique name when you start but this should still work)
git rebase main
git push -f master

yunusbb · 2020-10-10T15:06:01Z

Hi, @yunusbb! What you did sounded like it should work, but it depends on what version of tskit you're copying the file over. Here's a quick summary of how to do the git thing:
git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
make
Now, everything you do from this directory only will use the local version of tskit, matching this pull request. If you want to make edits, I'd recommend something different, but this is a quick and easy way to test out what's going on.

Hi @petrelharp,

I've tried to follow this recipe to do testing but was stuck at the compilation step with errors:
Anyway, I guess now I better wait until @saurabhbelsare will do the merging?

#just in case, this was my error during compilation (I've tried to read some gcc documentation and some other things but could not solve this issue)

make
python3 setup.py build_ext --inplace
running build_ext
building '_tskit' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/lib
creating build/temp.linux-x86_64-3.6/lib/tskit
creating build/temp.linux-x86_64-3.6/lib/subprojects
creating build/temp.linux-x86_64-3.6/lib/subprojects/kastore
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -UNDEBUG -Ilib -Ilib/subprojects/kastore -I/usr/include/python3.6m -I/ebc_data/bayazit1/Estonian_Genomes/SLIM_SIMS/TREES2MS/tskit/python/.eggs/numpy-1.19.2-py3.6-linux-x86_64.egg/numpy/core/include -c _tskitmodule.c -o build/temp.linux-x86_64-3.6/_tskitmodule.o -std=c99
_tskitmodule.c:34:21: fatal error: kastore.h: No such file or directory
#include "kastore.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1
make: *** [ext3] Error 1

benjeffery · 2020-10-10T15:40:59Z

@yunusbb sounds like you need git submodule update --recursive --init

Docs at https://tskit.readthedocs.io/en/latest/development.html can help when building tskit.

yunusbb · 2020-10-10T16:58:10Z

Thanks @benjeffery ! It worked now.

yunusbb · 2020-10-12T12:57:14Z

Hi,
I've now tried this pull request #892 version and it worked.
I tested using one simulation repetition stored in a tree sequence format.

To have #892 locally, I followed steps provided by @petrelharp and @benjeffery, i.e.

git clone https://github.com/tskit-dev/tskit.git
cd tskit/python
git fetch origin pull/854/head:ms_output
git checkout ms_output
git submodule update --recursive --init

make

python3.8
import tskit
ts1=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')

with open('output_1.ms', 'w') as ms_file:
tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8))

my original tree sequence was generated using SLIM and then it was recapitated & mutated using msprime
Now will looking into how it works with multiple replicates

petrelharp · 2020-10-12T15:46:09Z

@yunusbb - great! Let us know if the output works for your pipeline?

benjeffery · 2020-10-12T15:47:34Z

@saurabhbelsare Would you like me to rebase this to main?

saurabhbelsare · 2020-10-12T17:22:14Z

Hi @benjeffery, is it possible for you to quickly do the rebase? I'm still getting merge conflicts when I try to do it and can't figure out what's going wrong. Sorry about that.

Hi @yunusbb, were you able to generate ms-style output with replicates the write_ms function from this pull request now? The same function should work with and without replicates. Let me know if everything is working the way you need it to.

yunusbb · 2020-10-13T07:59:50Z

Hi, @saurabhbelsare!

So I did some testing this morning.
'write_ms' worked with iterable object returned by msprime, but it did only once. Strange isn't it. I mean, when I repeated msrpime simulations by setting a smaller sample size (changed only one parameter) and tried 'write_ms' again, it gave an empty ms file.

here is the first successful try:

reps = msprime.simulate(
... sample_size=2000,
... Ne=10000,
... length=100000,
... mutation_rate=float(1.25e-8),
... recombination_rate=float(1.1e-8),
... random_seed=9889,
... num_replicates=3)

with open('output_1.ms', 'w') as ms_file:
... tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

wc -l test_msprime_3_sim_iters.ms
6014 test_msprime_3_sim_iters.ms

now the failed attempt, with one parameter changed (sample_size=100):

reps = msprime.simulate(
sample_size=100,
Ne=10000,
length=100000,
mutation_rate=float(1.25e-8),
recombination_rate=float(1.1e-8),
random_seed=9889,
num_replicates=3)

with open("test_msprime_3_sim_iters.ms", "w") as ms_file:
tskit.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

wc -l test_msprime_3_sim_iters.ms
0 test_msprime_3_sim_iters.ms

As far as I remember, I did not change anything between these attempts.
Anyway, the ms-formatted file from my first attempt looked fine, i.e. one rep with a header line + 2 reps without.
So this gonna work fine with msprime iterable objects (provided that this strange behaviour would be resolved).
It also worked fine with num_replicates=1 by adding header by default.
Here I wonder if you can add an option to do 'header=False' for num_replicates=1? This can be useful for trees imported into msprime.
For example, I generate tree sequences in SLIM and then recapitate and add mutations in msprime.
So I would import tree sequences into msprime and if I had the 'header=False' option I could sequentially write ms output
by adding a header for the first imported tree sequence and then omitting it for the rest.

saurabhbelsare · 2020-10-13T18:04:24Z

Hi @yunusbb, I wasn't able to reproduce the problem you are seeing with changing the parameter. Here is my script:

import msprime
import tskit as ts 

reps = msprime.simulate(sample_size=2000, Ne=10000, length=100000, mutation_rate=float(1.25e-8), recombination_rate=float(1.1e-8), random_seed=9889, num_replicates=3)
with open('output_1.ms', 'w') as ms_file:
    ts.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

reps = msprime.simulate(sample_size=100, Ne=10000, length=100000, mutation_rate=float(1.25e-8), recombination_rate=float(1.1e-8), random_seed=9889, num_replicates=3)
with open("test_msprime_3_sim_iters.ms", "w") as ms_file:
    ts.write_ms(reps, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

Here are my outputs:
wc -l output_1.ms
6014 output_1.ms

wc -l test_msprime_3_sim_iters.ms
314 test_msprime_3_sim_iters.ms

Both the output files have data written out. Is there a mismatch in the tree_sequence object you are giving the write_ms function? In your first example, the output of msprime is to reps but you are giving write_ms the object ts1. The output file names are also different. Might that mismatch have something to do with the error?

I can add the parameter to manually turn off write_header for a single tree_sequence. Could you send me a short example where I can test it when I implement it? I'm not very familiar with SLiM and it'll be better if I can test it for the usecase you're looking for. Thanks!

yunusbb · 2020-10-14T08:10:17Z

Hi @saurabhbelsare ,
Thank you for your help with this. Now everything works. I started everything from the beginning (quit & launch python, load modules, run msprime & output).
These two lines crept in by mistake, they are from my tests with SLIM tree sequences.

with open('output_1.ms', 'w') as ms_file:
... tskit.write_ms(ts1, ms_file, mutation_rate=float(1.25e-8), num_replicates=3)

Sorry for taking your time on this. Must have been able to spot this myself.

Anyway, I am sending you three SLIM generated tree sequence files.
They were post-processed using msprime (recapitation & mutate).

I have compressed them using right-click 'Compress' menu in MACOSX finder.

SLIM_generated_tree_sequences.zip

Here are my commands to load tree sequences and process with write_ms:

ts1=tskit.load('slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')
with open('output_1.ms', 'w') as ms_file:
tskit.write_ms(ts1, ms_file, precision=6, mutation_rate=float(1.25e-8), num_replicates=1)

codecov · 2020-10-14T11:04:11Z

Codecov Report

Merging #854 into main will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #854      +/-   ##
==========================================
- Coverage   93.47%   93.43%   -0.05%     
==========================================
  Files          25       25              
  Lines       20029    20065      +36     
  Branches      796      808      +12     
==========================================
+ Hits        18723    18747      +24     
- Misses       1272     1281       +9     
- Partials       34       37       +3

Impacted Files	Coverage Δ
tskit/trees.py	`97.27% <0.00%> (-0.77%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d823801...6e3a671. Read the comment docs.

benjeffery · 2020-10-14T11:04:33Z

@saurabhbelsare I've rebased to master - there was only one small conflict. I'd suggest reading https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/resolving-a-merge-conflict-using-the-command-line for how to do this next time.

benjeffery · 2020-10-14T13:02:01Z

@saurabhbelsare I forgot to say that to fetch the rebased changes, you should not git pull as this will merge the rebased changes here with your local un-rebased changes. You need to first git fetch then git reset --hard origin/master

AdminBot-tskit · 2020-10-14T13:19:31Z

📖 Docs for this PR can be previewed here

benjeffery

Thanks for picking this up @saurabhbelsare! I think the main method needs some refactoring and simplifying, so I haven't reviewed the tests yet as they will change.

benjeffery · 2020-10-14T13:48:37Z

python/tskit/ms.py

+"""
+
+
+class msWriter:


I don't think this deserves a class it has no life-time or identity. I also don't think it deserves its own module as a single function, it be rolled into the write_ms function in trees.py.

I've removed the class entirely and moved all the functions to trees.py

benjeffery · 2020-10-14T13:56:59Z

python/tskit/ms.py

+        tree_sequence,
+        print_trees=False,
+        precision=4,
+        mutation_rate=0,


mutation_rate has no purpose here, only made sense in the msprime code.

I've removed the mutation_rate argument

benjeffery · 2020-10-14T13:57:28Z

python/tskit/ms.py

+            "ms {} {}".format(
+                self._tree_sequence.sample_size, max(self._num_replicates, 1)
+            ),


I'm not sure this is right - my understanding is that the first line of an ms file is the command line used to generate it. Should we be putting " ".join(sys.argv) here?

These two are the first two arguments in the ms file format. Including the rest is a complicated procedure, since the tree_sequence could be generated from different software like msprime or SLiM, and it is not straightforward to recreate all the ms-style command line arguments from that. I discussed this with @petrelharp a while ago and we thought that only having these two basic arguments makes sense.

The question is: what do downstream users expect to see here? If no-one is parsing this line, then it would make sense to put something about provenance here, as @benjeffery suggests. But I do worry that people might be parsing it a bit (well, hopefully just looking for a line starting with ms X Y to skip), and they might be pulling the sample size and number of replicates out of it. So, I guess I vote for leaving it as-is?

benjeffery · 2020-10-14T13:59:45Z

python/tskit/ms.py

+        newick = tree.newick(precision=self._precision)
+        print(f"[{tree.span}]", newick, file=output)
+
+    def __write_header(self, output):


Don't think this needs to be a separate function.

Merged with the main function.

benjeffery · 2020-10-14T16:20:54Z

python/tskit/ms.py

+            ),
+            file=output,
+        )
+        print("{}".format(999), file=output)


Is this the random seed line in the ms file? If it has to be numeric then using 0 would be better, at first I assumed 999 had a special meaning.

I've replaced 999 with 0.

benjeffery · 2020-10-14T16:31:47Z

python/tskit/ms.py

+                ]
+                for position in positions:
+                    print(
+                        "{0:.{1}f}".format(position, self._precision),


You can use an f string here.

I've replaced the printing with f strings.

benjeffery · 2020-10-14T16:33:35Z

python/tskit/ms.py

+                    if set(tmp_str).issubset({"0", "1", "-"}):
+                        print(tmp_str, file=output)
+                    #################################################
+                    # Introducing an error to exit if the sequence  #


Comment isn't needed now we have the exception text.

Removed the comment.

benjeffery · 2020-10-14T16:35:59Z

python/tskit/trees.py

+                writer = ms.msWriter(
+                    tree_seq,
+                    print_trees=print_trees,
+                    precision=precision,
+                    mutation_rate=mutation_rate,
+                    num_replicates=num_replicates,
+                    write_header=True,
+                )
+            else:
+                writer = ms.msWriter(
+                    tree_seq,
+                    print_trees=print_trees,
+                    precision=precision,
+                    mutation_rate=mutation_rate,
+                    num_replicates=num_replicates,
+                    write_header=False,
+                )


Here you can have one call to msWriter with write_header=(i==0)

Implemented this change.

AdminBot-tskit · 2020-10-16T21:43:35Z

📖 Docs for this PR can be previewed here

saurabhbelsare · 2020-10-16T21:48:52Z

Hi @benjeffery, I've implemented all the fixes you've suggested. The tests have also been modified to work with the new structure. Thanks for all the suggestions! Let me know if everything looks good now.

Hi @yunusbb, The function now has a write_header argument. Your example runs as follows:


import tskit as ts

ts1=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_1_Est_Dem_decap_subset_1800.trees')
ts2=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_2_Est_Dem_decap_subset_1800.trees')
ts3=ts.load('SLIM_generated_tree_sequences/slim_neutral_reps_3_Est_Dem_decap_subset_1800.trees')
with open('output_1.ms', 'w') as ms_file:
    ts.write_ms(ts1, ms_file, precision=6, num_replicates=1, write_header=True)
    ts.write_ms(ts2, ms_file, precision=6, num_replicates=1, write_header=False)
    ts.write_ms(ts3, ms_file, precision=6, num_replicates=1, write_header=False)

Let me know if this is what you are looking for, and if the output is as you expect.

yunusbb · 2020-10-17T16:35:10Z

Hi @saurabhbelsare! Fantastic! That is exactly what is needed to process "non-msrpime" tree sequences. Thank you for taking the time to do this!

benjeffery

Heading in the right direction @saurabhbelsare! I still think this could be simpler as just one function though.

benjeffery · 2020-10-19T10:35:23Z

python/tskit/trees.py

+        )
+
+
+def print_ms_file_trees(tree_seq, precision, output):


As this function is only called in one place it doesn't have a life of it's own. Best to inline it into print_ms_file.

Inlined the function

benjeffery · 2020-10-19T10:36:32Z

python/tskit/trees.py

+    print(newick, file=output)
+
+
+def print_ms_file(


This function too can be inlined.

Inlined the function

benjeffery · 2020-10-19T10:41:24Z

python/tskit/trees.py

+    are sample size and number of replicates. The second line has a 0 as a substitute
+    for the random seed.
+    """
+    if isinstance(tree_sequence, collections.Iterable):


Here you could do tree_sequence=[tree_sequence] if the argument wasn't an iterable. This would let you inline print_ms_file.

Done this modification.

benjeffery · 2020-10-19T10:42:34Z

python/tskit/trees.py

+    Print out the trees in ms-format from the specified tree sequence.
+    """
+    tree = next(tree_seq.trees())
+    newick = tree.newick(precision=precision)


This is only printing one tree, is that right?

You are right, this was a mistake. I've fixed this now. Also comparing with the ms output, when there is no recombination, and therefore only one tree, ms does not write out the span, while when there are multiple trees, it does. The new print_trees part of the function does that. Also, I looked carefully at the ms manual, and the -T argument which prints trees suppresses the output of genotypes. I have modified the write_ms function to behave accordingly.

…function from print genotypes, and added iterator to print all trees

AdminBot-tskit · 2020-10-20T22:11:28Z

📖 Docs for this PR can be previewed here

saurabhbelsare · 2020-10-20T22:13:45Z

Hi @benjeffery, I've done all the latest modifications you suggested and (hopefully correctly) rebased and squashed the new commits. Let me know how things look now.

jeromekelleher

LGTM, thanks @saurabhbelsare! I think there's a few things we can improve a little bit, but the basic functionality is here so I think we can merge and file some issues to track the rest.

benjeffery · 2020-10-21T12:30:42Z

Thanks @saurabhbelsare!

petrelharp reviewed Sep 17, 2020

View reviewed changes

jeromekelleher reviewed Sep 17, 2020

View reviewed changes

petrelharp mentioned this pull request Sep 18, 2020

Generating ms-output from tree sequence #727

Closed

benjeffery changed the base branch from master to main September 28, 2020 12:11

petrelharp mentioned this pull request Oct 4, 2020

docs on checking out a pr #892

Merged

saurabhbelsare force-pushed the master branch from 4294da3 to 1b062a8 Compare October 9, 2020 20:53

benjeffery force-pushed the master branch from 1b062a8 to 16fac57 Compare October 14, 2020 10:47

benjeffery force-pushed the master branch from 16fac57 to 33bfbb8 Compare October 14, 2020 13:17

benjeffery reviewed Oct 14, 2020

View reviewed changes

benjeffery reviewed Oct 19, 2020

View reviewed changes

Folded all of write_ms into a single function. Split the print_trees …

6e3a671

…function from print genotypes, and added iterator to print all trees

saurabhbelsare force-pushed the master branch from 8b48a3f to 6e3a671 Compare October 20, 2020 22:09

jeromekelleher approved these changes Oct 21, 2020

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label Oct 21, 2020

mergify bot merged commit a5f9c30 into tskit-dev:main Oct 21, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Oct 21, 2020

petrelharp mentioned this pull request May 24, 2021

tskit.write_ms is not documented #1464

Open


		def write(self, output):

		if os.path.getsize(output.fileno()) == 0:

Added the write_ms function to write out ms-style output from a tree sequence #854

Added the write_ms function to write out ms-style output from a tree sequence #854

Uh oh!

Conversation

saurabhbelsare commented Sep 16, 2020

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saurabhbelsare commented Sep 17, 2020

Uh oh!

yunusbb commented Sep 18, 2020

Uh oh!

yunusbb commented Oct 3, 2020 • edited by petrelharp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Oct 4, 2020

Uh oh!

yunusbb commented Oct 5, 2020

Uh oh!

saurabhbelsare commented Oct 8, 2020

Uh oh!

benjeffery commented Oct 8, 2020

Uh oh!

saurabhbelsare commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Oct 9, 2020

Uh oh!

yunusbb commented Oct 3, 2020 •

edited by petrelharp

Loading

saurabhbelsare commented Oct 9, 2020 •

edited

Loading

yunusbb commented Oct 12, 2020 •

edited

Loading

saurabhbelsare commented Oct 12, 2020 •

edited

Loading

saurabhbelsare commented Oct 13, 2020 •

edited

Loading