# Sweep inference

This notebook contains code to generate the underlying data for Figure 2, and Supplementary Figure 1 of the manuscript. It assumes Relate (https://myersgroup.github.io/relate/), SINGER (https://github.com/popgenmethods/SINGER/blob/main/releases/) binaries are isntalled in the current directory and that tsinfer, tsdate, tskit, arg_needle (https://palamaralab.github.io/software/argneedle/manual/), arg_needle_lib libraries are installed in the Python environment.
Note that ARG-Needle requires Python <= 3.11

In [None]:
#%pip install tsinfer
#%pip install tsdate
#%pip install tskit
#% pip install arg_needle
#% pip install arg_needle_lib
import subprocess
import numpy as np
import tsinfer
import tsdate
import tskit
import arg_needle
import arg_needle_lib

In [None]:
mut_rate = 1e-8
pop_size = 10_000
recomb_rate = 1e-8

First we load the simulated tree sequence created in `sweep_simulation.ipynb`. This is our ground truth data for the inferences.

In [64]:
sim_ts = tskit.load("true_topology.trees")
sim_ts

Tree Sequence,Unnamed: 1
Trees,9 231
Sequence Length,5 000 000
Time Units,generations
Sample Nodes,600
Total Size,2.1 MiB
Metadata,No Metadata

Table,Rows,Size,Has Metadata
Edges,33 929,1.0 MiB,
Individuals,300,8.2 KiB,
Migrations,0,8 Bytes,
Mutations,10 681,386.0 KiB,
Nodes,7 360,201.3 KiB,
Populations,1,224 Bytes,✅
Provenances,2,1.9 KiB,
Sites,10 668,260.5 KiB,

Provenance Timestamp,Software Name,Version,Command,Full record
"30 July, 2025 at 02:13:24 PM",msprime,1.3.4,sim_mutations,Details  dict  schema_version: 1.0.0  software:  dict  name: msprime version: 1.3.4  parameters:  dict  command: sim_mutations  tree_sequence:  dict  __constant__: __current_ts__  rate: 1e-08 model: None start_time: None end_time: None discrete_genome: None keep: None random_seed: 4321  environment:  dict  os:  dict  system: Linux node: Savita release: 6.6.87.1-microsoft-standard- WSL2 version: #1 SMP PREEMPT_DYNAMIC Mon Apr 21 17:08:54 UTC 2025 machine: x86_64  python:  dict  implementation: CPython version: 3.12.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  gsl:  dict  version: 2.6
"30 July, 2025 at 02:13:24 PM",msprime,1.3.4,sim_ancestry,Details  dict  schema_version: 1.0.0  software:  dict  name: msprime version: 1.3.4  parameters:  dict  command: sim_ancestry samples: 300 demography: None sequence_length: 5000000 discrete_genome: None recombination_rate: 1e-08 gene_conversion_rate: None gene_conversion_tract_length: None population_size: 10000 ploidy: None  model:  list  dict  duration: None position: 2500000.0 start_frequency: 0.0001 end_frequency: 0.9999 s: 0.25 dt: 1e-06 __class__: msprime.ancestry.SweepGenicSel ection  dict  duration: None __class__: msprime.ancestry.StandardCoale scent  initial_state: None start_time: None end_time: None record_migrations: None record_full_arg: None additional_nodes: None coalescing_segments_only: None num_labels: None random_seed: 1234 replicate_index: 0  environment:  dict  os:  dict  system: Linux node: Savita release: 6.6.87.1-microsoft-standard- WSL2 version: #1 SMP PREEMPT_DYNAMIC Mon Apr 21 17:08:54 UTC 2025 machine: x86_64  python:  dict  implementation: CPython version: 3.12.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  gsl:  dict  version: 2.6


To run tsinfer on the simulated data, we read the input sample data in sim_ts, and infer the tree sequence from it, setting a mutation rate `mut_rate`.
When the toplogy has been inferred, we use tsdate to estimate the age of ancestral nodes, and write the tree sequence to file.

In [None]:
#tsinfer + tsdate (one iteration of tsinfer, tsdate v 0.2)
# took just over 1 minute to run this cell on an Intel i7-12700H CPU with 16 GB RAM

i_ts = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(sim_ts))
s_ts = i_ts.simplify()
d_ts = tsdate.date(
    s_ts,
    mutation_rate=mut_rate, max_shape=1000)
d_ts.dump("tsinfer_tsdated.trees")

2025-07-30 14:14:27 INFO     Max encoded genotype matrix size=6.1 MiB
2025-07-30 14:14:27 INFO     Starting addition of 10668 sites
2025-07-30 14:14:29 INFO     Finished adding sites
2025-07-30 14:14:29 INFO     Ancestor builder peak RAM: 4.0 MiB
2025-07-30 14:14:29 INFO     Starting build for 5281 ancestors
2025-07-30 14:14:30 INFO     Finished building ancestors
2025-07-30 14:14:30 INFO     Mismatch prevented by setting constant high recombination and low mismatch probabilities
2025-07-30 14:14:30 INFO     Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
2025-07-30 14:14:30 INFO     Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
2025-07-30 14:14:30 INFO     Matching using 13 digits of precision in likelihood calcs
2025-07-30 14:14:30 INFO     583 epochs with 4.0 median size.
2025-07-30 14:14:30 INFO     First large (>2000.0) epoch is 583
2025-07-30 14:14:30 INFO     Grouping 5283 ancestors b

For inferences using Relate, SINGER and ARG-Needle, we use these functions by Dr Yan Wong (https://github.com/tskit-dev/tsinfer/issues/877) to format the input tree sequence as required by each method, and run the inference from the notebook.

In [67]:
#Convert input data to haps/sample format
def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
    """
    Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
    (see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)

    ``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
    To obtain either as strings, you can pass an io.StringIO object here.

    ``sample_name_field`` gives the metadata field in which to look up names to use in the
    output sample file. Where possible, names are taken from the associated individual metadata.
    If samples are not associated with individuals (i.e. this is haploid data), then
    names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
    the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
    otherwise.

    Returns an array of the site_ids that were written to the haps file (sites
    with 1 allele or > 2 alleles are skipped)

    .. example::

        with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
            ts_to_haps_sample(ts, haps, sample)
    """
    used = np.zeros(ts.num_sites, dtype=bool)
    for v in ts.variants():
        if len(v.alleles) == 1:
            continue
        if len(v.alleles) > 2:
            print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
            continue
        used[v.site.id] = True
        print(
            str(chromosome_number),
            f"SNP{v.site.id}",
            int(v.site.position),
            v.alleles[0],
            v.alleles[1],
            " ".join([str(g) for g in v.genotypes]),
            sep=" ",
            file=haps_output,
        )

    print("ID_1 ID_2 missing", file=sample_output)
    print("0    0    0", file=sample_output)
    individuals = ts.nodes_individual[ts.samples()]
    if np.all(individuals == tskit.NULL):
        # No individuals, just use node metadata
        pass
    else:
        if np.any(individuals == tskit.NULL):
            raise ValueError("Some samples have no individuals")
        _, counts = np.unique(individuals, return_counts=True)
        if np.all(counts == 2):
            if np.any(np.diff(individuals)[0::2]) != 0:
                ValueError("Pairs of adjacent samples must come from the same individual")
        elif np.all(counts == 1):
            pass
        else:
            raise ValueError("Must have all diploid or all haploid samples")
    samples = ts.samples()
    i=0
    while i < len(samples):
        ind1 = ts.node(samples[i]).individual
        ind2 = tskit.NULL
        if ind1 == tskit.NULL:
            try:
                name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Sample_{samples[i]}"
        else:
            try:
                name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Individual_{ind1}"
            try:
                ind2 = ts.node(samples[i+1]).individual
            except IndexError:
                pass
        if ind2 == tskit.NULL or ind2 != ind1:
            print(f'{name} NA 0', file=sample_output)
            i += 1
        else:
            print(f'{name} {name} 0', file=sample_output)
            i += 2
    return np.where(used)[0].astype(ts.mutations_site.dtype)

In [None]:
#Run Relate
# took ~1.5 min to run this cell 
def run_relate(ts, population_size, mut_rate, recomb_rate, random_seed=111, 
               path_to_relate="relate_v1.2.2_x86_64_dynamic/"):
    with open("true_topology.haps", "wt") as haps, open("true_topology.sample", "wt") as sample:
        # ts_to_haps_sample routine from https://github.com/tskit-dev/tsconvert/issues/55#issuecomment-1831959994
        ts_to_haps_sample(ts, haps, sample)

    with open("true_topology.map", "wt") as map:
        cM_per_MB = recomb_rate * 1e8
        print("pos", "COMBINED_rate", "Genetic_Map", sep=" ", file=map)
        print(0, f"{cM_per_MB:.5f}", 0, sep=" ", file=map)
        print(
            int(ts.sequence_length),
            f"{cM_per_MB:.5f}",
            ts.sequence_length / 1e6 * cM_per_MB,
            sep=" ",
            file=map)
        map.flush()

        params = [
            path_to_relate + "bin/Relate",
            "--haps", "true_topology.haps",
            "--sample", "true_topology.sample",
            "--map", "true_topology.map",
            "-o", "relate",
            "--mode", "All",
            "-m", f"{mut_rate}",
            "-N", f"{population_size}",
            "--seed", f"{random_seed}",
        ]
        print(f"running `{' '.join(params)}`")
        subprocess.run(params)

    # Convert to tree sequence format
    params = [
        path_to_relate + "/bin/RelateFileFormats",
        "--mode", "ConvertToTreeSequence",
        "-i", "relate",
        "-o", "relate",
    ]
    print(f"running `{' '.join(params)}`")
    subprocess.run(params)
    return tskit.load("relate.trees")

relate_ts = run_relate(sim_ts, pop_size/2, mut_rate, recomb_rate)

Multialleic site (('T', 'C', 'A')) at position 1284745.0 ignored
Multialleic site (('C', 'G', 'T')) at position 1307324.0 ignored
Multialleic site (('C', 'G', 'T')) at position 2859263.0 ignored
Multialleic site (('A', 'G', 'C')) at position 3599479.0 ignored
Multialleic site (('A', 'C', 'G')) at position 3757839.0 ignored
Multialleic site (('C', 'T', 'A')) at position 3892870.0 ignored
Multialleic site (('G', 'A', 'T')) at position 3908018.0 ignored
Multialleic site (('G', 'A', 'C')) at position 4071401.0 ignored
Multialleic site (('C', 'G', 'T')) at position 4214239.0 ignored
running `relate_v1.2.2_x86_64_dynamic/bin/Relate --haps true_topology.haps --sample true_topology.sample --map true_topology.map -o relate --mode All -m 1e-08 -N 5000.0 --seed 111`



*********************************************************
---------------------------------------------------------
Relate
 * Authors: Leo Speidel, Marie Forest, Sinan Shi, Simon Myers.
 * Doc:     https://myersgroup.github.io/relate
---------------------------------------------------------

---------------------------------------------------------
Using:
  true_topology.haps
  true_topology.sample
  true_topology.map
with mu = 1e-08 and 2Ne = 5000.
---------------------------------------------------------

---------------------------------------------------------
Parsing data..
CPU Time spent: 0.122159s; Max Memory usage: 8.3e+02Mb.
---------------------------------------------------------

---------------------------------------------------------
Read 600 haplotypes with 10659 SNPs per haplotype.
Expected minimum memory usage: 2.8Gb.
---------------------------------------------------------

---------------------------------------------------------
Starting chunk 0 of 0.
-----------

running `relate_v1.2.2_x86_64_dynamic//bin/RelateFileFormats --mode ConvertToTreeSequence -i relate -o relate`


CPU Time spent: 1.176646s; Max Memory usage: 830.348Mb.
---------------------------------------------------------



In [70]:
#SINGER
#download and uncompress .tar.gz from https://github.com/popgenmethods/SINGER/blob/main/releases/ (v 0.1.8 used here)
#took ~52 min to run this cell
with open("true_topology.vcf", "w") as vcf_file:
    sim_ts.write_vcf(vcf_file)
cmd = [
    "./singer-0.1.8-beta-linux-x86_64/singer_master",
    "-m", str(mut_rate),
    "-vcf", "true_topology",
    "-start", "0",
    "-end", "4999999",
    "-output", "singer"
]

subprocess.run(cmd, check=True)

/home/shug7116/tsbrowse_paper/test/singer-0.1.8-beta-linux-x86_64/singer -Ne 6619.98160177412 -m 1e-08 -r 1e-08 -input true_topology -output singer -start 0.0 -end 4999999.0 -polar 0.5 -n 100 -thin 20 -seed 42
valid mutations: 10658
removed mutations: 9
Iteration: 1
[14:27:16.340] : begin BSP
BSP avg num of states: -nan
[14:27:16.352] : begin sampling branches
[14:27:16.353] : begin TSP
[14:27:16.384] : begin sampling points
[14:27:16.387] : begin adding
[14:27:16.389] : begin sampling recombination
[14:27:16.390] : finish
776
Number of incompatibilities: 0
Number of flippings: 0
Iteration: 2
[14:27:16.393] : begin BSP
BSP avg num of states: 9.89128
[14:27:16.408] : begin sampling branches
[14:27:16.409] : begin TSP
[14:27:16.432] : begin sampling points
[14:27:16.435] : begin adding
[14:27:16.438] : begin sampling recombination
[14:27:16.438] : finish
1156
Number of incompatibilities: 0
Number of flippings: 43
Iteration: 3
[14:27:16.444] : begin BSP
BSP avg num of states: 10.9181
[14:

CompletedProcess(args=['./singer-0.1.8-beta-linux-x86_64/singer_master', '-m', '1e-08', '-vcf', 'true_topology', '-start', '0', '-end', '4999999', '-output', 'singer'], returncode=0)

In [71]:
# Convert SINGER output to tskit format
cmd = [
    "./singer-0.1.8-beta-linux-x86_64/convert_to_tskit",
    "-i", "singer",
    "-o", "singer",
    "-start", "0",
    "-end", "99"
]
subprocess.run(cmd, check=True)


CompletedProcess(args=['./singer-0.1.8-beta-linux-x86_64/convert_to_tskit', '-i', 'singer', '-o', 'singer', '-start', '0', '-end', '99'], returncode=0)

In [72]:
with open("true_topology_argn.haps", "wt") as haps, open("true_topology_argn.sample", "wt") as sample:
    sites = ts_to_haps_sample(sim_ts, haps, sample)
with open("true_topology_argn.map", "wt") as map, open("true_topology_argn.demo", "wt") as demo:
    # Make the required mapfile (one line per variant)
    # https://palamaralab.github.io/software/argneedle/manual/#genetic-map-mapmapgz
    # chromosome SNP_name genetic_position_cM physical_position_bp
    for s in sites:
        pos = sim_ts.site(s).position
        print("1", f"Site{s}", f"{pos * recomb_rate * 100}", f"{pos}", sep="\t", file=map)
    print("\t".join(["0.0", str(pop_size)]), file=demo)
    print("\t".join(["5000.0", str(pop_size)]), file=demo)
#In case of errors about numpy version needing to be downgraded try running the below steps directly from the terminal
#!arg_needle --hap_gz test.haps --map test.map --mode sequence --normalize_demography test.demo --out arg_needle
#!arg2tskit --arg_path arg_needle.argn --ts_path arg_needle.trees

Multialleic site (('T', 'C', 'A')) at position 1284745.0 ignored
Multialleic site (('C', 'G', 'T')) at position 1307324.0 ignored
Multialleic site (('C', 'G', 'T')) at position 2859263.0 ignored
Multialleic site (('A', 'G', 'C')) at position 3599479.0 ignored
Multialleic site (('A', 'C', 'G')) at position 3757839.0 ignored
Multialleic site (('C', 'T', 'A')) at position 3892870.0 ignored
Multialleic site (('G', 'A', 'T')) at position 3908018.0 ignored
Multialleic site (('G', 'A', 'C')) at position 4071401.0 ignored
Multialleic site (('C', 'G', 'T')) at position 4214239.0 ignored


In [None]:
#took ~19 minutes to run this cell
cmd = [
    "arg_needle",
    "--hap_gz", "true_topology_argn.haps",
    "--map", "true_topology_argn.map",
    "--mode", "sequence",
    "--normalize_demography", "true_topology_argn.demo",
    "--out", "arg_needle"
]

subprocess.run(cmd, check=True)

2025-07-30 15:19:45 INFO     Command-line arguments:
2025-07-30 15:19:45 INFO       asmc_clust: 0
2025-07-30 15:19:45 INFO       asmc_clust_chunk_sites: -1
2025-07-30 15:19:45 INFO       asmc_decoding_file: /home/shug7116/tsbrowse_paper/test/env_3.11/lib/python3.11/site-packages/arg_needle/resources/30-100-2000_CEU.decodingQuantities.gz
2025-07-30 15:19:45 INFO       asmc_pad_cm: 2.0
2025-07-30 15:19:45 INFO       asmc_tmp_string: asmc
2025-07-30 15:19:45 INFO       backup_hash_word_size: 8
2025-07-30 15:19:45 INFO       chromosome: 1
2025-07-30 15:19:45 INFO       hap_gz: true_topology_argn.haps
2025-07-30 15:19:45 INFO       hash_tolerance: 1
2025-07-30 15:19:45 INFO       hash_topk: 64
2025-07-30 15:19:45 INFO       hash_word_size: 16
2025-07-30 15:19:45 INFO       map: true_topology_argn.map
2025-07-30 15:19:45 INFO       mode: sequence
2025-07-30 15:19:45 INFO       normalize: 1
2025-07-30 15:19:45 INFO       normalize_demography: true_topology_argn.demo
2025-07-30 15:19:45 INFO  

Read 300 samples.
Read data for 600 haploid samples and 10659 markers, 1 of which are monomorphic. This job will focus on 600 haploid samples.
Using precomputed decoding info from /home/shug7116/tsbrowse_paper/test/env_3.11/lib/python3.11/site-packages/arg_needle/resources/30-100-2000_CEU.decodingQuantities.gz
Will decode using AVX instruction set.

Using expected coalescent times from /home/shug7116/tsbrowse_paper/test/env_3.11/lib/python3.11/site-packages/arg_needle/resources/30-100-2000_CEU.decodingQuantities.gz


2025-07-30 15:19:46 INFO     About to thread 600 samples
2025-07-30 15:19:46 INFO     Hashing parameters:
2025-07-30 15:19:46 INFO       K for top K: 64
2025-07-30 15:19:46 INFO       Word size: 16
2025-07-30 15:19:46 INFO       Window cM size: 0.3
2025-07-30 15:19:46 INFO       Tolerance: 1
2025-07-30 15:19:46 INFO     Threading sample 1
2025-07-30 15:22:47 INFO     Threading sample 100
2025-07-30 15:26:00 INFO     Threading sample 200
2025-07-30 15:29:00 INFO     Threading sample 300
2025-07-30 15:32:02 INFO     Threading sample 400
2025-07-30 15:35:36 INFO     Threading sample 500
2025-07-30 15:38:34 INFO     Threading sample 599
2025-07-30 15:38:36 INFO     Done with ARG building
2025-07-30 15:38:44 INFO     Computing ARG normalization
2025-07-30 15:38:54 INFO     Read in demography assuming haploid
2025-07-30 15:38:54 INFO     Running 1000 msprime simulations for ARG normalization
2025-07-30 15:38:59 INFO     Applying ARG normalization
2025-07-30 15:39:07 INFO     Done with ARG no

CompletedProcess(args=['arg_needle', '--hap_gz', 'true_topology_argn.haps', '--map', 'true_topology_argn.map', '--mode', 'sequence', '--normalize_demography', 'true_topology_argn.demo', '--out', 'arg_needle'], returncode=0)

In [74]:
cmd = [
    "arg2tskit",
    "--arg_path", "arg_needle.argn",
    "--ts_path", "arg_needle.trees"
]

subprocess.run(cmd, check=True)

2025-07-30 15:39:42 INFO     Command-line args:
2025-07-30 15:39:42 INFO     arg_path: arg_needle.argn
2025-07-30 15:39:42 INFO     ts_path: arg_needle.trees


CompletedProcess(args=['arg2tskit', '--arg_path', 'arg_needle.argn', '--ts_path', 'arg_needle.trees'], returncode=0)

From an environment where tsbrowse is installed, we then view the tree sequences in the browser: <br>
`python -m tsbrowse preprocess <in.trees>` <br>
`python -m tsbrowse serve --port 8090 <in.tsbrowse>` <br>