Skip to content

Commit a7046aa

Browse files
committed
Improve IBD documentation and IdentitySegments docs
1 parent 5ffcf6f commit a7046aa

File tree

3 files changed

+111
-26
lines changed

3 files changed

+111
-26
lines changed

docs/ibd.md

Lines changed: 98 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -20,16 +20,12 @@ kernelspec:
2020
# Identity by descent
2121

2222
The {meth}`.TreeSequence.ibd_segments` method allows us to compute
23-
segments of identity by descent.
23+
segments of identity by descent along a tree sequence.
2424

2525
:::{note}
2626
This documentation page is preliminary
2727
:::
2828

29-
:::{todo}
30-
Relate the concept of identity by descent to the MRCA spans in the tree sequence.
31-
:::
32-
3329
## Examples
3430

3531
Let's take a simple tree sequence to illustrate the {meth}`.TreeSequence.ibd_segments`
@@ -77,7 +73,20 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7773
recent common ancestor of the segment ``[left, right)`` in nodes ``u``
7874
and ``v`` is ``a``, and the segment has been inherited along the same
7975
genealogical path (ie. it has not been broken by recombination). The
80-
segments returned are the longest possible ones.
76+
segments returned are the longest possible ones: for a fixed pair
77+
``(u, v)`` we follow the ancestral paths from ``u`` and ``v`` up the
78+
trees and merge together adjacent genomic intervals whenever both the
79+
MRCA ``a`` and the full ancestral paths from ``u`` and ``v`` to ``a``
80+
are identical.
81+
82+
This definition is purely genealogical: it depends only on the tree
83+
sequence topology and node times, and does not inspect allelic
84+
states or mutations. In particular, if we compute the MRCA of ``(u, v)``
85+
in each tree along the sequence, then (up to the additional refinement
86+
by genealogical path) the IBD segments are obtained by merging together
87+
adjacent MRCA intervals that share the same ancestor and paths to that
88+
ancestor. Intervals in which ``u`` and ``v`` lie in different roots
89+
have no MRCA and therefore do not contribute IBD segments.
8190

8291
Consider the IBD segments that we get from our example tree sequence:
8392

@@ -109,7 +118,11 @@ By default this class only stores the high-level summaries of the
109118
IBD segments discovered. As we can see in this example, we have a
110119
total of six segments and
111120
the total span (i.e., the sum lengths of the genomic intervals spanned
112-
by IBD segments) is 30.
121+
by IBD segments) is 30. In this default mode the object does not
122+
store information about individual sample pairs, and methods that
123+
inspect per-pair information (such as indexing with ``[(a, b)]`` or
124+
iterating over the mapping) will raise an
125+
{class}`.IdentityPairsNotStoredError`.
113126

114127
If required, we can get more detailed information about particular
115128
segment pairs and the actual segments using the ``store_pairs``
@@ -148,8 +161,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
148161
segments[(0, 1)]
149162
```
150163

151-
The {class}`.IdentitySegmentList` behaves like a Python list,
152-
where each element is an instance of {class}`.IdentitySegment`.
164+
When ``store_segments=True``, the {class}`.IdentitySegmentList` behaves
165+
like a Python list, where each element is an instance of
166+
{class}`.IdentitySegment`. When only ``store_pairs=True`` is specified,
167+
the number of segments and their total span are still available, but
168+
attempting to iterate over the list or access the per-segment arrays
169+
will raise an {class}`.IdentitySegmentsNotStoredError`.
153170

154171
:::{warning}
155172
The order of segments in an {class}`.IdentitySegmentList`
@@ -168,19 +185,26 @@ By default we get the IBD segments between all pairs of
168185
{ref}`sample<sec_data_model_definitions_sample>` nodes.
169186

170187
#### IBD within a sample set
188+
171189
We can reduce this to pairs within a specific set using the
172190
``within`` argument:
173191

174-
175-
```{eval-rst}
176-
.. todo:: More detail and better examples here.
177-
```
178-
179192
```{code-cell}
180193
segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
181194
print(list(segments.keys()))
182195
```
183196

197+
Here we have restricted attention to the samples with node IDs 0 and 2,
198+
so only the pair ``(0, 2)`` appears in the result. In general:
199+
200+
- ``within`` should be a one-dimensional array-like of node IDs
201+
(typically sample nodes). All unordered pairs from this set are
202+
considered.
203+
- If ``within`` is omitted (the default), all nodes flagged as samples
204+
in the node table are used.
205+
- Passing an empty list, e.g. ``within=[]``, is allowed and simply
206+
yields a result with zero pairs and zero segments.
207+
184208
#### IBD between sample sets
185209

186210
We can also compute IBD **between** sample sets:
@@ -190,6 +214,19 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
190214
print(list(segments.keys()))
191215
```
192216

217+
In this example we have two sample sets, ``[0, 1]`` and ``[2]``, so the
218+
identity segments are computed only for pairs in which one sample lies
219+
in the first set and the other lies in the second. More generally:
220+
221+
- ``between`` should be a list of non-overlapping lists of node IDs.
222+
- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong
223+
to different sample sets.
224+
- Empty sample sets are permitted (e.g., ``between=[[0, 1], []]``) and
225+
simply do not contribute any pairs.
226+
227+
The ``within`` and ``between`` arguments are mutually exclusive: passing
228+
both at the same time raises a :class:`ValueError`.
229+
193230
:::{seealso}
194231
See the {meth}`.TreeSequence.ibd_segments` documentation for
195232
more details.
@@ -200,6 +237,51 @@ more details.
200237
The ``max_time`` and ``min_span`` arguments allow us to constrain the
201238
segments that we consider.
202239

203-
```{eval-rst}
204-
.. todo:: Add examples for these arguments.
240+
The ``max_time`` argument specifies an upper bound on the time of the
241+
common ancestor node: only IBD segments whose MRCA node has a time
242+
no greater than ``max_time`` are returned. The time is measured in
243+
the same units as the node times in the tree sequence (e.g., generations).
244+
245+
The ``min_span`` argument filters by genomic length: only segments with
246+
span strictly greater than ``min_span`` are included. This threshold is
247+
measured in the same units as the ``sequence_length`` (for example,
248+
base pairs).
249+
250+
For example:
251+
252+
```{code-cell}
253+
import io
254+
255+
nodes = io.StringIO(
256+
"""\
257+
id is_sample time
258+
0 1 0
259+
1 1 0
260+
2 0 1
261+
3 0 1.5
262+
"""
263+
)
264+
edges = io.StringIO(
265+
"""\
266+
left right parent child
267+
0.0 0.4 2 0,1
268+
0.4 1.0 3 0,1
269+
"""
270+
)
271+
ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
272+
273+
segments = ts2.ibd_segments(store_segments=True)
274+
print("all segments:", list(segments.values())[0])
275+
276+
segments_recent = ts2.ibd_segments(max_time=1.2, store_segments=True)
277+
print("max_time=1.2:", list(segments_recent.values())[0])
278+
279+
segments_long = ts2.ibd_segments(min_span=0.5, store_segments=True)
280+
print("min_span=0.5:", list(segments_long.values())[0])
205281
```
282+
283+
Here the full result contains two IBD segments for the single sample
284+
pair, one inherited via ancestor 2 over ``[0.0, 0.4)`` and one via
285+
ancestor 3 over ``[0.4, 1.0)``. The ``max_time`` constraint removes the
286+
segment inherited from the older ancestor (time 1.5), while the
287+
``min_span`` constraint keeps only the longer of the two segments.

python/tskit/tables.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2737,8 +2737,8 @@ def assert_equals(self, other, *, ignore_timestamps=False):
27372737
@dataclasses.dataclass(eq=True, order=True)
27382738
class IdentitySegment:
27392739
"""
2740-
A single segment of identity spanning a genomic interval for a
2741-
a specific ancestor node.
2740+
A single segment of identity by descent spanning a genomic interval
2741+
for a specific ancestor node.
27422742
"""
27432743

27442744
left: float
@@ -2758,7 +2758,7 @@ def span(self) -> float:
27582758

27592759
class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
27602760
"""
2761-
A summary of identity segments for some pair of samples in a
2761+
A summary of identity-by-descent segments for some pair of samples in a
27622762
:class:`.IdentitySegments` result. If the ``store_segments`` argument
27632763
has been specified to :meth:`.TreeSequence.ibd_segments`, this class
27642764
can be treated as a sequence of :class:`.IdentitySegment` objects.
@@ -2769,7 +2769,9 @@ class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
27692769
27702770
If ``store_segments`` is False, only the overall summary values
27712771
such as :attr:`.IdentitySegmentList.total_span` and ``len()`` are
2772-
available.
2772+
available. Attempting to iterate over the list or access per-segment
2773+
arrays (``left``, ``right``, or ``node``) in this case will raise an
2774+
:class:`.IdentitySegmentsNotStoredError`.
27732775
27742776
.. warning:: The order of segments within an IdentitySegmentList is
27752777
arbitrary and may change in the future
@@ -2833,7 +2835,7 @@ def node(self):
28332835
class IdentitySegments(collections.abc.Mapping):
28342836
"""
28352837
A class summarising and optionally storing the segments of identity
2836-
by state returned by :meth:`.TreeSequence.ibd_segments`. See the
2838+
by descent returned by :meth:`.TreeSequence.ibd_segments`. See the
28372839
:ref:`sec_identity` for more information and examples.
28382840
28392841
Along with the documented methods and attributes, the class supports
@@ -2845,9 +2847,10 @@ class IdentitySegments(collections.abc.Mapping):
28452847
for a given instance of this class are determined by the
28462848
``store_pairs`` and ``store_segments`` arguments provided to
28472849
:meth:`.TreeSequence.ibd_segments`. For example, attempting
2848-
to access per-sample pair information if ``store_pairs``
2849-
is False will result in a (hopefully informative) error being
2850-
raised.
2850+
to access per-sample pair information (such as indexing with
2851+
``[(a, b)]``, iterating over the mapping, or accessing
2852+
:attr:`.IdentitySegments.pairs`) if ``store_pairs`` is False will
2853+
result in an :class:`.IdentityPairsNotStoredError` being raised.
28512854
28522855
.. warning:: This class should not be instantiated directly.
28532856
"""

python/tskit/trees.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10526,7 +10526,7 @@ def ibd_segments(
1052610526
represented by the tree sequence.
1052710527
1052810528
:param list within: A list of node IDs defining set of nodes that
10529-
we finding IBD segments for. If not specified, this defaults to
10529+
we find IBD segments for. If not specified, this defaults to
1053010530
all samples in the tree sequence.
1053110531
:param list[list] between: A list of lists of sample node IDs. Given
1053210532
two sample sets A and B, only IBD segments will be returned such
@@ -10541,7 +10541,7 @@ def ibd_segments(
1054110541
segment) is greater than this value will be included. (Default=0)
1054210542
:param bool store_pairs: If True store information separately for each
1054310543
pair of samples ``(a, b)`` that are found to be IBD. Otherwise
10544-
store summary information about all sample apirs. (Default=False)
10544+
store summary information about all sample pairs. (Default=False)
1054510545
:param bool store_segments: If True store each IBD segment
1054610546
``(left, right, c)`` and associate it with the corresponding
1054710547
sample pair ``(a, b)``. If True, implies ``store_pairs``.

0 commit comments

Comments
 (0)