@@ -20,16 +20,12 @@ kernelspec:
2020# Identity by descent
2121
2222The {meth}` .TreeSequence.ibd_segments ` method allows us to compute
23- segments of identity by descent.
23+ segments of identity by descent along a tree sequence .
2424
2525:::{note}
2626This documentation page is preliminary
2727:::
2828
29- :::{todo}
30- Relate the concept of identity by descent to the MRCA spans in the tree sequence.
31- :::
32-
3329## Examples
3430
3531Let's take a simple tree sequence to illustrate the {meth}` .TreeSequence.ibd_segments `
@@ -77,7 +73,20 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7773recent common ancestor of the segment `` [left, right) `` in nodes `` u ``
7874and `` v `` is `` a `` , and the segment has been inherited along the same
7975genealogical path (ie. it has not been broken by recombination). The
80- segments returned are the longest possible ones.
76+ segments returned are the longest possible ones: for a fixed pair
77+ `` (u, v) `` we follow the ancestral paths from `` u `` and `` v `` up the
78+ trees and merge together adjacent genomic intervals whenever both the
79+ MRCA `` a `` and the full ancestral paths from `` u `` and `` v `` to `` a ``
80+ are identical.
81+
82+ This definition is purely genealogical: it depends only on the tree
83+ sequence topology and node times, and does not inspect allelic
84+ states or mutations. In particular, if we compute the MRCA of `` (u, v) ``
85+ in each tree along the sequence, then (up to the additional refinement
86+ by genealogical path) the IBD segments are obtained by merging together
87+ adjacent MRCA intervals that share the same ancestor and paths to that
88+ ancestor. Intervals in which `` u `` and `` v `` lie in different roots
89+ have no MRCA and therefore do not contribute IBD segments.
8190
8291Consider the IBD segments that we get from our example tree sequence:
8392
@@ -109,7 +118,11 @@ By default this class only stores the high-level summaries of the
109118IBD segments discovered. As we can see in this example, we have a
110119total of six segments and
111120the total span (i.e., the sum lengths of the genomic intervals spanned
112- by IBD segments) is 30.
121+ by IBD segments) is 30. In this default mode the object does not
122+ store information about individual sample pairs, and methods that
123+ inspect per-pair information (such as indexing with `` [(a, b)] `` or
124+ iterating over the mapping) will raise an
125+ {class}` .IdentityPairsNotStoredError ` .
113126
114127If required, we can get more detailed information about particular
115128segment pairs and the actual segments using the `` store_pairs ``
@@ -148,8 +161,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
148161segments[(0, 1)]
149162```
150163
151- The {class}` .IdentitySegmentList ` behaves like a Python list,
152- where each element is an instance of {class}` .IdentitySegment ` .
164+ When `` store_segments=True `` , the {class}` .IdentitySegmentList ` behaves
165+ like a Python list, where each element is an instance of
166+ {class}` .IdentitySegment ` . When only `` store_pairs=True `` is specified,
167+ the number of segments and their total span are still available, but
168+ attempting to iterate over the list or access the per-segment arrays
169+ will raise an {class}` .IdentitySegmentsNotStoredError ` .
153170
154171:::{warning}
155172The order of segments in an {class}` .IdentitySegmentList `
@@ -168,19 +185,26 @@ By default we get the IBD segments between all pairs of
168185{ref}` sample<sec_data_model_definitions_sample> ` nodes.
169186
170187#### IBD within a sample set
188+
171189We can reduce this to pairs within a specific set using the
172190`` within `` argument:
173191
174-
175- ``` {eval-rst}
176- .. todo:: More detail and better examples here.
177- ```
178-
179192``` {code-cell}
180193segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
181194print(list(segments.keys()))
182195```
183196
197+ Here we have restricted attention to the samples with node IDs 0 and 2,
198+ so only the pair `` (0, 2) `` appears in the result. In general:
199+
200+ - `` within `` should be a one-dimensional array-like of node IDs
201+ (typically sample nodes). All unordered pairs from this set are
202+ considered.
203+ - If `` within `` is omitted (the default), all nodes flagged as samples
204+ in the node table are used.
205+ - Passing an empty list, e.g. `` within=[] `` , is allowed and simply
206+ yields a result with zero pairs and zero segments.
207+
184208#### IBD between sample sets
185209
186210We can also compute IBD ** between** sample sets:
@@ -190,6 +214,19 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
190214print(list(segments.keys()))
191215```
192216
217+ In this example we have two sample sets, `` [0, 1] `` and `` [2] `` , so the
218+ identity segments are computed only for pairs in which one sample lies
219+ in the first set and the other lies in the second. More generally:
220+
221+ - `` between `` should be a list of non-overlapping lists of node IDs.
222+ - All pairs `` (u, v) `` are considered such that `` u `` and `` v `` belong
223+ to different sample sets.
224+ - Empty sample sets are permitted (e.g., `` between=[[0, 1], []] `` ) and
225+ simply do not contribute any pairs.
226+
227+ The `` within `` and `` between `` arguments are mutually exclusive: passing
228+ both at the same time raises a :class:` ValueError ` .
229+
193230:::{seealso}
194231See the {meth}` .TreeSequence.ibd_segments ` documentation for
195232more details.
@@ -200,6 +237,51 @@ more details.
200237The `` max_time `` and `` min_span `` arguments allow us to constrain the
201238segments that we consider.
202239
203- ``` {eval-rst}
204- .. todo:: Add examples for these arguments.
240+ The `` max_time `` argument specifies an upper bound on the time of the
241+ common ancestor node: only IBD segments whose MRCA node has a time
242+ no greater than `` max_time `` are returned. The time is measured in
243+ the same units as the node times in the tree sequence (e.g., generations).
244+
245+ The `` min_span `` argument filters by genomic length: only segments with
246+ span strictly greater than `` min_span `` are included. This threshold is
247+ measured in the same units as the `` sequence_length `` (for example,
248+ base pairs).
249+
250+ For example:
251+
252+ ``` {code-cell}
253+ import io
254+
255+ nodes = io.StringIO(
256+ """\
257+ id is_sample time
258+ 0 1 0
259+ 1 1 0
260+ 2 0 1
261+ 3 0 1.5
262+ """
263+ )
264+ edges = io.StringIO(
265+ """\
266+ left right parent child
267+ 0.0 0.4 2 0,1
268+ 0.4 1.0 3 0,1
269+ """
270+ )
271+ ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
272+
273+ segments = ts2.ibd_segments(store_segments=True)
274+ print("all segments:", list(segments.values())[0])
275+
276+ segments_recent = ts2.ibd_segments(max_time=1.2, store_segments=True)
277+ print("max_time=1.2:", list(segments_recent.values())[0])
278+
279+ segments_long = ts2.ibd_segments(min_span=0.5, store_segments=True)
280+ print("min_span=0.5:", list(segments_long.values())[0])
205281```
282+
283+ Here the full result contains two IBD segments for the single sample
284+ pair, one inherited via ancestor 2 over `` [0.0, 0.4) `` and one via
285+ ancestor 3 over `` [0.4, 1.0) `` . The `` max_time `` constraint removes the
286+ segment inherited from the older ancestor (time 1.5), while the
287+ `` min_span `` constraint keeps only the longer of the two segments.
0 commit comments