# Step-by-step tutorial on how to use different strategies for multi-domain sequence analysis

Multi-domain sequence analysis is.... Please introduce.

This tutorial will guide you through multi-domain sequence analysis, including assessing the association between domains, and the four strategies of multi-domain sequence analysis, including .... 

Here, we will use biofam dataset from MedSeq R package to illustrate how these four strategies differ. 

Let's get started!

In [1]:
from sequenzo import *

# Load datasets, and which dataset corresponds to one domain
left_df = load_dataset('biofam_left_domain')
children_df = load_dataset('biofam_child_domain')
married_df = load_dataset('biofam_married_domain')

# For example, let's take a look at the dataset about whether individuals left home or not
left_df

Unnamed: 0,id,age_15,age_16,age_17,age_18,age_19,age_20,age_21,age_22,age_23,age_24,age_25,age_26,age_27,age_28,age_29,age_30
0,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1
1,2,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,3,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1
3,4,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1
4,5,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1
1996,1997,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1997,1998,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1998,1999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Assessing whether and to what extent domains are associated with each other

To explore how different life domains (e.g. marriage, leaving home, having children) are related across time, we use **sequence association analysis**. This helps us understand **if** and **how strongly** two domains tend to move together across a person's life.

### Step 1: Create Sequence Objects

We first create sequence data objects for each domain (e.g. a sequence showing whether someone was married at each age). These objects are then compared **pairwise** to analyze their associations.


In [2]:
# Extract the columns related to age/time
# which is a prerequisite for building a sequence data.
time_cols = [col for col in children_df.columns if col.startswith("age_")]

# Construct a sequence data for each 
print("\n------ seq_left ------")
seq_left = SequenceData(data=left_df, 
                        time_type="age", 
                        time=time_cols, 
                        states=[0, 1],
                        labels=["At home", "Left home"])

print("\n------ seq_child ------")
seq_child = SequenceData(data=children_df, 
                         time_type="age", 
                         time=time_cols, 
                         states=[0, 1],
                         labels=["No child", "Child"])

print("\n------ seq_married ------")
seq_married = SequenceData(data=married_df, 
                        time_type="age", 
                        time=time_cols, 
                        states=[0, 1],
                        labels=["Not married", "Married"])


------ seq_left ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

------ seq_child ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

------ seq_married ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]


### Step 2: Measuring Association

We use two complementary statistical measures:

| Measure      | Description                                                                 | What it tells us                                       |
|--------------|-----------------------------------------------------------------------------|---------------------------------------------------------|
| **LRT (Likelihood Ratio Test)** | A test of **whether** two domains are statistically associated | Tells you **if there is any significant link** at all    |
| **Cramer's V (v)**              | A measure of **how strong** the association is (0 to 1)         | Tells you **how strong** the link is if it exists       |

Both of these are calculated **based on cross-tabulations** of aligned sequence positions (e.g., marriage status vs. childbearing at each age).

### What's the Difference?

- **LRT (`p(LRT)`):**  
  Think of this as a **yes/no test** â€” *Is there any relationship?*
  - A low p-value (e.g., < 0.05) means "Yes, the association is statistically significant."
  - A high p-value means there's no evidence of association.

- **Cramer's V (`v` and `p(v)`):**  
  This tells you **how strong** the relationship is, even if it's weak.
  - Value ranges from 0 (no association) to 1 (perfect association).
  - We also attach a label:
    - `None` (v < 0.1)
    - `Weak` (0.1 â‰¤ v < 0.3)
    - `Moderate` (0.3 â‰¤ v < 0.5)
    - `Strong` (v â‰¥ 0.5)

> **Note:** Even when `v = 0`, non-linear associations *might* exist â€” this test only captures **linear dependencies**.

### Output Table

The result is a table that looks like the following. 

Each row shows how two domains relate to each other, how statistically significant that relationship is, and how strong it is.

In [3]:
result = get_association_between_domains(
    [seq_left, seq_child, seq_married],
    assoc=["V", "LRT"],
    rep_method="overall",
    cross_table=True,
    weighted=True,
    dnames=["children", "married", "left"],
    explain=True,
)


ðŸ“œ Full results table:


Unnamed: 0,df,LRT,p(LRT),v,p(v),strength
children vs married,1.0,9144.680641,0.000 ***,0.481817,0.000 ***,Moderate
children vs left,1.0,9561.568952,0.000 ***,0.531414,0.000 ***,Strong
married vs left,1.0,12430.120849,0.000 ***,0.626851,0.000 ***,Strong



ðŸ“˜ Column explanations:
  - df       : Degrees of freedom for the test (typically 1 for binary state sequences).
  - LRT      : Likelihood Ratio Test statistic (higher = stronger dependence).
  - p(LRT)   : p-value for LRT + significance stars: * (p<.05), ** (p<.01), *** (p<.001)
  - v        : Cramer's V statistic (0 to 1, measures association strength).
  - p(v)     : p-value for Cramer's V (based on chi-squared test) + significance stars: * (p<.05), ** (p<.01), *** (p<.001)
  - strength : Qualitative label for association strength based on Cramer's V:
               0.00â€“0.09 â†’ None, 0.10â€“0.29 â†’ Weak, 0.30â€“0.49 â†’ Moderate, â‰¥0.50 â†’ Strong


## The first strategy to conduct multi-domain sequence analysis: IDCD

In [5]:
left_df = load_dataset('biofam_left_domain')
children_df = load_dataset('biofam_child_domain')
married_df = load_dataset('biofam_married_domain')

time_cols = [col for col in children_df.columns if col.startswith("age_")]

common_path = './datasets'
csvs = [f'{common_path}/biofam_left_domain.csv',
        f'{common_path}/biofam_child_domain.csv',
        f'{common_path}/biofam_married_domain.csv']

time_cols = [f"age_{i}" for i in range(15, 31)]

seqdata = create_idcd_sequence_from_csvs(
    csv_paths=csvs,
    time_cols=time_cols,
    time_type="age",
    id_col="id",
    domain_state_labels=[
        {0: "At Home", 1: "Left Home"},
        {0: "No Child", 1: "Child"},
        {0: "Single", 1: "Married"}
    ]
)


[IDCD] Observed Combined States Frequency Table:
State                      Label  Frequency  Proportion (%)
0+0+0    At Home+No Child+Single      16378           51.18
1+0+0  Left Home+No Child+Single       5888           18.40
1+1+1    Left Home+Child+Married       4838           15.12
1+0+1 Left Home+No Child+Married       3244           10.14
0+0+1   At Home+No Child+Married       1466            4.58
1+1+0     Left Home+Child+Single        167            0.52
0+1+0       At Home+Child+Single         19            0.06

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: ['0+0+0', '1+0+0', '1+1+1', '1+0+1', '0+0+1', '1+1+0', '0+1+0']


After this, we can follow the common workflow of sequence analysis.

## Other strategies for conducting multi-domain sequence analysis

### CAT

In [8]:
left_df = load_dataset('biofam_left_domain')
children_df = load_dataset('biofam_child_domain')
married_df = load_dataset('biofam_married_domain')

time_cols = [col for col in children_df.columns if col.startswith("age_")]

seq_left = SequenceData(data=left_df, time_type="age", time=time_cols, states=[0, 1],
                        labels=["At home", "Left home"])
seq_child = SequenceData(data=children_df, time_type="age", time=time_cols, states=[0, 1],
                         labels=["No child", "Child"])
seq_marr = SequenceData(data=married_df, time_type="age", time=time_cols, states=[0, 1],
                        labels=["Not married", "Married"])

sequence_data = [seq_left, seq_child, seq_marr]

cat_distance_matrix = compute_cat_distance_matrix(sequence_data, method="OM", sm=["TRATE"], indel=[2, 1, 1], what="diss", link="sum")

cat_distance_matrix


[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]
[>] 3 domains with 2000 sequences.
[>] Building MD sequences of combined states.
  - OK.
[>] Computing substitution cost matrix for domain 0.
[>] Computing substitution cost matrix for domain 1.
[>] Computing substitution cost matrix for domain 2.
[>] Computing MD substitution and indel costs with additive trick.
  - OK.
[>] Computing MD distances using additive trick.
  - OK.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,0.000000,19.104439,7.684590,11.491206,11.491206,17.526643,13.323158,9.587898,7.738842,22.911056,...,17.526643,11.491206,21.333260,15.569083,17.363887,11.491206,17.201131,17.526643,25.171146,11.653962
1,19.104439,0.000000,17.516541,26.934681,15.352075,32.644606,36.305570,21.079104,19.033083,15.406327,...,32.644606,26.934681,36.451222,30.687045,32.481849,9.516541,13.537227,32.644606,59.573408,9.516541
2,7.684590,17.516541,0.000000,7.684590,7.684590,21.170504,28.763694,5.781281,7.613233,30.687045,...,21.170504,7.684590,24.977120,19.212943,21.007748,19.267195,27.247152,21.170504,46.321606,27.043185
3,11.491206,26.934681,7.684590,0.000000,11.582606,20.936391,48.082202,13.537227,11.419850,36.542622,...,20.936391,0.000000,17.129775,19.033083,32.570310,1.903308,36.665291,20.936391,71.350040,1.903308
4,11.491206,15.352075,7.684590,11.582606,0.000000,21.099147,36.448283,15.440535,11.419850,24.960016,...,21.099147,11.582606,24.905764,19.141587,20.936391,1.903308,25.082685,21.099147,54.006196,1.903308
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,11.491206,9.516541,19.267195,1.903308,1.903308,19.412847,19.033083,25.065581,9.625046,11.419850,...,19.412847,1.903308,23.219464,17.455287,19.250091,0.000000,11.491206,19.412847,28.977763,11.582606
1996,17.201131,13.537227,27.247152,36.665291,25.082685,42.375216,30.524289,23.684590,28.763694,25.028433,...,42.375216,36.665291,46.181833,40.417656,42.212460,11.491206,0.000000,42.375216,46.036181,7.613233
1997,17.526643,32.644606,21.170504,20.936391,21.099147,0.000000,27.145811,30.775506,9.787802,30.832697,...,0.000000,20.936391,3.806617,1.957560,11.633919,19.412847,42.375216,0.000000,50.413648,19.575603
1998,25.171146,59.573408,46.321606,71.350040,54.006196,50.413648,23.267838,46.250250,31.023784,53.720770,...,50.413648,71.350040,54.220265,46.535675,38.779730,28.977763,46.036181,50.413648,0.000000,11.633919


After this, we can follow the common workflow of sequence analysis.

### DAT

In [9]:
left_df = load_dataset('biofam_left_domain')
children_df = load_dataset('biofam_child_domain')
married_df = load_dataset('biofam_married_domain')

time_cols = [col for col in children_df.columns if col.startswith("age_")]

seq_left = SequenceData(data=left_df, time_type="age", time=time_cols, states=[0, 1],
                        labels=["At home", "Left home"])
seq_child = SequenceData(data=children_df, time_type="age", time=time_cols, states=[0, 1],
                         labels=["No child", "Child"])
seq_marr = SequenceData(data=married_df, time_type="age", time=time_cols, states=[0, 1],
                        labels=["Not married", "Married"])

domains_seq_list = [seq_left, seq_child, seq_marr]

domain_params = [
    {"method": "OM", "sm": "TRATE", "indel": "auto"},
    {"method": "OM", "sm": "CONSTANT", "indel": "auto"},
    {"method": "DHD"}   
]

dat_matrix = compute_dat_distance_matrix(domains_seq_list, method_params=domain_params)

dat_matrix


[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]
[>] Processing 2000 sequences with 2 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [0, 1]
[>] Indel cost generated.

[>] Identified 64 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Processing 2000 sequences with 2 unique states.
  - Creating 3x3 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 38 unique 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,0.000000,26.557654,20.813797,45.450287,28.308613,51.160212,25.323158,21.041113,13.137805,19.129775,...,51.160212,45.450287,47.353595,43.539009,40.018538,7.709925,21.081951,51.160212,48.525782,25.101763
1,26.557654,0.000000,17.095842,34.119099,16.977425,55.055490,43.880812,13.516541,21.033083,15.234496,...,55.055490,34.119099,51.248874,47.434287,43.913816,22.847730,12.993290,55.055490,67.083436,40.239568
2,20.813797,17.095842,0.000000,24.636490,7.494817,37.959649,34.136955,7.579300,15.289225,32.330338,...,37.959649,24.636490,34.153032,30.338446,26.817975,20.910488,26.282515,37.959649,57.339578,38.302327
3,45.450287,34.119099,24.636490,0.000000,17.141674,20.936391,46.773445,24.409174,39.925716,49.353595,...,20.936391,0.000000,17.129775,24.750977,32.078065,41.740362,43.305772,20.936391,69.976069,59.132200
4,28.308613,16.977425,7.494817,17.141674,0.000000,38.078065,41.631771,11.267500,22.784042,32.211921,...,38.078065,17.141674,34.271449,30.456862,26.936391,24.598688,26.164098,38.078065,64.834395,41.990527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,7.709925,22.847730,20.910488,41.740362,24.598688,58.870137,33.033083,17.331188,20.847730,11.419850,...,58.870137,41.740362,55.063520,51.248934,47.728463,0.000000,17.372027,58.870137,56.235707,17.391838
1996,21.081951,12.993290,26.282515,43.305772,26.164098,64.242163,38.405110,22.703215,30.219756,13.565410,...,64.242163,43.305772,60.435547,56.620961,53.100489,17.372027,0.000000,64.242163,54.090146,27.246278
1997,51.160212,55.055490,37.959649,20.936391,38.078065,0.000000,25.837054,41.538949,38.022407,70.289987,...,0.000000,20.936391,3.806617,7.621203,11.141674,58.870137,64.242163,0.000000,49.039678,76.261975
1998,48.525782,67.083436,57.339578,69.976069,64.834395,49.039678,23.202624,53.566895,50.050353,67.655556,...,49.039678,69.976069,52.846294,49.225091,37.898004,56.235707,54.090146,49.039678,0.000000,50.843868


After this, we can follow the common workflow of sequence analysis.

### CombT

For further details, please take a look at the combt.py file in the same folder where this jupyter notebook is.



```
/Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/venv/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 50215 --file /Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/sequenzo/multidomain/combt.py 
Connected to pydev debugger (build 232.9921.89)

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]
[>] Processing 2000 sequences with 2 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [0, 1]
[>] Indel cost generated.

[>] Identified 64 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Processing 2000 sequences with 2 unique states.
  - Creating 3x3 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 38 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Processing 2000 sequences with 2 unique states.
  - Creating 3x3 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 58 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.

[>] Processing domain: Left
[>] Converting DataFrame to NumPy array...
/Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/sequenzo/visualization/utils/utils.py:179: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()
Cluster Quality - Left.png has been saved. Please check it and then come back.

[?] Enter number of clusters for domain 'Left': 6

[>] Processing domain: Child
[>] Converting DataFrame to NumPy array...
/Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/sequenzo/visualization/utils/utils.py:179: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()
Cluster Quality - Child.png has been saved. Please check it and then come back.

[?] Enter number of clusters for domain 'Child': 5

[>] Processing domain: Married
[>] Converting DataFrame to NumPy array...
/Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/sequenzo/visualization/utils/utils.py:179: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()
Cluster Quality - Married.png has been saved. Please check it and then come back.

[?] Enter number of clusters for domain 'Married': 4

[>] Combined Typology Membership Table Preview:
   id  Left_Cluster  Child_Cluster  Married_Cluster  CombT
0   1             4              3                2  4+3+2
1   2             1              3                3  1+3+3
2   3             2              2                3  2+2+3
3   4             2              1                4  2+1+4
4   5             2              2                3  2+2+3

[>] combt_membership_table.csv has been saved.

[>] CombT Frequency Table:
    CombT  Frequency  Proportion (%)
0   3+1+4        250           12.50
1   2+1+4        159            7.95
2   2+3+2         97            4.85
3   3+1+2         93            4.65
4   4+3+2         91            4.55
..    ...        ...             ...
63  1+5+3          1            0.05
64  3+2+1          1            0.05
65  3+2+2          1            0.05
66  6+5+3          1            0.05
67  2+5+4          1            0.05

[68 rows x 3 columns]

[>] freq_table.csv has been saved.
/Users/lei/Documents/Sequenzo_all_folders/Sequenzo-main/sequenzo/visualization/utils/utils.py:179: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Frequency of Combined Typologies.png has been saved.
[>] Processing 2000 sequences with 2 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [0, 1]
[>] Indel cost generated.

[>] Identified 64 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Processing 2000 sequences with 2 unique states.
  - Creating 3x3 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 38 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Processing 2000 sequences with 2 unique states.
  - Creating 3x3 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 58 unique sequences.
[>] Sequence length: min/max = 16 / 16.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.

[>] CombT clusters before merging: 68
[>] CombT clusters after merging: 68

[>] combt_membership_table.csv has been saved.

Process finished with exit code 0

```