# XL-MoLPC: Structure Assembly Pipeline for Protein Complexes

This notebook demonstrates a complete pipeline for assembling protein complexes
from interaction data, AlphaFold predictions, and crosslinking constraints.

The workflow includes:

1. Protein–protein interaction (PPI) network construction
2. Manual curation of protein identifiers
3. Network and sequence preparation
4. Crosslink preprocessing
5. AlphaFold structure rewriting and scoring
6. Dimer extraction and selection
7. Monte Carlo Tree Search (MCTS)–based complex assembly

This notebook is intended as a **reproducible tutorial and presentation**
of the pipeline.

## Global Configuration and File Paths

This section defines all global paths used throughout the pipeline,
including input data, intermediate files, AlphaFold predictions,
and final outputs.

Adjust `BASE_DIR` to point to your project directory before running the notebook.

In [None]:
# Project root directory
BASE_DIR = r"N:\08_NK_structure_prediction\data\COPI_complex"

# Input files
USEQS_CSV = f"{BASE_DIR}/assembled_complex/useqs.csv"
RESIDUE_PAIR_CSV = (
    f"{BASE_DIR}/heklopit_pl3017_frd1ppi_sc151_fdr1rp_COPI_cleaned.csv"
)
FASTA_PATH = f"{BASE_DIR}/assembled_complex/COPI.fasta"

# Intermediate files
UCROSSLINKS_CSV = f"{BASE_DIR}/assembled_complex/ucrosslinks.csv"
CHAINS_CSV = f"{BASE_DIR}/assembled_complex/chains.csv"

# AlphaFold prediction directory
AF_PRED_DIR = f"{BASE_DIR}/afx_pred"

# Files for assembly
NETWORK_CSV = f"{BASE_DIR}/assembled_complex/network.csv"
PAIRS_DIR = f"{BASE_DIR}/assembled_complex/pairs"

# Output directories
REWRITED_PDB_DIR = f"{BASE_DIR}/assembled_complex/rewrited_pdbs"
OUTPUT_DIR = f"{BASE_DIR}/assembled_complex/output"

## Import Dependencies

This pipeline relies on custom modules for preprocessing, network analysis,
structure rewriting, and MCTS-based assembly.

Please ensure that all required packages and local modules are available
in your Python environment.

In [None]:
import argparse
import logging
import pandas as pd

from complex_assembly.rewrite_af_files import *
from complex_assembly.mcts import main
from preprocess.crosslink_prepare import *
from preprocess.network_prepare import build_network_and_useqs
from network.interact_map import *
import complex_assembly.mcts as mcts

## Manual Preparation of `chains.csv`

The `chains.csv` file defines the mapping between proteins and chain IDs
used in structure assembly.

Example format (tab-separated or CSV with consistent delimiter):

| Entry | Gene | Chain |
|-------|-------|-------|
| Q8WUH2 | TGFBRAP1 | A |
| Q9H270 | VPS11 | B |
| ... | ... | ... |

**Requirements:**
- Column names must be exactly: `Entry`, `Gene`, `Chain`
- Each chain ID should be a single uppercase letter (A, B, C, …)
- Each gene should appear only once

## Manual Preparation of FASTA File

The FASTA file must contain protein sequences corresponding to the `Gene`
column in `chains.csv`.

Each FASTA header **must include a `GN=` field**, for example:

&gt;sp|Q8WUH2|TGFA1_HUMAN Transforming growth factor-beta receptor-associated protein 1 OS=Homo sapiens OX=9606 GN=TGFBRAP1 PE=1 SV=1
MMSIKAFTLVSAVERELLMGDKERVNIECVECCGRDLYVGTNDCFVYHFLLEERPVPAGPATFTATKQLQRHLGFKKPVN...

**Important notes:**
- The gene name after `GN=` must exactly match the `Gene` column in `chains.csv`
- Only one sequence per gene is required
- Standard UniProt FASTA format is supported

## Manual Curation of Protein Identifiers

Some interaction datasets contain merged or ambiguous protein names
(e.g. `COPB2; COPB2`).

We manually define a mapping to clean these node names before downstream analysis.

In [None]:
# Load interaction data and build the PPI network
df = load_interaction_data(RESIDUE_PAIR_CSV)
G = build_ppi_network(df)

# Identify nodes that require manual cleanup
dirty_nodes = [n for n in G.nodes() if ";" in str(n)]
print("Nodes needing cleanup:", dirty_nodes)

In [None]:
# Manually define mappings for cleaning node names
manual_map = {
    'COPB2; COPB2':'COPB2', 
    'ASS1; ARCN1':'ARCN1', 
    'COPB1; COPB1':'COPB1', 
    'COPA; COPA':'COPA', 
    'COPB1; COPB1; COPB1; COPB1':'COPB1', 
    'ARF4; ARF6; ARF1; ARF5':'ARF1', 
    'COPG2; COPG1':'COPG1', 
    'COPZ1; COPZ1':'COPZ1'
}

## Define Target Protein Set

Here we specify the list of proteins that define the complex of interest.
Only interactions among these proteins will be considered for assembly.

In [None]:
# ====== 手动定义需要分析的复合体 ======
protein_list = ["COPB1",
                "COPZ1",
                "COPG1",
                "ARCN1",
                "COPE",
                "COPB2",
                "COPA",
                "ARF1",
                "ARFGAP2",
                "ARFGAP3"
                ]

## Cleaned PPI Network Visualization and Complex Enumeration

After cleaning node names, we:

1. Visualize the PPI network
2. Rewrite the original interaction file using cleaned identifiers
3. Enumerate valid dimers and trimers present in the network
4. Export results for downstream processing

In [None]:
# Apply node name cleanup
G = clean_node_names(G, manual_map)

# Visualize the cleaned PPI network
plot_ppi_network(G, "COPI Complex PPI Network (Cleaned Names)")

# Rewrite the original residue-pair file using cleaned names
clean_residue_pair_file(RESIDUE_PAIR_CSV, manual_map)


# Analyze binary and ternary complexes in the PPI network
dimer_in_ppi, trimer_in_ppi = analyze_complexes(G, protein_list)

# Save results
dimer_path = os.path.join(OUTPUT_DIR,"dimers.csv")
trimer_path = os.path.join(OUTPUT_DIR,"trimers.csv")

df_binary_ppi = pd.DataFrame(set(dimer_in_ppi), columns=["p1", "p2"])
df_binary_ppi.to_csv(dimer_path, index=False)

df_triplet_ppi = pd.DataFrame(set(trimer_in_ppi), columns=["p1", "p2", "p3"])
df_triplet_ppi.to_csv(trimer_path, index=False)

print(f"Total triplets: {len(list(combinations(protein_list, 3)))}")
print(f"Triplets found in PPI: {len(trimer_in_ppi)}")

In [None]:
# Build assembly network and unified sequence table
network_df, useq_df = build_network_and_useqs(
    binary_csv=dimer_path,
    chains_csv=CHAINS_CSV,
    fasta_file=FASTA_PATH,
    network_out=NETWORK_CSV,
    useqs_out=USEQS_CSV
)

## Step 1 — Prepare Crosslink Constraints

Crosslink data are mapped onto unified sequences to generate
a standardized crosslink table (`ucrosslinks.csv`).

This file is later used as a spatial constraint during MCTS assembly.

In [None]:
useq_df = pd.read_csv(USEQS_CSV)
residue_pair_df = pd.read_csv(RESIDUE_PAIR_CSV)

ucrosslinks = crosslink_prepare(useq_df, residue_pair_df)
ucrosslinks.to_csv(UCROSSLINKS_CSV, index=False)

print("✔ ucrosslinks written to:", UCROSSLINKS_CSV)

## Step 2 — Rewrite AlphaFold PDB/CIF and Score Files

AlphaFold predictions are rewritten to ensure:

- Consistent chain naming
- Compatibility with downstream assembly steps

Note: If this step terminates early, re-running the cell will automatically
resume from the last completed structure.

In [None]:
rewrite_af_cif_structure(
    af_pred_folder=AF_PRED_DIR,
    chains_df_path=CHAINS_CSV,
    output_folder=REWRITED_PDB_DIR,
)

print("✔ AF PDB rewritten")

In [None]:
rewrite_af_score_file(
    af_pred_folder=AF_PRED_DIR,
    chains_df_path=CHAINS_CSV,
    output_folder=REWRITED_PDB_DIR,
)

print("✔ score rewritten")

## Step 3 — Split Trimer Predictions into Dimers

Trimer AlphaFold predictions are decomposed into all possible dimer pairs.

These dimers form the candidate structural building blocks
for complex assembly.

In [None]:
split_trimer_to_dimers(
    REWRITED_PDB_DIR,
    PAIRS_DIR
)

print("✔ Trimer split to dimers")

## Step 4 — Select Central Dimer Structures

For each protein pair, the most central or representative dimer
is selected based on structural criteria.

This reduces redundancy and improves assembly efficiency.

In [None]:
select_most_central_pdb(PAIRS_DIR)

print("✔ Central dimers selected")

## Step 5 — Monte Carlo Tree Search (MCTS) Assembly

Finally, we assemble the full protein complex using an MCTS algorithm,
guided by:

- Network topology
- Structural compatibility
- Crosslinking constraints

The final assembled structures and logs are written to the output directory.

In [None]:
args = argparse.Namespace(
    network=NETWORK_CSV,
    pairdir=PAIRS_DIR,
    useqs=USEQS_CSV,
    ucrosslinks=UCROSSLINKS_CSV,
    outdir=OUTPUT_DIR,
)

mcts.main(args)

## Summary

This notebook provides an end-to-end demonstration of the
protein complex assembly pipeline, from interaction data
to final structural models.

The modular design allows individual steps to be adapted
or replaced depending on experimental input and biological context.