# NMDesc Protein Feature Parsing & Viewer

This notebook parses **NIHMS1818854-supplement-2.xlsx** to produce a tidy feature table for proteins
(e.g., Domains, PTMs, SLiMs, LCSs, NLSs/NESs, MORFs), and builds an interactive Plotly-based HTML
viewer for quick inspection and variant annotation.

**Inputs**
- `NIHMS1818854-supplement-2.xlsx` (expected sheets: A, B, H)

**Outputs**
- `parsed_features.pkl`, `parsed_features.csv`
- `docs/protein_data/*.json` (per-protein JSONs; **ALL proteins**)
- `docs/protein_features_viewer_with_variants.html` (GitHub Pages-ready)


## Environment & Dependencies

This notebook targets Python 3 with the following packages:

- `pandas`, `numpy`, `openpyxl` (for Excel), `plotly`

Install (if needed):

```bash
pip install pandas numpy openpyxl plotly


In [1]:
# --- Config paths (edit as needed) ---
EXCEL_PATH = "NIHMS1818854-supplement-2.xlsx"  # place the Excel next to the notebook

# outputs (GitHub Pages convention: publish /docs)
PICKLE_OUT = "parsed_features.pkl"
CSV_OUT    = "parsed_features.csv"
DATA_DIR   = "docs/protein_data"   # JSONs live under docs/
OUT_HTML   = "docs/protein_features_viewer_with_variants.html"

# Writer options
MINIFY_JSON = True         # json minified to shrink size
BUNDLE_BY_LETTER = False   # False = one file per protein; True = 26 files A..Z
ALPHA_SUBFOLDERS = False   # If True, place per-protein files under docs/protein_data/A/, B/, ...


## Parse Excel Sheets ‚Üí Tidy Feature Table

- Normalizes column names (e.g., **LCSs** synonyms).
- Parses range-like cells (`1-10, 14, 20-25`) and handles **MORFs** special format correctly.
- Produces a long-format table with **Protein, Feature_Type, Start, End, Source_Column, seq_len**.

In [2]:
import re, math, json, os
import numpy as np, pandas as pd

# Helper: normalize any LCSs-like column name to 'LCSs'import re, os, math, json
import numpy as np, pandas as pd

os.makedirs("docs", exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

# normalize any LCSs-like column name to 'LCSs'
def normalize_lcs_columns(df: pd.DataFrame) -> pd.DataFrame:
    candidates = [
        "LCSs", "LCS", "Low complexity", "Low-complexity",
        "Low complexity regions", "Low-complexity regions",
        "Low_complexity", "Low complexity (LCSs)"
    ]
    for c in candidates:
        if c in df.columns:
            if c != "LCSs":
                df = df.rename(columns={c: "LCSs"})
            break
    if "LCSs" not in df.columns:
        df["LCSs"] = np.nan
    return df

# --- Load sheets ---
sheetA = pd.read_excel(EXCEL_PATH, sheet_name="A")
sheetA = sheetA.rename(columns={
    "UniProt ID":"Uniprot ID", "UniProtID":"Uniprot ID",
    "seq":"Sequence", "sequence":"Sequence"
})
sheetA["Protein"] = sheetA["Protein"].astype(str).str.strip()
sheetA["Protein_norm"] = sheetA["Protein"].str.upper()
sheetA["seq_len"] = sheetA["Sequence"].astype(str).str.len()
seq_len_map = sheetA.groupby("Protein_norm")["seq_len"].max().to_dict()

# B and H contain features (G duplicates LCSs in some versions)
sheet_b = pd.read_excel(EXCEL_PATH, sheet_name="B"); sheet_b["Sheet"] = "B"
sheet_h = pd.read_excel(EXCEL_PATH, sheet_name="H"); sheet_h["Sheet"] = "H"
sheet_b = normalize_lcs_columns(sheet_b)
sheet_h = normalize_lcs_columns(sheet_h)

raw = pd.concat([sheet_b, sheet_h], ignore_index=True)

# columns we use
cols = ["Protein","Protein Features","Domains","SLiMs","MORFs","PTMs","NLSs/NESs","LCSs","Sheet"]
cols = [c for c in cols if c in raw.columns]
raw = raw[cols].copy()
raw["Protein"] = raw["Protein"].astype(str).str.strip()
raw["Protein_norm"] = raw["Protein"].str.upper()

# --- Parse cells ---
block_re = re.compile(r'([^(),]+?)\s*\(([^)]*)\)')  # e.g. "Feature (1-10, 14, 20-25)"

def parse_cell(cell_text, column_name=None):
    out = []
    if pd.isna(cell_text): 
        return out
    s = str(cell_text).strip()
    if not s: 
        return out
    
    # MORFs often appear as just ranges "1-10, 14-20" without a feature label
    if column_name == "MORFs":
        feat = "MORF"
        for tok in s.split(","):
            tok = tok.strip()
            if not tok:
                continue
            if "-" in tok:
                a, b = tok.split("-", 1)
                try:
                    st, en = int(a.strip()), int(b.strip())
                except:
                    continue
            else:
                try:
                    st = en = int(tok)
                except:
                    continue
            if st > 0 and en > 0:
                out.append((feat, st, en))
        return out
    
    # standard "FeatureName (1-10, 14, 20-25)" blocks
    for name, inner in block_re.findall(s):
        feat = name.strip().strip(",")
        for tok in inner.split(","):
            tok = tok.strip()
            if not tok:
                continue
            if "-" in tok:
                a, b = tok.split("-", 1)
                try:
                    st, en = int(a.strip()), int(b.strip())
                except:
                    continue
            else:
                try:
                    st = en = int(tok)
                except:
                    continue
            if st > 0 and en > 0:
                out.append((feat, st, en))
    return out

tidy_rows = []
for _, r in raw.iterrows():
    prot = r["Protein"]; pn = r["Protein_norm"]
    for col in ["Protein Features","Domains","SLiMs","MORFs","PTMs","NLSs/NESs","LCSs"]:
        if col not in raw.columns:
            continue
        for ft, st, en in parse_cell(r.get(col, None), column_name=col):
            tidy_rows.append([prot, pn, ft, st, en, col])

parsed = pd.DataFrame(
    tidy_rows,
    columns=["Protein","Protein_norm","Feature_Type","Start","End","Source_Column"]
)

# attach sequence length
parsed["seq_len"] = parsed["Protein_norm"].map(seq_len_map)

# cleanups
parsed = parsed.dropna(subset=["Start","End"])
parsed["Start"] = pd.to_numeric(parsed["Start"], errors="coerce").astype("Int64")
parsed["End"]   = pd.to_numeric(parsed["End"], errors="coerce").astype("Int64")
parsed = parsed.dropna(subset=["Start","End"]).astype({"Start":"int","End":"int"})
parsed = parsed[parsed["End"] >= parsed["Start"]]
parsed = parsed.drop_duplicates(
    subset=["Protein_norm","Feature_Type","Start","End","Source_Column"]
)

# save tidy outputs
parsed.to_pickle(PICKLE_OUT)
parsed.to_csv(CSV_OUT, index=False)

print(f"‚úÖ Saved {PICKLE_OUT} and {CSV_OUT}")
print(f"Rows: {len(parsed):,} | Proteins: {parsed['Protein'].nunique():,}")
print("\nBy source column:")
print(parsed["Source_Column"].value_counts())

‚úÖ Saved parsed_features.pkl and parsed_features.csv
Rows: 1,012,180 | Proteins: 20,145

By source column:
Source_Column
PTMs                373588
Protein Features    373429
LCSs                159533
Domains              57956
NLSs/NESs            45822
SLiMs                 1771
MORFs                   81
Name: count, dtype: int64


## Quick Preview of Parsed Table

In [3]:
parsed = pd.read_pickle(PICKLE_OUT)
print("Rows:", len(parsed))
print("Proteins:", parsed["Protein"].nunique())
print(parsed["Source_Column"].value_counts())
parsed.head(10)

Rows: 1012180
Proteins: 20145
Source_Column
PTMs                373588
Protein Features    373429
LCSs                159533
Domains              57956
NLSs/NESs            45822
SLiMs                 1771
MORFs                   81
Name: count, dtype: int64


Unnamed: 0,Protein,Protein_norm,Feature_Type,Start,End,Source_Column,seq_len
0,1433B,1433B,Cross-link,51,51,Protein Features,246
1,1433B,1433B,Helix,114,134,Protein Features,246
2,1433B,1433B,Helix,137,161,Protein Features,246
3,1433B,1433B,Helix,167,182,Protein Features,246
4,1433B,1433B,Helix,187,203,Protein Features,246
5,1433B,1433B,Helix,204,207,Protein Features,246
6,1433B,1433B,Helix,21,32,Protein Features,246
7,1433B,1433B,Helix,213,231,Protein Features,246
8,1433B,1433B,Helix,40,70,Protein Features,246
9,1433B,1433B,Helix,5,17,Protein Features,246


## Build Interactive Viewer (Plotly)

- Creates `protein_data/` with per-protein JSONs.
- Writes a self-contained viewer to `docs/protein_features_view_with_variants.html` suitable for GitHub Pages.
- Use the select box to choose a protein; you can add variant markers by editing the little snippet inside the script if needed.

In [5]:
import plotly.graph_objects as go
import plotly.io as pio

# normalize Source_Column drift for rendering
parsed_for_view = parsed.copy()
parsed_for_view["Source_Column"] = (
    parsed_for_view["Source_Column"].astype(str).str.strip()
    .replace({
        "LCS": "LCSs",
        "Low complexity": "LCSs",
        "LOW COMPLEXITY": "LCSs",
        "Low-complexity": "LCSs",
    })
)

track_order = ["PTMs", "Protein Features", "SLiMs", "LCSs", "NLSs/NESs", "MORFs"]
BAND_FILL = "rgba(0,0,0,0.03)"
COLOR_RANGE_FILL_DEFAULT = "rgba(66,135,245,0.35)"
COLOR_RANGE_LINE_DEFAULT = "rgba(0,0,0,0.35)"
COLOR_POINT_DEFAULT      = "#9467bd"
TRACK_RANGE_FILL = {
    "Domains":          "rgba(66,135,245,0.35)",
    "PTMs":             "rgba(148,103,189,0.30)",
    "Protein Features": "rgba(70,70,70,0.08)",
    "SLiMs":            "rgba(255,159,64,0.28)",
    "LCSs":             "rgba(50,205,50,0.45)",
    "NLSs/NESs":        "rgba(23,162,184,0.28)",
    "MORFs":            "rgba(220,53,69,0.22)",
}
TRACK_RANGE_LINE = {"LCSs":  {"color": "rgba(0,0,0,0.60)", "width": 1.6}}
TRACK_POINT_COLOR = {
    "PTMs":             "#8a63c2",
    "Protein Features": "#6c757d",
    "SLiMs":            "#e67e22",
    "LCSs":             "#1e7e34",
    "NLSs/NESs":        "#17a2b8",
    "MORFs":            "#dc3545",
}

VARIANT_COLOR = "#ff0000"
y_backbone = 0.5
y_spacing  = 1.45
y_positions = {name: (len(track_order)-i)*y_spacing for i, name in enumerate(track_order)}

PLOTLY_CONFIG = dict(
    toImageButtonOptions=dict(format="png", filename="protein_features", scale=2),
    displaylogo=False
)

proteins_all = sorted(parsed_for_view["Protein"].unique())
protein_map = {}  # {original_name: safe_filename}

def _range_fill_for(col: str) -> str:
    return TRACK_RANGE_FILL.get(col, COLOR_RANGE_FILL_DEFAULT)

def _range_line_for(col: str) -> dict:
    base = {"color": COLOR_RANGE_LINE_DEFAULT, "width": 1}
    base.update(TRACK_RANGE_LINE.get(col, {}))
    return base

def _point_color_for(col: str) -> str:
    return TRACK_POINT_COLOR.get(col, COLOR_POINT_DEFAULT)

def safe_length(sub):
    seq = sub["seq_len"].dropna()
    if not seq.empty:
        try:
            L = int(seq.iloc[0])
            if L > 0:
                return L
        except:
            pass
    max_feat = int(pd.to_numeric(sub["End"], errors="coerce").max()) if not sub.empty else 0
    return int(math.ceil(max_feat * 1.03 + 3))

def build_payload_for_protein(prot, include_single_sites=True):
    sub = parsed_for_view[parsed_for_view["Protein"]==prot].copy()
    if sub.empty:
        L = 100
        return dict(data=[], shapes=[], annotations=[], xr=[0,L], yr=[-0.6,5], tv=[], tt=[], title=f"{prot}", protein_length=L)
    L = safe_length(sub)
    shapes, ann, traces = [], [], []
    # backbone
    shapes.append(dict(type="rect", x0=0, x1=L, y0=y_backbone-0.2, y1=y_backbone+0.2, line=dict(width=0), fillcolor="#9ea0a3"))
    ann.append(dict(x=L, y=y_backbone+0.35, text=f"{L} aa", showarrow=False))

    # domains on backbone
    dom = sub[sub["Source_Column"]=="Domains"].copy()
    if not dom.empty:
        dom["Start"] = pd.to_numeric(dom["Start"], errors="coerce")
        dom["End"]   = pd.to_numeric(dom["End"],   errors="coerce")
    for _, r in dom.iterrows():
        st, en = float(r["Start"]), float(r["End"])
        feat = str(r["Feature_Type"]) if pd.notna(r["Feature_Type"]) else ""
        label = feat if len(feat)<=20 else feat[:17]+"..."
        shapes.append(dict(type="rect", x0=st, x1=en, y0=y_backbone-0.28, y1=y_backbone+0.28,
                           line=dict(color=COLOR_RANGE_LINE_DEFAULT, width=1),
                           fillcolor=TRACK_RANGE_FILL.get("Domains", COLOR_RANGE_FILL_DEFAULT)))
        if en-st >= 15:
            ann.append(dict(x=(st+en)/2, y=y_backbone, text=label, showarrow=False,
                            font=dict(size=10, color="black"), xanchor="center", yanchor="middle", align="center"))
        traces.append(go.Scatter(x=[(st+en)/2], y=[y_backbone], mode="markers", marker=dict(size=8, opacity=0),
                                 showlegend=False, hoverinfo="text", hovertext=f"Domain: {feat} ({int(st)}‚Äì{int(en)})"))

    # other tracks
    for col in track_order:
        y = y_positions[col]
        shapes.append(dict(type="rect", x0=0, x1=L, y0=y-0.28, y1=y+0.28, line=dict(width=0), fillcolor=BAND_FILL))
        chunk = sub[sub["Source_Column"]==col].copy()
        if chunk.empty: 
            continue
        chunk["Start"] = pd.to_numeric(chunk["Start"], errors="coerce")
        chunk["End"]   = pd.to_numeric(chunk["End"],   errors="coerce")

        rng = chunk[chunk["End"].fillna(chunk["Start"]) > chunk["Start"]]
        for _, r in rng.iterrows():
            st = float(r["Start"]); en = float(r["End"])
            ftype = (r.get("Feature_Type") if pd.notna(r.get("Feature_Type")) else "") or ""
            shapes.append(dict(type="rect", x0=st, x1=en, y0=y-0.28, y1=y+0.28,
                               line=_range_line_for(col), fillcolor=_range_fill_for(col)))
            traces.append(go.Scatter(x=[(st+en)/2], y=[y], mode="markers",
                                     marker=dict(size=8, opacity=0), showlegend=False,
                                     hoverinfo="text", hovertext=f"{col}: {ftype} ({int(st)}‚Äì{int(en)})"))

        pts = chunk[chunk["End"].fillna(chunk["Start"]) == chunk["Start"]]
        for _, r in pts.iterrows():
            x = float(r["Start"])
            ftype = (r.get("Feature_Type") if pd.notna(r.get("Feature_Type")) else "") or ""
            shapes.append(dict(type="line", x0=x, x1=x, y0=y-0.22, y1=y+0.22, line=dict(color=_point_color_for(col), width=2)))
            traces.append(go.Scatter(x=[x], y=[y], mode="markers",
                                     marker=dict(size=8, color=_point_color_for(col), opacity=0.0),
                                     hoverinfo="text", hovertext=f"{col}: {ftype} ({int(x)})", showlegend=False))

    tickvals = [y_positions[n] for n in track_order] + [y_backbone]
    ticktext = track_order[:] + ["backbone"]
    y_range = [-0.6, max(tickvals)+1.0]
    title = f"{prot}"
    return dict(data=[t.to_plotly_json() for t in traces], shapes=shapes, annotations=ann,
                xr=[0, L], yr=y_range, tv=tickvals, tt=ticktext, title=title, protein_length=L)

# ---------- Write ALL payloads ----------
from collections import defaultdict

def safe_name(s: str) -> str:
    return s.replace("/", "_").replace("\\", "_").replace(":", "_")

if BUNDLE_BY_LETTER:
    # 26 bundle files A..Z
    bundles = defaultdict(dict)  # {"A": {prot: payload, ...}, ...}
    for i, prot in enumerate(proteins_all, 1):
        payload = {
            "withSingles": build_payload_for_protein(prot, include_single_sites=True),
            "noSingles":   build_payload_for_protein(prot, include_single_sites=False),
        }
        bundles[safe_name(prot)[0].upper()][prot] = payload
        if i % 500 == 0:
            print(f"  Bundled {i}/{len(proteins_all)}...")
    for letter, obj in bundles.items():
        out = os.path.join(DATA_DIR, f"{letter}.json")
        with open(out, "w", encoding="utf-8") as f:
            if MINIFY_JSON:
                json.dump(obj, f, separators=(",", ":"))
            else:
                json.dump(obj, f, indent=2)
        print(f"Wrote {out} with {len(obj)} proteins")
else:
    # per-protein JSON, optional A‚ÄìZ subfolders
    if ALPHA_SUBFOLDERS:
        for L in [chr(c) for c in range(ord("A"), ord("Z")+1)]:
            os.makedirs(os.path.join(DATA_DIR, L), exist_ok=True)

    for i, prot in enumerate(proteins_all, 1):
        safe = safe_name(prot)
        protein_map[prot] = safe
        subdir = os.path.join(DATA_DIR, safe[0].upper()) if ALPHA_SUBFOLDERS else DATA_DIR
        os.makedirs(subdir, exist_ok=True)
        out = os.path.join(subdir, f"{safe}.json")
        payload = {
            "withSingles": build_payload_for_protein(prot, include_single_sites=True),
            "noSingles":   build_payload_for_protein(prot, include_single_sites=False),
        }
        with open(out, "w", encoding="utf-8") as f:
            if MINIFY_JSON:
                json.dump(payload, f, separators=(",", ":"))
            else:
                json.dump(payload, f, indent=2)
        if i % 500 == 0:
            print(f"  Wrote {i}/{len(proteins_all)} proteins")

# ---------- Build the HTML shell ----------
seed = {"data": [], "shapes": [], "annotations": [dict(x=0.5, y=0.5, xref="paper", yref="paper",
         text="Select a protein to begin...", showarrow=False, font=dict(size=16, color="#666"))],
        "xr": [0, 100], "yr": [-0.6, 5], "tv": [], "tt": [], "title": "Protein Features Viewer with Variants"}

fig = go.Figure(seed["data"])
fig.update_layout(title=seed["title"], shapes=seed["shapes"], annotations=seed["annotations"],
                  margin=dict(l=90, r=30, t=110, b=60), showlegend=True, paper_bgcolor="white")
fig.update_xaxes(range=seed["xr"], title="Amino acid position", showgrid=False, zeroline=False)
fig.update_yaxes(range=seed["yr"], tickvals=seed["tv"], ticktext=seed["tt"], showgrid=False)
html_core = pio.to_html(fig, include_plotlyjs="cdn", full_html=True, config=PLOTLY_CONFIG)

# controls/loader (supports both per-protein JSON and A‚ÄìZ bundles)
if BUNDLE_BY_LETTER:
    controls_html = f"""
<script>
function initViewer() {{
  if (typeof Plotly === 'undefined') {{ setTimeout(initViewer, 100); return; }}
  const PROTEINS = {json.dumps(proteins_all)};
  const DATA_DIR = "protein_data";
  const BUNDLE_CACHE = {{}};

  const gd = document.querySelector('.js-plotly-plot');
  const sel = document.createElement('select');
  sel.style.margin = '8px';
  PROTEINS.forEach(p=>{{ const o=document.createElement('option'); o.value=p; o.textContent=p; sel.appendChild(o); }});
  gd.parentNode.insertBefore(sel, gd);

  async function loadP(p) {{
    const letter = (p[0] || "X").toUpperCase();
    if (!BUNDLE_CACHE[letter]) {{
      const resp = await fetch(`${{DATA_DIR}}/${{letter}}.json`);
      BUNDLE_CACHE[letter] = await resp.json();
    }}
    return BUNDLE_CACHE[letter][p];
  }}

  function varShapes(vs, yTop) {{
    const VARIANT_COLOR = "#ff0000", y_backbone = 0.5;
    const shapes = [], traces = [];
    vs.forEach(v => {{
      const x = v.position;
      shapes.push({{type:'line', x0:x, x1:x, y0:-0.5, y1:yTop, line:{{color:VARIANT_COLOR, width:2}} }});
      traces.push({{x:[x], y:[y_backbone], mode:'markers', marker:{{symbol:'diamond', size:12, color:VARIANT_COLOR, line:{{color:'white', width:1}}}}, hoverinfo:'text', hovertext:`Variant ${{v.display}}` }});
    }});
    return {{shapes, traces}};
  }}

  let currentVariants = [];
  async function apply(p) {{
    const state = (await loadP(p))['noSingles'];
    const extra = varShapes(currentVariants, state.yr[1]);
    Plotly.react(gd, [...state.data, ...extra.traces], {{
      ...gd.layout, title: {{text: state.title}},
      shapes: [...state.shapes, ...extra.shapes], annotations: state.annotations,
      xaxis: {{...gd.layout.xaxis, range: state.xr, title: {{text:'Amino acid position'}}, showgrid:false, zeroline:false}},
      yaxis: {{...gd.layout.yaxis, range: state.yr, tickvals: state.tv, ticktext: state.tt, showgrid:false}}
    }});
  }}

  sel.addEventListener('change', ()=>apply(sel.value));
  sel.value = PROTEINS[0]; apply(sel.value);
}}
document.readyState === 'loading' ? document.addEventListener('DOMContentLoaded', initViewer) : initViewer();
</script>
"""
else:
    # per protein; if ALPHA_SUBFOLDERS=True, we fetch from protein_data/<LETTER>/<file>.json
    fetch_expr = """`${DATA_DIR}/${PROTEIN_MAP[p][0].toUpperCase()}/${PROTEIN_MAP[p]}.json`""" if ALPHA_SUBFOLDERS else """`${DATA_DIR}/${PROTEIN_MAP[p]}.json`"""
    controls_html = f"""
<script>
function initViewer() {{
  if (typeof Plotly === 'undefined') {{ setTimeout(initViewer, 100); return; }}
  const PROTEINS = {json.dumps(proteins_all)};
  const PROTEIN_MAP = {json.dumps({p: p.replace('/','_').replace('\\\\','_').replace(':','_') for p in proteins_all})};
  const DATA_DIR = "protein_data";
  const cache = {{}};

  const gd = document.querySelector('.js-plotly-plot');
  const sel = document.createElement('select');
  sel.style.margin = '8px';
  PROTEINS.forEach(p=>{{ const o=document.createElement('option'); o.value=p; o.textContent=p; sel.appendChild(o); }});
  gd.parentNode.insertBefore(sel, gd);

  async function loadP(p) {{
    if (cache[p]) return cache[p];
    const resp = await fetch({fetch_expr});
    const data = await resp.json(); cache[p]=data; return data;
  }}

  function varShapes(vs, yTop) {{
    const VARIANT_COLOR = "#ff0000", y_backbone = 0.5;
    const shapes = [], traces = [];
    vs.forEach(v => {{
      const x = v.position;
      shapes.push({{type:'line', x0:x, x1:x, y0:-0.5, y1:yTop, line:{{color:VARIANT_COLOR, width:2}} }});
      traces.push({{x:[x], y:[y_backbone], mode:'markers', marker:{{symbol:'diamond', size:12, color:VARIANT_COLOR, line:{{color:'white', width:1}}}}, hoverinfo:'text', hovertext:`Variant ${{v.display}}` }});
    }});
    return {{shapes, traces}};
  }}

  let currentVariants = [];
  async function apply(p) {{
    const state = (await loadP(p))['noSingles'];
    const extra = varShapes(currentVariants, state.yr[1]);
    Plotly.react(gd, [...state.data, ...extra.traces], {{
      ...gd.layout, title: {{text: state.title}},
      shapes: [...state.shapes, ...extra.shapes], annotations: state.annotations,
      xaxis: {{...gd.layout.xaxis, range: state.xr, title: {{text:'Amino acid position'}}, showgrid:false, zeroline:false}},
      yaxis: {{...gd.layout.yaxis, range: state.yr, tickvals: state.tv, ticktext: state.tt, showgrid:false}}
    }});
  }}

  sel.addEventListener('change', ()=>apply(sel.value));
  sel.value = PROTEINS[0]; apply(sel.value);
}}
document.readyState === 'loading' ? document.addEventListener('DOMContentLoaded', initViewer) : initViewer();
</script>
"""

html_out = html_core.replace("<body>", "<body>\n" + controls_html)
with open(OUT_HTML, "w", encoding="utf-8") as f:
    f.write(html_out)

print(f"‚úÖ Wrote viewer: {OUT_HTML}")
print(f"Proteins written: {len(proteins_all)}")
print(f"JSON location: {DATA_DIR} (bundle_by_letter={BUNDLE_BY_LETTER}, alpha_subfolders={ALPHA_SUBFOLDERS})")

  Wrote 500/20145 proteins
  Wrote 1000/20145 proteins
  Wrote 1500/20145 proteins
  Wrote 2000/20145 proteins
  Wrote 2500/20145 proteins
  Wrote 3000/20145 proteins
  Wrote 3500/20145 proteins
  Wrote 4000/20145 proteins
  Wrote 4500/20145 proteins
  Wrote 5000/20145 proteins
  Wrote 5500/20145 proteins
  Wrote 6000/20145 proteins
  Wrote 6500/20145 proteins
  Wrote 7000/20145 proteins
  Wrote 7500/20145 proteins
  Wrote 8000/20145 proteins
  Wrote 8500/20145 proteins
  Wrote 9000/20145 proteins
  Wrote 9500/20145 proteins
  Wrote 10000/20145 proteins
  Wrote 10500/20145 proteins
  Wrote 11000/20145 proteins
  Wrote 11500/20145 proteins
  Wrote 12000/20145 proteins
  Wrote 12500/20145 proteins
  Wrote 13000/20145 proteins
  Wrote 13500/20145 proteins
  Wrote 14000/20145 proteins
  Wrote 14500/20145 proteins
  Wrote 15000/20145 proteins
  Wrote 15500/20145 proteins
  Wrote 16000/20145 proteins
  Wrote 16500/20145 proteins
  Wrote 17000/20145 proteins
  Wrote 17500/20145 proteins
  Wro

## Quick Lists: Proteins with MORFs / NLSs-NESs / SLiMs

Exports a CSV and JSON summary to help pick test proteins quickly.

In [7]:
# --- Find proteins with MORFs, NLSs/NESs, and SLiMs features ---
import json
import pandas as pd

PICKLE_OUT = "parsed_features.pkl"
OUT_CSV    = "feature_proteins_MORFs_NLSs_SLIMs.csv"
OUT_JSON   = "feature_protein_index.json"

# Load
df = pd.read_pickle(PICKLE_OUT).copy()

# Normalize Source_Column to reduce drift / synonyms
norm_map = {
    # LCS handled elsewhere; here we care about SLiMs / NLSs / MORFs synonyms
    "SLIM": "SLiMs", "SLIMs": "SLiMs", "Slims": "SLiMs", "SLiM": "SLiMs",
    "NLS": "NLSs/NESs", "NES": "NLSs/NESs", "NLS/NES": "NLSs/NESs",
    "NLSs": "NLSs/NESs", "NESs": "NLSs/NESs",
    "Nuclear localization signal": "NLSs/NESs",
    "Nuclear export signal": "NLSs/NESs",
    "MoRF": "MORFs", "MoRFs": "MORFs", "MORF": "MORFs"
}
df["Source_Column"] = (
    df["Source_Column"].astype(str).str.strip()
    .replace(norm_map)
)

target_tracks = ["MORFs", "NLSs/NESs", "SLiMs"]

# Ensure numeric positions (useful if you want to filter/sort later)
for col in ("Start", "End"):
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# Subset to only requested tracks
sub = df[df["Source_Column"].isin(target_tracks)].copy()

# Safety: keep useful columns only (add/remove as needed)
cols = [c for c in ["Protein", "Source_Column", "Feature_Type", "Start", "End", "seq_len", "Source"] if c in sub.columns]
sub = sub[cols].sort_values(["Source_Column", "Protein", "Start", "End"], na_position="last")

# Build summaries
summary = {}
for track in target_tracks:
    dtrack = sub[sub["Source_Column"] == track]
    counts = dtrack["Protein"].value_counts()
    summary[track] = {
        "num_proteins_with_features": int(counts.size),
        "top_proteins_by_feature_count": counts.head(25).to_dict(),  # top-25 for quick picking
    }

# Write outputs for inspection
sub.to_csv(OUT_CSV, index=False)
with open(OUT_JSON, "w", encoding="utf-8") as f:
    json.dump(summary, f, indent=2)

# Pretty print a quick console summary
print("\n=== Proteins with requested track features ===")
for track in target_tracks:
    print(f"\n[{track}] proteins: {summary[track]['num_proteins_with_features']}")
    if summary[track]["top_proteins_by_feature_count"]:
        print("Top examples (protein: count):")
        for p, n in list(summary[track]["top_proteins_by_feature_count"].items())[:10]:
            print(f"  - {p}: {n}")
    else:
        print("  (No proteins found for this track.)")

print(f"\nüìÑ Wrote details to: {OUT_CSV}")
print(f"üß≠ Wrote summary index to: {OUT_JSON}")

# Optional: quick ‚Äúpick lists‚Äù you can copy/paste into your viewer
pick_lists = {
    track: list(summary[track]["top_proteins_by_feature_count"].keys())[:10]
    for track in target_tracks
}
print("\nQuick pick-lists (first 10 each):")
for track, plist in pick_lists.items():
    print(f"  {track}: {plist}")


=== Proteins with requested track features ===

[MORFs] proteins: 72
Top examples (protein: count):
  - AF9: 2
  - SNP29: 2
  - P73: 2
  - VIME: 2
  - RLA2: 2
  - GEMI: 2
  - H4: 2
  - P63: 2
  - P53: 2
  - OPTN: 1

[NLSs/NESs] proteins: 5160
Top examples (protein: count):
  - TITIN: 301
  - SRRM2: 106
  - DYST: 98
  - SYNE1: 90
  - ASH1L: 84
  - TTF1: 84
  - ATRX: 83
  - SYNE2: 77
  - ASPM: 75
  - CHD6: 75

[SLiMs] proteins: 1093
Top examples (protein: count):
  - EP15R: 21
  - THYG: 17
  - EPS15: 15
  - P53: 14
  - NRIP1: 10
  - CASR: 9
  - JUN: 9
  - PML: 9
  - NFAC1: 9
  - IL6RB: 8

üìÑ Wrote details to: feature_proteins_MORFs_NLSs_SLIMs.csv
üß≠ Wrote summary index to: feature_protein_index.json

Quick pick-lists (first 10 each):
  MORFs: ['AF9', 'SNP29', 'P73', 'VIME', 'RLA2', 'GEMI', 'H4', 'P63', 'P53', 'OPTN']
  NLSs/NESs: ['TITIN', 'SRRM2', 'DYST', 'SYNE1', 'ASH1L', 'TTF1', 'ATRX', 'SYNE2', 'ASPM', 'CHD6']
  SLiMs: ['EP15R', 'THYG', 'EPS15', 'P53', 'NRIP1', 'CASR', 'JUN', 'P

## Reproducibility Notes

- All outputs are deterministic given the same Excel input.
- For large repositories: place the full `protein_data/` behind Git LFS or keep the demo subset.
- The viewer HTML is in `docs/` so GitHub Pages can serve it automatically.