# Local per-residue embedding analysis of TP53 missense mutations

**Goal:**  
To identify residue-level structural regions that differentiate functional from non-functional TP53 missense mutations.

**Key result:**  
The largest embedding differences cluster in residues ~234–276, corresponding to the L3 loop and adjacent β-strands (S9/S10) of the DNA-binding domain.

**Interpretation:**  
Disruption of the loop–sheet–helix motif support region is strongly associated with loss of TP53 function.


### Generating all possible Mutations
The wild type TP53 was taken from uniprot Database.Then using a function all possible missense mutation was generated.

In [1]:
TP53='MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'

In [2]:
len(TP53)

393

In [3]:
from typing import Dict

# Standard 20 amino acids
AMINO_ACIDS = [
    "A", "C", "D", "E", "F",
    "G", "H", "I", "K", "L",
    "M", "N", "P", "Q", "R",
    "S", "T", "V", "W", "Y"
]


def generate_single_mutations(
    protein_sequence: str
) -> Dict[str, str]:
    """
    Generate all possible single amino-acid substitutions
    for a protein sequence.

    Parameters
    ----------
    protein_sequence : str
        Wild-type protein sequence (e.g. TP53)

    Returns
    -------
    mutations : dict
        Dictionary where:
        key   = mutation code (e.g. 'R175H')
        value = mutated protein sequence
    """

    mutations = {}
    seq_len = len(protein_sequence)

    # Loop over each position in the sequence
    for i in range(seq_len):
        wt_aa = protein_sequence[i]  # wild-type amino acid
        pos = i + 1                  # 1-based indexing (biological convention)

        # Try all possible amino-acid substitutions
        for mut_aa in AMINO_ACIDS:
            # Skip if mutation is same as wild-type
            if mut_aa == wt_aa:
                continue

            # Construct mutation code (e.g. R175H)
            mutation_code = f"{wt_aa}{pos}{mut_aa}"

            # Create mutated sequence
            mutated_sequence = (
                protein_sequence[:i]
                + mut_aa
                + protein_sequence[i + 1:]
            )

            mutations[mutation_code] = mutated_sequence

    return mutations


In [4]:
mutations = generate_single_mutations(TP53)

print(len(mutations))       # 7467
print(mutations["R175H"])   # mutated TP53 sequence

7467
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRHCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD


### Making the Input ready
To get embeddings for each protein we need to make the input data ready.

The output is a list of tuples where each tuple is

 (mutation_name,sequence)



In [5]:
from typing import Dict, List, Tuple


def mutation_dict_to_esm_input(
    mutation_dict: Dict[str, str],
    protein_name: str = "TP53"
) -> List[Tuple[str, str]]:
    """
    Convert a mutation dictionary into ESM batch input format.

    Parameters
    ----------
    mutation_dict : dict
        Key   = mutation code (e.g. 'R175H')
        Value = mutated protein sequence

    protein_name : str
        Prefix for sequence names (e.g. 'TP53')

    Returns
    -------
    esm_sequences : list of tuples
        Format: [(name, sequence), ...]
        Example: [('TP53_R175H', 'MEEPQSDPSV...')]
    """

    esm_sequences = []

    for mutation_code, sequence in mutation_dict.items():
        # Create a unique sequence identifier for ESM
        seq_name = f"{protein_name}_{mutation_code}"

        # Append in ESM-required format
        esm_sequences.append((seq_name, sequence))

    return esm_sequences


In [6]:
mutations = generate_single_mutations(TP53)

esm_input = mutation_dict_to_esm_input(
    mutations,
    protein_name="TP53"
)

print(esm_input[0])
# ('TP53_M1A', 'MEEPQSDPSV...')


('TP53_M1A', 'AEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD')


In [7]:
wt_input = [("TP53_WT", TP53)]


## Set Up model and batch converter

In [8]:
!pip install fair-esm

Collecting fair-esm
  Downloading fair_esm-2.0.0-py3-none-any.whl.metadata (37 kB)
Downloading fair_esm-2.0.0-py3-none-any.whl (93 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/93.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fair-esm
Successfully installed fair-esm-2.0.0


In [9]:
import torch

In [10]:
device="cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [11]:
import torch

def extract_esm_embeddings_batch(
    model,
    batch_converter,
    sequences,
    repr_layer=5,
    batch_size=4,
    return_per_residue=True,
    device='cuda'
):
    """
    sequences: list of (name, sequence) tuples
    returns:
        global_embeddings: torch.Tensor (N, D)
        residue_embeddings: list of torch.Tensor (L_i, D) if return_per_residue
    """

    #device = next(model.parameters()).device
    global_embeddings = []
    residue_embeddings = []

    for i in range(0, len(sequences), batch_size):
        batch = sequences[i:i + batch_size]

        labels, strs, tokens = batch_converter(batch)
        tokens = tokens.to(device)

        with torch.no_grad():
            outputs = model(tokens, repr_layers=[repr_layer])
            reps = outputs["representations"][repr_layer]  # (B, L, D)

            # mask padding tokens
            mask = tokens != model.padding_idx  # (B, L)

            # ----- GLOBAL MEAN POOLING -----
            mask_expanded = mask.unsqueeze(-1)  # (B, L, 1)
            pooled = (reps * mask_expanded).sum(dim=1) / mask_expanded.sum(dim=1)
            global_embeddings.append(pooled.cpu())

            # ----- PER-RESIDUE EMBEDDINGS -----
            if return_per_residue:
                for b in range(reps.size(0)):
                    seq_len = mask[b].sum().item()-2
                    residue_embeddings.append(
                        reps[b, 1:seq_len+1].cpu()
                    )
                    # NOTE: skip CLS token at index 0

    global_embeddings = torch.cat(global_embeddings, dim=0)

    if return_per_residue:
        return global_embeddings, residue_embeddings
    else:
        return global_embeddings


In [12]:
import torch
import esm

model,alphabet=esm.pretrained.esm2_t6_8M_UR50D()
batch_converter=alphabet.get_batch_converter()
model.eval()

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt" to /root/.cache/torch/hub/checkpoints/esm2_t6_8M_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t6_8M_UR50D-contact-regression.pt" to /root/.cache/torch/hub/checkpoints/esm2_t6_8M_UR50D-contact-regression.pt


ESM2(
  (embed_tokens): Embedding(33, 320, padding_idx=1)
  (layers): ModuleList(
    (0-5): 6 x TransformerLayer(
      (self_attn): MultiheadAttention(
        (k_proj): Linear(in_features=320, out_features=320, bias=True)
        (v_proj): Linear(in_features=320, out_features=320, bias=True)
        (q_proj): Linear(in_features=320, out_features=320, bias=True)
        (out_proj): Linear(in_features=320, out_features=320, bias=True)
        (rot_emb): RotaryEmbedding()
      )
      (self_attn_layer_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
      (fc1): Linear(in_features=320, out_features=1280, bias=True)
      (fc2): Linear(in_features=1280, out_features=320, bias=True)
      (final_layer_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
    )
  )
  (contact_head): ContactPredictionHead(
    (regression): Linear(in_features=120, out_features=1, bias=True)
    (activation): Sigmoid()
  )
  (emb_layer_norm_after): LayerNorm((320,), eps=1e-05, elementwis

### Computing global and Local embedding for Wild type TP53
Here input is the Wild type TP53 sequence and output is global and local meaning per residue embedding.

In [13]:
model.to(device)

ESM2(
  (embed_tokens): Embedding(33, 320, padding_idx=1)
  (layers): ModuleList(
    (0-5): 6 x TransformerLayer(
      (self_attn): MultiheadAttention(
        (k_proj): Linear(in_features=320, out_features=320, bias=True)
        (v_proj): Linear(in_features=320, out_features=320, bias=True)
        (q_proj): Linear(in_features=320, out_features=320, bias=True)
        (out_proj): Linear(in_features=320, out_features=320, bias=True)
        (rot_emb): RotaryEmbedding()
      )
      (self_attn_layer_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
      (fc1): Linear(in_features=320, out_features=1280, bias=True)
      (fc2): Linear(in_features=1280, out_features=320, bias=True)
      (final_layer_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
    )
  )
  (contact_head): ContactPredictionHead(
    (regression): Linear(in_features=120, out_features=1, bias=True)
    (activation): Sigmoid()
  )
  (emb_layer_norm_after): LayerNorm((320,), eps=1e-05, elementwis

In [14]:
wt_global_embedding ,wt_local_embedding= extract_esm_embeddings_batch(
    model=model,
    batch_converter=batch_converter,
    sequences=wt_input,
    batch_size=1,
    return_per_residue=True,
    device=device
)# shape: (D,) , shape :(D,320)


### Compute global and local  embedding for each  mutant sequence

Here input is each possible missense mutation and output is global and local embeddings for each mutant sequence

In [15]:
mutant_global_embeddings,mutant_local_embeddings = extract_esm_embeddings_batch(
    model=model,
    batch_converter=batch_converter,
    sequences=esm_input,
    batch_size=4,           # adjust based on GPU memory
    return_per_residue=True,
    device=device
)


In [16]:
print(len(mutant_local_embeddings))
print(mutant_global_embeddings.shape)

7467
torch.Size([7467, 320])


In [17]:
len(TP53)

393

### Computing Per Residue distance

Here what we will do is for each position in the amino acid sequence we will calculate the distance of every mutant sequence from the wild type TP53

For example the mutation is R175H meaning at position 175 the R replaced with H.so this is the mutated sequence and for that we will get local embeddings as 393*320 dimensional tensor.

so we will take the euclidian distance of this tensor with the wild type TP53 tensor which is also 393*320 for each position thus that will result in a 393 length distance tensor.

This will be done for each 7467 possible mutations thus the final results dimension would be 7467*393

In [18]:
import torch
import numpy as np
import matplotlib.pyplot as plt


In [19]:
# Shape: (num_mutations, seq_len, embedding_dim)
mutant_tensor = torch.stack(mutant_local_embeddings)
# (7467, 393, 320)


In [20]:
mutant_tensor.shape

torch.Size([7467, 393, 320])

In [21]:
wt_tensor=wt_local_embedding[0].unsqueeze(0)

In [22]:
wt_tensor.shape

torch.Size([1, 393, 320])

In [23]:
# Δembedding per residue per mutation
# Shape: (7467, 393)
delta_distances = torch.norm(
    mutant_tensor - wt_tensor,
    dim=2  # embedding dimension
)


In [24]:
delta_distances.shape

torch.Size([7467, 393])

In [25]:
import torch

# Save
torch.save(delta_distances, "delta_distances.pt")




In [26]:
# Later, to load:
delta_distances = torch.load("delta_distances.pt")
print(delta_distances.shape)  # should be [7467, 393]

torch.Size([7467, 393])


In [27]:
delta_distances_np = delta_distances.cpu().numpy()
# Shape: (7467, 393)

## Plotting Median for each position

The delta_distance is now a 7467*393 dimensional array. Here 393 is the number of positions in the TP53
For each position we can consider a list of length 7467 this list contain the embedding change from the wild type TP53 for each of this possible mutations. So to make the visualization simple we will take the median.
The median would tell us roughly on average how much is the embedding deviating from the wild type

In [28]:
import plotly.graph_objects as go
import numpy as np

# delta_distances_np has shape (num_mutations, 393)
num_positions = delta_distances_np.shape[1]

# Compute median for each position
median_per_position = np.median(delta_distances_np, axis=0)  # shape: (393,)

# Create a line plot
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=np.arange(1, num_positions + 1),  # Residue positions (1-indexed)
        y=median_per_position,
        mode='lines+markers',
        line=dict(color='blue'),
        marker=dict(size=4),
        name='Median ΔEmbedding'
    )
)

# Update layout
fig.update_layout(
    title="Median Per-Residue ESM Embedding Perturbation Across TP53 Mutations",
    xaxis_title="Residue Position",
    yaxis_title="Median ΔEmbedding Norm (Euclidean)",
    width=1200,
    height=500,
    showlegend=True
)

fig.show()


### Getting Data From Real tumor patients

In [29]:
import pandas as pd


In [30]:
tumor_variant=pd.read_csv("TumorVariantDownload_r21.csv")

In [31]:
tumor_variant['StructureFunctionClass'].value_counts()

Unnamed: 0_level_0,count
StructureFunctionClass,Unnamed: 1_level_1
non-functional,18393
functional,2359


Here for various patients we are seeing that if the mutation is keeping the protein as functional or making it non functional.
That is described in the Structure function Class column.

In [32]:
tumor_variant.columns

Index(['Mutation_ID', 'MUT_ID', 'hg18_Chr17_coordinates',
       'hg19_Chr17_coordinates', 'hg38_Chr17_coordinates', 'ExonIntron',
       'Codon_number', 'Description', 'c_description', 'g_description',
       'g_description_GRCh38', 'WT_nucleotide', 'Mutant_nucleotide',
       'Splice_site', 'CpG_site', 'Context_coding_3', 'Type', 'Mut_rate',
       'WT_codon', 'Mutant_codon', 'WT_AA', 'Mutant_AA', 'ProtDescription',
       'COSMIClink', 'CLINVARlink', 'TCGA_ICGC_GENIE_count', 'Mut_rateAA',
       'Effect', 'AGVGDClass', 'SIFTClass', 'Polyphen2', 'REVEL', 'BayesDel',
       'TransactivationClass', 'DNE_LOFclass', 'DNEclass',
       'StructureFunctionClass', 'Hotspot', 'Structural_motif', 'Sample_Name',
       'Sample_ID', 'Sample_source', 'Tumor_origin', 'Topography',
       'Short_topo', 'Topo_code', 'Sub_topography', 'Morphology',
       'Morpho_code', 'Grade', 'Stage', 'TNM', 'p53_IHC', 'KRAS_status',
       'Other_mutations', 'Other_associations', 'Add_Info', 'Individual_ID',
    

In [33]:
tumor_variant['ProtDescription'].unique()

array(['p.V143A', 'p.R175H', 'p.K132Q', ..., 'p.D281P', 'p.Q375K',
       'p.L114F'], dtype=object)

Here it is seen that the ProtDescription column contains mutation name as 'p.V143A' to remove the extra part and get only V143A which matches our mutation name we define the clean_tumor_df function

In [34]:
def clean_tumor_df(df_tumor):
    """
    Cleans the tumor dataset so ProtDescription matches
    """
    df = df_tumor.copy()

    # Remove whitespace
    df['ProtDescription'] = df['ProtDescription'].str.strip()

    # Remove 'p.' prefix if present
    df['ProtDescription'] = df['ProtDescription'].apply(lambda x: x.split('.')[-1] if isinstance(x, str) else x)

    return df


In [35]:
df_tumor_clean = clean_tumor_df(tumor_variant)


In [36]:
df_functional=df_tumor_clean[df_tumor_clean['StructureFunctionClass']=='functional']

In [37]:
df_functional['ProtDescription'].value_counts()

Unnamed: 0_level_0,count
ProtDescription,Unnamed: 1_level_1
G266R,79
M246V,61
A138V,56
N239D,52
M246I,48
...,...
S227P,1
V97I,1
T170R,1
D148H,1


Here we can see the same mutation like G266R happened multiple times. For the same mutation all the data related to it get the same label either non functional or functional because the label is not patient specific.
Rather it is an experimental estimation.

In [38]:
df_non_functional=df_tumor_clean[df_tumor_clean['StructureFunctionClass']=='non-functional']

In [39]:
functional_list=df_functional['ProtDescription'].tolist()
non_functional_list=df_non_functional['ProtDescription'].tolist()

As same mutation occured multiple times we took unique mutations using set

In [40]:
functional_set=set(functional_list)
non_functional_set=set(non_functional_list)

In [41]:
import re
def get_driver_missense_mutations(driver_list_set):
  # Filter only single amino acid substitutions
  driver_missense_mutations = [m for m in driver_list_set if re.match(r'^[A-Z]\d+[A-Z]$', m)]

  return driver_missense_mutations


In [42]:
functional_missense=get_driver_missense_mutations(functional_set)
non_functional_missense=get_driver_missense_mutations(non_functional_set)

In [43]:
len(functional_missense)

419

In [44]:
len(non_functional_missense)

554

In [45]:
mutation_names=list(mutations.keys())

import pandas as pd

# Create residue column names: Res_1, Res_2, ..., Res_393
residue_columns = [f"Res_{i+1}" for i in range(delta_distances_np.shape[1])]


delta_df = pd.DataFrame(
    delta_distances_np,
    columns=residue_columns
)

# Insert mutation column at the front
delta_df.insert(0, "mutation", mutation_names)


In [46]:
delta_df.head()

Unnamed: 0,mutation,Res_1,Res_2,Res_3,Res_4,Res_5,Res_6,Res_7,Res_8,Res_9,...,Res_384,Res_385,Res_386,Res_387,Res_388,Res_389,Res_390,Res_391,Res_392,Res_393
0,M1A,19.450222,11.176102,6.857653,5.498821,3.953537,3.799923,3.247987,2.877016,2.652528,...,1.569873,1.47397,1.569951,1.595149,1.786188,2.020312,2.503189,2.513647,3.095676,3.81792
1,M1C,21.16699,12.245365,8.160588,6.858315,5.766734,4.711988,4.063015,3.580425,3.36053,...,1.63733,1.566124,1.621945,1.698105,1.814542,2.042419,2.474639,2.521356,3.169095,3.798867
2,M1D,20.045994,14.770971,10.192583,7.930643,6.117188,5.103247,4.600428,3.578305,3.497922,...,1.534982,1.446283,1.577546,1.57191,1.74247,1.965291,2.449162,2.495674,3.055547,3.840234
3,M1E,20.297533,14.201246,10.480467,7.959015,6.012251,5.220583,4.405277,3.817424,3.461552,...,1.632452,1.532336,1.630919,1.659174,1.869325,2.023959,2.522747,2.588508,3.155215,4.021658
4,M1F,18.297354,12.221185,7.919462,6.784067,5.840298,4.848624,4.389029,3.543854,3.518028,...,1.470868,1.39045,1.532498,1.521367,1.67132,2.006783,2.420929,2.425156,2.982587,3.609127


The delta_functional and delta_non_functional seperates the whole 7467*393 matrix which contains every possible mutation into functional and non functional class

In [47]:
delta_functional=delta_df[delta_df['mutation'].isin(functional_missense)]
delta_non_functional=delta_df[delta_df['mutation'].isin(non_functional_missense)]

In [48]:
delta_functional.shape

(419, 394)

In [49]:
delta_non_functional.shape

(554, 394)

Here we can see that we have 419 in the functional df and 554 in the non functional df meaning total 419+554 =973 missense mutations

But the total possible mutations were 7467 .That is because we took this labels from the cancer dataset and not every possible mutation occur in cancer.

## 4 group seperation
Here we have taken the median per position for 4 different groups
1. Contains every possible mutations. The purpose of taking this group is to compare the data with random mutations.
2. Contains mutations that are labeled as functional in the tumor variant dataset meaning this mutations has some effect on protein but does not make it non functional meaning hinders it from doing normal job

3. Contains mutations that are labeled as non functional.Meaning they effect the normal job of TP53 like growth suppression

4. Contains all possible mutation in the dna binding domain of TP53.TP53 is mainly a transcription factor meaning it regulates the expression of other genes by binding to DNA.So it has dna binding domain from 102 to 292 region.
The purpose of taking this group is all functional and non functional mutations happen in dna binding domain.
So if the cancer data looks completely like the random dna binding domain mutations then that means our data is not delivering any signal specific to cancer .


In [52]:
median_per_position = np.median(delta_distances_np, axis=0)  # shape: (393,)

In [53]:
functional_median_per_position = np.median(delta_functional.iloc[:, 1:].values, axis=0)
non_functional_median_per_position = np.median(delta_non_functional.iloc[:, 1:].values, axis=0)

In [54]:
dbd_mutations=[mutation for mutation in mutation_names if 101<int(mutation[1:-1])<293]
#

In [56]:
len(dbd_mutations)

3629

In [57]:
all_dbd_df=delta_df[delta_df['mutation'].isin(dbd_mutations)]
all_dbd_df.shape

(3629, 394)

In [58]:
dbd_median_per_position=np.median(all_dbd_df.iloc[:,1:].values,axis=0)

## All groups Plotted
Here all the 4 groups are plotted to see where are the difference the main goal here  is to see if there are any local difference between functional and non functional mutations

In [64]:
import plotly.graph_objects as go
import numpy as np

num_positions = len(median_per_position)
x_positions = np.arange(1, num_positions + 1)  # 1–393

fig = go.Figure()

# ----------------------------
# All mutations (baseline)
# ----------------------------
fig.add_trace(
    go.Scatter(
        x=x_positions,
        y=median_per_position,
        mode='lines',
        line=dict(color='blue', width=2),
        name='All mutations (median)'
    )
)


# All possible DBD muations
fig.add_trace(
    go.Scatter(
        x=x_positions,
        y=dbd_median_per_position,
        mode='lines',
        line=dict(color='green', width=2),
        name='DBD Mutation'
    )
)



#  Functional Mutations
fig.add_trace(
    go.Scatter(
        x=x_positions,
        y=functional_median_per_position,
        mode='lines',
        line=dict(color='purple', width=2),
        name='Functional Mutations'
    )
)

# Non functional Mutations

fig.add_trace(
    go.Scatter(
        x=x_positions,
        y=non_functional_median_per_position,
        mode='lines',
        line=dict(color='red', width=2),
        name='Non Functional Mutations'
    )
)
# ----------------------------
# Layout
# ----------------------------
fig.update_layout(
    title="Per-Residue Median ESM Embedding Perturbation in TP53",
    xaxis_title="Residue position",
    yaxis_title="Median ΔEmbedding norm (Euclidean)",
    width=1200,
    height=500,
    legend_title_text="Mutation set",
    template="simple_white"
)

fig.show()


### Observation
1. The most change has occured among the 4 groups on the dna binding domain among the 4 groups

2. The dna binding domain mutation group is having huge embedding change in the dna binding domain. This suggest that dna binding domain is a constrained group like a block. Any change here change the embedding of others in the dna binding domain significantly

3. The functional mutation group is have lower embedding change in the dna binding domain for every position which supports the idea that this mutations are not having much huge effect on the structure

4. The non functional mutation group is mostly like the random dna binding domain mutation group. The only difference we are observing in the local region from 230 ~ 274 residues.So that means mutations that cause structural changes in this local region is likely responsible for making the protein non functional in cancer.So if we can find what parts are in this region we can probably get what structural changes are likely responsible for cancer.


In [66]:
fig.add_vrect(
    x0=102, x1=292,
    fillcolor="gray",
    opacity=0.15,
    layer="below",
    line_width=0,
    annotation_text="DNA-binding domain",
    annotation_position="top left"
)
