# DMS training data creation

Here I compile my own csv dataset to be used for training with different methods. I use DMS data.

TLDR: Just run the entire notebook to recreate the csv dataset

## DMS studies

Many of the proteins studied are single domains, and I think it is useful to refer to single domains also in the input used.
Each training protein is identified by the uniprot identifier and the position considered.
The selection of the positions to be included is done manually after reading the original publications.
I report here only the proteins for which I do not use the entire uniprot sequence.

### Beta lactamase - P62593

They used the full protein and selected on survival on antibiotic. The mutated position extend 1 after the uniprot sequence (second last is coherent with the last uniprot position).

- Region: full
- Paper: Firnberg2014
- DMS_id: beta-lactamase

### WW domain (Yap65) - P46937

They used the first WW domain of human Yap65. The positions represented in the DMS experiment are 170-203 while the Interpro WW domain is in position 171-204. To be sure to include all the relevant positions I select the Interpro superfamily which covers 167-207. They selected by binding the peptide GTPPPPYTVG using phage display. The WW domain in general binds proline rich peptides.

- Region: 167-207
- Paper: Fowler2010
- DMS_id: WW_domain

### PSD95 (pdz3 ddomain) - P31016

They only mutated the third PDZ domain (pdz3), position 311-393. The Interpro domain is 313-394. The interpro superfamily is 302-426.
I use the superfamily to be permissive.
Binds the peptide TKNYKQTSV-COOH, from the cysteine-rich interactor of PDZ (CRIPT). They selected by binding this peptide. They measured ligand binding via GFP and FACS.

- Region: 302-426
- Paper: McLaughlin2012
- DMS_id: PSD95pdz3

### Aminoglycoside kinase - P00552

They used the full protein (I checked the positions, form the first to the last). This protein confers resistance to aminoglicoside antibiotics. It was isolated in K. pneumoniae and has been traansferred in many bacteria. It is coded by the transposon Tn5. They repeated the experiment with different selection strategies (antibiotic type and concentration).
I suppose that 1:2 inthe gray2018 dms_id name refers to the concentration of kanamycin used (1/2 of the WT MIC).

- Region: full
- Paper: Melnikov2014
- DMS_id: kka2_1:2

### Ubiquitin yeast - P0CG63

- Region: 1-76
- Paper: Roscoe2013, Roscoe2014
- DMS_id: Ubiquitin, E1_Ubiquitin

Ubiquitin is code as a poly-ubiquitin gene, but all the mutations in the paper refer to a single ubiquitin repeat. I use a single repeat as input sequence. In Ubiquitin they did a growth assay with a plasmid in a yeast strain with a conditional ubiquitin knockout. In E1_Ubiquitin they used a yeast display approach and detected with FACS.

### Hsp90 yeast - P02829

Only the ATPase domain is considered in the study. They did a growth assay.

- Region: 2-231
- Paper: Mishra2016
- DMS_id: hsp90

The trimmed sequence generates better alignments for ev couplings since the full sequence tends to recruit sequences that are homologs only for a different, unrelated domain.

### IgG binding protein Streptococcus - P06654

Only the binding domain 1 (GB1) is considered. They selected according to the binding to the IgG Fc portion. They used an mRNA display approach.

- Region: 226-283
- Paper: Olson2014
- DMS_id: gb1

### Pab1 (Poly-A binding protein) - P04147

They mutated the RRM (Rna Recognition Motif) domain, which binds the poly-A tail of mRNA. This specific RRM domain of the protein (domain 2 of 4) also binds to eIF4G at the cap of the mRNA. The cited paper (Melamed2015) is not actually the primary study, but a re-analisys of the data done by the same group (Melamed2013). They did a growth assay. Mutations are in position 126-200, and the Interpro RRM2 domain in position 126-203. I use the interpro domain.

- Region: 126-203
- Paper: Melamed2013
- DMS_id: Pab1

## Trimming to the mutated domain

Trimming seems to be useful for some proteins where the untrimmed MSA is very wrong (hsp90, ubiquitin) and indifferent or deleterious in other cases. For trrosetta, trimming seems to always improve the contact maps. I will trim always the inputs for trrosetta, and only for hsp90 and ubiquitin for the other features.

## Creation of basenames for the features

All the features are extracted from the sequence, so it makes sence to refer to all the features derived from a given sequence with a unique ID. I call this `feature_basename`. More than 1 dms experiment can refer to the same basename if they correspond to mutations in the same wt sequence. I report the mapping of basenames and mutations in the dataset as appropriate. The basenames have the format `<uniprotid>_<start>-<end>`. This snippet generates the basenames from start, end and uniprot id and places them in a new column in the dataframe.

In [23]:
import pandas as pd


def recreate_basenames(df_path):
    df = pd.read_csv(df_path)
    df["feature_basename"] = (
        df.uniprot_id
        + "_"
        + df.uniprot_first.astype(str)
        + "-"
        + df.uniprot_last.astype(str)
    )
    df.loc[
        ((df.uniprot_last == -1) & (df.uniprot_first == 1)), ["feature_basename"]
    ] = df[((df.uniprot_last == -1) & (df.uniprot_first == 1))].uniprot_id
    df["feature_basename_trrosetta"] = (
        df.uniprot_id
        + "_"
        + df.mutated_domain_uniprot_first.astype(str)
        + "-"
        + df.mutated_domain_uniprot_last.astype(str)
    )
    df.loc[
        (
            (df.mutated_domain_uniprot_last == -1)
            & (df.mutated_domain_uniprot_first == 1)
        ),
        ["feature_basename_trrosetta"],
    ] = df[
        (
            (df.mutated_domain_uniprot_last == -1)
            & (df.mutated_domain_uniprot_first == 1)
        )
    ].uniprot_id
    df.to_csv(df_path, index=False)


recreate_basenames("~/master_thesis_work/dataset/dms_datasets.csv")

## Substitution of the kka2_1:2 dataset with data from the original publication

After some exploratory analysis it was observed that the version of the dataset kka2_1:2 included in gray2018 contained some duplicate entries.
To avoid possible biases, it is replaced with the original data from the publication for that dataset, melnikov2014.
Since that dataset includes many different experiments (different selection strategies), I selected only one of them.
This is `KKA2_S3_Kan12_L1` and I already computed it in tidy format using the notebook `kka2_1:2_inspection.Rmd`.
Here I replace the fitness scores for this dataset in my training set.

In [24]:
import pandas as pd

# extraction of the data that I need from the gray2018 summary
gray2018_df = pd.read_csv("../../dataset/gray2018/dmsTraining_2017-02-20.csv")

# the data from the original publication
melnikov2014_df = pd.read_csv("../../dataset/kka2_1:2_scores_from_melnikov2014.csv")

# add the same fields already in gray2018_df
melnikov2014_df["dms_id"] = "kka2_1:2"
melnikov2014_df["protein"] = "Kka2"
melnikov2014_df["uniprot_id"] = "P00552"

# remove the original kka2_1:2 data and replace with the new one
gray2018_df_kka2_replaced = pd.concat(
    [gray2018_df[gray2018_df.dms_id != "kka2_1:2"], melnikov2014_df]
).reset_index(drop=True)

## Extraction of functional scores

I extract the functional scores from the Gray2018 dataset and merge the dataframe of basenames with it so to have each mutation paired to a basename.

In [25]:
import numpy as np
import pandas as pd

# correct wrong entries
gray2018_df_kka2_replaced.loc[
    (gray2018_df_kka2_replaced.dms_id == "Ubiquitin"), "uniprot_id"
] = "P0CG63"
cols_gray2018_to_consider = [
    "protein",
    "dms_id",
    "uniprot_id",
    "position",
    "aa1",
    "aa2",
    "reported_fitness",
]
gray2018_df_kka2_replaced = gray2018_df_kka2_replaced[cols_gray2018_to_consider]

# uniprot sequences ranges to consider (both extremes included) and basenames for each study
dms_datasets_df = pd.read_csv("../../dataset/dms_datasets.csv")

df_raw = pd.merge(gray2018_df_kka2_replaced, dms_datasets_df)

# compute the correct index in the feature vectors
# not that this is the 0-indexed value since position is 1-indexed and also uniprot_first
# by subtracting uniprot_first I am subtracting 1 more than neeeded for getting the 1-indexed position
df_raw["feature_index"] = df_raw.position - df_raw.uniprot_first
# this is the position 1-indexed
df_raw["feature_position"] = df_raw.feature_index + 1

# check that my mapping is correct (I am not removing needed positions)
assert np.all((df_raw.position <= df_raw.uniprot_last) | (df_raw.uniprot_last == -1))

## Removal of problematic entries

In [26]:
from Bio import SeqIO

# I remove positions with unknown mutations
df = df_raw[df_raw.aa2 != "X"]

# I remove positions where the aa1 does not coincide with the one from uniprot
uniprot_seq_df_all_proteins = pd.DataFrame()
for basename in set(df.feature_basename):
    uniprot_seq = np.array(
        SeqIO.read(
            "../../processing/uniprot_sequences/" + basename + ".fasta", "fasta"
        ).seq
    )
    uniprot_seq_df = pd.DataFrame(
        {
            "aa1": uniprot_seq,
            "feature_position": range(1, uniprot_seq.shape[0] + 1),
            "feature_basename": basename,
        }
    )
    uniprot_seq_df_all_proteins = pd.concat(
        [uniprot_seq_df_all_proteins, uniprot_seq_df]
    )

# I want the intersection between the 2 df (where aa1 is the same)
df = pd.merge(df, uniprot_seq_df_all_proteins)

for study in set(df.dms_id):
    print(study, len(df_raw[df_raw.dms_id == study]), len(df[df.dms_id == study]))
print("TOTAL", len(df_raw), len(df))
print("Number of mutations removed:", len(df_raw) - len(df))

hsp90 4417 4231
beta-lactamase 5436 5397
WW_domain 377 373
PSD95pdz3 1577 1577
gb1 1045 1026
E1_Ubiquitin 1198 1142
kka2_1:2 5280 5280
Pab1 1276 1220
Ubiquitin 1403 1267
TOTAL 22009 21513
Number of mutations removed: 496


## Addition of normalised scores

In [27]:
from sklearn.preprocessing import QuantileTransformer

for study in set(df.dms_id):
    y_curr = df[df.dms_id == study].reported_fitness.to_numpy().reshape(-1, 1)
    scaler = QuantileTransformer(n_quantiles=100)
    scaler.fit(y_curr)
    df.loc[(df.dms_id == study), ["reported_fitness_quantile"]] = scaler.transform(
        y_curr
    ) - scaler.transform([[0]])

## Adding features

The features are loaded from a variety of inputs and always contain the basename in the filename. Here I add them to the dataframe.

In [28]:
import glob
import os

import joblib
import networkx as nx
import numpy as np
import pandas as pd
from sklearn.preprocessing import (
    MaxAbsScaler,
    MinMaxScaler,
    QuantileTransformer,
    RobustScaler,
    StandardScaler,
)

# load the ev mutation output
ev_df_all_proteins = pd.DataFrame()
for basename in set(df.feature_basename):
    ev_df = pd.read_csv(
        "../../processing/ev_couplings/" + basename + "_single_mutant_matrix.csv"
    )
    ev_df_simple = pd.DataFrame(
        {
            "feature_position": ev_df.pos,
            "aa1": ev_df.wt,
            "aa2": ev_df.subs,
            "ev_frequency": ev_df.frequency,
            "ev_conservation": ev_df.column_conservation,
            "ev_independent": ev_df.prediction_independent,
            "ev_epistatic": ev_df.prediction_epistatic,
        }
    )
    ev_df_simple["feature_basename"] = basename
    ev_df_all_proteins = pd.concat([ev_df_all_proteins, ev_df_simple])

df = pd.merge(df, ev_df_all_proteins, how="left")

# load the netsurf output
netsurf_df_all_proteins = pd.DataFrame()
for basename in set(df.feature_basename):
    netsurf_df = pd.read_csv("../../processing/netsurfp2/" + basename + "_netsurf.csv")
    netsurf_df_simple = pd.DataFrame(
        {
            "feature_position": netsurf_df.n,
            "aa1": netsurf_df.seq,
        }
    )
    netsurf_feature_names = [
        "rsa",
        "asa",
        "q3",
        "p[q3_H]",
        "p[q3_E]",
        "p[q3_C]",
        "q8",
        "p[q8_G]",
        "p[q8_H]",
        "p[q8_I]",
        "p[q8_B]",
        "p[q8_E]",
        "p[q8_S]",
        "p[q8_T]",
        "p[q8_C]",
        "phi",
        "psi",
        "disorder",
    ]
    for label in netsurf_feature_names:
        new_label = "netsurf_" + label
        netsurf_df_simple[new_label] = netsurf_df[label]
    netsurf_df_simple["feature_basename"] = basename
    netsurf_df_all_proteins = pd.concat([netsurf_df_all_proteins, netsurf_df_simple])

df = pd.merge(df, netsurf_df_all_proteins, how="left")

# load the hmmer pssm
hmm_pssm_df_all_proteins = pd.DataFrame()
for basename in set(df.feature_basename):
    hmm_pssm_vec = joblib.load(
        "../../processing/hmmer/" + basename + ".hmm_pssm.joblib.xz"
    )["pssm"]
    hmm_pssm_colnames = joblib.load(
        "../../processing/hmmer/" + basename + ".hmm_pssm.joblib.xz"
    )["colnames"]
    hmm_pssm_df_simple = pd.DataFrame(
        {
            # the position is 1-indexed and the range 0-indexed
            # entries in the profile are in the correct order
            "feature_position": np.array(range(len(hmm_pssm_vec)))
            + 1,
        }
    )
    for i, residue in enumerate(hmm_pssm_colnames):
        hmm_pssm_df_simple["hmm_pssm_" + residue] = hmm_pssm_vec[:, i]
    hmm_pssm_df_simple["feature_basename"] = basename
    hmm_pssm_df_all_proteins = pd.concat([hmm_pssm_df_all_proteins, hmm_pssm_df_simple])

df = pd.merge(df, hmm_pssm_df_all_proteins, how="left")

# compute the likelyhood of the mutation from the pssm
df["hmm_pssm_aa1_likelyhood"] = [row["hmm_pssm_" + row.aa1] for _, row in df.iterrows()]
df["hmm_pssm_aa2_likelyhood"] = [row["hmm_pssm_" + row.aa2] for _, row in df.iterrows()]
df["hmm_pssm_delta_likelyhood"] = (
    df.hmm_pssm_aa2_likelyhood - df.hmm_pssm_aa1_likelyhood
)

# add graph metrics from trRosetta
contact_treshold = 8
tr_rosetta_graph_df_all_proteins = pd.DataFrame()
for trrosetta_basename in set(df.feature_basename_trrosetta):
    trrosetta_uniprot_firsts = set(
        df[
            df.feature_basename_trrosetta == trrosetta_basename
        ].mutated_domain_uniprot_first
    )
    assert len(trrosetta_uniprot_firsts) == 1
    trrosetta_uniprot_first = trrosetta_uniprot_firsts.pop()
    tr_rosetta_distances = joblib.load(
        "../../processing/tr_rosetta/{}_trRosetta_distance_mat.joblib.xz".format(
            trrosetta_basename
        )
    )
    G = nx.Graph()
    for i, row in enumerate(tr_rosetta_distances):
        G.add_node(i)
        for j, el in enumerate(row):
            if i >= j and el < contact_treshold:
                G.add_edge(i, j)
    assert G.number_of_nodes() == tr_rosetta_distances.shape[0]
    graph_df = pd.DataFrame()
    graph_df["tr_rosetta_feature_index"] = range(tr_rosetta_distances.shape[0])
    graph_df["tr_rosetta_graph_closeness_centrality"] = graph_df[
        "tr_rosetta_feature_index"
    ].map(nx.closeness_centrality(G))
    graph_df["tr_rosetta_graph_betweenness_centrality"] = graph_df[
        "tr_rosetta_feature_index"
    ].map(nx.betweenness_centrality(G, normalized=True))
    graph_df["tr_rosetta_graph_degree_centrality"] = graph_df[
        "tr_rosetta_feature_index"
    ].map(nx.degree_centrality(G))
    graph_df["tr_rosetta_graph_load_centrality"] = graph_df[
        "tr_rosetta_feature_index"
    ].map(nx.load_centrality(G, normalized=True))
    graph_df["tr_rosetta_graph_harmonic_centrality"] = (
        graph_df["tr_rosetta_feature_index"].map(nx.harmonic_centrality(G)) / 100
    )
    graph_df["tr_rosetta_graph_clustering"] = graph_df["tr_rosetta_feature_index"].map(
        nx.clustering(G)
    )
    graph_df["feature_basename_trrosetta"] = trrosetta_basename
    # this is a position and not an index since trtrrosetta_uniprot_first is 1-indexed
    # It is position and not feature_position since it refers to the full uniprot seq
    graph_df["position"] = (
        trrosetta_uniprot_first + graph_df["tr_rosetta_feature_index"]
    )
    tr_rosetta_graph_df_all_proteins = pd.concat(
        [tr_rosetta_graph_df_all_proteins, graph_df]
    )

df = pd.merge(df, tr_rosetta_graph_df_all_proteins, how="left")

# add dssp output (not a feature, but for comparison)
dssp_df_all_proteins = pd.DataFrame()
for uniprot_id in set(df.uniprot_id):
    file_glob = glob.glob(
        "../../processing/structures/dssp_mapped/{}_mapped_*_*.uniprot_dssp.joblib.xz".format(
            uniprot_id
        )
    )
    assert len(file_glob) == 1
    filename = file_glob[0]
    dssp_df_raw = joblib.load(filename)
    dssp_df_simple = pd.DataFrame()
    dssp_df_simple["position"] = dssp_df_raw["uniprot_res_pos"]
    dssp_df_simple["uniprot_id"] = uniprot_id
    dssp_df_simple["aa1"] = dssp_df_raw["residue_type"]
    dssp_df_simple["dssp_sec_struct"] = dssp_df_raw["secondary_structure"]
    dssp_df_simple["dssp_rsa"] = dssp_df_raw["relative_solvent_accessibility"]
    dssp_df_simple["dssp_asa"] = dssp_df_raw["solvent_accessibility"]
    dssp_df_simple["dssp_phi"] = dssp_df_raw["phi"]
    dssp_df_simple["dssp_psi"] = dssp_df_raw["psi"]
    if uniprot_id == "P06654":
        # the gb1 structure is not available but it is availbale a structure for the next IgG binding domain
        dssp_df_simple["position"] = dssp_df_raw["uniprot_res_pos"] - 70
    elif uniprot_id == "P0CG63":
        # the structure is about the third ubiquitin repeat, the mutated positions refer to the first
        dssp_df_simple["position"] = dssp_df_raw["uniprot_res_pos"] - 304
    dssp_df_all_proteins = pd.concat([dssp_df_all_proteins, dssp_df_simple])

df = pd.merge(df, dssp_df_all_proteins, how="left")

# sanify the feature names for xgb (cannot contain \[, \], or \,)
df = df.rename(
    {col: col.replace("[", "_").replace("]", "") for col in df.columns}, axis="columns"
)

## Export the data

I save the dataframe with all the mappings, features, and labels in a csv file.

In [29]:
df.to_csv("../../dataset/dms_training.csv", index=False)
df

Unnamed: 0,protein,dms_id,uniprot_id,position,aa1,aa2,reported_fitness,pdb_id,pdb_chain,author_year,...,tr_rosetta_graph_betweenness_centrality,tr_rosetta_graph_degree_centrality,tr_rosetta_graph_load_centrality,tr_rosetta_graph_harmonic_centrality,tr_rosetta_graph_clustering,dssp_sec_struct,dssp_rsa,dssp_asa,dssp_phi,dssp_psi
0,TEM-1,beta-lactamase,P62593,20,P,P,0.581033,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
1,TEM-1,beta-lactamase,P62593,20,P,Q,0.441480,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
2,TEM-1,beta-lactamase,P62593,20,P,D,0.289750,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
3,TEM-1,beta-lactamase,P62593,20,P,K,0.196582,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
4,TEM-1,beta-lactamase,P62593,20,P,N,0.053725,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21508,Kka2,kka2_1:2,P00552,64,A,L,-3.278095,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21509,Kka2,kka2_1:2,P00552,64,A,F,0.450056,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21510,Kka2,kka2_1:2,P00552,64,A,I,-2.099644,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21511,Kka2,kka2_1:2,P00552,64,A,Q,-0.107029,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1


In [66]:
## a

In [35]:
features = [
        "aa1",
        "aa2",
        "ev_frequency",
        "ev_conservation",
        "ev_independent",
        "ev_epistatic",
        "netsurf_rsa",
        "netsurf_asa",
        "netsurf_p_q3_H",
        "netsurf_p_q3_E",
        "netsurf_p_q3_C",
        "netsurf_p_q8_G",
        "netsurf_p_q8_H",
        "netsurf_p_q8_I",
        "netsurf_p_q8_B",
        "netsurf_p_q8_E",
        "netsurf_p_q8_S",
        "netsurf_p_q8_T",
        "netsurf_p_q8_C",
        "netsurf_phi",
        "netsurf_psi",
        "netsurf_disorder",
        "hmm_pssm_A",
        "hmm_pssm_C",
        "hmm_pssm_D",
        "hmm_pssm_E",
        "hmm_pssm_F",
        "hmm_pssm_G",
        "hmm_pssm_H",
        "hmm_pssm_I",
        "hmm_pssm_K",
        "hmm_pssm_L",
        "hmm_pssm_M",
        "hmm_pssm_N",
        "hmm_pssm_P",
        "hmm_pssm_Q",
        "hmm_pssm_R",
        "hmm_pssm_S",
        "hmm_pssm_T",
        "hmm_pssm_V",
        "hmm_pssm_W",
        "hmm_pssm_Y",
        "hmm_pssm_aa1_likelyhood",
        "hmm_pssm_aa2_likelyhood",
        "hmm_pssm_delta_likelyhood",
        "tr_rosetta_graph_closeness_centrality",
        "tr_rosetta_graph_betweenness_centrality",
        "tr_rosetta_graph_degree_centrality",
        "tr_rosetta_graph_load_centrality",
        "tr_rosetta_graph_harmonic_centrality",
        "tr_rosetta_graph_clustering",
    ]

df

Unnamed: 0,protein,dms_id,uniprot_id,position,aa1,aa2,reported_fitness,pdb_id,pdb_chain,author_year,...,tr_rosetta_graph_betweenness_centrality,tr_rosetta_graph_degree_centrality,tr_rosetta_graph_load_centrality,tr_rosetta_graph_harmonic_centrality,tr_rosetta_graph_clustering,dssp_sec_struct,dssp_rsa,dssp_asa,dssp_phi,dssp_psi
0,TEM-1,beta-lactamase,P62593,20,P,P,0.581033,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
1,TEM-1,beta-lactamase,P62593,20,P,Q,0.441480,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
2,TEM-1,beta-lactamase,P62593,20,P,D,0.289750,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
3,TEM-1,beta-lactamase,P62593,20,P,K,0.196582,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
4,TEM-1,beta-lactamase,P62593,20,P,N,0.053725,1btl,A,Firnberg2014,...,0.116044,0.017544,0.111866,0.42778,0.500000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21508,Kka2,kka2_1:2,P00552,64,A,L,-3.278095,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21509,Kka2,kka2_1:2,P00552,64,A,F,0.450056,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21510,Kka2,kka2_1:2,P00552,64,A,I,-2.099644,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1
21511,Kka2,kka2_1:2,P00552,64,A,Q,-0.107029,1nd4,A,Melnikov2014,...,0.023451,0.041825,0.025188,0.69352,0.490909,H,0.072595,8.0,-55.2,-52.1


In [65]:
na_count = df[features].isna()
dms_id = df[['dms_id']]
dms_id.join(na_count).groupby('dms_id').sum()

Unnamed: 0_level_0,aa1,aa2,ev_frequency,ev_conservation,ev_independent,ev_epistatic,netsurf_rsa,netsurf_asa,netsurf_p_q3_H,netsurf_p_q3_E,...,hmm_pssm_Y,hmm_pssm_aa1_likelyhood,hmm_pssm_aa2_likelyhood,hmm_pssm_delta_likelyhood,tr_rosetta_graph_closeness_centrality,tr_rosetta_graph_betweenness_centrality,tr_rosetta_graph_degree_centrality,tr_rosetta_graph_load_centrality,tr_rosetta_graph_harmonic_centrality,tr_rosetta_graph_clustering
dms_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
E1_Ubiquitin,0,0,57,57,57,57,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PSD95pdz3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Pab1,0,0,49,49,49,49,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ubiquitin,0,0,110,110,110,110,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WW_domain,0,0,10,10,10,10,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beta-lactamase,0,0,569,569,569,569,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
gb1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hsp90,0,0,464,464,464,464,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kka2_1:2,0,0,625,625,625,625,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
na_count

aa1                                           0
aa2                                           0
ev_frequency                               1884
ev_conservation                            1884
ev_independent                             1884
ev_epistatic                               1884
netsurf_rsa                                   0
netsurf_asa                                   0
netsurf_p_q3_H                                0
netsurf_p_q3_E                                0
netsurf_p_q3_C                                0
netsurf_p_q8_G                                0
netsurf_p_q8_H                                0
netsurf_p_q8_I                                0
netsurf_p_q8_B                                0
netsurf_p_q8_E                                0
netsurf_p_q8_S                                0
netsurf_p_q8_T                                0
netsurf_p_q8_C                                0
netsurf_phi                                   0
netsurf_psi                             