# Screening the Human Proteome for Novel Nuclear Export Signals
*Authors: Daniel Levin, Imri Shuval, Shira Gelbstein and Ron Levin*
*Date: June 19, 2025*

---


### Section 1: Setup and Goal

**Goal:**
The goal of this project is to use a deep learning model to screen the entire human proteome for novel, undiscovered Nuclear Export Signal (NES) motifs.

In [1]:

from plotting_utils import *
from IPython.display import display

---
### Section 2: Data Loading and Preparation

**Data Sources:**
We are loading the main screening results, which were generated by running our predictor over ~20,000 human proteins. We are also loading the results from running the same predictor on two control sets: a known positive set (original NesDB peptides) and a known negative set (mitochondrial proteins).

In [2]:
# TODO: change path
# Load the main results from the pipeline
main_results_df = pd.read_csv('DB/dummy2.csv')

# Load results from the control sets
# (These would be generated by running the pipeline on those specific FASTA files)
positive_controls_df = pd.read_csv('input_sequences/NESdb_NESpositive_sequences.csv')
negative_controls_df = pd.read_csv('input_sequences/PDB_Bacteria__Helical_Peptides_NESnegative_sequences.csv')

# Display the first few rows and basic stats to show the data has loaded correctly
print("Main Results:")
display(main_results_df.head())
main_results_df.info()
main_results_df.describe()

print("\nPositive Controls:")
display(positive_controls_df.head())
positive_controls_df.info()
positive_controls_df.describe()

print("\nNegative Controls:")
display(negative_controls_df.head())
negative_controls_df.info()
negative_controls_df.describe()


Main Results:


Unnamed: 0,uniprotID,logits,predictions,labels
0,P_001,"[-2.5, 2.1]",1,1
1,P_002,"[-1.8, 2.9]",1,1
2,P_003,"[-0.5, 0.9]",1,1
3,P_004,"[0.1, -0.2]",0,1
4,P_005,"[-2.2, 3.1]",1,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   uniprotID    20 non-null     object
 1   logits       20 non-null     object
 2   predictions  20 non-null     int64 
 3   labels       20 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 772.0+ bytes

Positive Controls:


Unnamed: 0,name,species,uniprotID,start#,NES sequence,full sequence,Unnamed: 6,positive
0,ADAR1,Homo sapiens,P55265,122,GVDCLSSHFQELSIYQ,MNPRQGYSLSGYYTHPFQGYEHRQLRYQQPGPGSSPSSFLLKQIEF...,,1
1,Cdc7,Homo sapiens,O00311,457,DLRKLCERLRGMDS,MEASLGIQMDEPMAFSPQRDRFQAEGSLKKNEQNFKLAGVKKDIEK...,,1
2,Cdc7,Homo sapiens,O00311,538,VPDEAYDLLDKLLDLNP,MEASLGIQMDEPMAFSPQRDRFQAEGSLKKNEQNFKLAGVKKDIEK...,,1
3,CPEB4,Homo sapiens,Q17RY0,379,RTFDMHSLESSLIDIM,MGDYGFGVLVQSNTGNKSAFPVRFHPHLQPPHHHQNATPSPAAFIN...,,1
4,CPEB4,Homo sapiens,Q17RY0,382,DMHSLESSLIDIMR,MGDYGFGVLVQSNTGNKSAFPVRFHPHLQPPHHHQNATPSPAAFIN...,,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           462 non-null    object 
 1   species        462 non-null    object 
 2   uniprotID      462 non-null    object 
 3   start#         462 non-null    int64  
 4   NES sequence   462 non-null    object 
 5   full sequence  462 non-null    object 
 6   Unnamed: 6     0 non-null      float64
 7   positive       462 non-null    int64  
dtypes: float64(1), int64(2), object(5)
memory usage: 29.0+ KB

Negative Controls:


Unnamed: 0,uniprotID,full sequence,NOT NES,label
0,P0A6W9,MIPDVSQALAWLEKHPQALKGIQRGLERETLRVNADGTLATTGHPE...,EVRSLDINPFSPIGVDEQQV,0
1,P0A6W9,MIPDVSQALAWLEKHPQALKGIQRGLERETLRVNADGTLATTGHPE...,TTDFAEALLEFITPVDGDIE,0
2,P0A6W9,MIPDVSQALAWLEKHPQALKGIQRGLERETLRVNADGTLATTGHPE...,EKHPQALKGIQRGLERETLR,0
3,P0ABE7,ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMAAAAADAWSATPP...,AAADAWSATPPKLEDKSPDS,0
4,P0A6W9,MIPDVSQALAWLEKHPQALKGIQRGLERETLRVNADGTLATTGHPE...,RFKTLYREGLKNRYGALMQT,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1856 entries, 0 to 1855
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   uniprotID      1856 non-null   object
 1   full sequence  1856 non-null   object
 2   NOT NES        1846 non-null   object
 3   label          1856 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 58.1+ KB


Unnamed: 0,label
count,1856.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


---
### Section 3: Performance Analysis - How Good is Our Model?

First, we'll validate our model's performance by generating an ROC curve using the known positive and negative control sets. This tells us how well our model can distinguish between real NES motifs and other sequences.

In [None]:
# Extract the scores from the control set dataframes
known_pos_scores = positive_controls_df['positive_probability'].tolist()
known_neg_scores = negative_controls_df['positive_probability'].tolist()

# Generate the ROC plot using your API
plot_roc_curve(known_pos_scores, known_neg_scores, output_path="roc_curve_validation.png")

In [None]:
# TODO: The resulting ROC curve shows an Area Under the Curve (AUC) of [e.g., 0.89]. This is a strong result, indicating our model has excellent discriminative power and is not just guessing.

---
### Section 4: Analysis of the Full Proteome Screen

Now that we've validated our model, let's analyze the results from the full human proteome screen. We will compare the distribution of scores from the main screen to our negative control set to see if we found anything interesting.


In [None]:
# Prepare the data for the plots
from plotting_utils import preprocess_pipeline_output
# Process raw outputs to add positive probabilities
processed_main = preprocess_pipeline_output(main_results_df)
processed_neg = preprocess_pipeline_output(negative_controls_df)
# Tag each DataFrame with its source
processed_main['source'] = 'Human Proteome Screen'
processed_neg['source'] = 'Negative Controls'
# Create a dict of processed DataFrames for distribution plotting
data_dict = {
    'Human Proteome Screen': processed_main,
    'Negative Controls': processed_neg
}

# Generate the plots using updated API
plot_score_distribution(data_dict, output_path="proteome_vs_control_dist.png")

# Combine for boxplot and ROC
import pandas as pd
combined_df = pd.concat([processed_main, processed_neg], ignore_index=True)

# Boxplot grouped by source
plot_score_boxplot(combined_df, group_column='source', output_path="proteome_vs_control_boxplot.png")

# ROC curve for combined data
plot_roc_curve(combined_df, output_path="proteome_vs_control_roc.png")

In [None]:
# TODO: The plots above clearly show a separation between the two groups. While the scores for the negative control proteins are tightly clustered around zero, the human proteome scores show a distinct tail in the high-probability region. This suggests our model is successfully identifying a set of candidate peptides that are significantly different from the negative baseline.


---
### Section 5: Identifying Novel High-Confidence Candidates

The final step is to isolate the most promising, high-confidence NES candidates from our screening of over [e.g., 2 million] peptides.

In [None]:
# Define a threshold for a high-confidence hit
confidence_threshold = 0.90

# Filter the dataframe
high_confidence_hits = main_results_df[main_results_df['positive_probability'] >= confidence_threshold]

# Sort by score and display the Top 15 candidates
top_candidates = high_confidence_hits.sort_values(by='positive_probability', ascending=False)
print(f"Found {len(top_candidates)} high-confidence candidates (score >= {confidence_threshold}).")
display(top_candidates.head(15))


In [None]:
# TODO: Our project successfully developed and deployed a prediction pipeline that screened the entire human proteome. We validated our model (AUC = [e.g., 0.89]) and identified **[e.g., 75]** novel, high-confidence NES candidates. The top candidate, found in the protein **[e.g., P53_HUMAN]**, has a score of **[e.g., 0.98]**, making it a prime candidate for future experimental validation.
