# Screening the Human Proteome for Novel Nuclear Export Signals
*Authors: Daniel Levin, Imri Shuval, Shira Gelbstein and Ron Levin*
*Date: June 19, 2025*

---


### Section 1: Setup and Goal

**Goal:**
The goal of this project is to use a deep learning model to screen the entire human proteome for novel, undiscovered Nuclear Export Signal (NES) motifs.

In [None]:

from plotting_utils import *
from IPython.display import display

---
### Section 2: Data Loading and Preparation

**Data Sources:**
We are loading the main screening results, which were generated by running our predictor over ~20,000 human proteins. We are also loading the results from running the same predictor on two control sets: a known positive set (original NesDB peptides) and a known negative set (mitochondrial proteins).

In [None]:
# Load the main results from the pipeline
main_results_df = pd.read_csv('screening_results.csv')

# Load results from the control sets
# (These would be generated by running the pipeline on those specific FASTA files)
positive_controls_df = pd.read_csv('positive_controls_results.csv')
negative_controls_df = pd.read_csv('negative_controls_results.csv')

# Display the first few rows and basic stats to show the data has loaded correctly
print("Main Results:")
display(main_results_df.head())
main_results_df.info()
main_results_df.describe()

print("\nPositive Controls:")
display(positive_controls_df.head())
positive_controls_df.info()
positive_controls_df.describe()

print("\nNegative Controls:")
display(negative_controls_df.head())
negative_controls_df.info()
negative_controls_df.describe()


---
### Section 3: Performance Analysis - How Good is Our Model?

First, we'll validate our model's performance by generating an ROC curve using the known positive and negative control sets. This tells us how well our model can distinguish between real NES motifs and other sequences.

In [None]:
# Extract the scores from the control set dataframes
known_pos_scores = positive_controls_df['positive_probability'].tolist()
known_neg_scores = negative_controls_df['positive_probability'].tolist()

# Generate the ROC plot using your API
plot_roc_curve(known_pos_scores, known_neg_scores, output_path="roc_curve_validation.png")

In [None]:
# TODO: The resulting ROC curve shows an Area Under the Curve (AUC) of [e.g., 0.89]. This is a strong result, indicating our model has excellent discriminative power and is not just guessing.

---
### Section 4: Analysis of the Full Proteome Screen

Now that we've validated our model, let's analyze the results from the full human proteome screen. We will compare the distribution of scores from the main screen to our negative control set to see if we found anything interesting.


In [None]:
# Prepare the data for the plots
plot_scores = {
    'Human Proteome Screen': main_results_df['positive_probability'].tolist(),
    'Negative Controls': negative_controls_df['positive_probability'].tolist()
}

# Generate the plots using your API
plot_score_distribution(plot_scores, output_path="proteome_vs_control_dist.png")
plot_score_boxplot(plot_scores, output_path="proteome_vs_control_boxplot.png")


In [None]:
# TODO: The plots above clearly show a separation between the two groups. While the scores for the negative control proteins are tightly clustered around zero, the human proteome scores show a distinct tail in the high-probability region. This suggests our model is successfully identifying a set of candidate peptides that are significantly different from the negative baseline.


---
### Section 5: Identifying Novel High-Confidence Candidates

The final step is to isolate the most promising, high-confidence NES candidates from our screening of over [e.g., 2 million] peptides.

In [None]:
# Define a threshold for a high-confidence hit
confidence_threshold = 0.90

# Filter the dataframe
high_confidence_hits = main_results_df[main_results_df['positive_probability'] >= confidence_threshold]

# Sort by score and display the Top 15 candidates
top_candidates = high_confidence_hits.sort_values(by='positive_probability', ascending=False)
print(f"Found {len(top_candidates)} high-confidence candidates (score >= {confidence_threshold}).")
display(top_candidates.head(15))


In [None]:
# TODO: Our project successfully developed and deployed a prediction pipeline that screened the entire human proteome. We validated our model (AUC = [e.g., 0.89]) and identified **[e.g., 75]** novel, high-confidence NES candidates. The top candidate, found in the protein **[e.g., P53_HUMAN]**, has a score of **[e.g., 0.98]**, making it a prime candidate for future experimental validation.
