In [1]:
%load_ext autoreload

In [5]:
%autoreload 2
%aimport AD_predictor_tools
%aimport AD_comparison_tools
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

Steps:

1. Load in our predictions, Sanborn PADDLE predictions, and PARROT predictions.
2. Find which predictions are made by only one, by each pair (3 pairs), and by all three.
    1. Make one long dataframe with all predictions, keeping track of the predictor that made it.
    2. Aggregate the predictions, keeping track of the predictor(s) it came from.
    3. Separate the dataframe into five based on the categorys outlined above.
3. Find what proportions of predictions in each category overlap with the gold standard list.
4. Find what proportions of predictions in each category overlap with the tested dataset.

## 1. Load in our predictions, Sanborn PADDLE predictions, and PARROT predictions.

These are our predictions made with a line, slope 1, VP16 and CITED2 as charge and WFYL.

In [43]:
our_preds=pd.read_csv('../output/predictions/LambertTFs_s_001_lcc_VP16_lch_VP16_ucc_CITED2_uch_CITED2_lcs1_000_lcs2_000_lcs1_001_ucs2_inf_comp_WFYL_tl_039_ws_001_ps1_Charge_ps2_AllHydros',index_col=0)
our_preds["uniprotID"]=our_preds.apply(lambda row: row['GeneName'].split('|')[1],axis=1)
our_preds

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID
0,sp|P11473|VDR_HUMAN,195,236,41,Prediction,MMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSY,P11473
1,sp|P01106|MYC_HUMAN,11,50,39,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106
2,sp|O60304|ZN500_HUMAN,4,53,49,Prediction,PGLQPLPTLEQDLEQEEILIVKVEEDFCLEEEPSVETEDPSPETFRQLF,O60304
3,sp|Q96MW7|TIGD1_HUMAN,410,475,65,Prediction,IDDYEGFKTSVEEVSADVVEIAKELELEVEPEDVTELLQSHDKTLT...,Q96MW7
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5
...,...,...,...,...,...,...,...
139,sp|Q8IUX7|AEBP1_HUMAN,1083,1158,75,Prediction,ETYTEVVTEFGTEVEPEFGTKVEPEFETQLEPEFETQLEPEFEEEE...,Q8IUX7
140,sp|Q9UGL1|KDM5B_HUMAN,869,909,40,Prediction,EDFQQHSQKLLSEETPSAAELQDLLDVSFEFDVELPQLAE,Q9UGL1
141,sp|Q7Z7K2|ZN467_HUMAN,99,137,38,Prediction,DEDQEAEEEVEWPQHLSLLPSPFPAPDLGHLAAAYKLE,Q7Z7K2
142,sp|Q12772|SRBP2_HUMAN,1,56,55,Prediction,DDSGELGGLETMETLTELGDELTLGDIDEMLQFVSNQVGEFPDLFS...,Q12772


---

I got the Sanborn PADDLE predictions by going to https://elifesciences.org/articles/68068 then downloading Figure 3—source data 1 at https://cdn.elifesciences.org/articles/68068/elife-68068-fig3-data1-v3.xlsx.

I then made a dataframe with all high and medium-strength predicted ADs made on human TFs using the data on the tab of the spreadsheet titled "Predicted ADs in human TFs."

In [6]:
#Sanborn PADDLE preds of both strengths
PADDLE=pd.read_csv("../../PredictionADs_ToShare/Output/Sanborn_HumanTF_Predictions_BothStrengths",index_col=0)
PADDLE

Unnamed: 0,uniprotID,Start,End,max predicted Z score,Activity_Zscore_mean,protein,description
0,Q6P9G9,155,235,6.80,6.42,ZNF449,Zinc finger protein 449
1,Q04206,418,502,8.22,7.55,RELA,Transcription factor p65
2,Q9Y2G1,0,62,7.64,6.98,MYRF,Myelin regulatory factor
3,P43354,13,77,6.98,6.59,NR4A2,Nuclear receptor subfamily 4 group A member 2
4,Q9ULD5,278,338,6.52,6.24,ZNF777,Zinc finger protein 777
...,...,...,...,...,...,...,...
597,Q03701,943,1030,6.40,4.49,CEBPZ,CCAAT/enhancer-binding protein zeta
598,Q2M1K9,51,136,4.81,4.00,ZNF423,Zinc finger protein 423
599,Q96LX8,0,107,5.62,3.93,ZNF597,Zinc finger protein 597
600,Q9Y2D1,15,104,5.56,4.40,ATF5,Cyclic AMP-dependent transcription factor ATF-5


---

I got the PARROT predictions by tiling Lambert TFs into 30AA windows, then running it through the PARROT predictor trained on data from Erijman et al that's best optimized to run on 30aa peptides (http://localhost:8888/notebooks/Desktop/Staller_Lab_SU21/PredictionADs_ToShare/PARROT%20Predictions%20vs.%20Our%20Predictions.ipynb). Then, I aggregated the predictions and saved them.


In [23]:
#PARROT 30AA Predictions
PARROT=pd.read_csv("../../PredictionADs_ToShare/Output/PARROT_HumanTF_Predictions",index_col=0)
PARROT

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID
0,sp|Q5TYW1|ZN658_HUMAN,15,87,72,Prediction,EFTREEWQHLGPVERTLYRDVMLENYSHLISVGYCITKPKVISKLE...,Q5TYW1
1,sp|Q5TYW1|ZN658_HUMAN,203,237,34,Prediction,QHWKFQTLEESFECDGSGQGLYDKTICITPQSFL,Q5TYW1
2,sp|P22736|NR4A1_HUMAN,38,89,51,Prediction,SPEAAPAAPTALPSFSTFMDGYTGEFDTFLYQLPGTVQPCSSASSS...,P22736
3,sp|P22736|NR4A1_HUMAN,380,510,130,Prediction,KLDYSKFQELVLPHFGKEDAGDVQQFYDLLSGSLEVIRKWAEKIPG...,P22736
4,sp|P22736|NR4A1_HUMAN,566,598,32,Prediction,TQGLQRIFYLKLEDLVPPPPIIDKIFMDTLPF,P22736
...,...,...,...,...,...,...,...
2147,sp|Q03933|HSF2_HUMAN,492,536,44,Prediction,PEPTQSKLVRLEPLTEAEASEATLFYLCELAPAPLDSDMPLLDS,Q03933
2148,sp|Q9Y2W7|CSEN_HUMAN,91,256,165,Prediction,ELQSLYRGFKNECPTGLVDEDTFKLIYAQFFPQGDATTYAHFLFNA...,Q9Y2W7
2149,sp|Q9NQX6|ZN331_HUMAN,1,61,60,Prediction,AQGLVTFADVAIDFSQEEWACLNSAQRDLYWDVMLENYSNLVSLDL...,Q9NQX6
2150,sp|P37275|ZEB1_HUMAN,182,211,29,Prediction,TSLKEHIKYRHEKNEDNFSCSLCSYTFAY,P37275


---

## 2. Find which predictions are made by only one, by each pair (3 pairs), and by all three.

#### A. Make one long dataframe with all predictions, keeping track of the predictor that made it.

In [104]:
all_preds=AD_comparison_tools.df_list_to_df(df_list=[our_preds, PADDLE, PARROT], 
                                            note_list=["ours","PADDLE","PARROT"], 
                                            note_list_col_name="predictor")
all_preds

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description
0,sp|P11473|VDR_HUMAN,195,236,41.0,Prediction,MMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSY,P11473,ours,,,,
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,
2,sp|O60304|ZN500_HUMAN,4,53,49.0,Prediction,PGLQPLPTLEQDLEQEEILIVKVEEDFCLEEEPSVETEDPSPETFRQLF,O60304,ours,,,,
3,sp|Q96MW7|TIGD1_HUMAN,410,475,65.0,Prediction,IDDYEGFKTSVEEVSADVVEIAKELELEVEPEDVTELLQSHDKTLT...,Q96MW7,ours,,,,
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
2893,sp|Q03933|HSF2_HUMAN,492,536,44.0,Prediction,PEPTQSKLVRLEPLTEAEASEATLFYLCELAPAPLDSDMPLLDS,Q03933,PARROT,,,,
2894,sp|Q9Y2W7|CSEN_HUMAN,91,256,165.0,Prediction,ELQSLYRGFKNECPTGLVDEDTFKLIYAQFFPQGDATTYAHFLFNA...,Q9Y2W7,PARROT,,,,
2895,sp|Q9NQX6|ZN331_HUMAN,1,61,60.0,Prediction,AQGLVTFADVAIDFSQEEWACLNSAQRDLYWDVMLENYSNLVSLDL...,Q9NQX6,PARROT,,,,
2896,sp|P37275|ZEB1_HUMAN,182,211,29.0,Prediction,TSLKEHIKYRHEKNEDNFSCSLCSYTFAY,P37275,PARROT,,,,


---
#### B. Identify which predictions are made by which predictors.

In [105]:
we_predict=[]
PADDLE_predicts=[]
PARROT_predicts=[]

for i in all_preds.index:
    we_predict.append(AD_comparison_tools.contains_prediction(pred_df_row=all_preds.iloc[i], compare_to_df=our_preds))
    PADDLE_predicts.append(AD_comparison_tools.contains_prediction(pred_df_row=all_preds.iloc[i], compare_to_df=PADDLE))
    PARROT_predicts.append(AD_comparison_tools.contains_prediction(pred_df_row=all_preds.iloc[i], compare_to_df=PARROT))

all_preds["Us?"]=we_predict
all_preds["PADDLE?"]=PADDLE_predicts
all_preds["PARROT?"]=PARROT_predicts

all_preds

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description,Us?,PADDLE?,PARROT?
0,sp|P11473|VDR_HUMAN,195,236,41.0,Prediction,MMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSY,P11473,ours,,,,,True,False,True
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,,True,True,True
2,sp|O60304|ZN500_HUMAN,4,53,49.0,Prediction,PGLQPLPTLEQDLEQEEILIVKVEEDFCLEEEPSVETEDPSPETFRQLF,O60304,ours,,,,,True,False,False
3,sp|Q96MW7|TIGD1_HUMAN,410,475,65.0,Prediction,IDDYEGFKTSVEEVSADVVEIAKELELEVEPEDVTELLQSHDKTLT...,Q96MW7,ours,,,,,True,False,True
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2893,sp|Q03933|HSF2_HUMAN,492,536,44.0,Prediction,PEPTQSKLVRLEPLTEAEASEATLFYLCELAPAPLDSDMPLLDS,Q03933,PARROT,,,,,False,True,True
2894,sp|Q9Y2W7|CSEN_HUMAN,91,256,165.0,Prediction,ELQSLYRGFKNECPTGLVDEDTFKLIYAQFFPQGDATTYAHFLFNA...,Q9Y2W7,PARROT,,,,,False,False,True
2895,sp|Q9NQX6|ZN331_HUMAN,1,61,60.0,Prediction,AQGLVTFADVAIDFSQEEWACLNSAQRDLYWDVMLENYSNLVSLDL...,Q9NQX6,PARROT,,,,,False,True,True
2896,sp|P37275|ZEB1_HUMAN,182,211,29.0,Prediction,TSLKEHIKYRHEKNEDNFSCSLCSYTFAY,P37275,PARROT,,,,,False,False,True


---
#### C. Separate the dataframe into five based on the categorys outlined above.

In [113]:
Ours_PADDLE=all_preds[all_preds["Us?"] == True]
Ours_PADDLE=Ours_PADDLE[Ours_PADDLE["PADDLE?"] == True]
Ours_PADDLE

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description,Us?,PADDLE?,PARROT?
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,,True,True,True
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,,True,True,True
5,sp|Q8WYA1|BMAL2_HUMAN,587,636,49.0,Prediction,EPLLSDGAQLDFDALCDNDDTAMAAFMNYLEAEGGLGDPGDFSDIQWTL,Q8WYA1,ours,,,,,True,True,True
6,sp|Q04206|TF65_HUMAN,432,476,44.0,Prediction,EGTLSEALLQLQFDDEDLGALLGNSTDPAVFTDLASVDNSEFQQ,Q04206,ours,,,,,True,True,True
7,sp|A8MYZ6|FOXO6_HUMAN,428,476,48.0,Prediction,APDRFPADLDLDMFSGSLECDVESIILNDFMDSDEMDFNFDSALPPPP,A8MYZ6,ours,,,,,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2779,sp|Q8IUM7|NPAS4_HUMAN,674,784,110.0,Prediction,SGAGPPVLSLDLKPWKCQELDFLADPDNMFLEETPVEDIFMDLSTP...,Q8IUM7,PARROT,,,,,True,True,True
2821,sp|P35716|SOX11_HUMAN,338,441,103.0,Prediction,VSTSSSSSSGSSSGSSGEDADDLMFDLSLNFSQSAHSASEQQLGGG...,P35716,PARROT,,,,,True,True,True
2836,sp|P36956|SRBP1_HUMAN,2,58,56.0,Prediction,EPPFSEAALEQALGEPCDLDAALLTDIEDMLQLINNQDSDFPGLFD...,P36956,PARROT,,,,,True,True,True
2866,sp|O43309|ZSC12_HUMAN,64,123,59.0,Prediction,SRLRELCHQWLRPETHTKEQILELLVLEQFLTILPEELQAWVQEQH...,O43309,PARROT,,,,,True,True,True


In [118]:
PARROT_PADDLE=all_preds[all_preds["PARROT?"] == True]
PARROT_PADDLE=PARROT_PADDLE[PARROT_PADDLE["PADDLE?"] == True]
PARROT_PADDLE

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description,Us?,PADDLE?,PARROT?
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,,True,True,True
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,,True,True,True
5,sp|Q8WYA1|BMAL2_HUMAN,587,636,49.0,Prediction,EPLLSDGAQLDFDALCDNDDTAMAAFMNYLEAEGGLGDPGDFSDIQWTL,Q8WYA1,ours,,,,,True,True,True
6,sp|Q04206|TF65_HUMAN,432,476,44.0,Prediction,EGTLSEALLQLQFDDEDLGALLGNSTDPAVFTDLASVDNSEFQQ,Q04206,ours,,,,,True,True,True
7,sp|A8MYZ6|FOXO6_HUMAN,428,476,48.0,Prediction,APDRFPADLDLDMFSGSLECDVESIILNDFMDSDEMDFNFDSALPPPP,A8MYZ6,ours,,,,,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2884,sp|Q14765|STAT4_HUMAN,177,282,105.0,Prediction,IQTMDQSDKNSAMVNQEVLTLQEMLNSLDFKRKEALSKMTQIIHET...,Q14765,PARROT,,,,,False,True,True
2892,sp|Q03933|HSF2_HUMAN,330,412,82.0,Prediction,GSSSLTSEDPVTMMDSILNDNINLLGKVELLDYLDSIDCSLEDFQA...,Q03933,PARROT,,,,,False,True,True
2893,sp|Q03933|HSF2_HUMAN,492,536,44.0,Prediction,PEPTQSKLVRLEPLTEAEASEATLFYLCELAPAPLDSDMPLLDS,Q03933,PARROT,,,,,False,True,True
2895,sp|Q9NQX6|ZN331_HUMAN,1,61,60.0,Prediction,AQGLVTFADVAIDFSQEEWACLNSAQRDLYWDVMLENYSNLVSLDL...,Q9NQX6,PARROT,,,,,False,True,True


In [119]:
Ours_PARROT=all_preds[all_preds["PARROT?"] == True]
Ours_PARROT=Ours_PARROT[Ours_PARROT["Us?"] == True]
Ours_PARROT

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description,Us?,PADDLE?,PARROT?
0,sp|P11473|VDR_HUMAN,195,236,41.0,Prediction,MMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSY,P11473,ours,,,,,True,False,True
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,,True,True,True
3,sp|Q96MW7|TIGD1_HUMAN,410,475,65.0,Prediction,IDDYEGFKTSVEEVSADVVEIAKELELEVEPEDVTELLQSHDKTLT...,Q96MW7,ours,,,,,True,False,True
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,,True,True,True
5,sp|Q8WYA1|BMAL2_HUMAN,587,636,49.0,Prediction,EPLLSDGAQLDFDALCDNDDTAMAAFMNYLEAEGGLGDPGDFSDIQWTL,Q8WYA1,ours,,,,,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2811,sp|Q9BQA5|HINFP_HUMAN,3,173,170.0,Prediction,PGKVPRKENLWLQCEWGSCSFVCSTMEKFFEHVTQHLQQHLHGSGE...,Q9BQA5,PARROT,,,,,True,False,True
2821,sp|P35716|SOX11_HUMAN,338,441,103.0,Prediction,VSTSSSSSSGSSSGSSGEDADDLMFDLSLNFSQSAHSASEQQLGGG...,P35716,PARROT,,,,,True,True,True
2836,sp|P36956|SRBP1_HUMAN,2,58,56.0,Prediction,EPPFSEAALEQALGEPCDLDAALLTDIEDMLQLINNQDSDFPGLFD...,P36956,PARROT,,,,,True,True,True
2866,sp|O43309|ZSC12_HUMAN,64,123,59.0,Prediction,SRLRELCHQWLRPETHTKEQILELLVLEQFLTILPEELQAWVQEQH...,O43309,PARROT,,,,,True,True,True


In [120]:
Ours_PARROT_PADDLE=all_preds[all_preds["PARROT?"] == True]
Ours_PARROT_PADDLE=Ours_PARROT_PADDLE[Ours_PARROT_PADDLE["Us?"] == True]
Ours_PARROT_PADDLE=Ours_PARROT_PADDLE[Ours_PARROT_PADDLE["PADDLE?"] == True]
Ours_PARROT_PADDLE

Unnamed: 0,GeneName,Start,End,Length,RegionType,ProteinRegionSeq,uniprotID,predictor,max predicted Z score,Activity_Zscore_mean,protein,description,Us?,PADDLE?,PARROT?
1,sp|P01106|MYC_HUMAN,11,50,39.0,Prediction,YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW,P01106,ours,,,,,True,True,True
4,sp|Q9Y4E5|ZN451_HUMAN,861,899,38.0,Prediction,NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF,Q9Y4E5,ours,,,,,True,True,True
5,sp|Q8WYA1|BMAL2_HUMAN,587,636,49.0,Prediction,EPLLSDGAQLDFDALCDNDDTAMAAFMNYLEAEGGLGDPGDFSDIQWTL,Q8WYA1,ours,,,,,True,True,True
6,sp|Q04206|TF65_HUMAN,432,476,44.0,Prediction,EGTLSEALLQLQFDDEDLGALLGNSTDPAVFTDLASVDNSEFQQ,Q04206,ours,,,,,True,True,True
7,sp|A8MYZ6|FOXO6_HUMAN,428,476,48.0,Prediction,APDRFPADLDLDMFSGSLECDVESIILNDFMDSDEMDFNFDSALPPPP,A8MYZ6,ours,,,,,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2779,sp|Q8IUM7|NPAS4_HUMAN,674,784,110.0,Prediction,SGAGPPVLSLDLKPWKCQELDFLADPDNMFLEETPVEDIFMDLSTP...,Q8IUM7,PARROT,,,,,True,True,True
2821,sp|P35716|SOX11_HUMAN,338,441,103.0,Prediction,VSTSSSSSSGSSSGSSGEDADDLMFDLSLNFSQSAHSASEQQLGGG...,P35716,PARROT,,,,,True,True,True
2836,sp|P36956|SRBP1_HUMAN,2,58,56.0,Prediction,EPPFSEAALEQALGEPCDLDAALLTDIEDMLQLINNQDSDFPGLFD...,P36956,PARROT,,,,,True,True,True
2866,sp|O43309|ZSC12_HUMAN,64,123,59.0,Prediction,SRLRELCHQWLRPETHTKEQILELLVLEQFLTILPEELQAWVQEQH...,O43309,PARROT,,,,,True,True,True


---

## 3. Find what proportions of predictions in each category overlap with the gold standard list.


In [127]:
AD_predictor_tools.compare_to_random(outputfilepath=Ours_PARROT_PADDLE)

There are 1608 proteins
                   GeneName  Start  End  Length  RegionType  \
1       sp|P01106|MYC_HUMAN     11   50    39.0  Prediction   
4     sp|Q9Y4E5|ZN451_HUMAN    861  899    38.0  Prediction   
5     sp|Q8WYA1|BMAL2_HUMAN    587  636    49.0  Prediction   
6      sp|Q04206|TF65_HUMAN    432  476    44.0  Prediction   
7     sp|A8MYZ6|FOXO6_HUMAN    428  476    48.0  Prediction   
...                     ...    ...  ...     ...         ...   
2779  sp|Q8IUM7|NPAS4_HUMAN    674  784   110.0  Prediction   
2821  sp|P35716|SOX11_HUMAN    338  441   103.0  Prediction   
2836  sp|P36956|SRBP1_HUMAN      2   58    56.0  Prediction   
2866  sp|O43309|ZSC12_HUMAN     64  123    59.0  Prediction   
2873  sp|Q96MU6|ZN778_HUMAN     18   87    69.0  Prediction   

                                       ProteinRegionSeq uniprotID predictor  \
1               YDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIW    P01106      ours   
4                NDLSYQNIEEEIVELPDLDYLRTMTHIVFVDFDNWSNF    Q9

TypeError: object of type 'float' has no len()

## 4. Find what proportions of predictions in each category overlap with the tested dataset.