Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Proprecess data #12

Closed
lbwfff opened this issue Dec 6, 2021 · 11 comments
Closed

About Proprecess data #12

lbwfff opened this issue Dec 6, 2021 · 11 comments

Comments

@lbwfff
Copy link

lbwfff commented Dec 6, 2021

Hi,
According to Data curation, I need to format the peptide-protein data like "protein sequence, peptide sequence, protein_ss, peptide_ss", but in fact preprocess_features.py needs me to provide the Protein_pssm_dict, Protein_Intrinsic_dict and Peptide_Intrinsic_dict_v3 files, if I understand correctly.
I can get pssm_dict according to step3_generate_features.py, but what are the next two files?
Thanks,
LeeLee

@twopin
Copy link
Owner

twopin commented Dec 13, 2021

Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.

@lbwfff
Copy link
Author

lbwfff commented Dec 13, 2021

Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.

Hi, twopin
Thank you for your reply. Are Peptide_Intrinsic_dict_v3 and Protein_Intrinsic_dict files that use the same process but the input sequence is peptide and protein respectively?
Besides that, I also encountered a problem, I'm using a small file to test, After changing fasta_filename to my input fasta file, I got the following error. How can I solve this problem?

(4, 16, 16)
(4, 0)
(4, 16, 16)
(4, 0)
Traceback (most recent call last):
  File "step3_generate_features.py", line 57, in <module>
    Intrinsic = raw_score_dict_long[key]
KeyError: '>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens OX=9606 GN=YWHAB PE=1 SV=3'

Thanks,
LeeLee

@lbwfff
Copy link
Author

lbwfff commented Dec 13, 2021

I seem to output an empty raw_score_dict, why is this happening? The following is one of the files output by my IUPred. Is there any problem with this file?

# IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
# Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi
# Nucleic Acids Research 2018;46(W1):W329-W337.
#
# Prediction type: short
# Prediction output
# POS	RES	IUPRED2
1	M	0.9141
2	T	0.8713
3	M	0.8311
4	D	0.7458
5	K	0.6870
6	S	0.6650
7	E	0.6374
8	L	0.5711
9	V	0.5473
10	Q	0.5084
...........

@twopin
Copy link
Owner

twopin commented Dec 13, 2021

  1. https://github.com/twopin/CAMP/issues/12#issuecomment-992278927:
    Yes, Peptide_Intrinsic_dict_v3 and Protein_Intrinsic_dict use the same code for generation (difference input files, one for protein and on for peptide). 'KeyError: '>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens OX=9606 GN=YWHAB PE=1 SV=3'' should be the fasta name in your fasta file. The error message indicates that there is s sequence in your sequence in your fasta file whose fasta name is not in the key list of the raw_score_dict. I suspect that there is something wrong in the function 'extract_intrinsic_disorder'. You can run the function line by line and print the two dicts to check.

@twopin
Copy link
Owner

twopin commented Dec 13, 2021

hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files.
The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result".
cm4_pep_long.txt
cm4_pep_short.txt

@lbwfff
Copy link
Author

lbwfff commented Dec 14, 2021

hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files. The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result". cm4_pep_long.txt cm4_pep_short.txt

Hi, Thanks for your reply, let me have solved this problem, there is still a small question, for this piece of code:

Intrinsic_score = {}
for seq in Intrinsic_score_short.keys():
    Intrinsic = Intrinsic_score_long[prot_seq][:,0]
    short_Intrinsic = Intrinsic_score_short[prot_seq]
    concat_Intrinsic = np.column_stack((long_Intrinsic,short_Intrinsic))
    Intrinsic_score[seq] = np.column_stack((long_Intrinsic,short_Intrinsic))

Here will report an error NameError: name'prot_seq' is not defined, And the long_Intrinsic here does not appear in the previous code, I guess it is Intrinsic?

@lbwfff
Copy link
Author

lbwfff commented Dec 14, 2021

Hi, I encountered another troubles in preprocess_features.py, the following is my error:

(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv 
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
Traceback (most recent call last):
  File "preprocess_features.py", line 137, in <module>
    f = open(datafile)
NameError: name 'datafile' is not defined

I guess the datafile here is equivalent to input_file is this? After changing the datafile to input_file, I got the following error:

(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv 
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
Traceback (most recent call last):
  File "preprocess_features.py", line 148, in <module>
    feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
  File "preprocess_features.py", line 49, in label_seq_ss
    X[i] = res_ind[res]
TypeError: 'set' object has no attribute '__getitem__'

How can I solve this problem? look forward to your reply.
Best wishes,
LeeLee

@twopin
Copy link
Owner

twopin commented Dec 14, 2021

It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.

@twopin
Copy link
Owner

twopin commented Dec 14, 2021

#12 (comment): Yes

@lbwfff
Copy link
Author

lbwfff commented Dec 14, 2021

It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.

So datafile refers to the test_filename in Data curation? This amino acid sequence is something else I printed, I guess the error should be due to this piece of code.

def label_seq_ss(line, pad_prot_len, res_ind):
	line = line.strip().split(',')
	X = np.zeros(pad_prot_len)
	for i ,res in enumerate(line[:pad_prot_len]):
		X[i] = res_ind[res]
	return X

		if pep_ss not in peptide_ss_feature_dict:
			print(pep_ss)
			print(pad_pep_len)
			print(seq_ss_set)
			feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
			peptide_ss_feature_dict[pep_ss] = feature

The following is my pep_ss, pad_pep_len and seq_ss_set:

"XC,YC,IE,QC,NC,CC,PC,LC,GC"
50
set(['"MC,VC,DC,RH,EH,QH,LH,VH,QH,KH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,NH,VH,TH,EC,LC,NC,EC,PC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,SC,AC,DC,GC,NC,EH,KH,KH,IH,EH,MH,VH,RH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,AH,VH,CH,QH,DH,VH,LH,SH,LH,LH,DH,NH,YH,LH,IH,KH,NH,CC,SC,EC,TC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,TC,GC,EH,KH,RH,AH,TH,VH,VH,EH,SH,SH,EH,KH,AH,YH,SH,EH,AH,HH,EH,IH,SH,KH,EH,HH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,YH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,HH,LH,AH,KH,TH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,DC,DC,GC,GC,EC,GC,NC,NC"', '"MC,GC,DC,RH,EH,QH,LH,LH,QH,RH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,SH,AH,MH,KH,AH,VH,TH,EH,LC,NC,EC,PC,LC,SC,NH,EH,DH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,MC,AC,DC,GC,NC,EH,KH,KH,LH,EH,KH,VH,KH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,TH,VH,CH,NH,DH,VH,LH,SH,LH,LH,DH,KH,FH,LH,IH,KC,NC,CC,NC,DC,FC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,SC,GC,EH,KH,KH,NH,SH,VH,VH,EH,AH,SH,EH,AH,AH,YH,KH,EH,AH,FH,EH,IH,SH,KH,EH,QH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,LH,LH,AH,KH,QH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,EC,EC,AC,GC,EC,GC,NC"', '"MC,DC,DC,RH,EH,DH,LH,VH,YH,QH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,EH,MH,VH,EH,SH,MH,KH,KH,VH,AH,GC,MC,DC,VC,EC,LC,TC,VH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,IH,GH,AH,RH,RH,AH,SH,WH,RH,IH,IH,SH,SH,IH,EH,QH,KH,EH,EC,NC,KC,GC,GC,EH,DH,KH,LH,KH,MH,IH,RH,EH,YH,RH,QH,MH,VH,EH,TH,EH,LH,KH,LH,IH,CH,CH,DH,IH,LH,DH,VH,LH,DH,KH,HH,LH,IH,PH,AH,AC,NC,TC,GH,EH,SH,KH,VH,FH,YH,YH,KH,MH,KH,GH,DH,YH,HH,RH,YH,LH,AH,EH,FH,AC,TC,GC,NH,DH,RH,KH,EH,AH,AH,EH,NH,SH,LH,VH,AH,YH,KH,AH,AH,SH,DH,IH,AH,MH,TH,EH,LC,PC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,DH,RH,AH,CH,RH,LH,AH,KH,AH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,SC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,MC,QC,GC,DC,GC,EC,EC,QH,NC,KC,EH,AH,LH,QH,DC,VC,EC,DC,EC,NC,QC"', '"MC,TC,MC,DC,KH,SH,EH,LH,VH,QH,KH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,AH,VH,TH,EH,QC,GC,HC,EC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,EC,RC,NC,EC,KH,KH,QH,QH,MH,GH,KH,EH,YH,RH,EH,KH,IH,EH,AH,EH,LH,QH,DH,IH,CH,NH,DH,VH,LH,EH,LH,LH,DH,KH,YH,LH,IH,PH,NH,AC,TC,QC,PH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,FH,RH,YH,LH,SH,EH,VC,AC,SC,GC,DH,NH,KH,QH,TH,TH,VH,SH,NH,SH,QH,QH,AH,YH,QH,EH,AH,FH,EH,IH,SH,KH,KH,EH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,EH,KH,AH,CH,SH,LH,AH,KH,TH,AH,FH,DH,EH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,EC,NC,QC,GC,DC,EC,GC,DC,AC,GC,EC,GC,EC,NC"'])

@twopin
Copy link
Owner

twopin commented Mar 8, 2023

I think this bug is due to the naming of variable seq_ss_set twice. I just fixed the bug and revised the script.

@twopin twopin closed this as completed Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants