Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbLSTM.py returns scores in different order #4

Closed
prihoda opened this issue Dec 17, 2020 · 4 comments
Closed

AbLSTM.py returns scores in different order #4

prihoda opened this issue Dec 17, 2020 · 4 comments

Comments

@prihoda
Copy link

prihoda commented Dec 17, 2020

Hi all, I noticed a very important issue - the ablstm.py script returns scores in a different order than the order of the input sequences.

I tried processing a diverse set of sequences in one file (human, humanized and murine therapeutic sequences) and got scores that were not consistent with your published distributions:

image

At first I thought it was an overfitting issue, but then I found that I am getting a different result when processing just the first few sequences. When I processed the sequences one by one, the scores now fall into the expected ranges:

image

@xf3227
Copy link
Collaborator

xf3227 commented Dec 17, 2020

Hi prihoda. Thank you for the comment! I have encountered similar issues caused by the inconsistent mechanisms of random number generation across different environments. Since we was also processing the sequences one by one during the testing stage, we failed to notice this bug. I will try fixing it and get back to you soon.

@xf3227
Copy link
Collaborator

xf3227 commented Dec 17, 2020

In the eval() function, I accidently made the dataloader shuffle the sequences. Thank you for pointing this out. It will also be greatly appreciated that you could help us test the code again to see if the issue has been resolved.

@prihoda
Copy link
Author

prihoda commented Dec 17, 2020

Hi @xf3227, thanks for the quick fix. I am now getting the same result when running one by one as when running the whole file 👍 You can close this issue.

Btw a side note, in terms of usability, I think users might find useful to have some instructions on producing the AHo aligned input files. You could even include a script, since it takes a few steps (running anarci to produce an aligned CSV and then converting that CSV to txt while making sure that the same positions as in your input files are present).

Anarci will only include positions that exist within your processed set of sequences, so here's what I got from the ANARCI CSV on my set of sequences:

QVQLKES-GPGLVAPSQSLSITCTVSG-FSVTN-----YGVHWVRQPPGKGLEWLGVIWA----GGITNYNSAFMSRLSISKDNSKSQVFLKMNSLQIDDTAMYYCASRGGHY-------------------GYALDYWGQGTSVTVSS

I then needed to insert the gaps at the correct positions:

-QVQLKES-GPGLVAPSQSLSITCTVSG-FSVTN-----YGVHWVRQPPGKGLEWLGVIWA----GGITNYNSAFMSRLSISKDNSKSQVFLKMNSLQIDDTAMYYCASRGGHY-------------------GYALDYWGQGTSVTVSS

@xf3227
Copy link
Collaborator

xf3227 commented Dec 18, 2020

Hi @prihoda, thank you for locating this bug. I just closed this thread.

As to sequence alignment, sorry that I was not the guy handling this part, neither am I experienced on using sequence alignment tools. Two possible solutions could be:

  1. Simply remove gaps from all sequences. The model can run under two modes one of which is to handle unaligned sequences, although the performance may be expected to be a bit poorer.

  2. Create user's own training dataset aligned in any specific format.

Of course, thank you for bringing this up! Hope this repo could help with your researches and projects!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants