Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniprot FASTA database download failed #82

Closed
wardiam opened this issue Aug 11, 2022 · 4 comments
Closed

Uniprot FASTA database download failed #82

wardiam opened this issue Aug 11, 2022 · 4 comments

Comments

@wardiam
Copy link

wardiam commented Aug 11, 2022

With the last update of the Uniprot interface, it is no longer possible to download FASTA databases as before.

See information here:
https://www.uniprot.org/help/api_queries

Most relevant is that you can only use the "direct" download for proteomes of less than 5 million sequences. For larger proteomes (e.g. Firmicutes (taxid = 1239, has 21 million entries) you have to use the new paging system.

Could you implement the new Uniprot FASTA file download system in the next release, please?.

Thank you very much.

Best regards,
Wardiam

@HajkD
Copy link
Member

HajkD commented Aug 30, 2022

Dear @wardiam

Many thanks for letting me know.

I put it on my ToDo list and will check whether an adoption to the new retrieval system is maintainable.

Cheers,
Hajk

@HajkD
Copy link
Member

HajkD commented Sep 29, 2023

Dear @wardiam

I finally found the time to reimplement the download procedure from UniProt.

Would it be possible to confirm that it works for you now (by installing the developer version of biomartr):

biomartr::getProteome(organism = "Homo sapiens", db = "uniprot")
-> Starting proteome retrieval of 'Homo sapiens' from uniprot ...


The proteome of 'Homo sapiens' has been downloaded to '_ncbi_downloads/proteomes' and has been named 'Homo_sapiens_protein_uniprot.faa.gz' .

With many thanks,
Hajk

@HajkD
Copy link
Member

HajkD commented Sep 29, 2023

Dear @Roleren

Would it be possible to have a look as well and if all is good then I will prepare the release, submit to CRAN, and we can have a chat :)

With many thanks and very best wishes,
Hajk

@HajkD
Copy link
Member

HajkD commented Oct 2, 2023

Dear All,

I now fixed all remaining issues and thoroughly tested the new UniProt retrieval functionality.
All works smoothly now and users can bulk-retrieve proteomes from UniProt via:

# download the proteomes of three different species at the same time
#### Database: UniProt
file_paths <- getProteomeSet( db = "uniprot",  organisms = c("Homo sapiens",   "Mus musculus",  "Caenorhabditis elegans") )
# look at file paths
file_paths
Starting proteome retrieval of the following proteomes: Homo sapiens, Mus musculus, Caenorhabditis elegans ...
Generating folder set_proteomes ...




-> Starting proteome retrieval of 'Homo sapiens' from uniprot ...


-> Retrieve UniProt information for organism: Homo sapiens
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Homo sapiens' has been downloaded to 'set_proteomes' and has been named 'Homo_sapiens_protein_uniprot.faa.gz' .


-> Starting proteome retrieval of 'Mus musculus' from uniprot ...


-> Retrieve UniProt information for organism: Mus musculus
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Mus musculus' has been downloaded to 'set_proteomes' and has been named 'Mus_musculus_protein_uniprot.faa.gz' .


-> Starting proteome retrieval of 'Caenorhabditis elegans' from uniprot ...


-> Retrieve UniProt information for organism: Caenorhabditis elegans
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Caenorhabditis elegans' has been downloaded to 'set_proteomes' and has been named 'Caenorhabditis_elegans_protein_uniprot.faa.gz' .


A summary file (which can be used as supplementary information file in publications) containig retrieval information for all species has been stored at 'set_proteomes/documentation/set_proteomes_summary.csv'.

Cleaning file names for more convenient downstream processing ...
Cleaning file names and unzipping files ...
Unzipping file Caenorhabditis_elegans_protein_uniprot.faa.gz' ...
Unzipping file Homo_sapiens_protein_uniprot.faa.gz' ...
Unzipping file Mus_musculus_protein_uniprot.faa.gz' ...
Unzipping file UP000000589_10090.fasta.gz' ...
Unzipping file UP000001940_6239.fasta.gz' ...
Unzipping file UP000005640_9606.fasta.gz' ...
Finished formatting.
> file_paths
[1] "set_proteomes/CaenorhabditisElegans.faa"
[2] "set_proteomes/HomoSapiens.faa"          
[3] "set_proteomes/MusMusculus.faa"  

@HajkD HajkD closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants