Uniprot FASTA database download failed #82

wardiam · 2022-08-11T10:06:34Z

With the last update of the Uniprot interface, it is no longer possible to download FASTA databases as before.

See information here:
https://www.uniprot.org/help/api_queries

Most relevant is that you can only use the "direct" download for proteomes of less than 5 million sequences. For larger proteomes (e.g. Firmicutes (taxid = 1239, has 21 million entries) you have to use the new paging system.

Could you implement the new Uniprot FASTA file download system in the next release, please?.

Thank you very much.

Best regards,
Wardiam

HajkD · 2022-08-30T12:25:23Z

Dear @wardiam

Many thanks for letting me know.

I put it on my ToDo list and will check whether an adoption to the new retrieval system is maintainable.

Cheers,
Hajk

HajkD · 2023-09-29T16:12:43Z

Dear @wardiam

I finally found the time to reimplement the download procedure from UniProt.

Would it be possible to confirm that it works for you now (by installing the developer version of biomartr):

biomartr::getProteome(organism = "Homo sapiens", db = "uniprot")

-> Starting proteome retrieval of 'Homo sapiens' from uniprot ...


The proteome of 'Homo sapiens' has been downloaded to '_ncbi_downloads/proteomes' and has been named 'Homo_sapiens_protein_uniprot.faa.gz' .

With many thanks,
Hajk

HajkD · 2023-09-29T17:15:19Z

Dear @Roleren

Would it be possible to have a look as well and if all is good then I will prepare the release, submit to CRAN, and we can have a chat :)

With many thanks and very best wishes,
Hajk

#82

HajkD · 2023-10-02T11:21:04Z

Dear All,

I now fixed all remaining issues and thoroughly tested the new UniProt retrieval functionality.
All works smoothly now and users can bulk-retrieve proteomes from UniProt via:

# download the proteomes of three different species at the same time
#### Database: UniProt
file_paths <- getProteomeSet( db = "uniprot",  organisms = c("Homo sapiens",   "Mus musculus",  "Caenorhabditis elegans") )
# look at file paths
file_paths

Starting proteome retrieval of the following proteomes: Homo sapiens, Mus musculus, Caenorhabditis elegans ...
Generating folder set_proteomes ...




-> Starting proteome retrieval of 'Homo sapiens' from uniprot ...


-> Retrieve UniProt information for organism: Homo sapiens
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Homo sapiens' has been downloaded to 'set_proteomes' and has been named 'Homo_sapiens_protein_uniprot.faa.gz' .


-> Starting proteome retrieval of 'Mus musculus' from uniprot ...


-> Retrieve UniProt information for organism: Mus musculus
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Mus musculus' has been downloaded to 'set_proteomes' and has been named 'Mus_musculus_protein_uniprot.faa.gz' .


-> Starting proteome retrieval of 'Caenorhabditis elegans' from uniprot ...


-> Retrieve UniProt information for organism: Caenorhabditis elegans
-> Running download ...                                           
-> Write downloaded *.fasta file to local disk ...
The proteome of 'Caenorhabditis elegans' has been downloaded to 'set_proteomes' and has been named 'Caenorhabditis_elegans_protein_uniprot.faa.gz' .


A summary file (which can be used as supplementary information file in publications) containig retrieval information for all species has been stored at 'set_proteomes/documentation/set_proteomes_summary.csv'.

Cleaning file names for more convenient downstream processing ...
Cleaning file names and unzipping files ...
Unzipping file Caenorhabditis_elegans_protein_uniprot.faa.gz' ...
Unzipping file Homo_sapiens_protein_uniprot.faa.gz' ...
Unzipping file Mus_musculus_protein_uniprot.faa.gz' ...
Unzipping file UP000000589_10090.fasta.gz' ...
Unzipping file UP000001940_6239.fasta.gz' ...
Unzipping file UP000005640_9606.fasta.gz' ...
Finished formatting.
> file_paths
[1] "set_proteomes/CaenorhabditisElegans.faa"
[2] "set_proteomes/HomoSapiens.faa"          
[3] "set_proteomes/MusMusculus.faa"

HajkD pushed a commit that referenced this issue Sep 29, 2023

Fix download link for proteome retrieval from UniProt #82

b94cb71

HajkD pushed a commit that referenced this issue Oct 2, 2023

fixing path issue in getProteome() and getProteomeSet() extend examples

61a1432

#82

HajkD closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniprot FASTA database download failed #82

Uniprot FASTA database download failed #82

wardiam commented Aug 11, 2022

HajkD commented Aug 30, 2022

HajkD commented Sep 29, 2023

HajkD commented Sep 29, 2023

HajkD commented Oct 2, 2023

Uniprot FASTA database download failed #82

Uniprot FASTA database download failed #82

Comments

wardiam commented Aug 11, 2022

HajkD commented Aug 30, 2022

HajkD commented Sep 29, 2023

HajkD commented Sep 29, 2023

HajkD commented Oct 2, 2023