Error using python API for batch SRAweb search #46

anwarMZ · 2020-06-30T05:59:54Z

pysradb version: 0.10.4
Python version: 3.8.3
Operating System: mac OS Catalina 10.15.5. But using anaconda environment and pip installation of pysradb

Description

Came across pysradb to extract the metadata for a batch of SRA runs (~9K). I tried two different approaches, however, both gave different error. Likely because of a missing value on SRAweb, but i am not sure how an error can either be ignored and moved forward.

1st Method

I tried to convert 9K SRA run accessions to SRA study IDs using srr_to_srp and then search approx. 500 accession ids against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
srp = db.srr_to_srp(lineList)
unique_srp = srp.study_accession.unique()
studies_list = unique_srp.tolist()
Metadata = db.sra_metadata(studies_list, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-d1fb481fd5e3> in <module>
----> 1 Metadata=db.sra_metadata(studies_list, detailed= True)

~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    457                     # detailed_record[key] = value
    458 
--> 459                 pool_record = record["Pool"]["Member"]
    460                 detailed_record["run_accession"] = run_set["@accession"]
    461                 detailed_record["run_alias"] = run_set["@alias"]

KeyError: 'Pool'

2nd Method

In this case I tried to run all 9K SRA run accessions directly against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
Metadata = db.sra_metadata(lineList, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

Traceback (most recent call last):
  File "/Users/Zohaib/PycharmProjects/SRA-Metadata/fetchSRAmetadata.py", line 10, in <module>
    Metadata = db.sra_metadata(lineList, detailed=True)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 425, in sra_metadata
    efetch_result = self.get_efetch_response("sra", srp)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 250, in get_efetch_response
    esearch_response = request.json()
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thanks in advance, looking forward to hear from you.
Zohaib

The text was updated successfully, but these errors were encountered:

saketkc · 2020-06-30T06:03:15Z

Thanks a lot for the bug report @anwarMZ. Would you be able to share a SRP or SRR so that I can reproduce it at my end?

saketkc · 2020-06-30T06:24:55Z

I just pushed b1fa5d6 which might fix it.
Can you try this with the version on master?:

conda create -y -n pysradb_fix  && conda activate pysradb_fix && pip install git+https://github.com/saketkc/pysradb.git

anwarMZ · 2020-06-30T06:25:51Z

Thank for prompt reply, i have attached the file of study accessions here -
SRA_srp.txt

saketkc · 2020-06-30T06:26:46Z

Thanks for the SRP list. I will update here once I have a proper fix.

saketkc · 2020-06-30T06:46:18Z

The last fix works. Here is an example with your SRP list: https://colab.research.google.com/drive/1pNeuZJjjHliYFk582kGNRpGJ1Fa2h9cn?usp=sharing

Let me know if you still face any errors. I prefer giving it a few seconds of sleep time to make sure it doesn't hit NCBI's API limits.

anwarMZ · 2020-06-30T17:45:39Z

Hi @saketkc This works well for querying the ids. However, in this case it creates separate files for each query. In my case, I would like to have one file combined for all SRP queries. But i am not sure if the except can catch the error if the list is passed directly. any thoughts?

saketkc · 2020-06-30T17:51:25Z

You should be able to concat the dataframes using pandas:

master_df = pd.concat([df1,df2, df3,....])

It is possible to query multiple SRPs at once, however given the NCBI's API limits it might time out if there are multiple SRRs (100s of them as in this case).

anwarMZ · 2020-06-30T18:54:37Z

Sure so i just wanted to confirm that querying multiple (100s) of ids at once doesn't work with NCBI's API.
Thank you for answering all queries.
I have a quick question - For IDs where a certain metadata is missing. Does it still make a column for that and leave the cell empty? Because when concatenating, this needs to be made sure that two files don't have varying columns & order.

saketkc · 2020-06-30T23:08:31Z

I have a quick question - For IDs where a certain metadata is missing. Does it still make a column for that and leave the cell empty? Because when concatenating, this needs to be made sure that two files don't have varying columns & order.

That's correct.
The only scenario in which this is not true is when you request detailed metadata. sra_metadata(srp, detailed=True). But you can still concat the dataframes pd.concat(sort=False)

Closing this, feel free to reopen if you still encounter issues.

anwarMZ · 2020-07-08T06:39:52Z

It worked well for me when we last spoke but now i am gradually increasing my list to fetch metadata and i am facing an issue. The problem is when there is a certain Study accession that for some reason doesn't fetch metadata it takes long time catch the exception and move on to next one.For example in the current loop as we discussed - here in collab it stalls on following IDs and it takes significant time to get pass these IDs.

In this case i checked that for example these two accession IDs have had issues:
SRP040281
SRP046387
ERP000171

Also after looking at #47 i tried to update pysradb with v=0.10.5.dev0 after commit #6904315
Thanks,
Zohaib

saketkc · 2020-07-08T06:42:31Z

Thanks for reporting @anwarMZ, I will be taking a look at it later tomorrow.

Thanks!
Saket

anwarMZ · 2020-07-11T07:50:36Z

Hi @saketkc did you get a chance to reproduce the error?

Cheers,
Zohaib

saketkc · 2020-07-12T20:05:33Z

Sorry about the delay in responding. I am able to obtain results for the first two of these ids:

SRP040281
SRP046387
https://colab.research.google.com/drive/1UQpJG32BbjHOf0cV6rxmljf8vhqw22R-?usp=sharing

The problem with the third id is a missing organism tag ERP000171 (which ideally should have been Yersinia. I will have a fix for this soon, but this is not really a bug at the pysradb end.

saketkc · 2020-07-12T20:07:30Z

Also, SRP040281 has 120k+ records, so it takes approximately 7 minutes on Colab to fetch it which I think is reasonable.

anwarMZ · 2020-07-12T20:25:52Z

Okay, I was trying to get the details about the host specie which only comes with detailed flag e.g. db.sra_metadata(srp, detailed=True). In this case when I was get error in one of the accession ids, it just freezes for a significant time. But good to see i can now calculate time on each. Thanks

saketkc · 2020-07-12T20:31:31Z

Yes, for a project with lot of runs, the retrieval time for metadata will increase (though only linearly as you would see in the last Colab notebook). The detailed mode adds an additional overhead, I haven't done any benchmarking but it should take at least 2x the time for the non-detailed mode.

I have fixed the issue with ERP000171, so I am closing this. Please feel free to reopen this if you face any issues. For projects with a lot of runs, you can expect it to take ~ 0.004 * nrecords seconds if you are on Colab using the non-detailed mode.

anwarMZ · 2020-07-16T02:00:31Z

Hi again @saketkc , Thank you for insights, i managed to get this done. I am now trying to download the sra files for the fetched metadata. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error

self.retrieve()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'

With this the process was killed, I would like to know if you have any idea about this ? I believe it could be becasue the API timed out and needs time delay between successive downloads? Also if there is a way to skip the files that are already downloaded?

Thank you

saketkc · 2020-07-16T16:23:26Z

The download method first downloads to a temporary location which in this case is pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part: notice the .part. Downloads are resumable by default. Once a download finishes, the .part extension is removed to mark it complete.

In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).

You should have SRR12100406.sra Please feel free to open a new issue otherwise.

Thanks,
Saket

anwarMZ · 2020-07-17T21:20:18Z

Thanks, i will open a new issue to discuss downloading

saketkc added a commit that referenced this issue Jun 30, 2020

Remove pool member since it was unused. See #46

b1fa5d6

saketkc closed this as completed Jun 30, 2020

saketkc added a commit that referenced this issue Jul 3, 2020

Remove pool member since it was unused. See #46

890b9eb

saketkc reopened this Jul 8, 2020

saketkc closed this as completed in f9b1c5f Jul 12, 2020

anwarMZ mentioned this issue Jul 17, 2020

Error during batch downloading SRA files using SRAweb() #48

Closed

bscrow pushed a commit to bscrow/pysradb that referenced this issue Nov 24, 2020

Remove pool member since it was unused. See saketkc#46

79f314b

bscrow pushed a commit to bscrow/pysradb that referenced this issue Nov 24, 2020

Handle missing organism names. Fixes saketkc#46

2cf7397

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error using python API for batch SRAweb search #46

Error using python API for batch SRAweb search #46

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jul 8, 2020

saketkc commented Jul 8, 2020

anwarMZ commented Jul 11, 2020

saketkc commented Jul 12, 2020 •

edited

Loading

saketkc commented Jul 12, 2020

anwarMZ commented Jul 12, 2020

saketkc commented Jul 12, 2020 •

edited

Loading

anwarMZ commented Jul 16, 2020 •

edited

Loading

saketkc commented Jul 16, 2020

anwarMZ commented Jul 17, 2020

Error using python API for batch SRAweb search #46

Error using python API for batch SRAweb search #46

Comments

anwarMZ commented Jun 30, 2020

Description

1st Method

Error

2nd Method

Error

saketkc commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jun 30, 2020

saketkc commented Jun 30, 2020

anwarMZ commented Jul 8, 2020

saketkc commented Jul 8, 2020

anwarMZ commented Jul 11, 2020

saketkc commented Jul 12, 2020 • edited Loading

saketkc commented Jul 12, 2020

anwarMZ commented Jul 12, 2020

saketkc commented Jul 12, 2020 • edited Loading

anwarMZ commented Jul 16, 2020 • edited Loading

saketkc commented Jul 16, 2020

anwarMZ commented Jul 17, 2020

saketkc commented Jul 12, 2020 •

edited

Loading

saketkc commented Jul 12, 2020 •

edited

Loading

anwarMZ commented Jul 16, 2020 •

edited

Loading