Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC2020] Basic search feature #43

Closed
wants to merge 63 commits into from

Conversation

bscrow
Copy link
Collaborator

@bscrow bscrow commented Jun 25, 2020

Implemented the search feature for phase 1 of GSoC

@saketkc saketkc self-requested a review June 25, 2020 07:09
@saketkc saketkc self-assigned this Jun 25, 2020
@codecov
Copy link

codecov bot commented Jun 25, 2020

Codecov Report

Merging #43 into master will increase coverage by 12.81%.
The diff coverage is 71.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #43       +/-   ##
===========================================
+ Coverage   41.93%   54.74%   +12.81%     
===========================================
  Files           5        7        +2     
  Lines        1023     1706      +683     
===========================================
+ Hits          429      934      +505     
- Misses        594      772      +178     
Impacted Files Coverage Δ
pysradb/cli.py 0.00% <0.00%> (ø)
pysradb/download.py 20.25% <12.50%> (-4.75%) ⬇️
pysradb/search.py 81.29% <81.29%> (ø)
pysradb/exceptions.py 100.00% <100.00%> (ø)
pysradb/sraweb.py 83.25% <0.00%> (-1.33%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da62643...25f4c13. Read the comment docs.

Copy link
Collaborator

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good, I just have some minor comments so far. Good job!

pysradb/cli.py Outdated
@@ -34,15 +38,23 @@ def error(self, message):

def _print_save_df(df, saveto=None):
if saveto:
df.to_csv(saveto, index=False, header=True, sep="\t")
if saveto.split(".")[-1].strip().lower() == "csv":
Copy link
Collaborator

@mvdbeek mvdbeek Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify this to

Suggested change
if saveto.split(".")[-1].strip().lower() == "csv":
if saveto.lower().endswith(".csv"):

pysradb/cli.py Outdated
to_print_split = to_print.split(os.linesep)
to_print = []
# Header formatting seems off when it is added via to_string()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem might be difficult to understand and check if this is still a problem in the future. If you have a case that doesn't work it might be worth writing a small unit test for _print_save_df.

r = requests_3_retries().get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params=OrderedDict(payload),
timeout=20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these chosen defaults as uppercase "constants" just after the import ?
This could for instance be SEARCH_REQUEST_TIMEOUT.

try:
r = requests_3_retries().get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params=OrderedDict(payload),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OrderedDict adds anything here (dictionaries have stable order since python 3.6), and since _format_request just uses dictionary literals all sorting would be lost there anyway. Also it doesn't look like parameters need to be sorted in the entrez API.

r.raise_for_status()
uids = r.json()["esearchresult"]["idlist"]

# Step 2: retrieves the detailed information for each uid returned, in groups of 500.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 300 now, right ?

)
return # If no queries found, return nothing

for i in range(0, len(uids), 300):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make 300 a named variable, like PAGINATION_SIZE or GROUP_SIZE ?

r = requests_3_retries().get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
params=OrderedDict(payload2),
timeout=20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use SEARCH_REQUEST_TIMEOUT here as well.

r.raise_for_status()
self._format_result(r.content)
except requests.exceptions.Timeout:
print(f"Connection to the server has timed out. Please retry.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design issue that I guess we should discuss (ping @saketkc ), but personally I think Timeout and HTTPError exceptions should not be caught. Say you put this in a script, and the script finishes with exit code 0, then you might think: "Alright, no results for this query", when actually the query failed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mvdbeek. This is currently failing silently. A script using this will also fail silently.

pysradb/utils.py Outdated
try:
r.raise_for_status()
except HTTPError:
raise IncorrectFieldException(f"Unknown scientific name: {name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more precise here ? HTTPErrors could have many different cases. Is there a given http status code or even a message that indicates Unknown scientific name ?

@saketkc saketkc changed the title Basic search feature [GSoC2020] Basic search feature Aug 16, 2020
@saketkc
Copy link
Owner

saketkc commented Sep 6, 2020

Hi @bscrow, would you be able to create a new PR (from a new branch) that is similar to this PR but without any writeup sections?
I have reviewed it and it looks good so far, I will fix the small changes at my end.

Planning to merge it in the coming week. Thanks!

@bscrow bscrow mentioned this pull request Sep 7, 2020
@bscrow
Copy link
Collaborator Author

bscrow commented Sep 7, 2020

No problems! I've created the new PR: #57

@saketkc
Copy link
Owner

saketkc commented Sep 7, 2020

Awesome, thanks a lot @bscrow! Closing in favor of #57

@saketkc saketkc closed this Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants