-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC2020] Basic search feature #43
Conversation
Codecov Report
@@ Coverage Diff @@
## master #43 +/- ##
===========================================
+ Coverage 41.93% 54.74% +12.81%
===========================================
Files 5 7 +2
Lines 1023 1706 +683
===========================================
+ Hits 429 934 +505
- Misses 594 772 +178
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good, I just have some minor comments so far. Good job!
pysradb/cli.py
Outdated
@@ -34,15 +38,23 @@ def error(self, message): | |||
|
|||
def _print_save_df(df, saveto=None): | |||
if saveto: | |||
df.to_csv(saveto, index=False, header=True, sep="\t") | |||
if saveto.split(".")[-1].strip().lower() == "csv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify this to
if saveto.split(".")[-1].strip().lower() == "csv": | |
if saveto.lower().endswith(".csv"): |
pysradb/cli.py
Outdated
to_print_split = to_print.split(os.linesep) | ||
to_print = [] | ||
# Header formatting seems off when it is added via to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this problem might be difficult to understand and check if this is still a problem in the future. If you have a case that doesn't work it might be worth writing a small unit test for _print_save_df
.
pysradb/search.py
Outdated
r = requests_3_retries().get( | ||
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", | ||
params=OrderedDict(payload), | ||
timeout=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these chosen defaults as uppercase "constants" just after the import ?
This could for instance be SEARCH_REQUEST_TIMEOUT
.
pysradb/search.py
Outdated
try: | ||
r = requests_3_retries().get( | ||
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", | ||
params=OrderedDict(payload), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think OrderedDict
adds anything here (dictionaries have stable order since python 3.6), and since _format_request
just uses dictionary literals all sorting would be lost there anyway. Also it doesn't look like parameters need to be sorted in the entrez API.
pysradb/search.py
Outdated
r.raise_for_status() | ||
uids = r.json()["esearchresult"]["idlist"] | ||
|
||
# Step 2: retrieves the detailed information for each uid returned, in groups of 500. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 300 now, right ?
pysradb/search.py
Outdated
) | ||
return # If no queries found, return nothing | ||
|
||
for i in range(0, len(uids), 300): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make 300
a named variable, like PAGINATION_SIZE
or GROUP_SIZE
?
pysradb/search.py
Outdated
r = requests_3_retries().get( | ||
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi", | ||
params=OrderedDict(payload2), | ||
timeout=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use SEARCH_REQUEST_TIMEOUT
here as well.
pysradb/search.py
Outdated
r.raise_for_status() | ||
self._format_result(r.content) | ||
except requests.exceptions.Timeout: | ||
print(f"Connection to the server has timed out. Please retry.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a design issue that I guess we should discuss (ping @saketkc ), but personally I think Timeout and HTTPError exceptions should not be caught. Say you put this in a script, and the script finishes with exit code 0, then you might think: "Alright, no results for this query", when actually the query failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mvdbeek. This is currently failing silently. A script using this will also fail silently.
pysradb/utils.py
Outdated
try: | ||
r.raise_for_status() | ||
except HTTPError: | ||
raise IncorrectFieldException(f"Unknown scientific name: {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more precise here ? HTTPError
s could have many different cases. Is there a given http status code or even a message that indicates Unknown scientific name
?
Hi @bscrow, would you be able to create a new PR (from a new branch) that is similar to this PR but without any writeup sections? Planning to merge it in the coming week. Thanks! |
No problems! I've created the new PR: #57 |
Implemented the search feature for phase 1 of GSoC