[GSoC2020] Basic search feature #43

bscrow · 2020-06-25T06:43:24Z

Implemented the search feature for phase 1 of GSoC

codecov · 2020-06-25T07:26:29Z

Codecov Report

Merging #43 into master will increase coverage by 12.81%.
The diff coverage is 71.84%.

@@             Coverage Diff             @@
##           master      #43       +/-   ##
===========================================
+ Coverage   41.93%   54.74%   +12.81%     
===========================================
  Files           5        7        +2     
  Lines        1023     1706      +683     
===========================================
+ Hits          429      934      +505     
- Misses        594      772      +178

Impacted Files	Coverage Δ
pysradb/cli.py	`0.00% <0.00%> (ø)`
pysradb/download.py	`20.25% <12.50%> (-4.75%)`	⬇️
pysradb/search.py	`81.29% <81.29%> (ø)`
pysradb/exceptions.py	`100.00% <100.00%> (ø)`
pysradb/sraweb.py	`83.25% <0.00%> (-1.33%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da62643...25f4c13. Read the comment docs.

mvdbeek

This looks very good, I just have some minor comments so far. Good job!

mvdbeek · 2020-06-30T08:15:46Z

pysradb/cli.py

@@ -34,15 +38,23 @@ def error(self, message):

 def _print_save_df(df, saveto=None):
    if saveto:
-        df.to_csv(saveto, index=False, header=True, sep="\t")
+        if saveto.split(".")[-1].strip().lower() == "csv":


You could simplify this to

Suggested change

if saveto.split(".")[-1].strip().lower() == "csv":

if saveto.lower().endswith(".csv"):

mvdbeek · 2020-06-30T08:22:52Z

pysradb/cli.py

            to_print_split = to_print.split(os.linesep)
-            to_print = []
+            # Header formatting seems off when it is added via to_string()


I think this problem might be difficult to understand and check if this is still a problem in the future. If you have a case that doesn't work it might be worth writing a small unit test for _print_save_df.

mvdbeek · 2020-06-30T08:26:37Z

pysradb/search.py

+            r = requests_3_retries().get(
+                "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
+                params=OrderedDict(payload),
+                timeout=20,


Can you move these chosen defaults as uppercase "constants" just after the import ?
This could for instance be SEARCH_REQUEST_TIMEOUT.

mvdbeek · 2020-06-30T08:29:13Z

pysradb/search.py

+        try:
+            r = requests_3_retries().get(
+                "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
+                params=OrderedDict(payload),


I don't think OrderedDict adds anything here (dictionaries have stable order since python 3.6), and since _format_request just uses dictionary literals all sorting would be lost there anyway. Also it doesn't look like parameters need to be sorted in the entrez API.

mvdbeek · 2020-06-30T08:31:24Z

pysradb/search.py

+            r.raise_for_status()
+            uids = r.json()["esearchresult"]["idlist"]
+
+            # Step 2: retrieves the detailed information for each uid returned, in groups of 500.


This is 300 now, right ?

mvdbeek · 2020-06-30T08:31:49Z

pysradb/search.py

+                )
+                return  # If no queries found, return nothing
+
+            for i in range(0, len(uids), 300):


Can you make 300 a named variable, like PAGINATION_SIZE or GROUP_SIZE ?

mvdbeek · 2020-06-30T08:32:51Z

pysradb/search.py

+                r = requests_3_retries().get(
+                    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
+                    params=OrderedDict(payload2),
+                    timeout=20,


You could use SEARCH_REQUEST_TIMEOUT here as well.

mvdbeek · 2020-06-30T08:34:37Z

pysradb/search.py

+                r.raise_for_status()
+                self._format_result(r.content)
+        except requests.exceptions.Timeout:
+            print(f"Connection to the server has timed out. Please retry.")


This is a design issue that I guess we should discuss (ping @saketkc ), but personally I think Timeout and HTTPError exceptions should not be caught. Say you put this in a script, and the script finishes with exit code 0, then you might think: "Alright, no results for this query", when actually the query failed.

I agree with @mvdbeek. This is currently failing silently. A script using this will also fail silently.

mvdbeek · 2020-06-30T08:38:58Z

pysradb/utils.py

+    try:
+        r.raise_for_status()
+    except HTTPError:
+        raise IncorrectFieldException(f"Unknown scientific name: {name}")


Could you be more precise here ? HTTPErrors could have many different cases. Is there a given http status code or even a message that indicates Unknown scientific name ?

into basic-search-feature

saketkc · 2020-09-06T19:13:08Z

Hi @bscrow, would you be able to create a new PR (from a new branch) that is similar to this PR but without any writeup sections?
I have reviewed it and it looks good so far, I will fix the small changes at my end.

Planning to merge it in the coming week. Thanks!

bscrow · 2020-09-07T00:35:22Z

No problems! I've created the new PR: #57

saketkc · 2020-09-07T01:20:01Z

Awesome, thanks a lot @bscrow! Closing in favor of #57

bscrow added 7 commits June 12, 2020 04:43

Add search for sra and ena

7ab714d

fix bug in parsing arguments for search

ee4f42f

Fix bug in search arguments

5baaa22

Testing

8e38a98

Add weekly writeup

7f09c87

Update writeup

732dee6

Finish implementing search feature and tests

e443b12

saketkc self-requested a review June 25, 2020 07:09

saketkc self-assigned this Jun 25, 2020

bscrow added 2 commits June 25, 2020 15:13

Add week 3 writeup

e870cc8

merge upstream repo

93fb561

bscrow and others added 6 commits June 25, 2020 22:34

Add notebook and minor debug in search

1eac005

Add week 4 writeup

eb63384

black formatting

3e48947

More formatting changes

d68da8e

Add retries for requests

9686c30

Simplify QuerySearch input

6bb7c15

mvdbeek reviewed Jun 30, 2020

View reviewed changes

saketkc and others added 11 commits July 2, 2020 00:01

Formatting

8140afa

Fixes for piping to download | See saketkc#44

cffdd78

Merge branch 'master' into basic-search-feature

61f06ac

Merge branch 'master' into basic-search-feature

b068dd3

Merge branch 'master' into basic-search-feature

81af640

Debug SRA XML parsing

539917d

refactor search feature and add validation to fields

89319c5

format import statement

2e69b63

Implement GeoSearch and sra_geo

d52b47c

Debugging GeoSearch

3ffafc4

Merge branch 'basic-search-feature' of https://github.com/bscrow/pysradb

069768a

into basic-search-feature

saketkc changed the title ~~Basic search feature~~ [GSoC2020] Basic search feature Aug 16, 2020

bscrow and others added 24 commits August 17, 2020 19:58

Resolve issue 44; Debug print to console

f937074

Merge branch 'issue-44' into basic-search-feature

4749ec3

Merge upstream master; Fix print to console

8452b05

Fix printing and reading dataframes from console

86f11bd

Update get_file_size to process connection timeouts

edb7a22

Handle get_file_size request Errors

8dc7952

Add compatibility between piping and aspera downloads

cf98c40

Fix graphs; Fix saketkc#50

3330d0a

Merge graphs-and-stats

4e831fc

Update codestyle

bc6697d

Add pmid to sra queries with verbosity >= 2

940617e

Fix library layout for sra queries

b92312d

Debug library layout and fasp

b0b4df3

Add self explanatory flags for verbosity

5cb315b

Remove legacy import

3cef541

Debug --detailed flag

ad95c3f

Update warning messages

76ba71e

Update writeup

43fc146

Update style based on new Black release

d067251

Merge branch 'master' into basic-search-feature

ed9bb93

Merge branch 'master' into basic-search-feature

01f1526

Format filesize

38a2c62

Merge branch 'master' into basic-search-feature

d908bd3

Merge branch 'master' into basic-search-feature

25f4c13

bscrow mentioned this pull request Sep 7, 2020

Search feature only #57

Merged

saketkc closed this Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC2020] Basic search feature #43

[GSoC2020] Basic search feature #43

bscrow commented Jun 25, 2020

codecov bot commented Jun 25, 2020 •

edited

Loading

mvdbeek left a comment

mvdbeek Jun 30, 2020 •

edited

Loading

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

mvdbeek Jun 30, 2020

saketkc Jul 2, 2020

mvdbeek Jun 30, 2020

saketkc commented Sep 6, 2020

bscrow commented Sep 7, 2020 •

edited

Loading

saketkc commented Sep 7, 2020

	if saveto.split(".")[-1].strip().lower() == "csv":
	if saveto.lower().endswith(".csv"):

[GSoC2020] Basic search feature #43

[GSoC2020] Basic search feature #43

Conversation

bscrow commented Jun 25, 2020

codecov bot commented Jun 25, 2020 • edited Loading

Codecov Report

mvdbeek left a comment

Choose a reason for hiding this comment

mvdbeek Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saketkc commented Sep 6, 2020

bscrow commented Sep 7, 2020 • edited Loading

saketkc commented Sep 7, 2020

codecov bot commented Jun 25, 2020 •

edited

Loading

mvdbeek Jun 30, 2020 •

edited

Loading

bscrow commented Sep 7, 2020 •

edited

Loading