Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index creation function and option for store_bikedata #3

Merged
merged 4 commits into from
Mar 18, 2017

Conversation

richardellison
Copy link
Contributor

This PR adds a function to create indexes on the database as well as adds an option to store_bikedata to automatically create some relevant indexes on the trips table to speed up (some) queries.

@codecov-io
Copy link

codecov-io commented Mar 16, 2017

Codecov Report

Merging #3 into master will decrease coverage by 0.25%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master       #3      +/-   ##
==========================================
- Coverage   89.82%   89.56%   -0.26%     
==========================================
  Files           4        4              
  Lines         334      364      +30     
==========================================
+ Hits          300      326      +26     
- Misses         34       38       +4
Impacted Files Coverage Δ
src/spatialitedb.cpp 90.38% <84.21%> (-0.92%)
R/dl_bikedata.R 93.1% <92.3%> (-0.23%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69be518...a86ca0b. Read the comment docs.

@mpadge
Copy link
Member

mpadge commented Mar 16, 2017

you've got a brace too many in your createDBindexes function (L#322)

@mpadge
Copy link
Member

mpadge commented Mar 16, 2017

I should also have said that creating indices is important and necessary, but I wasn't sure how they should best be created. One question in regard to this PR (because I really don't know the answer): If database extraction of times and dates relies entirely on SQL date(start_time)-type operations, does indexing the raw start_time and end_time columns actually help there, since they are never referenced in their raw states?

@richardellison
Copy link
Contributor Author

The brace on line 322 seems to match the opening brace in the if statement on line 309, or am I missing something?

You're correct that if you use the time() and date() functions (as is used in tripmat), then the indexes on the raw columns are not used. However, would somebody not want to specify both date and time? In any case, and this is more related to #4, we can add additional indexes that are used when we cast start_time and stop_time to date or time if we use cast(start_time as time) rather than time(start_time).

@richardellison
Copy link
Contributor Author

The latest commit seems to fail on Travis for some reason. I expect this is because Travis uses an older version of SQLite.

@richardellison
Copy link
Contributor Author

Seems indexing expressions was added in version 3.9.0 and 3.6.x is being installed. Will see if I can find a newer version to use.

@richardellison
Copy link
Contributor Author

It seems 3.9.x isn't available for Ubuntu before Xenial, which doesn't seem available for Travis. I could either remove the default creation of indexes using cast, or add a check to see if the SQLite > 3.9.0 is available and only add those indices if it is.

Thoughts?

@mpadge
Copy link
Member

mpadge commented Mar 17, 2017

this extends beyond my realms of expertise, so ill-informed insights warning here, but the short answer would be to implement the most robust and general solution. I really don't understand the difference between date(start_time) and cast(start_time as date), but

  1. travis clearly don't have any plans to implement xenial for some time, and
  2. the less restriction built into travis.yml the better (particularly for ropensci purposes).

Ergo: Revert cast back to direct date and time. Right?

Part of why i hadn't implemented indices was also because i was trying to get my head around exactly how SQL compares date-time classes - time comparison can't be done without extraction (time (start_time)) or cast-ing. Date comparison can of course, but it seems better (to me) to access both in a uniform manner. But then what's the purpose of indexing? ... that's about where my thoughts and further development in that regard stopped.

And re: the brace - i was interpreting the git flag on the last line to suggest a mis-match, but yeah, it does actually seem okay.

@richardellison
Copy link
Contributor Author

When I referred to adding a check for SQLite > 3.9.x I was referring to adding a check in the function itself, rather than in travis.yml so that users who have a newer version can take advantage of the indexes and

In any case, cast(start_time as time) and time(start_time), as well as the equivalent for date(), are not equivalent. The time() and date() functions take the raw datetime format and formats it as a string in the format HH:MM:SS. cast(start_time as time) converts the original datetime to a time value. Although in this case they're likely to give the same result, casting to time will likely be more efficient. I think cast is probably the better option here regardless because we don't really want to do too many string comparisons, even with an index.

I think using indexes is important because we don't want to have to scan the whole table every time, when we may just be selecting a short date range. Since newer versions of SQLite support index on fields that have been cast to another datatype, I think it would be sensible to use them if they are available. However, obviously that is problematic for users of older versions which is why I suggested that those indexes are only added if that feature is available in the SQLite version being used. I also think we need to consider if users may want to run their own (custom) queries on the data, rather than just use tripmat and other defined functions whether that is through dplyr or RSQLite directly, at which point the indexes become useful.

I'm going to try to make some changes to check for the SQLite version and then only create an index on the cast date/time values if it is supported. I expect that would be acceptable to ropensci.

@mpadge
Copy link
Member

mpadge commented Mar 18, 2017

That all sounds great Richard. That's the main challenge as far as I see it with this package - juggling the need to anticipate likely usage and to make that as easy as possible, while also providing ready ability to construct more complex queries that we can't necessarily anticipate. In case you haven't seen it, this blog post generated more widespread media attention that any analysis of bike hire systems I've seen to date, and nicely illustrates the kinds of analyses we ought to consider.

I'll merge this now to give you a fresh start - sorry for my changes to description, and thanks for incorporating that.

@mpadge mpadge merged commit e96aaf9 into ropensci:master Mar 18, 2017
mpadge added a commit that referenced this pull request May 8, 2017
mpadge added a commit that referenced this pull request May 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants