Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional Database Import / Docos #692

Closed
wants to merge 36 commits into from
Closed

Conversation

jlaura
Copy link
Member

@jlaura jlaura commented Sep 10, 2015

Two parter, so feel free to request that I break apart.

  1. Trivial fixes to ESDA documentation.
  2. I added a database driver to the current FileIO module. This uses a conditional import (in core/IOHandlers/init.py) and depends on both sqlalchemy (ships with anaconda) and geomet.

The main reason for the PR is to continue the conversation re: FileIO / Database access / Conditional imports.

To test (conda install gdal - only to get the test data generated, not necessary for the PR):

#Slurp all the pysal example files into a single sqlite database
import glob
import subprocess
psexamples = glob.glob('/Users/jay/github/pysal/pysal/examples/*/*.shp')
for i, s in enumerate(psexamples):
    cmd = 'ogr2ogr -f SQLite -append /Users/jay/Desktop/psexamples.sqlite {}'.format(s)
    subprocess.check_call(cmd.split())

Then

import pysal as ps
db = ps.open('sqlite:///psexamples.db') # /// is a relative sqlite open and //// is an absolute path
db.tables #property to list available tables

# List of tables printed to screen 

virginia_geojson = ps.read('virginia', attributes=True)  #Defaults to attributes=False, or just the 

Some things to discuss (maybe in a hangout?)

  1. FileIO looks like it would be much more streamlined using ABC. Having said that, it works and adding a new driver is trivial once the learning curve is stumbled up.
  2. FileIO docos claim to require read, seek, and next methods, but a DB is not a file object. How can the interface be either generalized or the differences obfuscated (the current hack feels like a hack, so zero love from here)
  3. Conditional import a la @ljwolf suggestion via gitter. Thoughts? It is throwing a proper error on your side if you do not have Geomet?
  4. What about passing an SQL statement to the reader - totally possible - but do we want to support that?
  5. The user is required to know the input data type with respect to identifying sqlite:/// (or postgresql with username/password/port). Do we want to try and abstract that from the user or just accept that this is the way SQLAlchemy manages interfacing with databases?

@ljwolf
Copy link
Member

ljwolf commented Sep 10, 2015

This looks awesome! Cloning now.

What you've put are great starting points for a discussion. That said, I may be a bit more outspoken about FileIO...

  1. This is a great implementation of a db interface leaning on the current mature standard for database interaction in python. Awesome. IMHO, we should extend this approach: FileIO started before a "good" python wrapper around OGR IO stuff. Moving forward, we should take advantage of new IO packages with soft dependencies while providing ways to access old behavior. Our comparative advantage is analytics, not feature-complete python clones of OGR. Using SQLalchemy here is definitely 👍
  2. True... also applies more broadly to streams of GeoJSON/semistructured data. No clue how to do this. Ties to 4 quite strongly.
  3. freaking awesome, exactly how this should work @ module level! Now, how can we get nose to acknowledge this and avoid running tests on disabled modules?
    2015-09-09-230942_817x266_scrot
  4. So, if the connection is opened as conn, then, a conn.read(<SQL>) would just pass the query string and kwargs down to the sqlalchemy engine? Seems fine in theory.
  5. I've seen some solutions to make it simple to connect to databases (and to avoid using plaintext passwords in demo code), but they involved a persistent local store of hashed credentials, and a set of "accessor" functions to connect to a database using the stored creds. So, like pysal.db.add_database('USCensus', 'username', 'password', 'host', 'port', 'protocol')) would store and pysal.db.connect('USCensus') would attach. Not the best, but definitely should be discussed. I have code to do this laying around if a prototype is wanted.

@jlaura
Copy link
Member Author

jlaura commented Sep 10, 2015

For 3, I think that the solution could be as easy as using a unittest.skipif() decorator.
https://docs.python.org/2/library/unittest.html#skipping-tests-and-expected-failures

So perhaps something like:

test_db.py

import unittest
from .. import db

try:
    import sqlalchemy
    import geomet
    missing = False
except:
    logger.debug('some message')
    missing = True

class TestDB(unittest.testcase):

    @unittest.skipif(missing == True)
    def setUp(self):
        #Fire up the db connection

    @unittest.skipif(missing == True)
    def test_listtables(self):
        #Test table listing

Then it looks like nose has support for printing a 'S' for skipped tests. http://nose.readthedocs.org/en/latest/plugins/skip.html

As an aside - checkout some of the other decorators and the ability to roll your own unittest decorators. Not something I have thought about before, but super nice to see that these exist.

For 4:
That would be the idea. I will extend the class to support that. At the same time, I need to explore SQLAlchemy a bit more to see how much metadata I can get off of a mapped table. For example, it would be a nice bit of functionality to print table metadata via a mechanism similar to the .tables attribute.

@ljwolf
Copy link
Member

ljwolf commented Sep 10, 2015

Okay, cool. @TaylorOshan and I were dealing with some of the same issues in skipping tests based on conditions while trying to make moving contrib/spint into core work correctly, with testing and conditional import.

We should write condtional import stuff into the styleguide, so that it's consistent across the different sets of potential soft dependencies.

@jlaura
Copy link
Member Author

jlaura commented Sep 11, 2015

So I played around a bit more with FileIO and how this might look:

  1. If a class base architecture with strict interface enforcement is the desired approach, I think using the ABC module to define an abstract base class and then a dispatcher similar to the current implementation might work. That should be more explict than the existing implementation, but follow the same model.

This maintains the fragility in guessing input type by extension, but enforces a consistent interface. From a user perspective, the development of drivers that are roughly similar to those that already exist is quite straight forward. Additionally, the user can expect that ps.open will succeed in opening the file. Unforuntaley, this approach introduces significant difficulty when the input format does not conform so well to the interface, e.g. the current example of a file based approach interfacing with a database. I am also not 100% sure how a dispatacher might look without a metaclass. ABC enforces structure, but does not help with dispatching.

  1. Looking to pandas, the burden of input type identification is pushed to the user, e.g. pandas.from_json(). This allows the IO code structure to be significantly simpler. Looking at the excel driver, a class based approach is taken. Moving to the sql driver, the driver is organized as a module of functions. This approach support freedom for the developer and freedom in defining the interface, i.e. sql driver can focus on treating the database as a database. Pandas is, in some cases, defining writers using ABC, but this is inconsistent. This works for pandas because all the drivers create the same set of internal data representations (data frame, series).

Some questions that we need to answer:

  1. Where does the burden of driver selection lie? I see this as a balancing act between fragility and user responsability, but perhaps that is not the best approach?
  2. Given the range of drivers, does it make sense to try and strictly enforce a uniform interface? Consider .shp, .gal, and .sqlite as three potential divergent formats.

Looking at FileIO.py, the documentation at the top suggests that all driers will support read, seek, and next. Take a look at the GAL driver - seek is just a place holder since seeking makes little sense and next does not exist. This suggests that the defined interface is not representative of the interface we need. (Or that ABC should be used to throw an error that we do not conform to our own docos).

  1. Am I missing some other essential issue?

@ljwolf
Copy link
Member

ljwolf commented Jan 14, 2016

While we probably want to finally finish this (& weights construction stuff), I think we might benefit from:

  1. Moving the discussion of improving/restructuring of FileIO around plug-and-play drivers to a "release + ∞" issue. I think it requires quite a bit of sub-issuing to make attainable.
  2. Inaugurating an "OPTIONAL" tag for soft dependency introductions that are intended for core.
  3. Merging this PR as "OPT: Database driver in FileIO" OR redirecting this to contrib

Thoughts?

@coveralls
Copy link

coveralls commented Apr 28, 2016

Coverage Status

Changes Unknown when pulling e1b54ee on jlaura:master into * on pysal:master*.

@ljwolf
Copy link
Member

ljwolf commented Jul 16, 2016

@jlaura would it be alright to cherrypick your commits from this & redirect to dev? I've rebased this in ljwolf/pysal/pr692 and in #841

sjsrey added a commit that referenced this pull request Jul 16, 2016
[REBASE & REDIRECT] Conditional Database Imports & Docos, #692
@jlaura
Copy link
Member Author

jlaura commented Jul 17, 2016

👍

On Saturday, July 16, 2016, Levi John Wolf notifications@github.com wrote:

@jlaura https://github.com/jlaura would it be alright to cherrypick
your commits from this & redirect to dev?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#692 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACAgfMogfa8jHQokVhJVbrtC4h732yljks5qWWEUgaJpZM4F6va2
.

@ljwolf
Copy link
Member

ljwolf commented Jul 17, 2016

closed by #841

@ljwolf ljwolf closed this Jul 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants