Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sql support #4163

Closed
5 of 20 tasks
hayd opened this issue Jul 8, 2013 · 176 comments
Closed
5 of 20 tasks

ENH: sql support #4163

hayd opened this issue Jul 8, 2013 · 176 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jul 8, 2013

UPDATE: this is continued in #6292 after the refactor based on sqlalchemy.


This is a central issue to track sql support (migrated from the mysql support ticket - #2482 and #1662).

I think we are agreed we will be moving to use SQLAlchemy, in the hope of comprehensive cross-platform support:

Todo:

Better Dtype coercion:

Peformance

  • Vbenches! (for various flavors even?)

cc @garaud @mangecoeur @danielballan @changhiskhan

@garaud
Copy link
Contributor

garaud commented Jul 8, 2013

I think it can be good if the "way to go" was defined before doing remaining tasks. I think SQLAlchemy is a good way:

  • less code
  • better maintainable
  • less specific to a given SQL engine

I don't know well SQLAlchemy. I can't tell if all wanted features can be implemented with it, or if it can cause some tricky problems.

Damien G.

@hayd
Copy link
Contributor Author

hayd commented Jul 8, 2013

@garaud I agree, I suspect it'll create fewer tricky problems than maintaining every flavour. :)

I think I've pulled in all the open issues re sql (as they'll have to be addressed in either case).

@danielballan
Copy link
Contributor

And while we're addressing the big picture, write_frame ignores the index. HDFStore, by contrast, saves it. It's easy enough to promote the index to a column if you want to write it to a SQL table, but would it better to write the index column by default? I have no good ideas for how to implement this cleanly; I'm raising the issue in case someone else can think of a way.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

you should write the index for sure (and provide index_col=) on reading

I guess this is a bit tricky because you don't have control of the table format (in HDFStore its opaque to the user)

maybe provide an option on writing, maybe : index_col='in_table' or False for dont' write it ?

or have the user provide a column name (whichif its already being written won't then do anything)

@danielballan
Copy link
Contributor

I guess the question is what to call it, then, if "index" is taken.

@danielballan
Copy link
Contributor

...and whether to attempt to use it as the PRIMARY KEY

@hayd
Copy link
Contributor Author

hayd commented Jul 8, 2013

index.name or index.names (do we have MultiIndex support ?) otherwise 'index' (or if Multi: [index_0, index_1, ...]) ?

Can check if already a column name and raise, I guess that is also under the multiple columns with the same name.. raise (or mangle_dup_columns first)... :s

Not sure can PK as index may not be unique... does pk matter?

@hayd
Copy link
Contributor Author

hayd commented Jul 8, 2013

@danielballan what am I supposed to be adding to get mysql tests to pass? Does travis run them?

SKIP: Cannot connect to database. Create a group of connection parameters under the heading [pandas] in your system's mysql default file, typically located at ~/.my.cnf or /etc/.my.cnf.

@danielballan
Copy link
Contributor

Travis does not run them. (I wrote those tests before I had any experience with Travis, but I will start a branch to install MySQL on the full_deps build.)

Add a block to your MySQL configuration file that provides valid connection parameters for pandas to use.

[pandas]
host = localhost
user = ...
password = ...
database = any_database_that_exists

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

I think #4110 should be in here/closed is it a dup?

@danielballan
Copy link
Contributor

No, not a dup, and still open I believe. @hayd can add it to the checklist.

Half the issues on the list could be closed in an afternoon's work, but maybe we should first decide whether we are moving to SQLAlchemy. @mangecoeur?

@hayd
Copy link
Contributor Author

hayd commented Jul 10, 2013

I rebased @mangecoeur's branch cleanly (no conflicts!) to master a few days ago, but couldn't get the tests working on my system - haven't got time to play for a couple of days.

Do you mind having a go @danielballan ?

@danielballan
Copy link
Contributor

Sure... To avoid git surgery down the road, I'll ask a dumb question now. I should add your fork of pandas as a remote and work on a local copy of this branch, correct?

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

fyi...need to add SQLAlchemy as a build dep (in the requirements file) for full deps, @cpcloud ? (its alread in ci/print_versions.py thought)

@cpcloud
Copy link
Member

cpcloud commented Jul 10, 2013

fyi u can add to the requirements file "at will" (meaning, of course, when it makes sense) and things will be pulled from pypi unless i put up a wheel. wheels should be used only if the package is slow to install.

@mangecoeur
Copy link
Contributor

@danielballan sadly because of workload i haven't had time to look into this much - i know pretty much what needs doing but i just don't have time to actually implement+test it. At the simplest level reading a frame using SQLAlchmey as a db abstraction looks like this:
https://gist.github.com/mangecoeur/5383970

(That's code i actually needed in production, so it works and is neater than what's in my branch)

The code for writing is in my branch, the principle is simple: map each Pandas Dtype to an SQLAlchemy supported type (most python builtins included), create a Table, write the values.

Testing is the kicker, i have very little experience with TDD but i do know that running tests that require DBs is generally a pain. On the other hand, SQLAlchemy's cross DB support is very good, so if it works with one DB it should work for them all, with SQLite as a kind of lowest common denominator.

Finally, I would recommend making SQLAlchemy a soft dependency (like a lot of the other IO) with a fallback to SQLite-support only since that's built into python - there's no point preventing people accessing SQLite just because they don't have SQLAlchemy installed. It's a simple try/except import, set an "sqlalchemy mode" flag, choose either genric or SQLite functions accordingly.

@hayd
Copy link
Contributor Author

hayd commented Jul 10, 2013

@danielballan I didn't get the travis build working anyway so feel free throw away my commit (I'd have to rebase after we pull in your mysql travis testing anyway).

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

I think that read_frame/write_frame should become internal only methods (and instead use the external facing read_sql/to_sql for consistency with the rest of the i/o (currently read_sql is just aliased to read_frame. I know that you will almost certainly just read/write frames, but leaves open then the possibility of dealing with Panel/Series (Series is sort of trivial), and e.g. in HDFStore, Panel is essentially converted to a mi-frame (though storing meta-data is different in SQL so not sure that is an option here).

@hayd
Copy link
Contributor Author

hayd commented Jul 10, 2013

Perhaps get SQLAlchemy working (passing the same tests as MySQL) first and then worry about precisely how to call from read_sql and to_sql... Saying that other operations use an engine argument (maybe this makes sense here).

@mangecoeur
Copy link
Contributor

@jreback or maybe it should be more explicit: read_sqlite and read_sqlalchemy
Sqlalchemy supports raw SQL as well as its own "expression syntax" (which abstracts from the different SQL flavours used by different DBs) - so it would be nice in the future if read_sqlalchemy could also accept a query object as well as an SQL string.
At the moment if you use SQLAlchemy to read a DB using a raw SQL query even though it connects to all DBs transparently it's up to the user to make sure their SQL is formatted for the DB they're using - there's nothing you can really do about that for reading but it would be nice to make it possible to use sqlalchemy's abstractions (though this doesn't need to be done just yet)

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

added extra cool SQL label for these (CSV/HDF5) also exist as well

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

@mangecoeur I disagree, I believe write_frame as flavor= arg....so should read_sql
then this becomes straightforward, e.g. try this using say flavor=sqlite, or use the default (SQLAlchemy if installed, otherwise sqlite?)

you could also provide a module level set_default_flavor maybe

@cpcloud
Copy link
Member

cpcloud commented Jul 10, 2013

that could go in config

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

ahh...yes makes more sense, and fallback really doesn't make sense, and people rarely use more than one SQL backend

@cpcloud
Copy link
Member

cpcloud commented Jul 10, 2013

probably some of the other module level variables floating around could go in config as well

@mangecoeur
Copy link
Contributor

@jreback thought about it some more and actually I agree with you now.

The SQL string given to read_sql has to be adapted by the user to fit the DB they know they are using, there's no point having different functions for different DBs. The function should accept both DBAPI connections and SQLAlchemy engines. There's no need to specify the flavor for read since pandas just needs to run the provided query through the given DB connection or engine, it doesn't need to modify it - therefore read will work so long as you provide a valid SQL for your DB and either a DBAPI compliant connection or an SQLAlchemy engine for that DB.

I think however we need to distinguish between 2 distinct user stories (use cases):

  1. "I want to read an arbitrary dataset from a database using an arbitrarily complex SQL query" plus "I want to store dataframe values in an SQL table"

This case is served by read_sql and to_sql, which just needs to be able to save values, a user will write an appropriate SQL query to turn them back into a Dataframe if needed. There is no need for a "flavor" for to_sql because we can infer the correct column type mapping and let SQLAlchemy generate the correct SQL syntax based on the supplied engine. However we would like to have SQLite available as an option even if SQLAlchemy is not installed (this is what i meant by fallback, not that it would try to write to one database and fall back to writing to sqlite).

  1. "I want to store the Dataframes I'm working in an SQL DB and retrieve them transparently later."

This is different from 1. because we have to assume a save format convention, or save additional metadata. You don't want to have to specify an SQL string when loading because the save/load layer should deal with it for you. This is akin to an ORM mapping, while 1. is dealing with raw SQL.

I believe the read_sqland to_sql feature under discussion should only attempt to address use case 1. Use case 2. is a different feature and might deserve its own read_ and to_ functions.

I think it's important to keep this difference in mind in order to have a clear understanding of what the IO functions should and should not attempt to do, and what problems they're actually trying to address.

It should also make the api simpler since, as I said above, there's no need for the user to specify SQL flavours as it can all be done automatically, and we don't need any additional machinery for defining save conventions because we assume they do not apply for this specific feature.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

@mangecoeur

but if sqlalchemy is the flavor, then you dont't need to specify DB specific sql (which is a good thing), but you DO need to know the backend (I agree conn already has some of this info, so maybe the flavor can be inferred)

just trying to keep api as simple as possible (but not too simple!)

@mangecoeur
Copy link
Contributor

@jreback I think you slightly misunderstood how sqlalchemy works. It provides you with a variety of abstractions in order to generate DB specific SQL (e.g. you create a Table object, SQLAlchemy figures out the correct CREATE TABLE syntax for the DB you're using). It also deals with things like transactions and connection pools for you. However if you tell the engine to execute a plain SQL string, it will not attempt to "translate" that string - how could it seeing as there is in practice no "strictly correct" SQL syntax and feature set for it to translate from in the first place.
What you can do is use their expression api to write queries in python which will be translated to the correct SQL flavour.

So if you are supply raw sql, you do need to know the DB you are using and to write SQL specific for that DB. SQLAlchemy also knows the db you're using, since you have to specify it when you create the engine with the connection URI, and it deals with all the communication business for you. The main strength of using SQLAlchemy in this case is not for reading using a provided SQL string (where it just provides a light abstraction from DBAPI) but for writing to the DB, because we can programatically generate an SQLAlchemy Table and let it generate the correct SQL based on the db you provided it through engine. All the information needed for this to happen is contained in the engine, so no flavour is required.

The only condition is see is if you have flavour= {'sqlite', 'sqlalchemy'}, rather than having two functions to_sqliteand to_sqlalchemy (or some other name).
However:

  1. its easy to detect if sqlalchemy support is present through a try import... If it's not present you can't even try to use it because you can't supply an sqlalchemy engine. Setting the flag just raises an error, which you could have done simply during the try import...
  2. If you set sqlite the function would accept a DBAPI connection and work with Sqlite featureset only, while setting sqlalchemy would accept an sqlalchemy engine. So the flavour is already implied by the arguments.
  3. That said, I personally don't like the idea of a single to_sql function that could have very different output depending on whether it was using sqlalchemy (with advanced type mapping) vs basic built in mapping.

@mangecoeur
Copy link
Contributor

Pull request: #5950

jreback pushed a commit to jreback/pandas that referenced this issue Feb 6, 2014
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH pandas-dev#4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH pandas-dev#4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH pandas-dev#4163 added version added coment

ENH pandas-dev#4163 added depreciation warning for tquery and uquery

ENH pandas-dev#4163 Documentation and tests

ENH pandas-dev#4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH pandas-dev#4163 Fixed class renaming, expanded docs

ENH pandas-dev#4163 Fixed tests in legacy mode

ENH pandas-dev#4163 Use SQLAlchemy for DB abstraction

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH pandas-dev#4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH pandas-dev#4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH pandas-dev#4163 added version added coment

ENH pandas-dev#4163 added depreciation warning for tquery and uquery

ENH pandas-dev#4163 Documentation and tests

ENH pandas-dev#4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH pandas-dev#4163 Fixed class renaming, expanded docs

ENH pandas-dev#4163 Fixed tests in legacy mode

ENH pandas-dev#4163 Tweaks to docs, avoid mutable default args, mysql tests

ENH pandas-dev#4163 Introduce DataFrame Index support. Refactor to introduce PandasSQLTable for cleaner OOP design

ENH pandas-dev#4163 Fix bug in index + parse date interaction, added test case for problem

ENH pandas-dev#4163 Fixed missing basestring import for py3.3 compat

ENH pandas-dev#4163 Fixed missing string_types import for py3.3 compat
jreback pushed a commit that referenced this issue Feb 6, 2014
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
jreback pushed a commit that referenced this issue Feb 6, 2014
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
jreback pushed a commit that referenced this issue Feb 6, 2014
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
jreback pushed a commit that referenced this issue Feb 6, 2014
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
jreback pushed a commit that referenced this issue Feb 6, 2014
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
jreback pushed a commit that referenced this issue Feb 6, 2014
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
jreback pushed a commit that referenced this issue Feb 6, 2014
jreback pushed a commit that referenced this issue Feb 6, 2014
ENH #4163 Fixed missing string_types import for py3.3 compat
@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

@jorisvandenbossche want to close this and open another issue with things that are not covered by #6292 (if they are at all relevant), but maybe want to address 'later'

@jorisvandenbossche
Copy link
Member

@hayd You opened this issue, what do you think? Do you have a clear view what of this list is still relevant?

@hayd
Copy link
Contributor Author

hayd commented Feb 8, 2014

Let's close and move parts into 6292, this issue has got very long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests