Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

ENH: sql support #4163

Closed
hayd opened this Issue · 176 comments

14 participants

Andy Hayden Damien Garaud Dan Allan jreback Phillip Cloud mangecoeur meteore Jeff Tratner Gagi Chris Morgan Steve Pike Tony Narlock y-p Joris Van den Bossche
Andy Hayden
Collaborator

UPDATE: this is continued in #6292 after the refactor based on sqlalchemy.


This is a central issue to track sql support (migrated from the mysql support ticket - #2482 and #1662).

I think we are agreed we will be moving to use SQLAlchemy, in the hope of comprehensive cross-platform support:

Todo:

  • travis should not skip tests (e.g. mysql #4177)
  • add flavor argument to read_sql
  • handling no row returns (#3745)
  • Specify chunksize (and iterator) for read_frame (#2908)
  • read and write the index (and multi-index)
  • mangle_dup_columns for read_frame (#2738)
  • bug in write_frame replace (#4110, #2971, PR #4304, #6164)
  • option in config to specify default engine/connection for SQL (maybe?)
  • refactor SQLAlchemy/sqliite tests (@hayd was thinking different flavors could inherit tests from a baseclass, so tests are only flavor explicit if obviously flavor explicit...)

Better Dtype coercion:

Peformance

  • Vbenches! (for various flavors even?)

cc @garaud @mangecoeur @danielballan @changhiskhan

Damien Garaud

I think it can be good if the "way to go" was defined before doing remaining tasks. I think SQLAlchemy is a good way:

  • less code
  • better maintainable
  • less specific to a given SQL engine

I don't know well SQLAlchemy. I can't tell if all wanted features can be implemented with it, or if it can cause some tricky problems.

Damien G.

Andy Hayden
Collaborator

@garaud I agree, I suspect it'll create fewer tricky problems than maintaining every flavour. :)

I think I've pulled in all the open issues re sql (as they'll have to be addressed in either case).

Dan Allan

And while we're addressing the big picture, write_frame ignores the index. HDFStore, by contrast, saves it. It's easy enough to promote the index to a column if you want to write it to a SQL table, but would it better to write the index column by default? I have no good ideas for how to implement this cleanly; I'm raising the issue in case someone else can think of a way.

jreback
Owner

you should write the index for sure (and provide index_col=) on reading

I guess this is a bit tricky because you don't have control of the table format (in HDFStore its opaque to the user)

maybe provide an option on writing, maybe : index_col='in_table' or False for dont' write it ?

or have the user provide a column name (whichif its already being written won't then do anything)

Dan Allan

I guess the question is what to call it, then, if "index" is taken.

Dan Allan

...and whether to attempt to use it as the PRIMARY KEY

Andy Hayden
Collaborator

index.name or index.names (do we have MultiIndex support ?) otherwise 'index' (or if Multi: [index_0, index_1, ...]) ?

Can check if already a column name and raise, I guess that is also under the multiple columns with the same name.. raise (or mangle_dup_columns first)... :s

Not sure can PK as index may not be unique... does pk matter?

Andy Hayden
Collaborator

@danielballan what am I supposed to be adding to get mysql tests to pass? Does travis run them?

SKIP: Cannot connect to database. Create a group of connection parameters under the heading [pandas] in your system's mysql default file, typically located at ~/.my.cnf or /etc/.my.cnf.

Dan Allan

Travis does not run them. (I wrote those tests before I had any experience with Travis, but I will start a branch to install MySQL on the full_deps build.)

Add a block to your MySQL configuration file that provides valid connection parameters for pandas to use.

[pandas]
host = localhost
user = ...
password = ...
database = any_database_that_exists
jreback
Owner

I think #4110 should be in here/closed is it a dup?

Dan Allan

No, not a dup, and still open I believe. @hayd can add it to the checklist.

Half the issues on the list could be closed in an afternoon's work, but maybe we should first decide whether we are moving to SQLAlchemy. @mangecoeur?

Andy Hayden
Collaborator

I rebased @mangecoeur's branch cleanly (no conflicts!) to master a few days ago, but couldn't get the tests working on my system - haven't got time to play for a couple of days.

Do you mind having a go @danielballan ?

Dan Allan

Sure... To avoid git surgery down the road, I'll ask a dumb question now. I should add your fork of pandas as a remote and work on a local copy of this branch, correct?

jreback
Owner

fyi...need to add SQLAlchemy as a build dep (in the requirements file) for full deps, @cpcloud ? (its alread in ci/print_versions.py thought)

Phillip Cloud
Collaborator

fyi u can add to the requirements file "at will" (meaning, of course, when it makes sense) and things will be pulled from pypi unless i put up a wheel. wheels should be used only if the package is slow to install.

mangecoeur

@danielballan sadly because of workload i haven't had time to look into this much - i know pretty much what needs doing but i just don't have time to actually implement+test it. At the simplest level reading a frame using SQLAlchmey as a db abstraction looks like this:
https://gist.github.com/mangecoeur/5383970

(That's code i actually needed in production, so it works and is neater than what's in my branch)

The code for writing is in my branch, the principle is simple: map each Pandas Dtype to an SQLAlchemy supported type (most python builtins included), create a Table, write the values.

Testing is the kicker, i have very little experience with TDD but i do know that running tests that require DBs is generally a pain. On the other hand, SQLAlchemy's cross DB support is very good, so if it works with one DB it should work for them all, with SQLite as a kind of lowest common denominator.

Finally, I would recommend making SQLAlchemy a soft dependency (like a lot of the other IO) with a fallback to SQLite-support only since that's built into python - there's no point preventing people accessing SQLite just because they don't have SQLAlchemy installed. It's a simple try/except import, set an "sqlalchemy mode" flag, choose either genric or SQLite functions accordingly.

Andy Hayden
Collaborator

@danielballan I didn't get the travis build working anyway so feel free throw away my commit (I'd have to rebase after we pull in your mysql travis testing anyway).

jreback
Owner

I think that read_frame/write_frame should become internal only methods (and instead use the external facing read_sql/to_sql for consistency with the rest of the i/o (currently read_sql is just aliased to read_frame. I know that you will almost certainly just read/write frames, but leaves open then the possibility of dealing with Panel/Series (Series is sort of trivial), and e.g. in HDFStore, Panel is essentially converted to a mi-frame (though storing meta-data is different in SQL so not sure that is an option here).

Andy Hayden
Collaborator

Perhaps get SQLAlchemy working (passing the same tests as MySQL) first and then worry about precisely how to call from read_sql and to_sql... Saying that other operations use an engine argument (maybe this makes sense here).

mangecoeur

@jreback or maybe it should be more explicit: read_sqlite and read_sqlalchemy
Sqlalchemy supports raw SQL as well as its own "expression syntax" (which abstracts from the different SQL flavours used by different DBs) - so it would be nice in the future if read_sqlalchemy could also accept a query object as well as an SQL string.
At the moment if you use SQLAlchemy to read a DB using a raw SQL query even though it connects to all DBs transparently it's up to the user to make sure their SQL is formatted for the DB they're using - there's nothing you can really do about that for reading but it would be nice to make it possible to use sqlalchemy's abstractions (though this doesn't need to be done just yet)

jreback
Owner

added extra cool SQL label for these (CSV/HDF5) also exist as well

jreback
Owner

@mangecoeur I disagree, I believe write_frame as flavor= arg....so should read_sql
then this becomes straightforward, e.g. try this using say flavor=sqlite, or use the default (SQLAlchemy if installed, otherwise sqlite?)

you could also provide a module level set_default_flavor maybe

Phillip Cloud
Collaborator

that could go in config

jreback
Owner

ahh...yes makes more sense, and fallback really doesn't make sense, and people rarely use more than one SQL backend

Phillip Cloud
Collaborator

probably some of the other module level variables floating around could go in config as well

mangecoeur

@jreback thought about it some more and actually I agree with you now.

The SQL string given to read_sql has to be adapted by the user to fit the DB they know they are using, there's no point having different functions for different DBs. The function should accept both DBAPI connections and SQLAlchemy engines. There's no need to specify the flavor for read since pandas just needs to run the provided query through the given DB connection or engine, it doesn't need to modify it - therefore read will work so long as you provide a valid SQL for your DB and either a DBAPI compliant connection or an SQLAlchemy engine for that DB.

I think however we need to distinguish between 2 distinct user stories (use cases):
1. "I want to read an arbitrary dataset from a database using an arbitrarily complex SQL query" plus "I want to store dataframe values in an SQL table"

This case is served by read_sql and to_sql, which just needs to be able to save values, a user will write an appropriate SQL query to turn them back into a Dataframe if needed. There is no need for a "flavor" for to_sql because we can infer the correct column type mapping and let SQLAlchemy generate the correct SQL syntax based on the supplied engine. However we would like to have SQLite available as an option even if SQLAlchemy is not installed (this is what i meant by fallback, not that it would try to write to one database and fall back to writing to sqlite).

  1. "I want to store the Dataframes I'm working in an SQL DB and retrieve them transparently later."

This is different from 1. because we have to assume a save format convention, or save additional metadata. You don't want to have to specify an SQL string when loading because the save/load layer should deal with it for you. This is akin to an ORM mapping, while 1. is dealing with raw SQL.

I believe the read_sqland to_sql feature under discussion should only attempt to address use case 1. Use case 2. is a different feature and might deserve its own read_ and to_ functions.

I think it's important to keep this difference in mind in order to have a clear understanding of what the IO functions should and should not attempt to do, and what problems they're actually trying to address.

It should also make the api simpler since, as I said above, there's no need for the user to specify SQL flavours as it can all be done automatically, and we don't need any additional machinery for defining save conventions because we assume they do not apply for this specific feature.

jreback
Owner

@mangecoeur

but if sqlalchemy is the flavor, then you dont't need to specify DB specific sql (which is a good thing), but you DO need to know the backend (I agree conn already has some of this info, so maybe the flavor can be inferred)

just trying to keep api as simple as possible (but not too simple!)

mangecoeur

@jreback I think you slightly misunderstood how sqlalchemy works. It provides you with a variety of abstractions in order to generate DB specific SQL (e.g. you create a Table object, SQLAlchemy figures out the correct CREATE TABLE syntax for the DB you're using). It also deals with things like transactions and connection pools for you. However if you tell the engine to execute a plain SQL string, it will not attempt to "translate" that string - how could it seeing as there is in practice no "strictly correct" SQL syntax and feature set for it to translate from in the first place.
What you can do is use their expression api to write queries in python which will be translated to the correct SQL flavour.

So if you are supply raw sql, you do need to know the DB you are using and to write SQL specific for that DB. SQLAlchemy also knows the db you're using, since you have to specify it when you create the engine with the connection URI, and it deals with all the communication business for you. The main strength of using SQLAlchemy in this case is not for reading using a provided SQL string (where it just provides a light abstraction from DBAPI) but for writing to the DB, because we can programatically generate an SQLAlchemy Table and let it generate the correct SQL based on the db you provided it through engine. All the information needed for this to happen is contained in the engine, so no flavour is required.

The only condition is see is if you have flavour= {'sqlite', 'sqlalchemy'}, rather than having two functions to_sqliteand to_sqlalchemy (or some other name).
However:
1. its easy to detect if sqlalchemy support is present through a try import... If it's not present you can't even try to use it because you can't supply an sqlalchemy engine. Setting the flag just raises an error, which you could have done simply during the try import...
2. If you set sqlite the function would accept a DBAPI connection and work with Sqlite featureset only, while setting sqlalchemy would accept an sqlalchemy engine. So the flavour is already implied by the arguments.
3. That said, I personally don't like the idea of a single to_sql function that could have very different output depending on whether it was using sqlalchemy (with advanced type mapping) vs basic built in mapping.

jreback
Owner

@mangecoeur

a user just wants to to_sql, (in reality I don't even want to specify the connection!),

that's why flavor is useful (obviously would also have to pass host/user/pw) and with sqlite3 maybe as the default this would just work

what you are talking is more advanced for sure, but a user shouldn't have to think too hard about this

so I am actually in favor of not requiring conn (of course if its there, then use it), and alternately specify user/host/pw etc....

Andy Hayden
Collaborator

@mangecoeur What is "engine" in this context?

I think it probably makes sense to copy the sqlalchemy syntax/arguments for passing in a raw_query.

I also think we should keep around at least the sqlite flavour (as for one thing we can use this to test against some of the SQLAlchemy functionality), and this should be an argument in to_sql/read_sql (probably the default) if engine/flavor is reserved (has some other meaning in sqlachemy) should think of some other argument name.

Agree with @jreback that as much as possible should just workTM, perhaps default connection could also go in config...

mangecoeur

@jreback ah i see, yes supplying user/pw etc instead of conn would be another way to do it.

But I don't think that's practical - it would mean hard-coding DB driver support since you don't give the user the control over which db driver to use (there's 5 or 6 for MySql alone, and different reasons for using each), unless you copied sqlalchemy connection uri syntax. A bigger problem is that You want more dynamic control over the connections you're using - the system I was working involved connecting to 6 different databases and pulling and combining data from all of them, and letting another part of my system manage the connections which i passed to the read_frame as needed. Creating a connection or SQLAlchemy engine and passing it to read_sql is not difficult for basic use (1 extra line) and maintains flexibility - and trust me, many people using Pandas are dealing with much more complex use-cases than mine and would appreciate that flexibility!

@hayd engine in this context is: http://docs.sqlalchemy.org/en/rel_0_8/core/connections.html#

I also argue against default connections in config - zen of python says "explicit is better than implicit" and i don't like the idea of a connection that exists "by magic". There should be no more defaults set than already exist in the DB drivers used. I'm also concerned it maybe be a safety risk, as it would allow someone to modify an existing database unintentionally if they accidentally didn't override the defaults. Again, with a nice documentation page it's very easy for someone to set up a db connection with 1 line, while the alternative introduces headaches for other use cases.

jreback
Owner

@mangecoeur

how about we do this:

if conn is a string

e.g.a standard connection string (that you then pass to sqlalchemy)
(which will then open/close the db too)

pd.read_sql("postgresql+psycopg2://user.pw@host/db")

else conn provides the connection (which is opened/closed by the user)

you could also allow args (flavor,user,pw,host,db) and default flavor=sqlite3

that way a user can pass a connection, a string to their favorite db, or use the connection parms
and get sqlite3

I didn't mean default connections, just a default for the flavor

?

mangecoeur

@jreback I think accept conn or the sqlalchemy string would be fine. I think adding user pw etc args on top of that gets cumbersome, and is mixing different concerns. I would really really prefer that the sqlite support not be mixed with the sqlalchemy support because the results they give could end up being very different - the last thing you want from an API is for it to surprise you! You could easily get a situation where you use this function write to Sqlite through Sqlalchemy and write the same data to sqlite through "raw" support and get different data in the database because of subtle differences in conversions.

I would suggest:
read_sql(sql: string, engine: {string or SqlAlchemy Engine}) => throws error if SqlAlchemy not installed
and
read_sqlite(sql: string, conn: DBAPI (default None), user, pw, host, db) => always works

jreback
Owner

to make this back_compat (and to avoid having 2 functions), I propose this:

read_sql(sql: string,
             conn: if a string, then use ``flavor/conn_kw`` to figure out how construct the connection
                      or use the connection (e.g. user created the connection) (or pass to ``SQLAlchemy``
                      if an ``SQLAlchemy Engine``)
             conn_parms: a dict of connection arguments (e.g. user/pw/host/db), could also be sep arguments
             flavor : string, default is ``None``, if specified will create that type of connection
                       e.g. ``sqlite3`` or ``sqlalchemy`` directly
             )

@mangecoeur

case 1:

connect to sqlalchemy
read_sql(conn=None,flavor='sqlalchemy'),conn_parms=dict(user=....,host=....))

case 2:

connect to sqllite

read_sql(conn=None,flavor='sqlite`),conn_parms=dict(user=.....,host=....))

Dan Allan

I don't see the need for read_sqlite. To elaborate on @jreback's idea.

  • If conn is formatted like a connection string 'postgresql+psycopg2://user.pw@host/db' use SqlAlchemy and raise if it not there.

  • If conn is formatted like a local file path, use sqlite, which (!) takes no username or password -- just a path.

  • If conn is a connection object, determine whether it is a sqlite3.Connection, and try SqlAlchemy if it is not.

  • Also accept user/passwd/host/flavor kwargs, which will always need SqlAlchemy.

So the simplest possible use case would be read_sql('data.sql').

Andy Hayden
Collaborator

Could it also pass a an engine (the suggestion from those docs @mangecoeur linked to is that engine should be created only once ("not per-object or per-function call").

Does this mean connection string is frowned upon (but maybe could cache the created engine? :S)

jreback
Owner

isn't an engine the same as a connection? from a user perspective?

Andy Hayden
Collaborator

@jreback agree it should be in the connection argument. Just wondering if they pass a string like 'postgresql+psycopg2://user.pw@host/db' whether we should cache, or whether we shouldn't accept it at all. (docs seem to suggest engine shouldn't be created on each function call, but should be "global".)

jreback
Owner

I think if you want to cache it, then you should create it externally and pass in conn (which the sql engine shouldn't close). But if you provide a connection string (or args/flavor), then pandas should open/close for you (consistent with other i/o parts)

jreback
Owner

my 2c

I think allowing the common connection args is fine, e.g.

read_sql(sql,user='foo',host=..,db=....) makes it pretty natural

and I don't think this will break back compat as basically making conn optional

Andy Hayden
Collaborator

Perhaps I'm wrong but the impression I got was that the engine opened and closed the connection on each call. The point being it wants you to pass it an engine (and not create an engine each time), then the engine opens and closes a connection.

+1 on standard connection args.

jreback
Owner

@hayd I think that is right

as an aside....it might make sense to have this in 2 pieces internally, basically the connection manager guy and then the `parser (takes the rows and produces a frame)

then, a power user could use the parser only and manage the connections themselves (say they wanted to do connection pooling or whatever)

Andy Hayden
Collaborator

That is we'd cache the Engine (but I think we needn't worry about this yet... too early to be that clever), so:

  • If conn is formatted like a connection string 'postgresql+psycopg2://user.pw@host/db' use SqlAlchemy and raise if it not there.
  • If conn is an Engine use SqlAlchemy (we can shove create_engine,.. in sql.io).
  • If conn is formatted like a local file path, use sqlite, which (!) takes no username or password -- just a path.
  • If conn is a connection object, determine whether it is a sqlite3.Connection, and try SqlAlchemy if it is not.
  • Also accept user/passwd/host/flavor kwargs, which will always need SqlAlchemy.

Not sure if the Engine would necessarily have already been given the kwargs.

jreback
Owner

also flavor can override your last point (of trying sqlalchemy first)

Dan Allan

I'll take a shot at this, leaving all our existing code but ultimately aiming for SQLAlchemy to take over the current mysql code, once it passes all tests with the original usage.

jreback
Owner

gr8!

Dan Allan

Progress report, requesting some input...

My revamped read_sql accepts:

  • ':memory:' -> Connect to an in-memory sqlite DB. Use sqlalchemy if it is available; fall back on pandas code based on build-in sqlite3.
  • '/path/to/file.db' -> Connect to a sqlite DB. Ditto above...
  • sqlalchemy.engine.Connection -> Ready to use!
  • sqlalchemy.engine.Engine -> Connect to engine.connect()
  • connection parameters kwargs: dialect (a.k.a., "flavor"), username, password, database, host, optional port and driver -> Use sqlalchemy to create an Engine and connect to it.

But what if it receives one of these?

  • sqlite3.Connection Suggestion: If sqlalchemy is not available, just proceed without it (as with the first two bullet points above). If it IS available, warn that more robust support is available if the user builds the connection using SQLAlchemy.

  • MySQLdb.Connection Suggestion: Whether or not sqlalchemy is installed, warn the user that support for connections created from MySQLdb is buggy and will be removed in an upcoming release. Recommend creating one from SQLAlchemy instead, or pass the connection parameters so that read_sql can create one (using SQLAlchemy.)

jreback
Owner

@danielballan that looks great...

the only comment I have is that IIRC, @mangecoeur mentioned that you may have a different preference for using sqlite3 even if SQLAlchemy is installed. This could easily be handled by:

arg: flavor=None

if flavor = 'sqlite3':
    ....
else:
    try to process via sqlalchemy
mangecoeur

@danielballan @jreback yes that would be my preference: if SQLAlchemy is not installed, only support SQLite. The logic is that sqlite is always available since it comes with python, and therefore it makes sense to support it without additional dependencies. If SQLAlchemy is installed it should be used to support all databases (i.e. no special case for MySQL). @jreback code snippet is correct.

Andy Hayden
Collaborator

@danielballan Now I look at this (was interested / checking out your branch), and probably you knew this already, I think we'll need a pretty big refactor to use SQLalchemy notation throughout write_frame i.e. let SQLalchemy derive the SQL from the engine (at the moment it's not). I had a little go at this, but I don't think my sqlaclhemy foo is up to it (yet?) but may be able to look at it later in the week.

Same for the tests, my idea was that we could have a base test TestSQLAlchemy class which had self.engine = sqlite memory, and everything else was platform independent i.e. once this works all you had to do was subclass it, override engine in setUp with a super (or perhaps a flavor_setup method).

The good thing being we can use sqlalchemy.types.

Dan Allan

@hayd Ah, I hadn't pushed any interesting stuff to github yet. See my branch now -- get_connection is the bulk of my progress.

I have no SQLAlchemy-fu whatsoever; I'm just learning as I go here. From what I've seen so far, I think the substantial refactoring will be worth the trouble for types and generality over SQL dialects.

To support old code that passes a MySQLdb.Connection (at least for the next version or two) I'm thinking our new read_sql and write_sql will hand off to legacy copies of these functions as they were before the SQLAlcehmy refactor. Maybe we'll stash them in a separate file, sql_legacy_support.py. sqlalchemy.engine.Connection objects are just too different to support both inside the same function.

Can I look at your go at deriving SQL from the engine?

meteore

Hi all,

I just thought this is the right place to remind of the major problem of SQL-to-numpy: performance. This message (2006) from Francesc Alted (with benchmarks) shows that intermediate conversion to Python objects degrades performance. Does anybody know if there were any developments on DBAPI since then?

jreback
Owner

@meteore performance is of course important
I think getting the io correct (eg types inferred correctly) and dealing with flavors properly has to be the first goal

optimization can be done for reading/writing at some point; however SQL is not optimized for array access and by definition will be slower than say HDF5

meteore

Thanks Jeff, that is a good reference for order-of-magnitude comparisons; It seems that PGSQL could cut some of the lost terrain, but also that the man-in-the-middle of conversion to Python datastructures is still there.

jreback
Owner

@meteore absolutely there are some optimizations

eg I have some cython routines that take a frame and replace nan to None for example

Andy Hayden
Collaborator

@danielballan this was the commit but was really just hacking around, although the test structure may be worth using (please take a look, code in main class obviously doesn't work yet, I haven't even tried to run it!).

I don't understand engine vs meta yet, so I'm not really sure what's going on (mainly with regards to finding all tables, they seems to not stay updated with each other... not read up on it). Even something as simple as dropping all tables doesn't seem to work as expected on that front :S

But the idea seems to be you describe your tables with the sqlachemy syntax and it worries about writing and executing statements. My opinion is we should stay in SQLAlchemy syntax as much as possible:

In [63]: t
Out[63]: Table(u't', MetaData(bind=Engine(sqlite:///:memory:)), Column(u'x', INTEGER(), table=<t>, primary_key=True), Column(u'y', VARCHAR(), table=<t>), Column(u'z', VARCHAR(), table=<t>), schema=None)

In [64]: t.select()
Out[64]: <sqlalchemy.sql.expression.Select at 0x102e598d0; Select object>

In [65]: t.select().execute()  # prefer to use this
Out[65]: <sqlalchemy.engine.result.ResultProxy at 0x102e59ed0>

In [66]: str(t.select())  # and completely avoid this
Out[66]: 'SELECT t.x, t.y, t.z \nFROM t'

So perhaps read_frame and write from should accept sql.expressions... suspect this will end up being so vastly different we should throw the old sql stuff into a separate sql_legacy file (and not even trying to if/else it all) or maybe even this stuff in an sql_alchemy file and the sql stuff just contains deciding whether to look at sql_alchemy or sql_legacy logic?

Dan Allan

@hayd I like the test structure, and I'm with you on staying with SQLAlchemy syntax as much as possible for code within the module. I think write_sql can be done entirely within the SQLAlchemy syntax.

I would like to support user code like 'SELECT t.x, t.y, t.z \nFROM t'. Someone who comes to pandas with SQL experience but no real understanding of SQLAlchemy should be able to feed a select statement to read_sql.

Andy Hayden
Collaborator

@danielballan totally agree, excellent point.

Andy Hayden
Collaborator

@danielballan I think there also need to be able to do it from a connection (or cursor?) and not an engine (unless you can make an engine from either of those), this is also mentioned the relevant PEP. Which perhaps makes it trickier.

Another syntax would that read_frame could actually read a table name (just like write_sql) e.g. read_sql(name='t', engine=engine), wonder if we can do that from a connection... ?

Dan Allan

@hayd Absolutely. It looks like we can do it from a connection, maybe not from a cursor. (But that's probably OK.)

mangecoeur

@hayd since you were wondering about meta vs engine: first read these if you haven't
http://docs.sqlalchemy.org/en/rel_0_8/core/engines.html
http://docs.sqlalchemy.org/en/latest/core/schema.html#describing-databases-with-metadata

Basically engine replaces the "connection" used when you are using raw DBAPI access. It adds a level of abstraction, for example by handling connection pooling, timeouts, etc. On engine might be maintaining several connections in a pool. When you access the engine it may be opening and closing connections to provide the db cursor - you don't need to know or care. It also keeps track of DB-specific information (such as SQL dialect stuff) which allows it to take expressions and turn them into SQL.

Metadata is a fairly different beast. It contains information about the tables themselves, so column definitions, indexes, character encoding, table settings and references (in case you're not familiar, most modern DBs can handle columns where the values refer to rows in other tables, and check/enforce consistency on those references).

In short, you deal with Engine to connect to the DB and MetaData to manipulate tables. You could create one Metadata with some table definitions but apply it to several databases by doing metadate.create_all(engine) with several different engines. The engine deals with turning everything into appropriate SQL.

Andy Hayden
Collaborator

@mangecoeur it seems giving an engine leads to being able to reflect a meta (i.e. get all the table info from the database, and therefore what the dtype would be), as well as create SQL dialect stuff... and so life much easier if you're given an engine

Is it possible to do both those from just a connection (or to create an engine from a connection?)

mangecoeur

@hayd indeed, because the engine knows about the DB it's connected to it can reflect lots of information from DB tables and generate MetaData (essentially reversing the process of creating metadata and applying it with an engine). You can actually reflect entire ORM object structures this way (have a look at SQLSoup). On a lower level it makes it easy to get column types which can be used to turn them into Dtypes according to some kind of mapping (note that this isn't trivial for more complicated data such as complex numbers - you need some kind of serialization/deserialization convention/configuration)

To my knowledge you can't create an engine from a connection, only a connection from an engine. Which makes sense because an engine is an abstraction on top of a connection, and allows you to select from different DB drivers to create that connection.

I would simply follow the recommendation from before and let the function accept an engine or an sqlalchemy compliant connection URI. While that's different behaviour from currently, it appears the consensus is to change the sql related function names anyway. The old functions could be kept for legacy access with depreciation warnings for 1 Pandas release before being removed (I don't know if pandas has a depreciation policy yet).

jreback
Owner

@hayd / @danielballan

maybe you want to have a master PR for this and then various other PR's can be incorporated, e.g. #4304

you might want to create a separate master branch to do this (conceptually easier that way I think), that you track via a PR

Andy Hayden
Collaborator

I think an upstream/sql branch is probably a good/sensible idea.... I created upstream/sql for the purpose, please PR away. @danielballan

Wondering how to deal with #4304... if it's not going to make it into 0.12 should we stick it in sql branch?

I think we should get the sql/sqllegacy separation in the codebase asap (after 0.12) to make contributions to legacy easier (there will be some merge conflicts from the above pr...)

jreback
Owner

that fine to separate legacy and not (maybe rename sql.py to sql_legacy.py) but I think you may need to have it in the same pr
it's pretty tightly wrapped (as read_sql needs to figure out the dispatch)

Dan Allan

I segregated the files, and the tests pass, falling back where appropriate. I think I PRed into pandas/master by mistake. Still unclear how this went wrong or what I should do. I will return to this tomorrow.

Andy Hayden
Collaborator

@jreback wow, that's interesting, I guess we shouldn't forget how performance critical we are (good to have something to compete with). Do you think memory footprint/speed could be improved later by cythonising (could that speed up writing to ndarrays)? Do you think we get it down to similar? (I don't think SQLAlchemy necessarily impacts on speed...unless it's does a load of slow type conversion :S) thoughts?

jreback
Owner

actually prob easier to support their version of pyodbc as an optional dep (use if available), with the new methods http://docs.continuum.io/iopro/pyodbc.html ( and the deps for this are simple, NOT the whole numba stack)

API looks similar to the current one

hmmm....doesn't look open source.....

Andy Hayden
Collaborator

@jreback I'm confused is this part of iopro free? (Either way we could support it.)

jreback
Owner

do you have a ref for io free? looks like maybe only in io pro.....they should open source it.....

Andy Hayden
Collaborator

Ah no, I was/am confused, I think this is based off an open source project (which hasn't been updated in a couple of years), I guess it's not free - truth is in the name!

Hopefully cythonizing will get a similar speed boost.

Dan Allan

We should try to reproduce his "pandas" performance measurement, at least approximately, and then see about cythonizing to push it down. If I understand correctly, the bottleneck that iopro addresses is buried in MySQLdb and the like. Optimizing those connections, with cython or otherwise, is feasible but ambitious.

Andy Hayden
Collaborator

Related discussion https://groups.google.com/forum/#!topic/pydata/NaBbrPZW8SE about "ExpressionFile: given SQL Expression Language, provide file-like CSV object." which uses chunksize to sql (think we should probably add chunksize to TODO list, ooop it's already there :) ) ...

Also, perhaps (way future) could use eval's expression language...

Andy Hayden hayd referenced this issue
Closed

Sql alchemy tests #4475

Jeff Tratner
Collaborator

@hayd I think the current sql branch is py3 incompatible right now (e.g., the raise statements), okay if I push some changes to fix that? I also suggest that you split these out into different classes, so you don't need to do dict lookups in every function. (I'd put up a PR for that though)

Andy Hayden
Collaborator

@jtratner I have some big changes coming, to see working (language indep) tests and had to tweak the main code quite a bit to allow that. Saying that, there'll be scope for a big refactor once it's merged (with working tests)... :)

Andy Hayden
Collaborator

@jtratner it feels like there ought to be a clever OOP refactor similar to Excel (?) etc. to take out tangley con/cur/engine jumping.

Jeff Tratner
Collaborator

@hayd okay, I'll wait before working on any changes. and if you put together everything with tests...I'd be happy to refactor that afterwards.

Jeff Tratner
Collaborator

@hayd btw - one good thing would be to change the custom errors to inherit from somethingother than Exception (e.g., IOError or ValueError)

Andy Hayden
Collaborator

Tests pass in with SQlite with SQLalchemy in language agnostic way, have a small bug in MySQL with NaN. Will push into branch when this one passes :)

Needs a big ol' refactor though which I wanted to discuss, one idea was to use an SQL class (similar to others):

class SQLHelper(PandasObject):
    def __init__(self, cur=None, con=None, engine=None):
        if engine:
               return SQLHelperWithEngine(engine)
        etc.

    def execute(...):
         raise SomeError('Must be called with either con, cur or engine')

     etc.

class SQLHelperWithEngine(SQLHelper)
    def execute(self, ...):
        # have access to self.engine

class SQLHelperWithCur(SQLHelper)

class SQLHelperWithCon(SQLHelper)

Keep execute, read_sql etc globals to be agnostic (i.e. select which class to use), and work out which helper to use.

Maybe there's a better way to get cleaner code here?

Jeff Tratner
Collaborator

@hayd can you link to your current sql branch?

Andy Hayden
Collaborator

Sorry, here it is: https://github.com/hayd/pandas/tree/sql_tests

And the failing test is the one at the bottom (TestSQLA_MySQLdb .test_write_row_by_row): https://travis-ci.org/hayd/pandas/jobs/11314669

Andy Hayden
Collaborator

Perhaps should pull class discussion into sep issue, atm have functions. But just wanted to mention all the functions:

execute
tquery (returns tuples)
uquery (returns rows updated)
read_sql (with a few aliases)
write_frame (with a few aliases)

The rest are basically all helpers, though perhaps execute, tquery and uquery could be considered helpers too... are write_frame and read_sql the only important ones? (?)

Dan Allan

Agreed. I like your proposed class organization. Also, catching up with the discussion, I think those py3-incompatible raise statements are my fault. Did you fix them as part of your changes, or should I?

jreback
Owner

maybe do check_less_precise in queries? (this is obviously a 1-timer, just a heads up)

https://travis-ci.org/pydata/pandas/jobs/11424698

Andy Hayden
Collaborator

That's really weird! Noted.

I'm hacking through the new class stucture atm. i've called in PandasSQL/PandasSQLWithEngine/... provisionally. Leads to much nicer code. But I'm sure @jtratner will rip it apart. :)

Andy Hayden
Collaborator

Please check out impl, I pushed to pydata/sql: https://github.com/pydata/pandas/blob/sql/pandas/io/sql.py

I've completely reworked the structure into classes, which I think makes the logic much easier to follow.

I think it still needs some work to clean the connection magic, but once that's done it should be easy to put into the main functions (initially feed it "something" and return either cur, con or engine, and be much easier to understand).

Not passing all tests yet, so if you have any ideas on them would be awesome... sorry for taking so long!!

Thoughts?

Jeff Tratner
Collaborator

@hayd - what's the life cycle for engine/cursor/connection? Are you saving the wrapper classes or connection objects so you can use them multiple times or is the idea that the user just passes something that gets generated each time? (eg for ExcelWriter, you can reuse it to store multiple frames in the same file). My impression is that there's not a need for that here. Do I have that right?

Andy Hayden
Collaborator

good question.

Engine is reusable. I think lifecycle of connection should be that it's kept open (at the moment, it's not really supported and cursor is used throughout, I really just refactored the code to make it clearer that was what was happening). Perhaps the solution here is for connections to create cursor for the PandasSQLWithCur, do stuff, keeping the connection open. ?

Dan Allan

+1 for "connections to create cursor for the PandasSQLWithCur, do stuff, keeping the connection open."

Andy Hayden
Collaborator

Cool, I think in the new structure this will be pretty straightforward to implement.

Re some API of the other connection stuff, con seems to accept quite a few things atm, perhaps we should slightly break the API to make input more explicit while still allowing lots of different input:

Preferred way (first arg could be convenience to capture all of these...):

  • passing engine as either an engine or a url (as engine=...)
  • passing a connection (as con=...)
  • passing a cursor (as cur=...)
  • passing a sqlite filename (opens as engine=..., if SQLAlchemy installed, otherwise as connection con=...)

I'm not sure about this way, it might just be better to use a corresponding SQLAlchemy function (assuming one exists) instead and not allow this here, this only works with SQLAlchemy:

  • passing (dialect/flavor, driver, username, password, host, port, database) using _build_engine_url (which I accidentally deleted in my cleanup, will put it back in, we're missing some serious code coverage!)

So public functions would just be:

to_sql # perhaps to_sqlbd is clearer...
read_sql # as above 
read_sql_table  # only works with engine... at least atm.

or maybe if first arg is a table name read_sql should just work... and we only need two.

Jeff Tratner
Collaborator

I think we're more on the same page now. Seems like it'd make sense to have the public methods dispatch to helpers that return connections + any other data that they need (e.g., if you pass an engine, gives you info on tables, etc.) and then for the actually querying, you only need to have one kind of class (except that I'd suggest separate classes for different dialects as necessary). I also think we should not make the classes public...putting them in the public API isn't that useful and makes it harder to refactor later on.

The upside of that is you could create separate classes for each flavor that define implementation-dependent parts and simplify the code quite a bit. [in particular, the dtype conversions]. Another advantage of classes per flavor is that it makes it easier to add a new engine (or modify the behavior of one of them to get better performance) without needing to alter code for existing engines or tweak their test cases. And if we decide that we mostly want to use SQLAlchemy, that's fine. That said, SQLAlchemy adds a layer of indirection (and, sometimes, slowness), so if it's easy to, say, handle sqlite separately and it performs better, we should do that. Worst thing that happens, we have a separate engine for sqlite and all the other flavors aliased to an SQLAlchemy wrapper.

Jeff Tratner
Collaborator

@hayd there are some issues with the code you linked (I think). For example, in execute, there's a global con referenced, which doesn't make sense. I also think that, instead of having an explicit rollback, we should have a try/except in write_sql and use a rollback there, making it much simpler overall.

Jeff Tratner
Collaborator

@hayd and others - I cleaned up io/sql on my own branch (clearly, it's not all the way there, because you need to dispatch to the various engines and I guess I removed all the sqlalchemy stuff by accidnet - probably don't want to use the abc.abstractproperty stuff then and instead just have BackendBase raise if the properties/methods aren't defined.

What I like about the setup is that it makes the differences (type names, table creating syntax and api for writing to cursor) really apparent, while still letting you override things if necessary. Anyways, hopefully this gives you a sense of how splitting them up into classes and using a cursor for everything could be easier. I don't think it would be hard to use the same setup with an engine instead, but just showing you what I mean.

To get this to work fully, write_frame would need to catch exceptions and then rollback on the connection or engine. (or let it bubble up to user and let the user handle it).

Jeff Tratner
Collaborator

I'll put this up as PR on io/sql so you can see what you think...

Andy Hayden
Collaborator

@jtratner there are lots of issues with my code! :) I like your flavor pulling out.

However, I do think it's a nice feature to be able to set up the connection/engine pandas object once, and query against it multiple times (without having to do engine=... or con=... each time). So I think having them as public is a good thing (whilst hiding some implementations)...

Andy Hayden
Collaborator

@jtratner Also, just wow!

Jeff Tratner
Collaborator

@hayd well, it turned out that it was pretty simple to cleanly wrap the engine implementation (and I think there's still more refactoring that can be done). I'm still not 100% clear on what types the engine and cursor can be. Today or tomorrow I'll have a chance to look over the test cases in current master and that will give me a better sense of how this should be setup.

A few issues I want to keep in mind as we keep working on this:

  1. I also think there are some issues with not correctly escaping things in string substitutions. (i.e., in the create table and has table methods).
  2. the connection generation should probably be moved from a global method to the individual constructors.
  3. Allowing persistent connections. (and considering who has what responsibility for cleaning up)

The nice thing about how you originally set this up is that it's really amenable to the overall DB API.

Jeff Tratner
Collaborator

@mangecoeur @hayd @jreback @danielballan @garaud @changhiskhan - do any of you have test cases for this (or would you be willing to create them)? The ones in the existing test suite didn't really give good coverage of functionality (and many of them are pretty much just smoke tests for the driver). It'd be particularly helpful if we had some data we could use to seed the Travis databases, to use from read_sql.

Andy Hayden
Collaborator

So, I think we should give an engine constructor function (which creates an engine url/pandas_sql object from dialect, username, password, etc.), but connection and cursor should be user-defined (the user should set up themselves) - except in the special case of an sqlite file w/o sqlalchemy.

imo these are three distinct concerns, engine / conn / cursor... and tbh I liked my factory for creating these (I think it made the logic for persistence really clear - although we weren't using connections before it was pretty easy to drop them in)...

  • Engine (methods create and close connections, does so within SQLAlchemy loveliness)
  • Connection (methods create and close cursors), using flavor info
  • Cursor, using flavor info

I think creating the pandas_sql object (or whatever) once and query against that without inputting the engine or connection for every method is super useful (also can have multiple dbs as difference variables to query against) and I think that these should definitely be "public" (we could pretend they are all PandasSQL objects in the repr anyways.)...

Dan Allan

@hayd, can you clarify with an example how PandasSQL objects fit into the API? (P.S. I'm still game to be involved, with testing or whatever is needed, but as is now apparent, I'm not finding time to keep up with the conversation as fast as Team hayd & jtratner, so I'm content to take the back seat here.)

Andy Hayden
Collaborator

@danielballan In sql atm (which is different from @jtratner's branch!), you create via a connection, cur or engine:

pandas_sql = PandasSQL(engine=..)  # obviously would use descriptive variable name for your db!
# pandas_sql = PandasSQL(con=..)

and the idea would be that you would query against that:

pandas_sql.read_sql('SELECT * ...')
pandas_sql.read_sql('table_name')  # is possible if it's an engine..
pandas_sql.to_sql(df, 'table_x')

You can still work with the old syntax too (which just calls the above):

sql.read_sql('SELECT * ...', engine=..) # or con=... or cur=...

I would propose that read_sql, to_sql and PandasSQL were the only things fed to io.api.

Dan Allan

I assume we also still support df.to_sql(table_name, con). Why shouldn't pandas_sql.read_sql('table_name') work on connections as well as engines?

Andy Hayden
Collaborator

@danielballan That would be neat, I just didn't know how to do it :)

Definitely, df.to_sql (table_name, con=...) would still be needed. but I guess we could also pass in the pandas_sql class here: df.to_sql(table_name, con=pandas_sql) (maybe using kwarg con for everything is bad..)

Jeff Tratner
Collaborator

@danielballan is it problematic to create an engine or connection separately and then pass it to read_sql every time? Only requires one extra argument...

Jeff Tratner
Collaborator

also, at everyone (I know I asked before) - test cases would be really appreciated!

Gagi

Are there any plans to support MSSQL flavor via pymssql?

Dan Allan

@jtratner I would say it is not problematic, and it resembles the usage familiar to MySQLdb users, FWIW.

The PandasSQL class suggested by @hayd invites a comparison with HDFStore (obviously). I think pandas_sql.to_sql and pandas_sql.read_sql seems a little redundant ("sql"). Could we devise something closer to store? Maybe this is a bad idea, but I want to put it into the discussion before the API congeals....

@dragoljub [EDITED -- I misread that at first] No, but it is reasonable to propose including this in the release. Any interest in helping?

Jeff Tratner
Collaborator

@danielballan where does the handing of MSSQL flavor via pymssql come in to the code?

Dan Allan

@jtratner Yeah, sorry I was typing my edit when your reply came up. Misread MSSYQ for MySQL, which I see we are supporting my pymysql as well as / instead of MySQL now.

Jeff Tratner
Collaborator

btw - is there any harm in always going to a cursor? Seems much easier to just do that in the actual querying and reading...

Andy Hayden
Collaborator

@danielballan pandas_sql is just a variable name, you'd call it something descriptive to do with the database's name, and not do the sql repeating... but happy if we have something better.

Anything SQLAlchemy supports and travis can handle, we can add to the SQLAlchemy tests. May be able to support MSSQL (travis-ci/travis-ci#216) certainly SQLAlchemy works with it.

Jeff Tratner
Collaborator

Also, @danielballan please check out this refactor I'm proposing - I think it should be a lot cleaner - https://github.com/jtratner/pandas/tree/sql . I'm just not clear on how you generate a connection.

Dan Allan

I'd say always using a cursor is preferred, and as far as I know there is never harm in it. Specifically, cleanup (closing the cursor) and transactional operations (e.g., rollback in the event of an exception) are more explicit.

I agree that pandas_sql.table_names (or something like that) is much better than using a string like pandas_sql.read_sql('table_names').

@jtratner, I like BackendBase and that structure. I agree some PandasSQL object in the API, associated with an engine and a persistent connection, is needed to avoid repeating connection parameters and, moreover, opening the connection each time, which can take ~ 1 s. A connection should be opened when the Backend class is instantiated. I believe this will be generic (i.e., can be in BackendBase) with a special case for sqlite.

Andy Hayden
Collaborator

@danielballan Ah, so you want the tables as properties like DataFrame columns? That's a really great idea.

@jtratner I think there is some happy middle ground between our two implementations (they are very similar!), can we mixin the flavor and whether its connection/cursor (and possibly engine) ? Essentially a connection wraps each cursor method call by first creating a cursor (perhaps we could make engine the same, but I'm not sure if we lose some SQLAlchemy power by doing that or not).

I like having SQL in the class name to describe what it does (although I wouldn't use it in variable names in real life!)... so BackendBase doesn't really do it for me, HDF5Store has HDF5 in the name, what about SQLStore? Now I think about it SQLsomething will make it easier to find (PandasSQL is a silly name).

Dan Allan

Yes, to be concrete, pandas_sql.tables would return ['some_table_name', 'another_table_name']. And maybe pandas_sql['some_table_name'] could effectively SELECT * FROM table_name, for consistency with HDFStore. I prefer SQLdb over SQLStore.

Jeff Tratner
Collaborator

@hayd I'd actually written out something similar to what you were discussing above.


Let's get all of this working with cursors first then. Once we have that part working, it's very simple to wrap the engine part. (In fact, I'd imagine an engine class that instantiates a cursor each time and then instantiates the appropriate backend. (i.e., some class that's like the following and allows you to pass args in a contextmanager which rollsback on any exception, just need to fill in a bit of logic around creating connections).:

class PandasSQLConnection(object):
    """Wrapper that maintains a connection and delegates to a particular backend based on flavor"""
    def __init__(self, flavor='sqlite', *args, **kwargs): # connection parameters
        self.flavor = self.create_connection(flavor, *args, **kwargs)
        if self.flavor not in flavor_mapping:
            raise ValueError("Unknown flavor %s" % flavor)
        else:
            self.backend = flavor_mapping[self.flavor]
   def read_sql(self, table_name, *args, **kwargs):
        reader = self.backend(self.cursor())
        reader.read_sql(table_name, *args, **kwargs)
   def write_frame(self, df, table_name, *args, **kwargs):
        writer = self.backend(self.cursor())
        writer.write_frame(df table_name, *args, **kwargs)
   def create_connection(flavor, *args, **kwargs):
        # try sqlite
        # try postgres/other special-cased connections
        # try SQLAlchemy (in which case return 'sqla')
        self.con = con # I'm pretty sure that SQLAEngine and Conn will work similarly
        return flavor
   # separating this out b/c not sure if SQLA needs to work differently.
   # or have to deal with some other non-standard weird thing
   def cursor(self):
        return self.con.cursor()
   def rollback(self):
        return self.con.rollback() # this should always be available, but may not do anything
   def close(self):
        return self.con.close()
   def commit(self):
        return self.con.commit()
   def __enter__(self):
        return self
   def __exit__(self, exc, exc_type, tb):
        if exc is None:
           self.commit()
        self.close() # sends a rollback if we haven't committed.
        return False # bubble up the Exception

The nice property of the above class is that it means that the backends only ever have to handle cursors, you can use the class in a with statement, and it has the potential to make it dead-simple to connect to sql databases... It also means that BackendBase never has to worry about committing changes.

This does add some complications and after writing up the class above, I'm thinking may be able to add a few methods to BackendBase to encapsulate the above.

Jeff Tratner
Collaborator

Also, I'm really unconvinced that sql should work like hdf5 here. SQL already has a full query language and a lot of flexibility and you can already accomplish querying by passing a cursor. Why is it helpful to reimplement or wrap things that other libraries (like SQLAlchemy and specific drivers) already provide?

Andy Hayden
Collaborator

@jtratner No, sorry, I wasn't saying it should work with HDF5!

Can live with SQLConnection (drop the Pandas) :)

Jeff Tratner
Collaborator

Not work with, but work like (i.e., I don't want us to wrap querying and such like PyTables does).

Also @danielballan here's potential get/setitem, it just means setting whatever args or kwargs need to be set ahead of time.

def __getitem__(self, name):
    return self.read_sql(name)
def __setitem__(self, name, df):
    return self.write_sql(df, name)
Andy Hayden
Collaborator

Also, one of the key things is that read_sql accepts sql ! So it's sql_statement_or_table_name (if table_name then we write the sql statement for the user).

Would be coool to have tab complete for table names (or am I being far too fussy :) ), so would on init run query to get all tables then add them as properties and/or get items.

Jeff Tratner
Collaborator

@hayd you'd do what DataFrame does: add them to the list on __dir__ (i.e., object.dir(self) + list(self.tables)) and then override __getattr__ to try to lookup attribute on the object then see if it's in the tables.

Jeff Tratner
Collaborator
Andy Hayden
Collaborator

Thinking again about this, perhaps SQLStore is better than SQLConnection since it's not a connection object...

jreback
Owner

how's the sql stuff coming?

jreback
Owner

@hayd what's the SQL status?

Chris Morgan

i too would be keen to hear news on this, with a view to contributing

Steve Pike

I think there might be a bug in the postgres flavour - it calls con.commit() in read_frame() which is not always usable (since you can be operating in a transaction-free environment). I guess it should catch the TransactionManagementError and carry on instead?

Dan Allan

@stringfellow Yes, I think that's the way to go for now. We'll be more careful with transactions in this rewrite.

We have two possible designs to reconcile, @hayd's PandasSQL and @jtratner's BackendBase. Once the bones are in place, I think it would be possible to divide tasks among more contributors. @jtratner, were you going to pull together a branch merging these designs?

Jeff Tratner
Collaborator

@stringfellow I think what we're going to do is move all the transactional handling to the user if they pass in a connection, and if we create the connection, we'll start off a transaction (where possible) and always roll back if there's an error.

Jeff Tratner
Collaborator

The branch I posted as PR pretty much already does all the merging of the designs (because it converted what @hayd had already done :smile:) except that it's not hooked into the connection-creating function - if I have time this weekend, I'll try to work on that.

My issue is tests - we don't have enough of them and that makes me concerned about correctness. Definitely going to keep working on this, but if worst comes to worst, we can add this in as an experimental module and explicitly make no promises about it working correctly.

Steve Pike

@jtratner @danielballan Great! Thanks for replying! I don't know if it is useful, but as a use case - I'm using the connection object supplied by Django (and doing some raw SQL with it).

Jeff Tratner
Collaborator
Steve Pike

@jtratner how do you want it? I guess it will be a whole Django project - I can't attach tar here can I?

Steve Pike

@jtratner Actually - I think I'm just not doing it right. For anyone that gets here via some search, this only really affects you when you are in django's shell, and you can get round it like this:

>>> from django.db import transaction
>>> transaction.enter_transaction_management()
>>> transaction.managed(True)
>>> transaction.is_managed()
True

I might stackoverflow this one too, just in case. Thanks folks!

Jeff Tratner
Collaborator
Andy Hayden
Collaborator

@jtratner perhaps one option is to trigger an sql travis build, similar to "travis faster please", (say, if it has sql in the commit message) where it installs dbs and sets up seeds etc etc. ?

Jeff Tratner
Collaborator
Dan Allan

Note to self/selves: I saved a DataFrame to a SQL database, and I wish I could query rows based on location. The SQL database implicitly creates a unique, sequential primary key. Pandas should create it explicitly so that we have access to it and can retrieve chunks by location.

Jeff Tratner
Collaborator

I don't want to hold this up and I don't think I'm going to have time to finish up the changes I proposed any time soon. @danielballan or @hayd do you want to take the lead on this? If you want to incorporate that branch I proposed go ahead, your call.

Dan Allan

Maybe. What's the trajectory in light of this and Wes's interest in a "low-level integration with ODBC"?

Chris Morgan

Why not rely on SQLAlchemy to manage SQL dialects and connectivity? Then ODBC integration is handled with:
import sqlalchemy
sqlalchemy.create_engine("DRIVER{FreeTDS};Server=my.db.server;Database=mydb;UID=myuser;PWD=mypwd;TDS_Version=8.0;Port=1433;").connect()

Dan Allan

@cmorgan We were (maybe are) headed in that direction. We drafted some code that introduced sqlalchemy dependence, handling engines and so on. The idea is to rely on SQLalchemy as much as possible, but also accept user-defined connections made through sqlite3, MySQLdb, etc. (At minimum, we need to support those or we'll break old code.) SQLAlchemy handles data types from different flavors, and that would save us a lot of trouble.

But the links above suggest that (1) SQLalchemy is lacking in performance, and pandas is all about speed; and (2) there is some interest in a more ambitious, low-level solution.

mangecoeur

@danielballan @hayd @jtratner Just wondering what the current status of SQL support is? I'm finally back on this topic and am available to help out. What is the most up-to-date branch? i cloned @hayd "sql_tests_amazing" since it seems to have been updated recently. Will give it a whirl in real-world use and see how it holds up.

mangecoeur

I forked the sql branch of the main repo: https://github.com/mangecoeur/pandas/tree/mangecoeur_sql

I've cleaned up the current sql.py file and worked with it in SQLAlchemy mode, but not yet in legacy mode. You can go look at my changes

Andy Hayden
Collaborator

@mangecoeur I'm not really sure where to see your commit(s) ?

...I'll rebase the pandas/sql branch. I have done a little more on this, but I don't think I was able to get it passing MySQL tests, so more eyes would be great. I should really go through my dead branches!

The status is that this was left in the air with two possible ways forward: an SQLStore (a la HDFStore) or BackendBase ( @jtratner )... ?

Jeff Tratner
Collaborator
mangecoeur

@hayd ah forgot to push my commits https://github.com/mangecoeur/pandas/commits/mangecoeur_sql
I mostly just refactored things that were in the pandas/sql branch to be cleaner and removed unused code. Because I actually need this for work I'm doing I didn't test much, just made sure the stuff I use actually works (and it does beautifully!)

I personally would not recommend pushing for an SQLStore OR BackendBase at this point, instead focus on tests, api, docs, and getting this into a Pandas release ASAP rather than blocking on extra features - we have the read/write functions and a nice flexible system for handling SQL backend support.

I would go so far as to make SQLStore/BackendBase a new ENH proposal that extends this one - the nature of SQL stores means that while something like an HDFStore is useful in some situations it's just one use case on top of the lower-level ability to read/write tabular data from an SQL query.

Also, the SQLAlchemy specific "read a table by name" function is awesome (as the code comment says) though we need to choose a name - In my version it's just "read_table" but we need something that's clear and concise, and docs to go with it.

Andy Hayden
Collaborator

Cool, will take a look over the next couple of days. Agree we should merge soon after 0.13 release.

Since the read_sql takes a string/SQL, we could just check first whether this is a table name, if it is then use the awesome function. Then no need for sep global function?

RE names, I hate the names read_sql and to_sql (I find them really confusing) think should change in future to be read_db and to_db (perhaps it's just me who finds it confusing)...

mangecoeur

Looking again over the code there's still API cleanup to do. This includes:

write_sql should be to_sql to match up to the actual docs! http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-12-0-july-24-2013

The "retry" option is pointless, IMHO this doesn't handle a general use case and it should be up to the user to manage errors in their DB connection - if it failed once, why should it succeed the second time? But if retying works, why just try twice? Its pretty arbitrary. If a user thinks their DB connection is flaky they need to handle that appropriately . On top of that, its only half implemented.

The legacy handler for "append" option appears to handle a case that was never documented (or working) in the first place, I think we can safely nuke this.

The read_sql function accepts a bunch of username/password/driver info, doesn't do anything with it at the moment. IMHO this is already covered by the DBAPI for creating a connection, is DB driver specific (e.g some MYSQL drivers accept a "passord" arg, others "passwd" - do we really want to deal with that?), and so is kinda redundant for us to include - by just accepting a connection object we let the dbapi implementation handle it

The option to supply either a connection or a cursor is causing some confusion in the code - we need to think about the DB connection lifecycle and decide how we want this to work - do we expect the user to just provide connections and we manage the cursors, or do we let the user manage cursors (e.g. for custom parallel handling). Does handling both make sense, and does the implementation actually support it?

I don't think the "read table by name" function should be merged with the others since it depends on SQLAlchemy. I don't see any advantage in merging it either, it means combining too many different bits of functionality into one place (harder to test) making the function more complex to understand for a user rather than easier.

Am working my way through these in my repo, feel free to change, comment, etc

mangecoeur

Update on connection/cursor (no SQLAlchemy mode only) -
It's good practice to allow users to use both the top level utility functions as well as the Classes that actually provide the functionality.

With that in mind, it makes sense to think of PandasSQL as a wrapper around either the SQLAlchemy engine or the DBAPI connection.

In the DBAPI case it should be the connection because the cursors get "used up" when you fetch data from them. So if you created a PandasSQL object wrapping a cursor, you could only use it for once:

pd_sql = PandasSQL(cursor, ...)
a = pd_sql.read_sql(...)  # fine
b = pd_sql.read_sql(...)  # doesn't work, because the cursor has already executed and fetched

on the other hand, if PandasSQL wraps the connection

pd_sql = PandasSQL(connection, ...)
a = pd_sql.read_sql(...)  # fine
b = pd_sql.read_sql(...)  # also fine, because we get a new cursor for the new function call

Therefore we need PandasSQLWithCon, instead of PandasSQLWithCur. It may be possible to still accept Cur in the utility function if we can get conection from it.

Please let me know if I'm wrong about this...

Tony Narlock

Hey. As an update on the SQLAlchemy front, it's on Github now at https://github.com/zzzeek/sqlalchemy.

#2717 mentioned something about incorporation of sqlalchemy. Is there any other places where potential sqlalchemy<->pandas development is taking place not mentioned here? @zzzeek created https://bitbucket.org/zzzeek/calchipan/ before. It would be good if people involved with pandas and sqlalchemy knew about this, they may be interested. :8ball:

y-p

calchipan is about using pandas dataframe as a backend database, this isuue is about pandas
grabbing data from databases (and writing back?).

Seeing as SQLAlchemy is an ORM, what's the selling point for it here? most of that abstraction
layer doesn't seem to apply here.

Also, what's the pain point in supporting multiple databases? Are SQL flavors that different
for the limited scope of pandas<->db interaction? divergent type support across dbs?

I'd appreciate a concise description of what this issue is trying to tackle,
is there a concensus?

mangecoeur

@y-p SQLAlchemy is not just an ORM - it is constituted of a Core and an ORM part. The Core is what we use, which provides "SQL rendering engine, DBAPI integration, transaction integration, and schema description services".

SQL flavors ARE significantly different - different native type support, subtle SQL syntax differences, string escaping rules, text encoding handling, etc. These alone are complicated enough to warrant the use of SQLAlchemy, I speak from experiance.

Using SQLAlchemy's engine to manage connections also frees the user from a lot of the subtleties of connection and transaction life-cycles (e.g. dealing with connection pool recycling on MySql)

SQLAlchemy also provides db and table reflection tools which are very useful and whose implementation does vary a lot between databases. This makes it possible to automatically create SQL tables based on a Pandas Dataframe and likewise to build Pandas Dataframes from named SQL DB tables and get the correct types - this is only realistically possible using SQLAlchemy, see the read_table function in my sql.py version in my branch:

https://github.com/mangecoeur/pandas/blob/mangecoeur_sql/pandas/io/sql.py

SQLAlchemy also provides a lot of scope for building more advances SQL-db features in the future, such as mirroring the HDFStore functionality provided via PyTables.

In order to maintain legacy support, limited functionality is provided for sqlite, mysql, and postgres - i suggest placing these in maintenance-only mode as a fallback for people without SQLAlchemy installed.

That said, SQLAlchemy is trivial to install and is included in all the Scientific/data oriented Python distributions that I know (EPD/Canopy, Anaconda, PythonXY, SAGE) - if you can install Pandas, you can install SQLAlchemy.

y-p

Good. What's ready for 0.14?

mangecoeur

My branch works in practice, but is currently a bit short on automated tests.

It probably also needs a little code cleanup and possibly some depreciation warnings on old function names.

I'm writing some basic tests now which i will push when i can.

y-p

Ok, and with your branch pandas will be able to...?

mangecoeur

Pandas will be able to do the following with any DB supported by SQLAlchemy (currently in core: Drizzle, Firebird, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, Sybase):

y-p

That's great. Are there docs ready?

mangecoeur

There are api docs but they need checking and cleanup. There are not yet sufficient manual-style docs.

y-p

API docs meaning docstrings? or is there a new section in the manual that just needs more work?

are those ~5 commits at the tip of mangecoeur_sql all, or did you merge other bits not in master (other PR's)
into your branch?

mangecoeur

Yes API docs meaning docstrings. The manual section:

http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries
Also needs updating since it doesn't reflect the updated API (that was already introduced in 0.13 I think) which uses read_sql and to_sql for consistency with e.g. the CSV api.

My changes are based on Master (last merged 2013-12-4) with no other PRs. I have changes I haven't pushed yet.

mangecoeur

Update: Basic tests and docs now in place on my branch. They need some review still. If we can get this into beta builds of pandas so that more people can try it out, that would be great!

y-p

That's great, good job.
If everything checks out we can perhaps get it in as experimental early into 0.14.

The missing bit for me is how your branch stands in relation to all the rest of the sql discussions
and PRs scatterd around the tracker, I lack perspective.

jreback
Owner

@mangecoeur it prob makes sense to make this a new io module so can leave the existing in place as well

this could involve different function names or an option (io.compat)

mangecoeur

@jreback these updates should be backwards compatible, so there's no need for two modules. If sqlalchemy is not installed or a DBAPI connection is provided instead of an engine it automatically falls back to legacy mode and issues a warning.

If you think there may be a functional difference the best thing to do would be to write a test for currently documented API in the current version of pandas and run that same test against the new version.

I have already made tests for read_sql, to_sql, execute in legacy mode so feel free to have a look and see if you can think of e.g. error conditions that need to be caught.

I based my version on the various work found in this thread (I had some difficulty doing pull/merges from other forks so some of it was just copy-pasted) so in theory it should represent some semblance of consensus around design and API.

jreback
Owner

@mangecoeur that is gr8 then! prob want to have an existing ci//requirements that does NOT use sqlalchemy for compat purposes (maybe the 2.6 and 3.2 builds)

mangecoeur

@jreback ok so the test will automatically skip the SQLAlchemy dependant parts if it is not installed. I've added sqlalchemy as a requirement for python 2.7 and 3.3 only, seems to be passing everywhere.

Docs now updated also and build tested.
Anyway I'm pretty content with the state things are in, issuing PR shortly...

mangecoeur

Pull request: #5950

mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 added version added coment 5bcd9ec
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 added depreciation warning for tquery and uquery 6122a55
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies
96c2149
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Fixed class renaming, expanded docs dd131de
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Fixed tests in legacy mode 9b44ed8
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers Initial draft of doc updates
minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
3575c8c
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
f3a8ea3
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
4ea6559
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers Initial draft of doc updates
minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
964d92e
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
bf876af
mangecoeur mangecoeur referenced this issue from a commit
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
1259dca
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
60590f2
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
ac6bf42
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Introduce DataFrame Index support. Refactor to introduce Pa…
…ndasSQLTable for cleaner OOP design
f156e81
mangecoeur mangecoeur referenced this issue from a commit
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
0ea6483
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
8ec0c33
mangecoeur mangecoeur referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
6416e65
jreback jreback referenced this issue from a commit in jreback/pandas
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH #4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests

ENH #4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode

ENH #4163 Use SQLAlchemy for DB abstraction

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH #4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests

ENH #4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode

ENH #4163 Tweaks to docs, avoid mutable default args, mysql tests

ENH #4163 Introduce DataFrame Index support. Refactor to introduce PandasSQLTable for cleaner OOP design

ENH #4163 Fix bug in index + parse date interaction, added test case for problem

ENH #4163 Fixed missing basestring import for py3.3 compat

ENH #4163 Fixed missing string_types import for py3.3 compat
f47f9fe
jreback jreback referenced this issue from a commit
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
0222d39
jreback jreback referenced this issue from a commit
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
145ab5b
jreback jreback referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
e57e125
jreback jreback referenced this issue from a commit
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
74d091f
jreback jreback referenced this issue from a commit
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
e600131
jreback jreback referenced this issue from a commit
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
0aa7d84
jreback jreback referenced this issue from a commit
Jonathan Chambers ENH #4163 Introduce DataFrame Index support. Refactor to introduce Pa…
…ndasSQLTable for cleaner OOP design
de64af1
jreback jreback referenced this issue from a commit
mangecoeur mangecoeur ENH #4163 Fixed missing basestring import for py3.3 compat
ENH #4163 Fixed missing string_types import for py3.3 compat
8e8d067
jreback
Owner

@jorisvandenbossche want to close this and open another issue with things that are not covered by #6292 (if they are at all relevant), but maybe want to address 'later'

Joris Van den Bossche
Collaborator

@hayd You opened this issue, what do you think? Do you have a clear view what of this list is still relevant?

Andy Hayden
Collaborator

Let's close and move parts into 6292, this issue has got very long.

Andy Hayden hayd closed this
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
3197811
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
85778c9
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Dan Allan danielballan ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
9f2f032
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Jonathan Chambers ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
672f3b2
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Jonathan Chambers ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
b7ad7aa
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Jonathan Chambers ENH #4163 Tweaks to docs, avoid mutable default args, mysql tests 3ad1351
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
Jonathan Chambers ENH #4163 Introduce DataFrame Index support. Refactor to introduce Pa…
…ndasSQLTable for cleaner OOP design
3512ab8
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
mangecoeur mangecoeur ENH #4163 Fix bug in index + parse date interaction, added test case …
…for problem
03ed9f5
dorandeluz dorandeluz referenced this issue from a commit in dorandeluz/pandas
mangecoeur mangecoeur ENH #4163 Fixed missing basestring import for py3.3 compat
ENH #4163 Fixed missing string_types import for py3.3 compat
ed89385
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.