Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgresql / sqlite / mysql /oracle regular expression operators #1390

Closed
sqlalchemy-bot opened this issue Apr 23, 2009 · 59 comments
Closed

Comments

@sqlalchemy-bot
Copy link
Collaborator

sqlalchemy-bot commented Apr 23, 2009

Migrated issue, originally created by Anonymous

match operator not implemented

in sqlalchemy/databases/postgres.py 727
just add
sql_operators.match_op: lambda x, y, escape=None: '%s ~ %s' % (x, y) + (escape and ' ESCAPE '%s'' % escape or ''),

and it works for me
CAVEAT sql might crash if improper regexp syntax entered.

  • missing imatch / notmatch ....

(these matching functions are still used with same syntax)
http://www.postgresql.org/docs/7.4/interactive/functions-matching.html


Attachments: 1390.patch


Edit 2020-07-10

Coping this recap from #5447 (comment)

The call column.regex_match('[a-z]*') would then evaluate to:

  • Postgres: column ~ "[a-z]*"
  • SQLite: column REGEXP "[a-z]*"
  • MySQL, MariaDB: column REGEXP "[a-z]*"
  • Oracle: REGEXP_LIKE(column, "[a-z]*")
  • MSSQL: throw exception
@sqlalchemy-bot
Copy link
Collaborator Author

Michael Trier (@empty) wrote:

Except you'll have to specify something other than match_op since we're using that for the Full Text Searching.

@sqlalchemy-bot
Copy link
Collaborator Author

Anonymous wrote:

Now I see Indeed.

http://www.postgresql.org/docs/8.3/static/textsearch-features.html

You're right. It works better by using fts anyway.

Thanks.

''cherry on top out of initial topic''

  • Is it possible without calling pure sql in a close future to get the search rank / order by rank implicit (don't seem trivial too me) ?

  • is there a doc on the clean way to implement the "basic" regexp search by declaring an operator somewhere ?

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

demo (using attached patch):

from sqlalchemy import *
from sqlalchemy.dialects import postgresql

metadata = MetaData()

t = Table('t', metadata,
    Column('x', postgresql.VARCHAR(30))
)


print t.c.x.regexp("foo").compile(dialect=postgresql.dialect())
print (~t.c.x.regexp("foo")).compile(dialect=postgresql.dialect())
print (~t.c.x.iregexp("foo")).compile(dialect=postgresql.dialect())

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • removed labels: low priority
  • changed milestone from "0.x.xx" to "0.8.xx"
  • changed title from "match / imatch / notmatch / notimatch for postgres" to "postgresql regular expression operators"

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "0.8.xx" to "0.9.0"

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "0.9.0" to "0.9.xx"

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "0.9.xx" to "1.0.xx"

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • edited description

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "1.0.xx" to "1.1"

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

nobody is looking for these much, keep pushing them...

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "1.1" to "1.2"

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

the current gerrit for this is at https://gerrit.sqlalchemy.org/#/c/101/ . this is largely ready-ish.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

1.3 for the moment, the gerrit needs some work and this is not crtiical

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • edited description

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "1.2" to "1.3"

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

still not critical

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed milestone from "1.3" to "1.5"

@sqlalchemy-bot sqlalchemy-bot added this to the 1.5 milestone Nov 27, 2018
@zzzeek zzzeek changed the title postgresql regular expression operators postgresql / sqlite / mysql /oracle regular expression operators Jul 10, 2020
@zzzeek
Copy link
Member

zzzeek commented Jul 10, 2020

note from #5447 this is also possible on other backends:

Postgres: column ~ "[a-z]*"
SQLite: column REGEXP "[a-z]*"
MySQL: column REGEXP "[a-z]*"
Oracle: REGEXP_LIKE(column, "[a-z]*")


@multimeric
Copy link

I'm happy to look into this. Can you explain how the Gerrit patch relates to this?

@zzzeek
Copy link
Member

zzzeek commented Jul 10, 2020

I would say you can look at that patch to see if there's anything relevant to use. it proposes new operators for PostgreSQL's version of CHAR, VARCHAR, and TEXT, but if these operators are generally available on many backends, it would instead be made as part of the base String class in sqlalchemy/sql/sqltypes.py .

as far as contributing code we use the normal pull request process. I merge code through gerrit but outside contributors don't need to worry about it, we synchronize to your pull request.

@zzzeek
Copy link
Member

zzzeek commented Jul 10, 2020

also I would note we've written various kinds of background on helping with development at https://www.sqlalchemy.org/develop.html.

This is certainly an important feature that we want to do, so if you arent able to work on it, no worries it will get done eventually.

@multimeric
Copy link

Great! Thanks for the info, I'll have a look. I'd certainly like for the regex operators to be available generally, but I'm worried that tiny differences in the regex implementations will make this tricky. For instance, I also would like regex flags (e.g. case insensitivity) to be implemented, but some backends, like SQLite, don't implement flags at all. How do you think I should handle this?

@multimeric
Copy link

I think they're sufficiently similar that we should try to unify them. For example, i in all backends means case-insensitive, c means case-sensitive and m means multiline. But you need nm in postgres to have equivalent multine behaviour to m in MySQL and Oracle.

@multimeric
Copy link

Since its really only those two/three settings that are commonly controlled by flags, we could just add some boolean arguments to the match function that abstract away the flags themselves, for example def regexp(pattern, case_insensitive=False, dot_newline=False, multiline=False)

@zzzeek
Copy link
Member

zzzeek commented Jul 13, 2020

I don't think we've ever had a new feature with such a great roundtable discussion going on to make sure we get it right (and that I can other things done while it happens! :) ) thanks all !!!

@CaselIT
Copy link
Member

CaselIT commented Jul 13, 2020

I'm 👎 on the boolean argument to specify the flags, since they are hard to change/scale. In case of regexp is not unusual to specify the flag as a string, so I would propose to just use a string that is rendered in the sql.

@multimeric
Copy link

multimeric commented Jul 13, 2020

In theory I would prefer that too, but as discussed above, the multiline behaviour is quite different between DBMS. If you called column.regexp('pattern', flags='m'), it would invoke ".-matches-newline" and "$ matches on linebreaks" for Postgres, but only $ matches on linebreaks" on all other backends.

I can't think of a better way to provide abstraction without using boolean arguments. Python's re module uses bitwise flags for this behaviour, but that's less user-friendly, if a little more scalable.

@CaselIT
Copy link
Member

CaselIT commented Jul 13, 2020

I believe that the same argument could be made for 'pattern'. I don't think we can assume that they are the same in all backends, so I'm not sure if it would be a pro to have the flag that are backend independent but the pattern is not.

Of the two I would prefer something like column.regexp('pattern', flags='m', postgres_flags='nm', somedb_pattern='PATTERN') to have the ability of specifying backend dependent flags.

Also regarding flags, if multiline=True gets translated to 'mn' in pg, expect someone to want only 'm' or only 'n'.

That said, I'm not sure we need to make it backend independent, at leas as a first implementation.

@jvanasco
Copy link
Member

jvanasco commented Jul 13, 2020

what if the regex operator just accepts the core arguments, and there is an ancillary operator/factory for customizing the operations via a dict or callable... so the experience is more inline with the custom-compiles?

for example...

column.regexp('pattern', flags='m')

and

column.regexp_factory({"postgresql": ['pattern', flags='m'],
					  "mysql": ['pattern', flags='m'],
					   })

or

def regexp_factory(backend):
     return pattern, flags
column.regex(callable=regexp_factory)

@multimeric
Copy link

I believe that the same argument could be made for 'pattern'

Not exactly, there is a common subset of all these regex implementations, which I think would cover 95% of cases. The same isn't true for flags, since the multiline flags are quite different

Both of the postgres_flags suggestions make sense, but I worry that we aren't providing enough abstraction if we do that, which I would have thought was one of the main purposes of SQLAlchemy. The average user says "I want to match this column in multiline mode", and doesn't want to have to look up the database manuals for every single DBMS in order to achieve this portably.

@zzzeek
Copy link
Member

zzzeek commented Jul 14, 2020

since we have a lot of flags and things here, the interface might best be done in the style of "variant",a little tricky to get it right, but it looks like:

 col.regexp(pattern, flags).with_variant("mysql", pattern, flags) ...

i haven't looked closely at the discussion here so don't fret if I'm missing the point entriely

@CaselIT
Copy link
Member

CaselIT commented Jul 15, 2020

Is there an helper/mixin for the with_variant functionality? I believe it is used already in some apis

@zzzeek
Copy link
Member

zzzeek commented Jul 15, 2020

it's very specific to TypeEngine. the approach can potentially be emulated.

for the moment we could move forward with this without doing all that, however. most people using regex are probably targeting just one database.

@multimeric
Copy link

As I've mentioned elsewhere, I think it's quite important that we do come up with a portable solution. The whole reason I've investigated this issue is actually because I wanted to do a regex match using the same codebase on Postgres and SQLite. If I wanted to only support Postgres I'd use the op("~").

I don't mind the with_variant(). The only problem, as I've said, is that it still involves actually knowing the differences between pattern/flag behaviour on each DBMS which your average user isn't aware of. Still, doing this and linking to the appropriate documentation would be a good start.

@jvanasco
Copy link
Member

@zzzeek'swith_variant style does the same as my idea, but it more SQLAlchemy-like, so I'm +1 for it.

@TMiguelT writing a regex that works across multiple database backends is really outside the scope of what SQLAlchmey does or promises to do. I usually use SQLite for unit-tests and Postgres for integrated tests and production – my projects are filled with custom @compiles decorators to provide the same functionality across backends.

@zzzeek
Copy link
Member

zzzeek commented Jul 15, 2020

OK arent the differences for the typical regexp between a database like PG and SQLite mostly going to be the case insensitive part? or are basic syntaxes within the regexp strings different between PG / SQLite for example?

@zzzeek
Copy link
Member

zzzeek commented Jul 15, 2020

because we definitely can't be parsing and tokenizing regex strings, if the regex syntaxes are truly differnet and you wanted to make it so the user didn't have to know that, that's what that would entail, that's out of scope for SQLAlchemy (a third party extension could certainly do it however).

@jvanasco
Copy link
Member

Flags- The multiline and newline modes are slightly different (see my chart)

Actual regex-
Postgres - POSIX, Spencer library since 7.2. TCL ARE by default, ERE and BRE supported by extending the string.
Mysql - POSIX ERE, but the backing library was Spencer until 8.0.4 and is now unicode
Sqlite - PCRE via plugin
Oracle - POSIX ERE

TLDR; they don't really support the same syntax. they can get close, but that involves changing the pattern. it's possible there is a session option.

@zzzeek
Copy link
Member

zzzeek commented Jul 15, 2020

I've used regular expressions for literally 25 years and i couldnt give you one difference between PCRE and POSIX, so I found this: https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816

@sqla-tester
Copy link
Collaborator

Federico Caselli has proposed a fix for this issue in the master branch:

Implement regexp operator https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/2092

@CaselIT
Copy link
Member

CaselIT commented Jul 15, 2020

Actually that gerrit is more the logic of with_variant

@CaselIT
Copy link
Member

CaselIT commented Jul 15, 2020

In the end it's easier to implement the with_variant construct for each boolean expression that making it a regexp only thing

@CaselIT
Copy link
Member

CaselIT commented Jul 16, 2020

Seems that maria db does not have a clear way of setting the flags that I could find, they are embedded in the pattern. https://mariadb.com/kb/en/pcre/#option-setting

Also it does not seem to be the case with mysql, that has https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-like and different flags

@sqla-tester
Copy link
Collaborator

Federico Caselli has proposed a fix for this issue in the master branch:

Implement regexp operator https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/2096

@CaselIT
Copy link
Member

CaselIT commented Jul 16, 2020

I've added a first implementation here https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/2096

to figure out:

  • how to support mysql/mariadb flags, since they are not compatible with each other and we cannot discriminate them while compiling. mysql uses regexp_like while mariadb is like postgres, end embeds them in the pattern.
  • how to support flags for postgres, since they are embedded in the pattern. (mariadb has the same problem)

@CaselIT
Copy link
Member

CaselIT commented Jul 18, 2020

I've update the changset https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/2096 to use concat for the flags in pg and issue a warning when flags are used in mysql.

Feedbacks welcome

@CaselIT CaselIT modified the milestones: 1.5, 1.4 Jul 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants