Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn that evaluate() should not be used on user input #442

Closed
corgeman opened this issue Jul 7, 2023 · 60 comments
Closed

Warn that evaluate() should not be used on user input #442

corgeman opened this issue Jul 7, 2023 · 60 comments

Comments

@corgeman
Copy link

corgeman commented Jul 7, 2023

The evaluate() function eventually calls eval() on the data provided. eval() is extremely dangerous when supplied with user input and to my knowledge it isn't mentioned that the function does this. I would add a warning in the documentation about this. As a proof-of-concept, the following code should execute the command 'echo verybad' on your computer when ran:

import numexpr
s = """
(lambda fc=(
    lambda n: [
        c for c in 
            ().__class__.__bases__[0].__subclasses__() 
            if c.__name__ == n
        ][0]
    ):
    fc("function")(
        fc("Popen")("echo verybad",shell=True),{}
    )()
)()
"""
numexpr.evaluate(s)
@robbmcleod
Copy link
Member

Yeah developers seem to be using NumExpr more and more as a parser for handling input from a UI. It would be much safer if we could use ast.parse instead. I think adding a warning to the docs is fine, but it would also make sense to me to add an optional sanitizer. We could search the string for various Python keywords that have no business being in an expression, such as:

  1. import
  2. lambda
  3. eval
  4. __ (dunder)
  5. locals
  6. globals

Sanitization could be enabled by default but give the user the means to turn it off.

@robbmcleod
Copy link
Member

Alright I implemented a check in 4b2d89c that forbids certain operators used in the expression. Namely: ;, :, [, and __. As far as I can tell this defeats all of the commonly cited attack vectors against eval(). I.e. it bans multiple expressions, lambdas, indexing, and all the dunders. If we want to be very safe, we would want to ban . as a character. However, this is difficult because . is both the reference operator and the decimal operator.

If we want to ban . we could do so if we require it to be followed by a numeral. E.g. 0.1 would pass the check, os.remove would not pass the check. However, this would ban valid Python code such as a * 5. which is a shorthand for casting to double. Perhaps there is someone out their with better regex-foo than myself and can suggest a solution for a regex that bans the Python . operator but doesn't cause issues for decimal points?

Regardless I think this is a good step forward and we're at the point where we should do a new release.

@corgeman
Copy link
Author

Looks good to me!

@robbmcleod
Copy link
Member

Ok, made some more improvements. I can ban attribute access to everything but .real and .imag, via

_forbidden_re = re.compile('[\;[\:]|__|\.[abcdefghjklmnopqstuvwxyzA-Z_]')

I also strip the string of all whitespace beforehand.

@robbmcleod
Copy link
Member

Closing with release of 2.8.5.

@jan-kubena
Copy link

jan-kubena commented Aug 7, 2023

Hi @robbmcleod, just wanted to point out that you can still access other attributes because Python translates some utf chars into ascii chars automatically (mainly greek alphabet used in math), thus:

numexpr.evaluate("(3+1).ᵇit_count()")

Results in:

array(1, dtype=int32)

It might be better to whitelist real and imag rather than blacklist chars.

@robbmcleod
Copy link
Member

@jan-kubena ok thanks for the heads up. The trouble here is that decimal needs to work, and numbers can appear in variable names, but not at the start. Do you happen to know where the documentation for which character sets are mangled is?

The really proper way to do it would be to back-port the ast parser work I did in for my attempt at making 3.0.

@robbmcleod
Copy link
Member

Apparently the pandas group is using dunders as private variables in some of their NumExpr calls.

pandas-dev/pandas#54449

I'm of two minds about this. I could regex against "[a-zA-Z0-9_]+" instead of just "__". But for the security conscious, it does feel to me that NumExpr shouldn't be able to access private variables in the locals and globals dictionaries.

@nicoddemus
Copy link

Hi folks,

I have a simpler example which also breaks in the new version:

import pandas as pd

name = "Mass (kg)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` < 300")
ValueError: Expression (BACKTICK_QUOTED_STRING_Mass__LPAR_kg_RPAR_) < (300) has forbidden control characters.

I understand Mass (kg) is a reasonable column name, so the validation definitely needs more tuning.

@zorion
Copy link

zorion commented Aug 8, 2023

Hi folks,
I'm not sure if it is the same issue here, we are using numexpr for calculations in a pandas dataframe and we are using double underscore in some of our calculations (for instance, a__b) but now it fails in 2.8.5 (working in 2.8.4).
I have a minimal example without involving pandas:

import numexpr
numexpr.evaluate('a__b / 2', {'a__b': 4})

Or a oneliner:

python -c "import numexpr; print(numexpr.evaluate('a__b / 2', {'a__b': 4}))"

In numexpr==2.8.4 that one works perfectly.
In numexpr==2.8.5 we have the following:
ValueError: Expression a__b / 2 has forbidden control characters.

Is it an intentional feature or something that can be workedaround easily (sending some kwarg to ignore the error)?
We will pin our numexpr requeriments to "<2.8.5" in the meanwhile.

Many thanks!

@robbmcleod
Copy link
Member

@nicoddemus and @zorion, just waiting to hear back from Pandas' devs on the original report: pandas-dev/pandas#54449 before I do anything. I've tried in the past to establish some sort of line of communication with them and gotten crickets.

@zorion
Copy link

zorion commented Aug 9, 2023

Hi, thanks for your reply.

I think we are having a legit use of dunder but it is not allowed in a breaking change from 2.8.4 to 2.8.5 so we have to fix our version to 2.8.4 (or lower) and this is an ok workaround for now.
On the other hand, if we had a way to flag "evaluate" that we trust our input we may remove this version restriction. Is it possible to do so?

Many thanks in advance!

@lithomas1
Copy link

@nicoddemus and @zorion, just waiting to hear back from Pandas' devs on the original report: pandas-dev/pandas#54449 before I do anything. I've tried in the past to establish some sort of line of communication with them and gotten crickets.

Hi, one of the pandas devs here.

I don't maintain the eval code (I don't think anyone still here does anymore), so not really the best person to comment on this.

Is there a way to gate this stricter checking behind an option?
(For pandas, we have a warning in the docs saying that eval will let users run arbritary code)

Thanks,
Thomas

@robbmcleod
Copy link
Member

@lithomas1 and company,

I see a few approaches here.

  1. We can deprecate the implementation of the __ filter for a immediate release and put it back in for a future release.
  2. We can put in an option to disable the security check. However, I do think it should default to be on.

In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize.

@lithomas1
Copy link

@lithomas1 and company,

I see a few approaches here.

  1. We can deprecate the implementation of the __ filter for a immediate release and put it back in for a future release.
  2. We can put in an option to disable the security check. However, I do think it should default to be on.

In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize.

Thanks, this sounds good to me.
2 is probably good enough for pandas.
(We are going to be releasing soon anyways in a couple weeks. I think users can be fine pinning numexpr for now).

It would be nice to actually fix pandas, however, I'm not too sure what the future of eval/query will be given its current "zombie"-like state.

@rebecca-palmer
Copy link
Contributor

There's actually two changes in numexpr 2.8.5 that fail pandas tests - this, and a change to integer overflow behaviour (possibly a result of the negative-powers changes, but I'm not sure yet) pandas-dev/pandas#54546

@robbmcleod
Copy link
Member

@rebecca-palmer You can make a new issue if you want, but NumExpr could never cover 2**100 as that's way in excess of 64-bit integers.

@robbmcleod
Copy link
Member

@zorion, @nicoddemus, @lithomas1 I made a push in 397cc98 that should hopefully fix these issues, if you could please test?

  1. I improved the blacklisting. It should better match the funny unicode coercion that can be done for the attribute access attack.
  2. It only blocks dunders and not a single double underscore.
  3. You can disable it by calling validate(..., sanitize=False) or equivalently evaluate(..., sanitize=False).

If you have a chance please test and provide me with any feedback you may have.

@nicoddemus
Copy link

Hi @robbmcleod,

Thanks for attempting a fix.

My original example now passes, but this one breaks (reduced from the actual code):

import pandas as pd

name = "II (MM)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` >= 3.1e-05")
ValueError: Expression (BACKTICK_QUOTED_STRING_II__LPAR_MM_RPAR_) >= (3.1e-05) has forbidden control characters.

The problem in this case seems to be the scientific notation, if I change 3.1e-05 to something that does not format to scientific (say 0.31) then it no longer errors out. If I use the actual number in decimal notation (0.000031) it still fails because it seems internally it formats back to scientific, because it generates the exact same message as above (with 3.1e-05):

import pandas as pd

name = "II (MM)"
df = pd.DataFrame({name: [200, 300, 400]})
df.query(f"`{name}` >= 0.000031")
ValueError: Expression (BACKTICK_QUOTED_STRING_II__LPAR_MM_RPAR_) >= (3.1e-05) has forbidden control characters.

sanitize=False would be great for us, however we call DataFrame.query which does not have that argument yet.

@rebecca-palmer
Copy link
Contributor

@nicoddemus @robbmcleod I think changing _attr_pat to r'.\b(?!(real|imag|\d+e?)\b)' (i.e. adding 'e?') fixes that, but I haven't actually tested it.

@nicoddemus
Copy link

nicoddemus commented Aug 16, 2023

Perhaps a more reliable approach would be to use ast.parse, and reject the tree if we find statements, lambdas, dunder import, etc?

@rebecca-palmer
Copy link
Contributor

@robbmcleod I've added some comments in the commit - do those notify anyone when it's already on the main branch?

@robbmcleod
Copy link
Member

It needs to match [eE]?[+-]?. I.e. either 'e' or 'E' can denote scientific notation and then it can be '-' or '+' exponents. It's not a big deal, I just haven't had time to sit down and do it yet. Please be patient.

I definitely don't get notifications on commits.

@robbmcleod
Copy link
Member

@nicoddemus I did use ast.parse for the NumExpr-3.0 branch. This is not a trivial fix to backport it. NumExpr 2 has an home-brew AST.

@robbmcleod
Copy link
Member

FWIW, NumExpr being a legacy piece of code seems to be in fairly widespread use in other legacy systems (e.g. the Pandas .query method, for example) that aren't well maintained. I used to commonly get requests to consult on code using NumExpr. Hence my desire to implement an effective sanitizer.

There were definitely some growing pains with writing the sanitizer, but to me that was expected. It does seem to work now. It's very hard for me to see any way to bypass it and execute malicious code. I did try and write a whitelist, but that's considerably harder to regex.

Regarding the choice to default to True on sanitization, it goes back to wanting to make the issue loud to end users who don't even know their code is using NumExpr.

I've been thinking if we did want to default to not sanitizing the input string, we could instead show a warning to the user. This warning could be suppressed by setting an environment variable, such as NUMEXPR_NO_WARN_SANTIZE. We could then, state that sanitize=False is deprecated and in the future it will be default to True.

@smorken
Copy link

smorken commented Sep 6, 2023

I have a workaround that involves running an external parser based on pyparsing to pre-validate expressions before passing them on to numexpr.

This is not fully tested, but I am considering this for my own use. I realize it might not be possible to add an additional package requirement to numexpr, but maybe a similar approach would be practical as opposed to a sanitation approach? This would admittedly cause a performance hit but maybe not huge for the typical numexpr use cases.

I suppose the fact that python eval is even involved might mean there are edge cases and a pre-parser might break some expected functionality of numexpr? I think it's fine for my use case though.

from pyparsing import (
    infix_notation,
    one_of,
    OpAssoc,
    Literal,
    Forward,
    Group,
    Suppress,
    Optional,
    delimited_list,
    ParserElement,
)
from numexpr.necompiler import vml_functions
from pyparsing.common import pyparsing_common


ParserElement.enablePackrat()


LPAREN, RPAREN = map(Suppress, "()")
NUMEXPR_FUNCS = vml_functions + ["where"]


def get_parser():
    integer = pyparsing_common.integer
    real = pyparsing_common.real | pyparsing_common.sci_real
    imaginary = (real | integer) + one_of("j J")
    arith_expr = Forward()
    fn_call = Group(
        one_of(NUMEXPR_FUNCS)
        + LPAREN
        - Group(Optional(delimited_list(arith_expr)))
        + RPAREN
    )
    operand = (
        fn_call | imaginary | real | integer | pyparsing_common.identifier
    )

    bitwise_operators = one_of("& | ~ ^")
    comparison_operators = one_of("< <= == != >= >")
    unary_arithmetic = one_of("-")
    binary_arithmetic = one_of("+ - * / ** % << >>")

    arith_expr << infix_notation(
        operand,
        [
            (bitwise_operators, 2, OpAssoc.LEFT, None),
            (comparison_operators, 2, OpAssoc.LEFT, None),
            (unary_arithmetic, 1, OpAssoc.RIGHT, None),
            (binary_arithmetic, 2, OpAssoc.LEFT, None),
        ],
    )

    return arith_expr

def test_passing_expressions():
    parser = numexpr_expression_parser.get_parser()
    result, parse_results = parser.runTests([
        "where(a) ==  1+2e6j",
        "1 + 2.0 + _abc + sin(o)",
        "1 + 2.0 + __abc",  # __abc is a valid identifier
        "1 + 2.0 + _abc + sin(o)",
    ])
    assert result

def test_failing_expressions():
    parser = numexpr_expression_parser.get_parser()
    result, parse_results = parser.runTests([
        "eval(123)"
    ])
    assert not result

@MichaelTiemannOSC
Copy link

If no one can think of adding any more tests to this I'll prepare another release?

I'll try and test locally against Pandas as well.

That would be very welcome. pip-audit is now failing my builds due to PYSEC-2023-163 (this issue).

@robbmcleod
Copy link
Member

I added the means to turn the sanitize=True default behavior off, by setting an environment variable,

set NUMEXPR_SANITIZE=0

Generally speaking I think this shouldn't be any more so a security hole than allowing people to pass sanitize=False is.

I tested with pandas against the tests I found that referenced numexpr or evaluate and they all passed. I wasn't able to run the full pandas test suite as I had some access violation.

Otherwise everything is good to release 2.8.6. I'll give everyone a day to comment.

@robbmcleod
Copy link
Member

@smorken we could consider adding that as a code snippet to the documentation? Perhaps some section titled "Using NumExpr for evaluating user inputs?"

@smorken
Copy link

smorken commented Sep 11, 2023

@smorken we could consider adding that as a code snippet to the documentation? Perhaps some section titled "Using NumExpr for evaluating user inputs?"

Sure, by all means, it's mostly just cobbled together from pyparsing examples and by looking at the supported numexpr syntax in the user guide, so feel free to make changes as needed if you see something that could be improved. I am pretty sure that not all of the numexpr syntax would be supported, but I am guessing it might work as a pre-filter for a useful subset of the syntax as it is.

@robbmcleod
Copy link
Member

2.8.6 has been released, we'll see if there are any further troubles. ::crosses fingers::

@smorken if you want to write a gist I can link to it?

@MichaelTiemannOSC
Copy link

The 2.8.6 version was just flagged by the same org that flagged 2.8.5: https://vulners.com/osv/OSV:PYSEC-2023-163

I suspect that the real problem is that the LangChain code is the real vulnerability and that numexpr is just exposing what Python itself exposes--an eval that can execute arbitrary code. In my view, the library itself should not be tagged unless it can be exploited by means other than using its standard API in normal ways. But any application that exposes eval to random users is vulnerable, whether they go through a library like numexpr or directly to Python. Somebody needs to sort this with the CVE community, however. I don't think I have standing to argue.

@newville
Copy link

I came across this from a test failure in my X-ray data analysis codes that uses a library (pyFAI) that uses numexpr.NumExpr on an expression like '4.0e-9*x') (see #449).

I am shocked to learn that numexpr uses eval, and somewhat alarmed at the simplistic approaches proposed here to disallow dunder names.

Trying to parse Python expressions yourself is foolhardy, especially since Python exposes its own parser with ast. You might, for example, consider replacing eval() with the asteval module (https://github.com/newville/asteval). It is true that I am the author, but this is far from a shameless plus to use code: I support it so that my other codes can work safely.

With asteval you could certainly throw out many of the "supported nodes" you do not like (loops, conditionals, etc) and use it only for evaluations of expressions. You could join the discussion about what attributes of Python objects are unsafe. For details, see the list of disallowed attributes at https://newville.github.io/asteval/motivation.html#how-safe-is-asteval.

@smorken
Copy link

smorken commented Sep 13, 2023

I spent a couple of hours packaging and testing that snippet I posted here before. It's a strict infix pre-parser that supports (much of) the numexpr syntax and all of the function names. Not sure if it will be useful but it's now here in case anyone wants to look.

https://github.com/smorken/numexpr_preparser

@newville
Copy link

@smorken. Well, that would definitely preserve the errant behavior of #449:

>>> numexpr_safe_evaluate('4.0e-9')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in numexpr_safe_evaluate
ValueError: Expected end of text, found 'e'  (at char 3), (line:1, col:4)

numexpr is designed for parsing and evaluating numerical Python expressions. It has been around for years, is widely used, it has a testing mechanism in place and good documentation. Yet numexpr.NumExpr fails to parse a valid literal number. It appears this has been known for several weeks, and the latest code was released knowing about this and not even adding a test for it. I hope I am misreading that.

Look I am an outsider here, so feel free to cast me as a bad guy I do not mean any ill-will to anyone. But, I am just shocked to learn that numexpr is ever using Python's eval. I am equally shocked to learn of this because the latest release of numexpr.NumExpr cannot parse a valid literal number in scientific notation.

I am dismayed at the attempts discussed here to try to create a pre-parser so that eval can be used "safely". I do hope that such efforts are not being taken seriously.

Please do not use eval.

@smorken
Copy link

smorken commented Sep 14, 2023

@newville thanks for taking the time to try it out. Added a fix here, to do with the pyparsing objects. I really don't want to distract from the issue here at hand any further with those scripts though, and to be clear I am not proposing that they be added into numexpr as a robust solution.

As numexpr user who was also surprised to learn eval is being used internally. I welcome progress and concrete solutions on the issue as well!

@robbmcleod
Copy link
Member

Alright, I'm going to try and provide a bit of history of this project because we have so many people coming here without context.

NumExpr (NE) was originally written by David Cooke in the late 2000s for Python 2.5 as a way to accelerate NumPy calculations. He's no longer involved in open source and no one knows where he is or how to contact him. Because NE predates many things in modern Python, the source has a lot of technical debt and it implements its own Abstract Syntax Tree (AST), because the ast module didn't exist back then (being a Python 2.7 innovation). NE turned 15-years old this year.

Incidentally the ast module documentation is still incomplete dogshit in 2023 and one should look at:

https://greentreesnakes.readthedocs.io/en/latest/

, if you want to understand how it actually works.

Francesc Alted took over maintenance in that void because PyTables used NumExpr to do queries. I was using NumExpr 2015-16 to accelerate some scientific calcs without having to implement customer functions in the C-API all the time. After I quit my Swiss post-doc and moved back to Canada I started on a project to make "NumExpr 3.0" which had the potential to fix all the shortfalls in 2.0. However, it was an (overly) ambitious project, and I got a paying job, and my free time evaporated. The 3.0 branch does use the ast module to parse expressions. However, I completely re-wrote the Python part of NE because it's frankly a mess of mutable arguments going into the NE AST and I added the ability to parse multiple lines with temporary variables (which existed in the virtual machine as just a single 4k block) among numerous other improvements. Fransesc asked me if I could take over the maintenance side of things and I agreed, since it was the reasonable thing to do.

My thinking at that time was that people would stop using NE and switch to Numba. I saw NE3 as having a potentially niche for "write once; execute on big data" scripts where the user didn't want to unravel their vectorized calculations to use Numba.  Little did I know at that time that a lot of people were using NE2 for a purpose it was never (IMO) intended to be used for: parsing user inputs. This is very clear if you look at the myriad of reported issues on this repo where people are using NumExpr to parse singletons (and hence running into issues with the NE AST), whereas in my opinion NE was intended to be a blocked-calculation virtual machine to avoid being calculation limited by memory bandwidth. NE is extremely inefficient for parsing singleton inputs. CPython itself is more efficient. I've tried repeatedly in the Issues tracker to discourage people from using NE in this way, to no avail.

I personally do not have the bandwidth to implement a new AST in the package. In 2017 yes, but not in 2023. Now, if someone wants to take over maintenance of the project, I'd be thrilled to hand it over. It's possible money could be sourced from one of the open-source funding agencies. The adjacent packages: NumPy, Pandas, and PyTables are all funded. Francesc has approached me in the past asking if I wanted to be part of one of the grant applications, and I said no. I've also been asked to consult for companies using NumExpr internally and again I said no (although this has dried up over the past couple of years). For me, it's not about the money, it's about my personal time. 

If I ask someone, "can you please write a unit test for this edge case?" and I get a no code response, I'm not going to be able to fix the problem in 15 minutes on my lunch break. To be clear: I don't use NumExpr in a professional context. I wrote my own virtual machine for professional use. There's no benefit to me in continuing to maintain this project.

@t20100
Copy link
Contributor

t20100 commented Sep 14, 2023

Thanks @robbmcleod for maintaining this project!

As this issue highlights, numexpr is a very much used piece of code.

Since the effort of refactoring/rewriting numexpr is too huge, a disclaimer in the documentation about security issues as initially proposed sounds more affordable.

Regarding the issue with scientific notation in v2.8.6, PR #451 proposes both a test and a fix for it.
If this fix is suited, a bug fix release would be much appreciated!

@newville
Copy link

@robbmcleod

NumExpr (NE) was originally written by David Cooke in the late 2000s for Python 2.5 as a way to accelerate NumPy calculations. He's no longer involved in open source and no one knows where he is or how to contact him. Because NE predates many things in modern Python, the source has a lot of technical debt and it implements its own Abstract Syntax Tree (AST), because the ast module didn't exist back then (being a Python 2.7 innovation). NE turned 15-years old this year.

If memory serves,ast was included with Python 2.6 (and even partially available in 2.5, perhaps as a third-party lib). But with Python 2.7 it became the same source->AST parser used by Python itself.

It is OK to play the "this is a very old codebase" card, but the ast module is hardly new. At many points in the process, the developers apparently decided to stick with "home-built" instead of "standard library". That's OK, if they are able and willing to maintain it.

FWIW, the origins of asteval are from about the same time (2.6 to 2.7 transition). The github repo goes back to 2012, reflecting the transition to using git. Again, not a new project.

Incidentally the ast module documentation is still incomplete dogshit in 2023 and one should look at:

https://greentreesnakes.readthedocs.io/en/latest/

, if you want to understand how it actually works.

Yes, the ast documentation is incomplete, but the usage within asteval sort of demonstrates that it is not really that hard to work with. For anyone who sort of understands the concepts (surely anyone who would consider using pyparsing), ast.dump(ast.parse(string)) is pretty self-explanatory.

But also: all of the well-meaning suggestions and bugfixes here about regular expressions, "dunder" names, and using pyparsing to try to make the input for eval "safe" are missing the entire point of the ast module: You do not ever need to do any lexing or parsing of Python statements. Any lexing or parsing of Python statements that you choose to do will add code that has to be maintained and supported. It will almost certainly depend on fragile regular expressions or parsing modules that are non-trivial to understand. In the end, the lexing parsing will be "correct" if and only if it agrees precisely with the results from the ast module. That is, you can use ast or you can decide to do something worse.

Most of what is dangerous about eval is accessing object attributes. That cannot be avoided by parsing. Many of the worst dangers of eval can be avoided by carefully deciding which attributes can be accessed.

Francesc Alted took over maintenance in that void because PyTables used NumExpr to do queries. I was using NumExpr 2015-16 to accelerate some scientific calcs without having to implement customer functions in the C-API all the time. After I quit my Swiss post-doc and moved back to Canada I started on a project to make "NumExpr 3.0" which had the potential to fix all the shortfalls in 2.0. However, it was an (overly) ambitious project, and I got a paying job, and my free time evaporated. The 3.0 branch does use the ast module to parse expressions. However, I completely re-wrote the Python part of NE because it's frankly a mess of mutable arguments going into the NE AST and I added the ability to parse multiple lines with temporary variables (which existed in the virtual machine as just a single 4k block) among numerous other improvements. Fransesc asked me if I could take over the maintenance side of things and I agreed, since it was the reasonable thing to do.

Well, definitely Thanks to you and Fransesc (and David Cooke) for doing that -- it is much appreciated.

My thinking at that time was that people would stop using NE and switch to Numba. I saw NE3 as having a potentially niche for "write once; execute on big data" scripts where the user didn't want to unravel their vectorized calculations to use Numba. Little did I know at that time that a lot of people were using NE2 for a purpose it was never (IMO) intended to be used for: parsing user inputs. This is very clear if you look at the myriad of reported issues on this repo where people are using NumExpr to parse singletons (and hence running into issues with the NE AST), whereas in my opinion NE was intended to be a blocked-calculation virtual machine to avoid being calculation limited by memory bandwidth. NE is extremely inefficient for parsing singleton inputs. CPython itself is more efficient. I've tried repeatedly in the Issues tracker to discourage people from using NE in this way, to no avail.

Yeah, I understand that...

I personally do not have the bandwidth to implement a new AST in the package. In 2017 yes, but not in 2023. Now, if someone wants to take over maintenance of the project, I'd be thrilled to hand it over. It's possible money could be sourced from one of the open-source funding agencies. The adjacent packages: NumPy, Pandas, and PyTables are all funded. Francesc has approached me in the past asking if I wanted to be part of one of the grant applications, and I said no. I've also been asked to consult for companies using NumExpr internally and again I said no (although this has dried up over the past couple of years). For me, it's not about the money, it's about my personal time.

I understand that too. The numexpr devels might just decide that replacing eval with asteval would circumvent the worst security issues, and avoid all of the discussion here about various band-aids for eval. But, I no absolutely nothing about the numexpr code base.

If I ask someone, "can you please write a unit test for this edge case?" and I get a no code response, I'm not going to be able to fix the problem in 15 minutes on my lunch break. To be clear: I don't use NumExpr in a professional context. I wrote my own virtual machine for professional use. There's no benefit to me in continuing to maintain this project.

Well, I think that many of us will understand trying to maintain software, especially on lunch breaks ;). It looks like you were updating and releasing versions until fairly recently, but maybe I am not understanding some things. Is someone else maintaining this?

@rebecca-palmer
Copy link
Contributor

As previously noted, dunder_pat is still blocking some things that aren't dunders. Hence, pandas still fails a test.

@rebecca-palmer
Copy link
Contributor

@robbmcleod what, if anything, blocks the above two fixes from being applied, to at least fix the known unnecessary breakage? (See #452 if you prefer a proper pull request.)

In the longer term, it looks to me like everyone here agrees that moving to ast would be better, but will take work.

I may be interested in becoming a maintainer and/or contributing to that work, but this is not a promise at this point.

@newville
Copy link

Yes, please merge #452 and #451 (both fix literals using scientific notation, #452 adds better checking for dunder names, while #451 adds a test for numeric literals using scientific notation).

As it stands, downstream packages must give a specific and not-the-latest version for numexpr, in their requirements such as numpexpr<=2.8.4.

@FrancescAlted
Copy link
Contributor

One can also take the opportunity to produce wheels for forthcoming Python 3.12. Although 3.12 is not final yet (Oct, 2nd is the tentative date), Python folks will not be introducing ABI changes after existing 3.12rc2, so extensions built on it should work well with the forthcoming 3.12 final. Also, the NumPy team is already producing wheels for 3.12, so this dependency should be ready too.

@robbmcleod I'm willing to help in doing the release in case you don't have lots of time right now. BTW, thanks for all the time that you have put in the project so far; you have done a most excellent job in maintaining the project.

@FrancescAlted
Copy link
Contributor

I am in the process to release 2.8.7, with the suggestions here. If you want to test how the candidate looks like, please go to #453 and give it a try. My plan is to do a release as soon as possible (hopefully by tomorrow).

Also, and after talking with @robbmcleod , I have added an advert in the README where it is said that the project is looking for (much needed) new maintainers. If anyone here is ready for tackling that, please speak. Thanks!

@avalentino
Copy link
Contributor

Maybe this issue can be closed now, right?

@chipmuenk
Copy link

I'm a bit late to the show but I only noticed yesterday that numexpr also fails with simple complex numbers like 1.0j which affects my software https://github.com/chipmuenk/pyfda. I think parsing complex numbers should be legit use case for numexpr.

@rebecca-palmer
Copy link
Contributor

@chipmuenk: yes, that sounds like a bug, sorry.

Untested fix:
-_attr_pat = r'.\b(?!(real|imag|\d*[eE]?[+-]?\d+)\b)'
+_attr_pat = r'.\b(?!(real|imag|\d*[eE]?[+-]?\d+j?)\b)'

@FrancescAlted
Copy link
Contributor

@rebecca-palmer could you open a PR adding a test for the new complex case too? thanks in advance!

@Dobatymo
Copy link

Dobatymo commented Feb 22, 2024

There has been an issue for this since 2018... #323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests