Provide Unicode support #150

FrancescAlted · 2014-08-21T09:26:49Z

Unicode support starts to be demanded for numexpr. See:

Although one can always fallback to Python, it is probably worth to implement it in numexpr itself.

FrancescAlted · 2014-08-21T09:41:53Z

Here it is a small snipped on what the problem is:

In [1]: import numpy as np

In [2]: import numexpr as ne

In [3]: a = np.array(['a', 'b', 'c'], dtype='U2')

In [4]: print ne.evaluate('a == "b"')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-8540515c8b80> in <module>()
----> 1 print ne.evaluate('a == "b"')

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    728 
    729     # Create a signature
--> 730     signature = [(name, getType(arg)) for (name, arg) in zip(names, arguments)]
    731 
    732     # Look up numexpr if possible.

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in getType(a)
    627     if kind == 'S':
    628         return bytes
--> 629     raise ValueError("unkown type %s" % a.dtype.name)
    630 
    631 

ValueError: unkown type unicode64

In [5]: a = np.array(['a', 'b', 'c'], dtype='U4')

In [6]: print ne.evaluate('a == "b"')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-8540515c8b80> in <module>()
----> 1 print ne.evaluate('a == "b"')

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    728 
    729     # Create a signature
--> 730     signature = [(name, getType(arg)) for (name, arg) in zip(names, arguments)]
    731 
    732     # Look up numexpr if possible.

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in getType(a)
    627     if kind == 'S':
    628         return bytes
--> 629     raise ValueError("unkown type %s" % a.dtype.name)
    630 
    631 

ValueError: unkown type unicode128

thequackdaddy · 2016-06-29T15:50:56Z

Hello,

What is the current thinking with getting unicode/strings supported with numexpr? I realized that I (think) this limitation causes a minor headache for something I've been toying with.

Here's something that fails for me:

In [9]: N = 1000
   ...: ra = np.fromiter(((i, i * 2., u"%.10s" % (i * 3))
   ...:                   for i in range(N)), dtype='i4,f8,U10')
   ...: ct = bcolz.ctable(ra)

In [10]: x = ct.eval("where(f2 == '10', 1, 0)")
Traceback (most recent call last):

  File "<ipython-input-10-ed3023e00b60>", line 1, in <module>
    x = ct.eval("where(f2 == '10', 1, 0)")

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\ctable.py", line 1375, in eval
    return bcolz.eval(expression, user_dict=self._ud(user_dict), **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 173, in eval
    **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 262, in _eval_blocks
    out_flavor, blen, **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 252, in _eval_blocks
    res_block = _eval(expression, vars_)

  File "<string>", line 1, in <module>

NameError: name 'where' is not defined

Interestingly, I can do:

In [7]: ct.eval("f2 == '10'")
Out[7]: 
carray((1000,), bool)
  nbytes := 1000; cbytes := 32.00 KB; ratio: 0.03
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 32768; chunksize: 32768; blocksize: 0
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
...

With no problems. Additionally, if I just coerce the strings to bytes, its a suitable workaround.

I'd offer to make a PR to fix this, but this code is way above my elementary understanding of how the evaluator/interpretor work. If someone can provide some hints to look at, I might be able to search through it.

FrancescAlted · 2016-06-29T19:05:03Z

Yeah, there are two separate issues here. First is that numexpr does not understand unicode indeed:

In [25]: x = ra['f2'][:10]

In [26]: x
Out[26]: 
array([u'0', u'3', u'6', u'9', u'12', u'15', u'18', u'21', u'24', u'27'], 
      dtype='<U10')

In [27]: ne.evaluate("x == '9'")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-282aa443feab> in <module>()
----> 1 ne.evaluate("x == '9'")

/home/faltet/software/numexpr/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    787     # Create a signature
    788     signature = [(name, getType(arg)) for (name, arg) in
--> 789                  zip(names, arguments)]
    790 
    791     # Look up numexpr if possible.

/home/faltet/software/numexpr/numexpr/necompiler.pyc in getType(a)
    684     if kind == 'S':
    685         return bytes
--> 686     raise ValueError("unknown type %s" % a.dtype.name)
    687 
    688 

ValueError: unknown type unicode320

The second is that bcolz catches any possible exception throw by numexpr and tries to use the regular eval() in Python and this is why complains about NameError: name 'where' is not defined. This is why the next idiom works:

In [29]: ct.eval("f2 == '9'")[:10]
Out[29]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

which is equivalent to:

In [28]: ct.eval("f2 == '9'", vm="python")[:10]
Out[28]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

Note that you can use dask for performing operations too:

In [30]: ct.eval("f2 == '9'", vm="dask")[:10]
Out[30]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

Having said that, Unicode could be added to numexpr indeed, but that would be a long and painful process, so perhaps the above workaround works better for you. If despite my warnings you are still interested in implementing unicode support, tell me again and I will try to come with more concrete suggestions.

thequackdaddy · 2016-06-29T21:35:13Z

No, you're warning is fair enough. Browsing through the code, it really sounds like there's a confluence of C and python here, and while my python is passable, my knowledge of C is nonexistent. I hadn't realized that unicode works if you use the python vm, which is why the boolean comparison works, but a vectorized "where" function is not part of the python, so you get the name error. Clever error handling.

I'll keep this in mind, but I'm probably going to ignore it if its not easy (solvable in a few days by me.)

FrancescAlted · 2016-06-30T07:02:51Z

Well, after seeing your use case, I rather think that catching all the errors that numexpr raises in bcolz and silently falling back to the python vm can be confusing. Perhaps raising an error and suggest to manually try the "python" or "dask" vm would be better. But this is very specific to bcolz, not to numexpr, so I opened an issue in bcolz for further discussion.

robbmcleod · 2017-09-09T00:43:23Z

Please redirect any discussion regarding unicode support to #263.

FrancescAlted added the enhancement label Aug 21, 2014

FrancescElies mentioned this issue Oct 20, 2015

Can't use unicode string as where clause Blosc/bcolz#272

Closed

FrancescAlted mentioned this issue Jun 30, 2016

Silently catching numexpr errors and changing vm can be confusing Blosc/bcolz#312

Closed

robbmcleod closed this as completed Sep 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Unicode support #150

Provide Unicode support #150

FrancescAlted commented Aug 21, 2014

FrancescAlted commented Aug 21, 2014

thequackdaddy commented Jun 29, 2016

FrancescAlted commented Jun 29, 2016

thequackdaddy commented Jun 29, 2016

FrancescAlted commented Jun 30, 2016

robbmcleod commented Sep 9, 2017

Provide Unicode support #150

Provide Unicode support #150

Comments

FrancescAlted commented Aug 21, 2014

FrancescAlted commented Aug 21, 2014

thequackdaddy commented Jun 29, 2016

FrancescAlted commented Jun 29, 2016

thequackdaddy commented Jun 29, 2016

FrancescAlted commented Jun 30, 2016

robbmcleod commented Sep 9, 2017