Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Unicode support #150

Closed
FrancescAlted opened this issue Aug 21, 2014 · 6 comments
Closed

Provide Unicode support #150

FrancescAlted opened this issue Aug 21, 2014 · 6 comments

Comments

@FrancescAlted
Copy link
Contributor

Unicode support starts to be demanded for numexpr. See:

Blosc/bcolz#38

Although one can always fallback to Python, it is probably worth to implement it in numexpr itself.

@FrancescAlted
Copy link
Contributor Author

Here it is a small snipped on what the problem is:

In [1]: import numpy as np

In [2]: import numexpr as ne

In [3]: a = np.array(['a', 'b', 'c'], dtype='U2')

In [4]: print ne.evaluate('a == "b"')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-8540515c8b80> in <module>()
----> 1 print ne.evaluate('a == "b"')

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    728 
    729     # Create a signature
--> 730     signature = [(name, getType(arg)) for (name, arg) in zip(names, arguments)]
    731 
    732     # Look up numexpr if possible.

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in getType(a)
    627     if kind == 'S':
    628         return bytes
--> 629     raise ValueError("unkown type %s" % a.dtype.name)
    630 
    631 

ValueError: unkown type unicode64

In [5]: a = np.array(['a', 'b', 'c'], dtype='U4')

In [6]: print ne.evaluate('a == "b"')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-8540515c8b80> in <module>()
----> 1 print ne.evaluate('a == "b"')

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    728 
    729     # Create a signature
--> 730     signature = [(name, getType(arg)) for (name, arg) in zip(names, arguments)]
    731 
    732     # Look up numexpr if possible.

/home/faltet/anaconda/lib/python2.7/site-packages/numexpr/necompiler.pyc in getType(a)
    627     if kind == 'S':
    628         return bytes
--> 629     raise ValueError("unkown type %s" % a.dtype.name)
    630 
    631 

ValueError: unkown type unicode128

@thequackdaddy
Copy link

Hello,

What is the current thinking with getting unicode/strings supported with numexpr? I realized that I (think) this limitation causes a minor headache for something I've been toying with.

Here's something that fails for me:

In [9]: N = 1000
   ...: ra = np.fromiter(((i, i * 2., u"%.10s" % (i * 3))
   ...:                   for i in range(N)), dtype='i4,f8,U10')
   ...: ct = bcolz.ctable(ra)

In [10]: x = ct.eval("where(f2 == '10', 1, 0)")
Traceback (most recent call last):

  File "<ipython-input-10-ed3023e00b60>", line 1, in <module>
    x = ct.eval("where(f2 == '10', 1, 0)")

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\ctable.py", line 1375, in eval
    return bcolz.eval(expression, user_dict=self._ud(user_dict), **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 173, in eval
    **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 262, in _eval_blocks
    out_flavor, blen, **kwargs)

  File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 252, in _eval_blocks
    res_block = _eval(expression, vars_)

  File "<string>", line 1, in <module>

NameError: name 'where' is not defined

Interestingly, I can do:

In [7]: ct.eval("f2 == '10'")
Out[7]: 
carray((1000,), bool)
  nbytes := 1000; cbytes := 32.00 KB; ratio: 0.03
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 32768; chunksize: 32768; blocksize: 0
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
...

With no problems. Additionally, if I just coerce the strings to bytes, its a suitable workaround.

I'd offer to make a PR to fix this, but this code is way above my elementary understanding of how the evaluator/interpretor work. If someone can provide some hints to look at, I might be able to search through it.

@FrancescAlted
Copy link
Contributor Author

Yeah, there are two separate issues here. First is that numexpr does not understand unicode indeed:

In [25]: x = ra['f2'][:10]

In [26]: x
Out[26]: 
array([u'0', u'3', u'6', u'9', u'12', u'15', u'18', u'21', u'24', u'27'], 
      dtype='<U10')

In [27]: ne.evaluate("x == '9'")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-282aa443feab> in <module>()
----> 1 ne.evaluate("x == '9'")

/home/faltet/software/numexpr/numexpr/necompiler.pyc in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    787     # Create a signature
    788     signature = [(name, getType(arg)) for (name, arg) in
--> 789                  zip(names, arguments)]
    790 
    791     # Look up numexpr if possible.

/home/faltet/software/numexpr/numexpr/necompiler.pyc in getType(a)
    684     if kind == 'S':
    685         return bytes
--> 686     raise ValueError("unknown type %s" % a.dtype.name)
    687 
    688 

ValueError: unknown type unicode320

The second is that bcolz catches any possible exception throw by numexpr and tries to use the regular eval() in Python and this is why complains about NameError: name 'where' is not defined. This is why the next idiom works:

In [29]: ct.eval("f2 == '9'")[:10]
Out[29]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

which is equivalent to:

In [28]: ct.eval("f2 == '9'", vm="python")[:10]
Out[28]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

Note that you can use dask for performing operations too:

In [30]: ct.eval("f2 == '9'", vm="dask")[:10]
Out[30]: array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

Having said that, Unicode could be added to numexpr indeed, but that would be a long and painful process, so perhaps the above workaround works better for you. If despite my warnings you are still interested in implementing unicode support, tell me again and I will try to come with more concrete suggestions.

@thequackdaddy
Copy link

No, you're warning is fair enough. Browsing through the code, it really sounds like there's a confluence of C and python here, and while my python is passable, my knowledge of C is nonexistent. I hadn't realized that unicode works if you use the python vm, which is why the boolean comparison works, but a vectorized "where" function is not part of the python, so you get the name error. Clever error handling.

I'll keep this in mind, but I'm probably going to ignore it if its not easy (solvable in a few days by me.)

@FrancescAlted
Copy link
Contributor Author

Well, after seeing your use case, I rather think that catching all the errors that numexpr raises in bcolz and silently falling back to the python vm can be confusing. Perhaps raising an error and suggest to manually try the "python" or "dask" vm would be better. But this is very specific to bcolz, not to numexpr, so I opened an issue in bcolz for further discussion.

@robbmcleod
Copy link
Member

Please redirect any discussion regarding unicode support to #263.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants