New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Unicode support #150
Comments
Here it is a small snipped on what the problem is:
|
Hello, What is the current thinking with getting unicode/strings supported with numexpr? I realized that I (think) this limitation causes a minor headache for something I've been toying with. Here's something that fails for me: In [9]: N = 1000
...: ra = np.fromiter(((i, i * 2., u"%.10s" % (i * 3))
...: for i in range(N)), dtype='i4,f8,U10')
...: ct = bcolz.ctable(ra)
In [10]: x = ct.eval("where(f2 == '10', 1, 0)")
Traceback (most recent call last):
File "<ipython-input-10-ed3023e00b60>", line 1, in <module>
x = ct.eval("where(f2 == '10', 1, 0)")
File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\ctable.py", line 1375, in eval
return bcolz.eval(expression, user_dict=self._ud(user_dict), **kwargs)
File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 173, in eval
**kwargs)
File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 262, in _eval_blocks
out_flavor, blen, **kwargs)
File "C:\Anaconda3\lib\site-packages\bcolz-1.0.1.dev112+dirty-py3.5-win-amd64.egg\bcolz\chunked_eval.py", line 252, in _eval_blocks
res_block = _eval(expression, vars_)
File "<string>", line 1, in <module>
NameError: name 'where' is not defined Interestingly, I can do: In [7]: ct.eval("f2 == '10'")
Out[7]:
carray((1000,), bool)
nbytes := 1000; cbytes := 32.00 KB; ratio: 0.03
cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
chunklen := 32768; chunksize: 32768; blocksize: 0
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
... With no problems. Additionally, if I just coerce the strings to bytes, its a suitable workaround. I'd offer to make a PR to fix this, but this code is way above my elementary understanding of how the evaluator/interpretor work. If someone can provide some hints to look at, I might be able to search through it. |
Yeah, there are two separate issues here. First is that numexpr does not understand unicode indeed:
The second is that bcolz catches any possible exception throw by numexpr and tries to use the regular
which is equivalent to:
Note that you can use dask for performing operations too:
Having said that, Unicode could be added to numexpr indeed, but that would be a long and painful process, so perhaps the above workaround works better for you. If despite my warnings you are still interested in implementing unicode support, tell me again and I will try to come with more concrete suggestions. |
No, you're warning is fair enough. Browsing through the code, it really sounds like there's a confluence of C and python here, and while my python is passable, my knowledge of C is nonexistent. I hadn't realized that unicode works if you use the python vm, which is why the boolean comparison works, but a vectorized "where" function is not part of the python, so you get the name error. Clever error handling. I'll keep this in mind, but I'm probably going to ignore it if its not easy (solvable in a few days by me.) |
Well, after seeing your use case, I rather think that catching all the errors that numexpr raises in bcolz and silently falling back to the python vm can be confusing. Perhaps raising an error and suggest to manually try the "python" or "dask" vm would be better. But this is very specific to bcolz, not to numexpr, so I opened an issue in bcolz for further discussion. |
Please redirect any discussion regarding unicode support to #263. |
Unicode support starts to be demanded for numexpr. See:
Blosc/bcolz#38
Although one can always fallback to Python, it is probably worth to implement it in numexpr itself.
The text was updated successfully, but these errors were encountered: