You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When joining two data frames on numeric type data (where one of them has a None value, vaex throws TypeError
I've tried changing dtype to string or object, but the error persists.
Sample Code:
import vaex
data_1 = {
"serial": [142, None, 133],
"mass": [58, 30, 40],
}
data_2 = {
"id": [142, 74, 133],
"name": ["jerry", None, "carli"],
}
df1 = vaex.from_dict(data_1)
df2 = vaex.from_dict(data_2)
# Since `serial` has a `None` value, it is mapped to `object` dtype
# Let's change them both to `string` dtype so we can join
df1.serial = df1.serial.astype('string')
df2.id = df2.id.astype('string')
# The data types appears to be unchanged, even after running `astype`
print(f'\ndf1 After `astype`:\n{df1.dtypes}')
print(f'\ndf2 After `astype`:\n{df2.dtypes}')
# This line errors with `TypeError`
df3 = df1.join(df2, left_on="serial", right_on="id")
Vaex Version: {'vaex-core': '4.5.1'}
Vaex installed via pip using pyenv virtualenv
OS: AmazonLinux:2
Here is the output of the example code:
python bug/join.py
df1 After `astype`:
serial object
mass int64
dtype: object
df2 After `astype`:
id int64
name string
dtype: object
Traceback (most recent call last):
File "/home/jack.desert/p/cool-project/bug/join.py", line 27, in <module>
df3 = df1.join(df2, left_on="serial", right_on="id")
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 6311, in join
return vaex.join.join(**kwargs)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/join.py", line 207, in join
left.map_reduce(map, reduce, [left_on], delay=False, name='fill looking', info=True, to_numpy=False, ignore_filter=True)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 415, in map_reduce
return self._delay(delay, task)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 1566, in _delay
self.execute()
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 398, in execute
just_run(self.execute_async())
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/nest_asyncio.py", line 70, in run_until_complete
return f.result()
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/tasks.py", line 258, in __step
result = coro.throw(exc)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 402, in execute_async
await self.executor.execute_async()
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/execution.py", line 253, in execute_async
async for element in self.thread_pool.map_async(self.process_part, dataset.chunk_iterator(columns, chunk_size),
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 95, in map_async
value = await value
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
future.result()
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 90, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in cancellable_iter())
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/execution.py", line 368, in process_part
task_part.process(thread_index, i1, i2, filter_mask, *blocks)
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/cpu.py", line 292, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/join.py", line 200, in map
found_masked = index.map_index(ar, lookup[i1:i2])
TypeError: map_index(): incompatible function arguments. The following argument types are supported:
1. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64]) -> numpy.ndarray[int64]
2. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int8]) -> bool
3. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int16]) -> bool
4. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int32]) -> bool
5. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int64]) -> bool
Invoked with: <vaex.superutils.index_hash_int64 object at 0x7fdcb1707cb0>, array([142, None, 133], dtype=object), array([127, 127, 127], dtype=int8)
The text was updated successfully, but these errors were encountered:
Vaex fails to identify df1.serial as an int column, and it default to dtype=object (do you or @JovanVeljanoski want to open an issue on that or a PR with a test)
dtype=object is not well supported in vaex (and never will be I'm afraid), but we have trouble casting dtype=object to string (Do you or @JovanVeljanoski want to create a separate issue for that, or maybe open a PR with a test to demonstrate it)
Assiging to a attribute (e.g. df.<column_name> = ...) is not supported, use df[<column_name>] = ...
The fix here is to avoid dtype=object, Arrow is pretty good at detecting that, so this will work for now:
When joining two data frames on numeric type data (where one of them has a
None
value, vaex throwsTypeError
I've tried changing dtype to
string
orobject
, but the error persists.Sample Code:
Vaex Version: {'vaex-core': '4.5.1'}
Vaex installed via pip using pyenv virtualenv
OS: AmazonLinux:2
Here is the output of the example code:
The text was updated successfully, but these errors were encountered: