Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] TypeError when Joining Numeric Columns with Some Missing Values #1725

Open
jackdesert opened this issue Nov 22, 2021 · 1 comment

Comments

@jackdesert
Copy link

jackdesert commented Nov 22, 2021

When joining two data frames on numeric type data (where one of them has a None value, vaex throws TypeError

I've tried changing dtype to string or object, but the error persists.

Sample Code:

import vaex

data_1 = {
    "serial": [142, None, 133],
    "mass": [58, 30, 40],
}

data_2 = {
    "id": [142, 74, 133],
    "name": ["jerry", None, "carli"],
}

df1 = vaex.from_dict(data_1)
df2 = vaex.from_dict(data_2)


# Since `serial` has a `None` value, it is mapped to `object` dtype
# Let's change them both to `string` dtype so we can join
df1.serial = df1.serial.astype('string')
df2.id = df2.id.astype('string')

# The data types appears to be unchanged, even after running `astype`
print(f'\ndf1 After `astype`:\n{df1.dtypes}')
print(f'\ndf2 After `astype`:\n{df2.dtypes}')

# This line errors with `TypeError`
df3 = df1.join(df2, left_on="serial", right_on="id")

Vaex Version: {'vaex-core': '4.5.1'}
Vaex installed via pip using pyenv virtualenv
OS: AmazonLinux:2

Here is the output of the example code:

python bug/join.py

df1 After `astype`:
serial    object
mass       int64
dtype: object

df2 After `astype`:
id       int64
name    string
dtype: object
Traceback (most recent call last):
  File "/home/jack.desert/p/cool-project/bug/join.py", line 27, in <module>
    df3 = df1.join(df2, left_on="serial", right_on="id")
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 6311, in join
    return vaex.join.join(**kwargs)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/join.py", line 207, in join
    left.map_reduce(map, reduce, [left_on], delay=False, name='fill looking', info=True, to_numpy=False, ignore_filter=True)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 415, in map_reduce
    return self._delay(delay, task)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 1566, in _delay
    self.execute()
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 398, in execute
    just_run(self.execute_async())
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/dataframe.py", line 402, in execute_async
    await self.executor.execute_async()
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/execution.py", line 253, in execute_async
    async for element in self.thread_pool.map_async(self.process_part, dataset.chunk_iterator(columns, chunk_size),
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 95, in map_async
    value = await value
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/jack.desert/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 90, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in cancellable_iter())
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/execution.py", line 368, in process_part
    task_part.process(thread_index, i1, i2, filter_mask, *blocks)
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/cpu.py", line 292, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "/home/jack.desert/.pyenv/versions/x/lib/python3.9/site-packages/vaex/join.py", line 200, in map
    found_masked = index.map_index(ar, lookup[i1:i2])
TypeError: map_index(): incompatible function arguments. The following argument types are supported:
    1. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64]) -> numpy.ndarray[int64]
    2. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int8]) -> bool
    3. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int16]) -> bool
    4. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int32]) -> bool
    5. (self: vaex.superutils.index_hash_int64, arg0: numpy.ndarray[int64], arg1: numpy.ndarray[int64]) -> bool

Invoked with: <vaex.superutils.index_hash_int64 object at 0x7fdcb1707cb0>, array([142, None, 133], dtype=object), array([127, 127, 127], dtype=int8)
@maartenbreddels
Copy link
Member

Hi Jack,

thanks for the report.

There are several issues here:

  • Vaex fails to identify df1.serial as an int column, and it default to dtype=object (do you or @JovanVeljanoski want to open an issue on that or a PR with a test)
  • dtype=object is not well supported in vaex (and never will be I'm afraid), but we have trouble casting dtype=object to string (Do you or @JovanVeljanoski want to create a separate issue for that, or maybe open a PR with a test to demonstrate it)
  • Assiging to a attribute (e.g. df.<column_name> = ...) is not supported, use df[<column_name>] = ...

The fix here is to avoid dtype=object, Arrow is pretty good at detecting that, so this will work for now:

import pyarrow as pa

data_1 = {
    "serial": pa.array([142, None, 133]),
    "mass": [58, 30, 40],
}

cheers,

Maarten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants