-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE-REQUEST] Joining on more than one columns. #1948
Comments
Officially not supported. |
Hey @JovanVeljanoski
and its throwing this error:
I'm attaching the part of the error but it was showing error for all the columns. Abnormal behavior is like sometime this code runs perfectly and most of the time it throws these errors. Also first time it throws error but if I re run the code at the same time then it works fine. |
Without a reproducible example, i really can't offer much advice. I would refer you to this response, it is perfectly applicable here also: #1729 (comment) Some things to consider:
|
@JovanVeljanoski Thanks for the immediate reply. Replying to what you have asked:
|
Overview: The LEFT dataset is 25M rows, the RIGHT is 10.5M. Worst case would be nested for loop with 25M*10.5M= 262.5M rows. This would never fit into ram. The typical workload is a LEFT join with 3-5% duplication and 20%-100% match on RIGHT. In attempts to find a method that works for worst case and is fast enough for the common case I've settled with: 1/5. Create LEFT COMPOSITE KEY INDEX using pool.apply(index task, LEFT dataframe) as HDF5 in chunks of 500_000 rows (50 tasks) 2/5. Create RIGHT COMPOSITE KEY INDEX using pool.apply(index task, RIGHT dataframe, limited to entries in LEFT COMPOSITE KEY INDEX) as HDF5 in chunks of 500_000 rows (21 tasks). 3/5. Create a task queue with the 153 join tasks 4/5. Do the join in chunks (with 500_000 + 500_000 rows) for each processor and store each result as HDF5 in the working directory. As the left and right composite key index are stored, all procs can access them in read-only mode. There is no inter-process communication. The tasks can probably be vectorised. The coordination mechanism is the task queue that only declares the search index limits. What about join skewness? Let's argue there are no matches in the RIGHT in index % 3 == 0. Whenever a processor seeks to perform join, it will read the left comp. key index and find no matches in the right comp. key index. The size of the output chunk is now zero. As IO probably will be the bottleneck, the processor can quickly move onto the next task, this means that the queue of tasks will be consumed by the frequency of matches. Thereby the processor load will balance itself. 5/5. Create a virtual dataset (HDF5) from all the chunks. This should take milliseconds. Preview shortcut for vaex: Compute first 5 hits for grid (0,0) and first five hits for grid (-1,-1) in as single task CPU. composite key reference implementations: |
Hey Vaex Team,
Any idea how we can join on more than one column in vaex? I've tried using that merging column hack but it gives error in my case. So how we join on more than one column.
The text was updated successfully, but these errors were encountered: