-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for concat #40
Conversation
- Add a top-level `concat` function for concatenating dataframes, series, or indices together - Remove old `concat` method, as it didn't match pandas api - Fix bugs in `to_pandas` implementation to properly handle indices
pygdf/categorical.py
Outdated
self._ordered == other._ordered and | ||
self._codes_impl == other._codes_impl) | ||
except Exception: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hide the Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you can't compare pandas custom dtypes and numpy dtypes. We hide the exception so that you can compare categorical and numeric implementations. Although in this case we can skip the try block because we check if other is CategoricalSeriesImpl
. The try/except for NumericalSeriesImpl
is still needed though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except
will just hide accidental code error. If other
is a NumericalSeriesImpl
, it will just return None
, which will be interpreted as False
. I think the try-except should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, except for the "return None" bit - I prefer comparison methods to always return booleans directly. Fixed.
pygdf/series.py
Outdated
if o._impl != head._impl: | ||
raise ValueError("All series must be of same type") | ||
|
||
data = head._impl.concat(objs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to delegate to _impl
to handle concat
. The GPU dataframe will always require plain-old-data types. (We can only handle data that can be copied b/w cpu and gpu memory.) It is always safe to just copy the data for concat just like the one in numerical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For categorical data we need to allocate a buffer of series.cat.codes.dtype
, while for numeric we need to allocate a buffer of series.dtype
. I figured it made more sense to special case inside the implementations than check if the series was categorical and special case in the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self._data.dtype
(Buffer.dtype
) will always have the physical dtype. The SeriesImpl
has ~~~a~~~ the logical meaning of things. Here, we will only need physical info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I didn't realize _data
also had a dtype
attribute, I was relying on self.dtype
(which could be category
) when I implemented it before. Fixed.
Thanks! |
Make xdf a subpackage of cudf instead of standalone
concat
function for concatenating dataframes,series, or indices together
concat
method, as it didn't match pandas apito_pandas
implementation to properly handle indices