Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Fully implement append, concat and join #1932

Merged
merged 7 commits into from Apr 24, 2018

Conversation

devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Apr 22, 2018

Make some changes to concat and DataFrame.join. Changes are:

  • Fix concat for pandas.Series
  • Implement concat for axis=1 and keys.
  • Implement DataFrame.join
  • Implement DataFrame.append

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5026/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5027/
Test PASSed.

@devin-petersohn devin-petersohn changed the title [DataFrame] Fully implement concat and join [DataFrame] Fully implement append, concat and join Apr 22, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5028/
Test FAILed.

if keys is not None:
objs = [objs[k] for k in keys]
else:
objs = list(objs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None objects need to be dropped from objs as specified in pandas docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved below

else:
objs = list(objs)

if len(objs) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle case of ValueError: All objects passed were None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved below

pdf.columns = pd.RangeIndex(len(new_columns))

return pdf
if isinstance(objs, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this case not necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is, my comment was wrong. I had understood that you could only pass in a dictionary with keys specified. Turns out you can pass in a dictionary by itself.


# (TODO) Group all the pandas dataframes
# We need this in a list because we use it later.
all_index, all_columns = list(zip(*[(obj.index, obj.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for Panel objects which do not have index or columns properties

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'm going to just drop Panel support.


# Put all of the DataFrames into Ray format
# TODO just partition the DataFrames instead of building a new Ray DF.
objs = [DataFrame(obj) if isinstance(obj, (pandas.DataFrame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pandas.Series objects would already be DataFrames by this point. Does it make sense to combine the steps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not completely efficient to do it this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in series_to_df.

other = pd.DataFrame(other.values.reshape((1, len(other))),
index=index,
columns=combined_columns)
other = other._convert(datetime=True, timedelta=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current DataFrame here need to reindex its columns to the combined_columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will happen in concat.

if isinstance(other, pd.Series):
if other.name is None:
raise ValueError("Other Series must have a name")
other = DataFrame({other.name: other})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can pass other directly into DataFrame constructor. It carries the series name over to the column name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to how Pandas does it, so I vote we keep it this way. It's probably for clarity.

raise ValueError("Joining multiple DataFrames only supported"
" for joining on index")

# Joining the empty DataFrames with either index of columns is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'of' -> 'or'

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5037/
Test FAILed.

type_check = next(obj for obj in objs
if not isinstance(obj, (pandas.Series,
pandas.DataFrame, DataFrame,
pandas.Panel)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped support for pandas.Panel?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5040/
Test PASSed.

@robertnishihara robertnishihara merged commit 1d1df7b into ray-project:master Apr 24, 2018
alok added a commit to alok/ray that referenced this pull request Apr 28, 2018
* master:
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)
royf added a commit to royf/ray that referenced this pull request Jun 22, 2018
* 'master' of https://github.com/ray-project/ray:
  [rllib] Fix broken link in docs (ray-project#1967)
  [DataFrame] Sample implement (ray-project#1954)
  [DataFrame] Implement Inter-DataFrame operations (ray-project#1937)
  remove UniqueIDHasher (ray-project#1957)
  [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946)
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants