[DataFrame] Fully implement append, concat and join #1932

devin-petersohn · 2018-04-22T01:57:08Z

Make some changes to concat and DataFrame.join. Changes are:

Fix concat for pandas.Series
Implement concat for axis=1 and keys.
Implement DataFrame.join
Implement DataFrame.append

AmplabJenkins · 2018-04-22T03:01:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5026/
Test PASSed.

AmplabJenkins · 2018-04-22T03:17:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5027/
Test PASSed.

AmplabJenkins · 2018-04-22T03:57:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5028/
Test FAILed.

kunalgosar · 2018-04-22T07:03:50Z

python/ray/dataframe/concat.py

+    if keys is not None:
+        objs = [objs[k] for k in keys]
+    else:
+        objs = list(objs)


None objects need to be dropped from objs as specified in pandas docs.

Resolved below

kunalgosar · 2018-04-22T07:05:16Z

python/ray/dataframe/concat.py

+    else:
+        objs = list(objs)
+
+    if len(objs) == 0:


Need to handle case of ValueError: All objects passed were None

Resolved below

kunalgosar · 2018-04-22T07:13:15Z

python/ray/dataframe/concat.py

-                pdf.columns = pd.RangeIndex(len(new_columns))
-
-                return pdf
+    if isinstance(objs, dict):


This case is not necessary.

Why is this case not necessary?

Actually it is, my comment was wrong. I had understood that you could only pass in a dictionary with keys specified. Turns out you can pass in a dictionary by itself.

kunalgosar · 2018-04-22T07:16:40Z

python/ray/dataframe/concat.py


-    # (TODO) Group all the pandas dataframes
+    # We need this in a list because we use it later.
+    all_index, all_columns = list(zip(*[(obj.index, obj.columns)


This will not work for Panel objects which do not have index or columns properties

True, I'm going to just drop Panel support.

kunalgosar · 2018-04-22T07:24:00Z

python/ray/dataframe/concat.py

+
+    # Put all of the DataFrames into Ray format
+    # TODO just partition the DataFrames instead of building a new Ray DF.
+    objs = [DataFrame(obj) if isinstance(obj, (pandas.DataFrame,


All pandas.Series objects would already be DataFrames by this point. Does it make sense to combine the steps?

Yes, it's not completely efficient to do it this way.

Resolved in series_to_df.

kunalgosar · 2018-04-22T07:33:43Z

python/ray/dataframe/dataframe.py

+            other = pd.DataFrame(other.values.reshape((1, len(other))),
+                                 index=index,
+                                 columns=combined_columns)
+            other = other._convert(datetime=True, timedelta=True)


Does the current DataFrame here need to reindex its columns to the combined_columns?

This will happen in concat.

kunalgosar · 2018-04-22T07:37:28Z

python/ray/dataframe/dataframe.py

+        if isinstance(other, pd.Series):
+            if other.name is None:
+                raise ValueError("Other Series must have a name")
+            other = DataFrame({other.name: other})


Can pass other directly into DataFrame constructor. It carries the series name over to the column name.

This is similar to how Pandas does it, so I vote we keep it this way. It's probably for clarity.

kunalgosar · 2018-04-22T07:47:20Z

python/ray/dataframe/dataframe.py

+                raise ValueError("Joining multiple DataFrames only supported"
+                                 " for joining on index")
+
+            # Joining the empty DataFrames with either index of columns is


'of' -> 'or'

AmplabJenkins · 2018-04-23T06:37:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5037/
Test FAILed.

p-yang · 2018-04-23T06:30:33Z

python/ray/dataframe/concat.py

+        type_check = next(obj for obj in objs
+                          if not isinstance(obj, (pandas.Series,
+                                                  pandas.DataFrame, DataFrame,
+                                                  pandas.Panel)))


Dropped support for pandas.Panel?

AmplabJenkins · 2018-04-23T15:58:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5040/
Test PASSed.

* master: updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914) Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903)

* 'master' of https://github.com/ray-project/ray: [rllib] Fix broken link in docs (ray-project#1967) [DataFrame] Sample implement (ray-project#1954) [DataFrame] Implement Inter-DataFrame operations (ray-project#1937) remove UniqueIDHasher (ray-project#1957) [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946) updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)

devin-petersohn added 4 commits April 21, 2018 18:48

Fixing some concat issues, adding join

92cb5f1

Fixing minor inplace bug and creating test

05f2130

Fixing concat and join

37d376f

Fixing tests

46e1ab4

devin-petersohn mentioned this pull request Apr 22, 2018

[DataFrame] Allow mixed type concat #1866

Closed

Implementing append

ec6d9dd

devin-petersohn changed the title ~~[DataFrame] Fully implement concat and join~~ [DataFrame] Fully implement append, concat and join Apr 22, 2018

kunalgosar suggested changes Apr 22, 2018

View reviewed changes

devin-petersohn mentioned this pull request Apr 23, 2018

[DataFrame] Implement Inter-DataFrame operations #1937

Merged

Addressing comments

246abed

devin-petersohn mentioned this pull request Apr 23, 2018

[DataFrame] concat operation does not preserve Nones #1865

Closed

p-yang suggested changes Apr 23, 2018

View reviewed changes

Dropping Panel support

92fe4bf

robertnishihara approved these changes Apr 24, 2018

View reviewed changes

robertnishihara merged commit 1d1df7b into ray-project:master Apr 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Fully implement append, concat and join #1932

[DataFrame] Fully implement append, concat and join #1932

devin-petersohn commented Apr 22, 2018 •

edited

AmplabJenkins commented Apr 22, 2018

AmplabJenkins commented Apr 22, 2018

AmplabJenkins commented Apr 22, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

devin-petersohn Apr 23, 2018

kunalgosar Apr 22, 2018

AmplabJenkins commented Apr 23, 2018

p-yang Apr 23, 2018

AmplabJenkins commented Apr 23, 2018

[DataFrame] Fully implement append, concat and join #1932

[DataFrame] Fully implement append, concat and join #1932

Conversation

devin-petersohn commented Apr 22, 2018 • edited

AmplabJenkins commented Apr 22, 2018

AmplabJenkins commented Apr 22, 2018

AmplabJenkins commented Apr 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Apr 23, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Apr 23, 2018

devin-petersohn commented Apr 22, 2018 •

edited