Implement sort for data panel and columns #237

hannahkim24 · 2022-04-27T20:30:26Z

No description provided.

seyuboglu · 2022-06-02T18:13:25Z

meerkat/datapanel.py

+                self[panda_col_name] = pd.Series(np.array(self[panda_col_name]))
+
+        else:  # Sort with single column
+            curr_col_type = str(type(self[by[0]]))


Lines 805-811 can probably just be replaced with:

sorted_indices = self[by[0]].argsort(ascending=ascending, kind=kind)

Any reason to explicitly check the type?

seyuboglu · 2022-06-02T18:28:04Z

meerkat/datapanel.py

+                curr_col_type = str(type(self[col]))
+                print(curr_col_type)
+                # Convert all columns to numpy type
+                if curr_col_type == panda_col_type:


Easiest way to check a column type is:

isinstance(self[col], PandasSeriesColumn)

https://docs.python.org/3/library/functions.html#isinstance

This has the added advantage of also working for subclasses of PandasSeriesColumn

https://stackoverflow.com/questions/1549801/what-are-the-differences-between-type-and-isinstance#:~:text=answers%2C%20isinstance%20caters%20for%20inheritance,of%20subtypes%2C%20AKA%20subclasses).

seyuboglu · 2022-06-02T18:32:10Z

meerkat/datapanel.py

+                    panda_col_name = col
+
+                if curr_col_type == tensor_col_type:
+                    self[col] = self[col].numpy()


We don't need to convert the column to NumPy in the original dp , we can just add it to a list of keys and then pass a reversed version of that list to lexsort

seyuboglu · 2022-06-02T18:33:40Z

meerkat/datapanel.py

+            self = self.lz[sorted_indices]
+
+
+            # Convert columns to original types 


This won't be necessary if we avoid converting the column type in the actual dp

seyuboglu · 2022-06-02T18:33:50Z

meerkat/datapanel.py

+            sorted_indices = np.lexsort(keys = keys)
+
+            # !! This doesn't update self!! 
+            self = self.lz[sorted_indices]


Any reason for overwriting self here? Could we just put it in a new variable sorted_dp or something?

seyuboglu · 2022-06-02T18:45:16Z

meerkat/datapanel.py

+        kind: str = "quicksort",
+    ) -> DataPanel:
+        """ 
+        Sort the DataPanel by the values in the specified columns. Similar to 


TODO(Sabri): Add a comment here specifying that the sort will not be in-place.

into feature/sort

seyuboglu · 2022-06-23T01:40:45Z

Tests look awesome - great work! Let's merge this in.

seyuboglu

LGTM – once we address the linting and autoformat issues we can merge in!

* delete nn * Add support for loading train and test set in cifar10" (#193) * Fix issue where tensor columns can't be indexed with pandas series (#195) * Update cifar10 to support test set too (#196) * Fix bacckwards compat issue with base_dir and gcs_image_column (#197) * Support backwards compatibility with nn (#198) * Bump version (#199) * Update contributing to support new dev main structure (#203) * Add args, kwargs to ColumnIOMixin._read_data (#204) Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com> * Fix from_huggingface and add tests (#205) closes #201 * allow_pickle=true when loading numpy block (#206) * Add downloader to ImageColumn (#207) * Remove default addition of index (#208) * Remove default addition of index * Fix provenance tests * Add DEW contrib to registry (#209) * Catch ConnectionResetError (#210) * Add inaturalist to contrib (#211) * Add inaturalist to contrib * Add annotations to intarualist * Fix issue where arraycolumns can't be saved with jsonlines (#214) * Update the docs and add user guide. (#215) * Add contrib for enron (#217) * Fix PIL attribute error on list column representation (#218) * mmap path bug fix (#219) * Downgrade pytorch dependency bound (#220) * Fix issue with subclassing datapanel _state_keys (#224) * Use multiple slices instead of pa.Table.take in ArrowBlock (#226) * Fix issue where boolean list can't index (#227) * Add support for AudioColumn (#222) * Add waterbirds (#228) * Add use guide to indexing and stubs for remaining sections (#225) * Docs/build fix (#230) * Bump version (#231) * Audioset DataPanel (#229) * Add the audioset dataset * Add AudioColumn to audioset datapanel * Fix issue where old datapanels didn't have formatter state (#233) * Make audioset datapanels relational (#235) * Add coco, mir, and pascal (#239) * Make write only write columns in datapanel (#240) * Enforce contiguous index in pandas columns (#244) * Fix issue where ray pickle fails on lazy loader (#245) * Add support for groupby operation * Reorganize the implementation of datasets (#246) * Add support for persistent configuration (#247) * Implement sort for data panel and columns (#237) * Add emb module (#249) * clusterby stuff * Add clusterby * clusterby stuff * Add clusterby * Add embed op (#248) * Autoformat Co-authored-by: Sam Randall <1billionmore@gmail.com> * Reorganize ops code (#250) * Update CI to include 3.9 and 3.10 and to drop 3.7 * Add sample (#251) * Update ci.yml * Add several HAPI datasets (#252) * Update styling of docs (#253) * Bump version (#254) * Remove fastbpe Co-authored-by: Karan Goel <kgoel93@gmail.com> Co-authored-by: Karan Goel <kgoel@cs.stanford.edu> Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com> Co-authored-by: Khaled Saab <36782882+khaledsaab@users.noreply.github.com> Co-authored-by: Priya2698 <52657555+Priya2698@users.noreply.github.com> Co-authored-by: sam-randall <38796503+sam-randall@users.noreply.github.com> Co-authored-by: Hannah Kim <61199762+hannahkim24@users.noreply.github.com> Co-authored-by: Sam Randall <1billionmore@gmail.com>

hannahkim24 force-pushed the feature/sort branch 3 times, most recently from c41f14e to d886a9d Compare April 27, 2022 20:38

seyuboglu and others added 7 commits June 1, 2022 14:52

Add function headers for sort

50a3a6a

Add notebook

490dfbd

Initial sort updates

2b86273

More sort column changes

58d8fc8

Updates for datapanel sort

16f4dca

Initial tests

0c8d8d4

Test updates

623f3db

hannahkim24 force-pushed the feature/sort branch from 9bf4e69 to 623f3db Compare June 2, 2022 02:37

hannahkim24 and others added 2 commits June 1, 2022 19:40

testing

3781c00

Fix issue with misaligned indicies in pandas block

12405db

seyuboglu reviewed Jun 2, 2022

View reviewed changes

hannahkim24 added 5 commits June 2, 2022 11:53

testing updates

484397d

fixed pandas col comparison

b0858d9

Merge branch 'feature/sort' of https://github.com/robustness-gym/meerkat

d3fb051

into feature/sort

Testing sort changes and sort implementation fixes

926c236

Final test and sort implementation changes

cd80f01

seyuboglu self-requested a review June 23, 2022 01:42

seyuboglu approved these changes Jun 23, 2022

View reviewed changes

Address autoformat

ce62892

seyuboglu merged commit 12e2886 into dev Jun 23, 2022

seyuboglu deleted the feature/sort branch June 23, 2022 18:31

seyuboglu mentioned this pull request Jul 12, 2022

[FEATURE] Sort DataPanel by a column #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sort for data panel and columns #237

Implement sort for data panel and columns #237

hannahkim24 commented Apr 27, 2022

seyuboglu Jun 2, 2022

seyuboglu Jun 2, 2022

seyuboglu Jun 2, 2022

seyuboglu Jun 2, 2022

seyuboglu Jun 2, 2022

seyuboglu Jun 2, 2022

seyuboglu commented Jun 23, 2022

seyuboglu left a comment

		self = self.lz[sorted_indices]


		# Convert columns to original types

Implement sort for data panel and columns #237

Implement sort for data panel and columns #237

Conversation

hannahkim24 commented Apr 27, 2022

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu Jun 2, 2022

Choose a reason for hiding this comment

seyuboglu commented Jun 23, 2022

seyuboglu left a comment

Choose a reason for hiding this comment