Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sort for data panel and columns #237

Merged
merged 15 commits into from
Jun 23, 2022
Merged

Implement sort for data panel and columns #237

merged 15 commits into from
Jun 23, 2022

Conversation

hannahkim24
Copy link
Collaborator

No description provided.

@hannahkim24 hannahkim24 force-pushed the feature/sort branch 3 times, most recently from c41f14e to d886a9d Compare April 27, 2022 20:38
self[panda_col_name] = pd.Series(np.array(self[panda_col_name]))

else: # Sort with single column
curr_col_type = str(type(self[by[0]]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 805-811 can probably just be replaced with:

sorted_indices = self[by[0]].argsort(ascending=ascending, kind=kind)

Any reason to explicitly check the type?

curr_col_type = str(type(self[col]))
print(curr_col_type)
# Convert all columns to numpy type
if curr_col_type == panda_col_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easiest way to check a column type is:

isinstance(self[col], PandasSeriesColumn)

https://docs.python.org/3/library/functions.html#isinstance

This has the added advantage of also working for subclasses of PandasSeriesColumn

https://stackoverflow.com/questions/1549801/what-are-the-differences-between-type-and-isinstance#:~:text=answers%2C%20isinstance%20caters%20for%20inheritance,of%20subtypes%2C%20AKA%20subclasses).

panda_col_name = col

if curr_col_type == tensor_col_type:
self[col] = self[col].numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to convert the column to NumPy in the original dp , we can just add it to a list of keys and then pass a reversed version of that list to lexsort

self = self.lz[sorted_indices]


# Convert columns to original types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be necessary if we avoid converting the column type in the actual dp

sorted_indices = np.lexsort(keys = keys)

# !! This doesn't update self!!
self = self.lz[sorted_indices]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for overwriting self here? Could we just put it in a new variable sorted_dp or something?

kind: str = "quicksort",
) -> DataPanel:
"""
Sort the DataPanel by the values in the specified columns. Similar to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(Sabri): Add a comment here specifying that the sort will not be in-place.

@seyuboglu
Copy link
Collaborator

Tests look awesome - great work! Let's merge this in.

@seyuboglu seyuboglu self-requested a review June 23, 2022 01:42
Copy link
Collaborator

@seyuboglu seyuboglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM – once we address the linting and autoformat issues we can merge in!

@seyuboglu seyuboglu merged commit 12e2886 into dev Jun 23, 2022
@seyuboglu seyuboglu deleted the feature/sort branch June 23, 2022 18:31
seyuboglu added a commit that referenced this pull request Jul 22, 2022
* delete nn

* Add support for loading train and test set in cifar10" (#193)

* Fix issue where tensor columns can't be indexed with pandas series (#195)

* Update cifar10 to support test set too (#196)

* Fix bacckwards compat issue with base_dir and gcs_image_column (#197)

* Support backwards compatibility with nn (#198)

* Bump version (#199)

* Update contributing to support new dev main structure (#203)

* Add args, kwargs to ColumnIOMixin._read_data (#204)

Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com>

* Fix from_huggingface and add tests (#205)

closes #201

* allow_pickle=true when loading numpy block (#206)

* Add downloader to ImageColumn (#207)

* Remove default addition of index (#208)

* Remove default addition of index

* Fix provenance tests

* Add DEW contrib to registry (#209)

* Catch ConnectionResetError (#210)

* Add inaturalist to contrib (#211)

* Add inaturalist to contrib

* Add annotations to intarualist

* Fix issue where arraycolumns can't be saved with jsonlines (#214)

* Update the docs and add user guide. (#215)

* Add contrib for enron (#217)

* Fix PIL attribute error on list column representation (#218)

* mmap path bug fix (#219)

* Downgrade pytorch dependency bound (#220)

* Fix issue with subclassing datapanel _state_keys (#224)

* Use multiple slices instead of pa.Table.take in ArrowBlock (#226)

* Fix issue where boolean list can't index (#227)

* Add support for AudioColumn (#222)

* Add waterbirds (#228)

* Add use guide to indexing and stubs for remaining sections (#225)

* Docs/build fix (#230)

* Bump version (#231)

* Audioset DataPanel (#229)

* Add the audioset dataset

* Add AudioColumn to audioset datapanel

* Fix issue where old datapanels didn't have formatter state (#233)

* Make audioset datapanels relational (#235)

* Add coco, mir, and pascal (#239)

* Make write only write columns in datapanel (#240)

* Enforce contiguous index in pandas columns (#244)

* Fix issue where ray pickle fails on lazy loader (#245)

* Add support for groupby operation

* Reorganize the implementation of datasets (#246)

* Add support for persistent configuration (#247)

* Implement sort for data panel and columns (#237)

* Add emb module (#249)

* clusterby stuff

* Add clusterby

* clusterby stuff

* Add clusterby

* Add embed op (#248)

* Autoformat

Co-authored-by: Sam Randall <1billionmore@gmail.com>

* Reorganize ops code (#250)

* Update CI to include 3.9 and 3.10 and to drop 3.7

* Add sample (#251)

* Update ci.yml

* Add several HAPI datasets  (#252)

* Update styling of docs (#253)

* Bump version (#254)

* Remove fastbpe

Co-authored-by: Karan Goel <kgoel93@gmail.com>
Co-authored-by: Karan Goel <kgoel@cs.stanford.edu>
Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com>
Co-authored-by: Khaled Saab <36782882+khaledsaab@users.noreply.github.com>
Co-authored-by: Priya2698 <52657555+Priya2698@users.noreply.github.com>
Co-authored-by: sam-randall <38796503+sam-randall@users.noreply.github.com>
Co-authored-by: Hannah Kim <61199762+hannahkim24@users.noreply.github.com>
Co-authored-by: Sam Randall <1billionmore@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants