Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the best way to merge two datasets? #257

Closed
bowbahdoe opened this issue Jul 1, 2021 · 2 comments
Closed

What is the best way to merge two datasets? #257

bowbahdoe opened this issue Jul 1, 2021 · 2 comments

Comments

@bowbahdoe
Copy link

bowbahdoe commented Jul 1, 2021

I am translating the following pandas code

initial = pd.read_excel("Initial.xlsx")
final = pd.read_excel("Final.xlsx")

categories = final[["Tag","Category"]].merge(initial[["Tag","Category"]], on = ["Tag","Category"], how = "outer")

And I don't know what to do to get equivalent semantics.

My first instinct was this

  (let [initial    (first (excel/workbook->datasets "Initial.xlsx"))
        final      (first (excel/workbook->datasets "Final.xlsx"))
        categories (join/left-join "Tag"
                    (dataset/select-columns initial ["Tag" "Category"])
                    (dataset/select-columns final ["Tag" "Category"]))

But I think it misses the mark. There is no "outer" join and the column names don't mesh either. I get "Sheet1.Category" as an extra column

@cnuernber
Copy link
Collaborator

Tablecloth offers a full join. I will take a look at merge and see what its semantics are. As a work around the low level hash join returns a lot more information you may be able to use to construct what you want.

@cnuernber
Copy link
Collaborator

The tech.v3.dataset.join namespace now includes pd-merge that should match pretty much exactly the semantics you are looking for.

Keep in mind that the metadata on the returned dataset contains maps that tell you how column names are mangled when they are mangled in order be able to map from the old name to the new name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants