Merge `underfilled_job_title` with `employee_position_title` in `employee_salaries` #581

LilianBoulard · 2023-06-08T15:49:41Z

This PR adds a parameter to fetch_employee_salaries, so the main dirty column is overloaded with another column that adds some new information (from my understanding).

GaelVaroquaux · 2023-06-16T08:11:47Z

I just had a look at the PR. I do not think that I am in favor of adding the new, cleaner, column.

In terms of philosophy, I would like us to try as hard as we can to improve our tools to work on the data as it is, rather than to change the data. This means that we must stare at our examples and wonder what makes them ugly, and then see if we can provide functionality to make them less ugly.

With regards to using more the upstream scikit-learn code, yes, I'm a thousands time in favor of doing that.

…ve_fetching # Conflicts: # skrub/datasets/_fetching.py

LilianBoulard · 2023-07-24T12:17:04Z

I agree that as much as we can, we should use appropriate tools, but in this specific instance, I think merging them in advance is the best option. Of course if we have a tool designed for this type of issue down the road, we can re-introduce it, but currently, this merge is something we do a lot in the new examples (#546), and it would simplify them quite a bit.

jovan-stojanovic · 2023-08-01T06:28:04Z

I might have missed something, but why do you need to overwrite the employee_position_title column for simplification? It seems to work well in the first example, and I think what you are doing in #546 might work as well. For instance, works here without preprocessing it for the Gap example.

LilianBoulard · 2023-08-01T12:39:32Z

To me, underfilled job title is a column that gives more specific information about the job title. Let me demonstrate:

>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> X = dataset.X  # alias
>>> # Filter, keep only the jobs that contain "Fire"
>>> X = X[X["employee_position_title"].str.contains("Fire")]
>>> X[["employee_position_title", "underfilled_job_title", "date_first_hired"]].head(10)
        employee_position_title            underfilled_job_title date_first_hired
8       Firefighter/Rescuer III  Firefighter/Rescuer I (Recruit)       12/12/2016
42      Firefighter/Rescuer III                              NaN       10/09/2006
107     Firefighter/Rescuer III                              NaN       05/08/2011
128         Fire/Rescue Captain                              NaN       02/26/1990
132     Firefighter/Rescuer III           Firefighter/Rescuer II       03/10/2014
142     Firefighter/Rescuer III                              NaN       03/17/2008
152  Master Firefighter/Rescuer                              NaN       01/30/2006
157         Fire/Rescue Captain                              NaN       09/11/2000
158     Firefighter/Rescuer III                              NaN       03/17/2008
167     Firefighter/Rescuer III           Firefighter/Rescuer II       03/10/2014

When there is a value, underfilled_job_title seems to give a more specific description of the job.
So my proposition is to overload employee_position_title with the underfilled_job_title column.

Also, for reference: https://chat.openai.com/share/d4a00de6-d10b-4c5a-af19-43757fb795cf

jovan-stojanovic

Ok, I see the difference though I don't feel it's crucial. But it's good to have it as an option.
I agree with merging this with False by default (you might easily use it for examples). WDYT?

skrub/datasets/_fetching.py

LilianBoulard · 2023-08-01T14:04:19Z

You're right, it's not crucial, but it unloads some boilerplate from the examples, which I think is a big benefit.
On the True/False default, I think that realistically, these fetching methods are mainly used in examples, thus it makes more sense to me being True by default (since we'll set it to True in pretty much all our examples). Maybe we should get a third opinion to settle this, wdyt @Vincent-Maladiere?

Vincent-Maladiere

Hey! The point of our examples is to showcase our features, not to explain how to use pandas or this dataset specifically, IMHO.

I agree with @LilianBoulard that we should do this quick preprocessing by default to simplify the examples, even though having it in the example is not dramatic or ugly.

skrub/datasets/_fetching.py

GaelVaroquaux

I became convinced :).

Merging, thank you!

…oyee_salaries` (skrub-data#581) * Add `overload_job_titles` parameter to `fetch_employee_salaries` * Add changelog entry * Fix path

Add overload_job_titles parameter to fetch_employee_salaries

2befe87

LilianBoulard added the enhancement New feature or request label Jun 8, 2023

LilianBoulard self-assigned this Jun 8, 2023

LilianBoulard mentioned this pull request Jun 12, 2023

Handle id columns differently #585

Open

This comment was marked as resolved.

Sign in to view

LilianBoulard changed the title ~~Improve fetching~~ Merge underfilled_job_title with employee_position_title in employee_salaries Jul 24, 2023

LilianBoulard marked this pull request as ready for review July 24, 2023 12:04

Merge branch 'main' of https://github.com/skrub-data/skrub into impro…

7628387

…ve_fetching # Conflicts: # skrub/datasets/_fetching.py

LilianBoulard added 2 commits July 24, 2023 14:22

Add changelog entry

9c99531

Fix path

a502941

LilianBoulard requested a review from jovan-stojanovic July 31, 2023 14:20

jovan-stojanovic requested changes Aug 1, 2023

View reviewed changes

skrub/datasets/_fetching.py Show resolved Hide resolved

skrub/datasets/_fetching.py Show resolved Hide resolved

Vincent-Maladiere reviewed Aug 2, 2023

View reviewed changes

skrub/datasets/_fetching.py Show resolved Hide resolved

LilianBoulard requested a review from jovan-stojanovic August 2, 2023 12:36

GaelVaroquaux approved these changes Aug 3, 2023

View reviewed changes

GaelVaroquaux merged commit 1a6f50f into skrub-data:main Aug 3, 2023
23 of 24 checks passed

LilianBoulard deleted the improve_fetching branch August 4, 2023 11:30

jovan-stojanovic mentioned this pull request Aug 18, 2023

DOC Rework GapEncoder example #686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge `underfilled_job_title` with `employee_position_title` in `employee_salaries` #581

Merge `underfilled_job_title` with `employee_position_title` in `employee_salaries` #581

LilianBoulard commented Jun 8, 2023 •

edited

This comment was marked as resolved.

GaelVaroquaux commented Jun 16, 2023

LilianBoulard commented Jul 24, 2023

jovan-stojanovic commented Aug 1, 2023

LilianBoulard commented Aug 1, 2023

jovan-stojanovic left a comment •

edited

LilianBoulard commented Aug 1, 2023

Vincent-Maladiere left a comment

GaelVaroquaux left a comment

Merge underfilled_job_title with employee_position_title in employee_salaries #581

Merge underfilled_job_title with employee_position_title in employee_salaries #581

Conversation

LilianBoulard commented Jun 8, 2023 • edited

This comment was marked as resolved.

GaelVaroquaux commented Jun 16, 2023

LilianBoulard commented Jul 24, 2023

jovan-stojanovic commented Aug 1, 2023

LilianBoulard commented Aug 1, 2023

jovan-stojanovic left a comment • edited

Choose a reason for hiding this comment

LilianBoulard commented Aug 1, 2023

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Merge `underfilled_job_title` with `employee_position_title` in `employee_salaries` #581

Merge `underfilled_job_title` with `employee_position_title` in `employee_salaries` #581

LilianBoulard commented Jun 8, 2023 •

edited

jovan-stojanovic left a comment •

edited