Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in building the denormalized table inside the Parent-Child Detection metrics #328

Closed
mohamedgy opened this issue Mar 22, 2023 · 1 comment · Fixed by #365
Closed
Assignees
Labels
bug Something isn't working feature:metrics Related to any of the individual metrics
Milestone

Comments

@mohamedgy
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.9.2
  • Python version: 3.9.7
  • Operating System: WSL

Error Description

Hello,
The denormalized table in SDMetrics uses the following code:

    def _denormalize(data, foreign_key):
        """Denormalize the child table over the parent."""
        parent_table, parent_key, child_table, child_key = foreign_key
        flat = data[parent_table].set_index(parent_key).merge(
            data[child_table].set_index(child_key),
            how='outer',
            left_index=True,
            right_index=True,
        ).reset_index(drop=True)
        return flat

The “how” parameter in “merge” function is set to “outer” as value. This value has as effect to generate new lines with NaN values for the child table columns when there are rows in parent table without child rows. For example, let us compute the denormalized table of train table (with 7381 rows) of Telstra database. Train table has severity_type table as parent table (with 18552 rows). With the above code, we get the following denormalized table:

Index severity_type location fault_severity
0 severity_type 1 location 601 1.0
1 severity_type 2 NaN NaN
2 severity_type 1 NaN NaN
3 severity_type 4 NaN NaN
4 severity_type 2 location 460 0.0
... ... ... ...
18547 severity_type 2 location 278 0.0
18548 severity_type 1 NaN NaN
18549 severity_type 1 location 12 0.0
18550 severity_type 1 NaN NaN
18551 severity_type 2 NaN NaN

The expected denormalized table has to be:

severity_type location fault_severity
0 severity_type 2 location 118 1
1 severity_type 2 location 91 0
2 severity_type 2 location 152 1
3 severity_type 1 location 931 1
4 severity_type 1 location 120 0
... ... ... ...
7376 severity_type 2 location 167 0
7377 severity_type 1 location 106 0
7378 severity_type 2 location 1086 2
7379 severity_type 1 location 7 0
7380 severity_type 1 location 885 0

with only 7381 rows. We can get this result by changing the value of “how” parameter to “right” instead of “outer”.
Best regards!

@mohamedgy mohamedgy added bug Something isn't working new Label applied to new issues labels Mar 22, 2023
@npatki
Copy link
Contributor

npatki commented Mar 29, 2023

Hi @mohamedgy thanks for catching this. I agree that "outer" doesn't make sense as the join here -- a "right" makes much more sense since the right rows (child table) should point to exactly 1 left row (parent table).

We'll add this issue to our queue!

@npatki npatki added under discussion Issue is currently being discussed feature:metrics Related to any of the individual metrics and removed new Label applied to new issues labels Mar 29, 2023
@npatki npatki removed the under discussion Issue is currently being discussed label Apr 10, 2023
@amontanez24 amontanez24 added this to the 0.11.0 milestone Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:metrics Related to any of the individual metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants