Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should ConstraintsNotMetError be a Warning instead? #595

Closed
npatki opened this issue Sep 17, 2021 · 13 comments
Closed

Should ConstraintsNotMetError be a Warning instead? #595

npatki opened this issue Sep 17, 2021 · 13 comments
Labels
feature:constraints Related to inputting rules or business logic

Comments

@npatki
Copy link
Contributor

npatki commented Sep 17, 2021

Problem Description

Starting fromv0.12.0, the SDV only allows you fit a model with constraints if all of the input data matches the constraints.

constraint = GreaterThan(low='effective_date', high='due_date')
model = CopulaGAN(constraints=[constraint])

model.fit(my_data)
# Crash (ConstraintsNotMetError) if effective_date > due_date for any single row of my dataset

Expected behavior

It's useful to know whether the input data passes the constraints, but should this really be a hard requirement that all the rows need to pass the constraint?

My expectation: Give me a Warning but continue fitting the data.

  • The warning can be descriptive. For eg, tell me how many rows aren't passing, or which rows they are
  • It's ok if the SDV drops the offending rows before modeling

Additional context

There may be legitimate reasons why a few rows of the input data don't match the constraints: some rows in the dataset were manually overridden exceptions, there was a bug in my application, the rows were generated by some legacy system, etc.

In any case, the only recourse I have now is to manually identify & delete the offending rows.

@npatki npatki added feature request Request for a new feature needs discussion labels Sep 17, 2021
@kveerama
Copy link
Contributor

@npatki it could be interesting to see if we can provide this information before fit is done. So user has an option to make some choices in the fit method. Those options could be to drop or fix and possibly remove the constraint.

@npatki
Copy link
Contributor Author

npatki commented Sep 17, 2021

I like that idea. As a default, it might still make sense to crash so I can figure out what to do with it. I think those are the only 2 options available:

  • Drop the offending rows. Or rather, have SDV do it for me. Maybe ignore_violations makes more sense
  • Just remove the constraint. I'd just have to manually get rid of it next time I instantiate the model.

@kveerama kveerama added the feature:constraints Related to inputting rules or business logic label Sep 19, 2021
@kvrameshreddy
Copy link

Hi @npatki,
The function "_validate_data_on_constraints" in table.py file is validating constraints over the entire sample data and if any record fails throwing error. Can we validate like if few rows of the sample satisfies the constraints the model should fit else raise error.
I am trying to change at this like this, I am not sure whether this will work for all the constraints or not. Please review it and suggest any workaround if available.
validate_original
validate

Thankyou.

@npatki
Copy link
Contributor Author

npatki commented Nov 1, 2021

Hi @kvrameshreddy, it would be helpful if you could describe your use case & the constraint you want to add in more detail.

The intended use for constraints is for strict rules that all of the rows must follow. For various reasons, the input data may have a few exceptions to the rule, which is why I filed this issue.

However, if the real data has many exceptions to the rule, I am curious why it's considered a strict rule you want to add? The goal is to emulate the input data. If a majority of the input data isn't following the rule, is this something you want the synthetic data to do?

@kilickursat
Copy link

I have taken the following error after ColumnFormula for UniqueConstraints. I have also attached some parts of the data.

1
2
3

@npatki
Copy link
Contributor Author

npatki commented Nov 1, 2021

@kilickursat You may want to try inputting only 1 constraint at a time to see which one is causing the ConstraintsNotMetError. This error indicates that the input data does not follow the rule you specified.

Is it expected that the input has some violations? For now, the easiest workaround will be to delete the offending row(s) in the input data before passing it into the fit function.

@kvrameshreddy
Copy link

@npatki , my use case is same, if any one record from the input data is not satisfying the constraints resulting in error. can we handle this internally to drop those records and fit the model

@kilickursat
Copy link

I applied only 1 constraint and but each of them has given the same error. I didn't understand why the data doesn't follow the specified rules.

@npatki
Copy link
Contributor Author

npatki commented Nov 2, 2021

@kilickursat can you check this manually in your input data? It's possible there may be 1-2 rows that are wrong for some reason -- could be due to rounding, misc errors, etc.

One other thing to check: Make sure that you're using the latest version of SDV (v0.12.1). The earlier version had an issue when constraints were applied to columns that could have NaN values.

@kilickursat
Copy link

@npatki thanks for your all effort. I have checked the file and converted it from excel to CSV. But, in this case, I got a different error, "MissingConstraintColumnError".

import numpy as np

df = pd.read_csv('soil.csv')

from sdv.tabular import CopulaGAN
from sdv.constraints import ColumnFormula


# According to Mohr Columb, the following functions were created using UniqueConstraints of SDV GAN 
def Compressive_Strength(df):
  return df["Effective_AxialStress"]-df["Effective_LateralStress"]

Compressive_Strength_constraint=ColumnFormula(
    column="Compressive_Strength",
    formula=Compressive_Strength,
    handling_strategy="transform"
)


def Effective_AxialStress(df):
  return df["Effective_LateralStress"]+df["Compressive_Strength"]

Effective_AxialStress_constraint=ColumnFormula(
    column="Effective_AxialStress",
    formula=Effective_AxialStress,
    handling_strategy="transform"
)

constraint = [Compressive_Strength_constraint, 
               Effective_AxialStress_constraint]


model = CopulaGAN(constraints=constraint)
model.fit(df)```

MissingConstraintColumnError

@npatki
Copy link
Contributor Author

npatki commented Nov 4, 2021

@kilickursat It seems like you have a circular dependency in your formulas

  • To compute the Compressive_Strength, you need to know the Effective_AxialStress
  • To compute the Effective_AxialStress, you need to know the Compressive_Strength

The SDV cannot handle circular dependencies. Is there a way to formulate your constraints in a different way?

@kilickursat
Copy link

Thanks again. Effective stress and compressive strength are known. I will try to apply different ways to find those parameters. But they link to each other.

@katxiao katxiao added work planned and removed feature request Request for a new feature needs discussion labels Nov 12, 2021
@npatki
Copy link
Contributor Author

npatki commented Jun 10, 2022

Update: We are defining constraints as business rules that are true of every row in your dataset. Models cannot be expected to properly learn constraints if the real data does not exhibit the same properties.

For a more user-friendly API we will:

  1. Print out the offending rows whenever there's a ConstraintsNotMetError so that users can easily debug the issue -- see Improve error message for invalid constraints #801
  2. Add documentation about what you can do if you encounter this (remove the constraint or remove the offensive rows)

@npatki npatki closed this as completed Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

No branches or pull requests

5 participants