Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMA sampling crashes when unknown sdtype detected for numerical column #2064

Closed
amontanez24 opened this issue Jun 13, 2024 · 0 comments · Fixed by #2089
Closed

HMA sampling crashes when unknown sdtype detected for numerical column #2064

amontanez24 opened this issue Jun 13, 2024 · 0 comments · Fixed by #2089
Assignees
Labels
bug Something isn't working
Milestone

Comments

@amontanez24
Copy link
Contributor

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.13.1 and on (maybe before that)
  • Python version: Any
  • Operating System: Any

Error Description

If a numerical column is detected as unknown in the metadata, then sampling seems to fail because it can't cast the generated value which is a string with the prefix sdv-pii- to an int.

Steps to reproduce

import pandas as pd
import faker
import numpy as np

fake = faker.Faker()

table1 = pd.DataFrame({
    'name': [fake.name() for i in range(20)],
    'salary': np.random.randint(20_000, 250_000, 20),
    'age': np.random.randint(18, 70, 20),
    'address': [fake.address() for i in range(20)]
})
table2 = pd.DataFrame({
    'company': [fake.company() for i in range(20)],
    'employee_count': np.random.randint(15, 4000, 20),
    'revenue': np.random.randint(100_000, 4_000_000_000)
})

tables_dict = {'people': table1, 'company': table2}

from sdv.metadata import MultiTableMetadata

metadata = MultiTableMetadata()
metadata.detect_from_dataframes(tables_dict)

from sdv.multi_table import HMASynthesizer

synth = HMASynthesizer(metadata)
synth.fit(tables_dict)
synth.sample(1)

The code above produces

ValueError                                Traceback (most recent call last)
[<ipython-input-4-f0bab0e0379c>](https://localhost:8080/#) in <cell line: 5>()
      3 synth = HMASynthesizer(metadata)
      4 synth.fit(tables_dict)
----> 5 synth.sample(1)

9 frames
[/usr/local/lib/python3.10/dist-packages/pandas/core/dtypes/astype.py](https://localhost:8080/#) in _astype_nansafe(arr, dtype, copy, skipna)
    136     if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
    137         # Explicit copy, or required since NumPy can't view from / to object.
--> 138         return arr.astype(dtype, copy=True)
    139 
    140     return arr.astype(dtype, copy=copy)

ValueError: invalid literal for int() with base 10: 'sdv-pii-cs7y0'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants