Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Cannot convert float NaN to integer #1730

Closed
ThomasK1018 opened this issue Dec 28, 2023 · 4 comments
Closed

ValueError: Cannot convert float NaN to integer #1730

ThomasK1018 opened this issue Dec 28, 2023 · 4 comments
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets resolution:duplicate This issue or pull request already exists

Comments

@ThomasK1018
Copy link

ThomasK1018 commented Dec 28, 2023

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.8.0
  • Python version: 3.11.5
  • Operating System: Mac

Error Description

I used the following code and the following dataset. Everything was alright yesterday. I tried to amend some of the contents of the datasets and suddenly this error keeps appearing. I reverted everything back to yesterday's version and this error keeps appearing. But there was not a single missing value in the existing dataset. Datasets are also attached for your testing purposes. Thanks.

ads_dataset.csv
feeds_dataset.csv
unique_id.csv

Steps to reproduce

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sdv.datasets.local import load_csvs

datasets = load_csvs(
    folder_name='/Users/thomaskwok/Downloads/SDV_dataset/New Small Dataset',
    read_csv_parameters={
        'skipinitialspace': True
    })



from sdv.metadata import MultiTableMetadata

metadata = MultiTableMetadata()

metadata.detect_from_dataframes(
    data={
        'unique_id': datasets['unique_id'],
        'feeds_dataset': datasets['feeds_dataset'],
        'ads_dataset': datasets['ads_dataset']
}
)

metadata.update_column(
    table_name='unique_id',
    column_name='user_id',
    sdtype='id')

metadata.update_column(
    table_name='feeds_dataset',
    column_name='user_id',
    sdtype='id')

metadata.update_column(
    table_name = 'ads_dataset',
    column_name = 'city',
    sdtype = 'categorical')

metadata.update_column(
    table_name = 'ads_dataset',
    column_name = 'city_rank',
    sdtype = 'categorical')


metadata.update_column(
    table_name='ads_dataset',
    column_name='user_id',
    sdtype='id')


metadata.set_primary_key(
    table_name='unique_id',
    column_name='user_id'
)

from sdv.multi_table import HMASynthesizer

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(datasets)
synthetic_data = synthesizer.sample()
@ThomasK1018 ThomasK1018 added bug Something isn't working new Automatic label applied to new issues labels Dec 28, 2023
@npatki
Copy link
Contributor

npatki commented Jan 2, 2024

Hi @ThomasK1018, thanks for the detailed description. We were able to replicate your error. For future reference, I'm attaching the stack trace below.

stack_trace.txt

Root Cause

We will investigate more but in all likelihood, this is probably related to #1691. The known bug is that HMA sometimes creates null values even when there aren't any in the real data. Usually this happens for float64 or string types in the child table. In this particular case, you have int64 columns (that are categorical). The int64 storage type does not support null values, hence the crash.

As a next step, the team can investigate more to confirm that the root cause is, indeed, #1691. If so, we will dedupe the issue. We can also do a better job at preventing the crash and instead just returning the null value (even though this is not quite correct). I've filed RDT issue 747 to keep track of this.

Workarounds

Changing the default distribution for each table to 'norm' seems to solve this issue because it makes the error less likely. Unfortunately, this may impact the data quality (but at least it won't crash!)

synthesizer = HMASynthesizer(metadata)

for table_name in ['unique_id', 'ads_dataset', 'feeds_dataset']:
  synthesizer.set_table_parameters(
      table_name=table_name,
      table_parameters={
          'enforce_min_max_values': True,
          'default_distribution': 'norm'
      }
  )

synthesizer.fit(datasets)
synthesizer.sample()

Do note that the HMA is only meant for smaller datasets though. So even with this workaround in place, you'll still see the performance alert:

PerformanceAlert: Using the HMASynthesizer on this metadata schema is not recommended. To model this data, HMA will generate a large number of columns. (1189 columns)


      Table Name  # Columns in Metadata  Est # Columns
0      unique_id                      1           1129
1  feeds_dataset                     26             26
2    ads_dataset                     34             34

We recommend simplifying your metadata schema by dropping columns that are not necessary. If this is not possible, contact us at info@sdv.dev for enterprise solutions.

Ultimately, I'd recommend using the HSASynthesizer as it's designed to handle larger datasets more robustly. I've confirmed that it works without issue on this dataset (fitting in a few seconds).

@npatki npatki added data:multi-table Related to multi-table, relational datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 2, 2024
@ThomasK1018
Copy link
Author

Hi Neha,

Thanks for your reply. That change on default distribution has certainly helped. May I also ask what are the available choices to be filled in for this section? I have also tried inputting truncnorm and the output is different from leaving in blank. So I would like to know if there are more choices for the distribution. Thanks.

Best Regards,
Thomas

@npatki
Copy link
Contributor

npatki commented Jan 9, 2024

Hi @ThomasK1018, no problem. For more info, I'd recommend checking our docs here. Note that the default_distributions sets the shape for all columns but you can override individual columns by using the numerical_distributions parameter too.

@npatki
Copy link
Contributor

npatki commented Jan 29, 2024

Hi @ThomasK1018, I'm closing off this issue as a duplicate since we do have now have more evidence that it is caused by #1691. We are actively looking into this root cause and hope to have fix up in an upcoming release.

Please feel free to reply directly to #1691 if there is anything more to discuss and we can always continue the conversation there.

@npatki npatki closed this as completed Jan 29, 2024
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants