Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having trouble creating metadata based on a dataframe #1998

Closed
fyy623 opened this issue May 11, 2024 · 2 comments
Closed

Having trouble creating metadata based on a dataframe #1998

fyy623 opened this issue May 11, 2024 · 2 comments
Labels
question General question about the software resolution:WAI The software is working as intended

Comments

@fyy623
Copy link

fyy623 commented May 11, 2024

Below is the start of my code:

import pandas as pd
import json
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
train = pd.read_csv('data/final/published.csv')

As I read through the document, there are two ways of creating metadata based on a dataframe.

One is using the sdv's built-in API:

metadata = SingleTableMetadata().detect_from_dataframe(train)

The other is manually create a json object as metadata:

m_json = {}
m_json['METADATA_SPEC_VERSION'] = 'SINGLE_TABLE_V1'
m_json['columns'] = {}
for col in list(train):
    m_json['columns'][col] = {}
    m_json['columns'][col]['sdtype'] = 'numerical'
metadata = json.dumps(m_json)

And either way, the metadata should be passed to CTGANSyhthesizer like

synthesizer = CTGANSynthesizer(
    metadata,
    epochs=500,
    verbose=True)
synthesizer.fit(train)

The first auto-dectect method gives me an error that says: AttributeError: 'NoneType' object has no attribute 'validate'

The seecond manual way raises an error that says: AttributeError: 'str' object has no attribute 'validate'

Since both issues are probably related, I think it would be adequate to discuss them together.

My json object prints something like below:

{"METADATA_SPEC_VERSION": "SINGLE_TABLE_V1", "columns": {"col1": {"sdtype": "numerical"}, "col2": {"sdtype": "numerical"}, "col3": {"sdtype": "numerical"}, "col4": {"sdtype": "numerical"}}}
@fyy623 fyy623 added new Automatic label applied to new issues question General question about the software labels May 11, 2024
@fyy623
Copy link
Author

fyy623 commented May 11, 2024

fixed myself, try below:

metadata = SingleTableMetadata()
for col in list(train):
    metadata.add_column(column_name=col, sdtype='numerical', computer_representation='Float')

if os.path.exists("metadata.json"):
    os.remove("metadata.json")

metadata.save_to_json(filepath='metadata.json')

synthesizer = CTGANSynthesizer(
    metadata,
    epochs=500,
    verbose=True)

@fyy623 fyy623 closed this as completed May 11, 2024
@npatki
Copy link
Contributor

npatki commented May 13, 2024

Hi @fyy623 glad you were able to fix it. Do feel free to file a new issue if you run into anything.

BTW I believe the issue was that you were passing in the metadata dictionary into CTGANSynthesizer. All SDV synthesizers expects you to pass in a SingleTableMetadata object, not a dictionary. For more info, you can see the resources below:

@npatki npatki added resolution:WAI The software is working as intended and removed new Automatic label applied to new issues labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants