Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean data type still not working, improve test #270

Closed
vankesteren opened this issue Mar 4, 2024 · 4 comments · Fixed by #273
Closed

Boolean data type still not working, improve test #270

vankesteren opened this issue Mar 4, 2024 · 4 comments · Fixed by #273

Comments

@vankesteren
Copy link
Member

After #260 we thought boolean data types were now fully supported. However, this is not yet the case due to some weird datatype issues when running pl.Series(values: list[numpy.bool_]). This should be fixed and we should build an integration test with a debug dataset with many different data types. The debug dataset should be added to the demo datasets already built into our package.

reprex:

import polars as pl
from metasyn import MetaFrame, demo_file

fp = demo_file("spaceship")
df = pl.read_csv(fp, dtypes={"HomePlanet": pl.Categorical})[:,0:3]
df.schema

Results in the following data types:

OrderedDict([('PassengerId', Utf8),
             ('HomePlanet', Categorical),
             ('CryoSleep', Boolean)])

However, when we fit and synthesize a model:

mf = MetaFrame.fit_dataframe(df)
sf = mf.synthesize()
sf.schema

We get the wrong data types:

OrderedDict([('PassengerId', Utf8),
             ('HomePlanet', Categorical),
             ('CryoSleep', Int64)]) <<<<<<<<<< ERROR!!

Even though the model seems to be correct:

# Rows: 8693
# Columns: 3

Column 1: "PassengerId"
- Variable Type: string
- Data Type: Utf8
- Proportion of Missing Values: 0.0000
- Distribution:
        - Type: core.regex
        - Provenance: builtin
        - Parameters:
                - regex: [0-9]{4}_0[0-9]


Column 2: "HomePlanet"
- Variable Type: categorical
- Data Type: Categorical
- Proportion of Missing Values: 0.0231
- Distribution:
        - Type: core.multinoulli
        - Provenance: builtin
        - Parameters:
                - labels: ['Earth' 'Europa' 'Mars']
                - probs: [0.54192181 0.25094206 0.20713613]


Column 3: "CryoSleep"
- Variable Type: categorical
- Data Type: Boolean
- Proportion of Missing Values: 0.0250
- Distribution:
        - Type: core.multinoulli
        - Provenance: builtin
        - Parameters:
                - labels: [False  True]
                - probs: [0.6416942 0.3583058]

We should also be careful in serialization to json. I'm not sure whether this works correctly now and should be part of our integration test too.

@qubixes
Copy link
Member

qubixes commented Mar 5, 2024

@vankesteren I'm trying to reproduce the issue, but I get the correct results. Which version of polars are you on?

@vankesteren
Copy link
Member Author

Alright nice that was it. I was on 0.19.3 I think, moving to >0.20 fixed things. Should we update our minimum version in our dependencies?

@qubixes
Copy link
Member

qubixes commented Mar 5, 2024

Sure, let's just put the minimum at the current version.

@vankesteren
Copy link
Member Author

After we merge #273 this issue will be closed. I have created a new issue for the integration test: #274

qubixes pushed a commit that referenced this issue Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants