Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError on UniqueCombinations constraint with non-strings #196

Closed
LihuaXiong2020 opened this issue Sep 17, 2020 · 4 comments · Fixed by #481
Closed

AttributeError on UniqueCombinations constraint with non-strings #196

LihuaXiong2020 opened this issue Sep 17, 2020 · 4 comments · Fixed by #481
Assignees
Labels
bug Something isn't working
Milestone

Comments

@LihuaXiong2020
Copy link

LihuaXiong2020 commented Sep 17, 2020

  • SDV version: 0.4.0
  • Python version: 3.6.8
  • Operating System: Windows

Description & What I did

  • Defined certain columns of the dataframe as categorical in the metadata class
  • Specified UniqueCombinations constraint based on those columns
  • Passed in the constraint to SDV with GaussianCopula
  • called sdv.fit() and had "AttributeError: Can only use .str accessor with string values...", which can be traced back to line 98 (_validate_separator function) of sdv/constraints/tabular.py

Reproduce

`
import pandas as pd
from sdv.constraints import UniqueCombinations
from sdv.tabular import GaussianCopula

df = pd.DataFrame({"cat_a": [1,2,3], "cat_b": [4,5,6], "value": [0.5, 1.0, 1.5]})
unique_comb_segments = UniqueCombinations(
columns=[
"cat_a",
"cat_b"
],
handling_strategy="transform"
)
model = GaussianCopula(constraints=[unique_comb_segments])
model.fit(df)

`

Error:

`
AttributeError Traceback (most recent call last)
in
8 )
9 model = GaussianCopula(constraints=[unique_comb_segments])
---> 10 model.fit(df)

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/tabular/base.py in fit(self, data)
100 """
101 if not self._metadata_fitted:
--> 102 self._metadata.fit(data)
103
104 self._num_rows = len(data)

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/metadata/table.py in fit(self, data)
446 data = self._anonymize(data)
447
--> 448 data = self._fit_transform_constraints(data)
449 self._fit_hyper_transformer(data)
450 self.fitted = True

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/metadata/table.py in _fit_transform_constraints(self, data)
330 self._constraints[idx] = constraint
331
--> 332 data = constraint.fit_transform(data)
333
334 return data

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/constraints/base.py in fit_transform(self, table_data)
124 Transformed data.
125 """
--> 126 self.fit(table_data)
127 return self.transform(table_data)
128

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/constraints/tabular.py in fit(self, table_data)
119 """
120 self._separator = '#'
--> 121 while not self._valid_separator(table_data):
122 self._separator += '#'
123

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/sdv/constraints/tabular.py in _valid_separator(self, table_data)
96 """
97 for column in self._columns:
---> 98 if table_data[column].str.contains(self._separator).any():
99 return False
100

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
5173 or name in self._accessors
5174 ):
-> 5175 return object.getattribute(self, name)
5176 else:
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/pandas/core/accessor.py in get(self, obj, cls)
173 # we're accessing the attribute of the class, i.e., Dataset.geo
174 return self._accessor
--> 175 accessor_obj = self._accessor(obj)
176 # Replace the property with the accessor object. Inspired by:
177 # http://www.pydanny.com/cached-property.html

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/pandas/core/strings.py in init(self, data)
1915
1916 def init(self, data):
-> 1917 self._inferred_dtype = self._validate(data)
1918 self._is_categorical = is_categorical_dtype(data)
1919

~/opt/anaconda3/envs/python3b/lib/python3.7/site-packages/pandas/core/strings.py in _validate(data)
1965
1966 if inferred_dtype not in allowed_types:
-> 1967 raise AttributeError("Can only use .str accessor with string " "values!")
1968 return inferred_dtype
1969

AttributeError: Can only use .str accessor with string values!
`

@csala
Copy link
Contributor

csala commented Sep 18, 2020

Thanks for reporting this @LihuaXiong2020

I think that the problem is not really the categorical data in general but just the categorical data made of integer values, so the title might be a bit misleading.

Would you mind editing the title to something like: "AttributeError when using UniqueCombinations constraint with integer values"?

It would also be helpful if you could post a short snippet of code showing how to reproduce the error.

@csala csala added the bug Something isn't working label Sep 18, 2020
@csala csala modified the milestones: 0.4.1, 0.4.2 Sep 18, 2020
@LihuaXiong2020 LihuaXiong2020 changed the title Categorical data incompatible with UniqueCombinations AttributeError when using UniqueCombinations constraint with integer values Sep 18, 2020
@LihuaXiong2020
Copy link
Author

Hi @csala, I think it's not just for integers, cuz I transformed the integers in to categoricals and it appears UniqueCombinations can only work with strings. Would it be possible to extend it to cover other dtypes?

Sure, I'll try to construct a reproducible example.

@csala csala changed the title AttributeError when using UniqueCombinations constraint with integer values AttributeError on UniqueCombinations constraint with non-strings Sep 18, 2020
@csala
Copy link
Contributor

csala commented Sep 18, 2020

Oh, yes, I actually meant this: Values that are not strings, independently on the type that they have in the metadata.
I further updated the title to reflect that.

This will be a tricky one, because even if we convert the values into strings on the fly inside the constraint, if we have mixed types it will be hard to keep track of what the original type was.

For example, if we have a column that contains two categories with different dtypes, like ["a", 1], converting the 1 into a string when we combine this column with the other one will be easy, but knowing that we have to cast the "1" back to a 1 when we split the columns again will be harder. And, going even further, we might have something like ["1", 1] (so, integers and their string representation mixed). We will have to think carefully about how to keep track of this!

@LihuaXiong2020
Copy link
Author

LihuaXiong2020 commented Sep 18, 2020

Hi @csala, I reproduced in python 3.7 but it's the same as python 3.6.8. The specification of the Categorical type through metadata is also omitted, as it's the same case with pure integers.

@csala csala modified the milestones: 0.4.2, 0.4.3 Sep 19, 2020
@csala csala modified the milestones: 0.4.3, 0.4.4 Sep 28, 2020
@csala csala modified the milestones: 0.4.4, 0.4.5 Oct 6, 2020
@csala csala modified the milestones: 0.4.5, 0.4.6 Oct 16, 2020
@csala csala removed this from the 0.4.6 milestone Nov 23, 2020
@katxiao katxiao linked a pull request Jun 25, 2021 that will close this issue
@katxiao katxiao linked a pull request Jun 25, 2021 that will close this issue
@csala csala added this to the 0.11.0 milestone Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants