Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexGenerator gives a confusing message: # of possibilities are shown as an imaginary number #748

Closed
npatki opened this issue Jan 3, 2024 · 0 comments · Fixed by #754
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jan 3, 2024

Environment Details

  • RDT version: 1.9.0

Error Description

In some cases, the RegexGenerator may be asked to create more regexes than there are unique possibilities. In this case, it will produce a warning indicating that the IDs may repeat. The message also includes the total number of unique possibilities. This total number is formatted as an imaginary number instead of an integer, which is confusing to users.

In the example below, the RegexGenerator correctly concludes that there 6 possibilities yet it represents the number as 6+0j.

/usr/local/lib/python3.10/dist-packages/rdt/transformers/text.py:164: UserWarning:
The data has 10 rows but the regex for 'ID' can only create (6+0j) unique values.
Some values in 'ID' may be repeated.

Expected Behavior

  1. The warning should be an info.log message instead. There is nothing concerning about duplicating Regex values if the unique enforcement flag is off (default).
  2. The number should be represented as an integer instead of an imaginary number and it should not be inside parenthesis (see below).
Info log:
The data has 10 rows but the regex for 'ID' can only create 6 unique values.
Some values in 'ID' may be repeated.

Steps to reproduce

from rdt import get_demo
from rdt import HyperTransformer
from rdt.transformers.text import RegexGenerator


data = get_demo()
data['ID'] = ['id_0', 'id_1', 'id_2', 'id_3', 'id_4']

ht = HyperTransformer()
ht.detect_initial_config(data=data)

ht.update_sdtypes({'ID': 'text'})
ht.update_transformers({'ID': RegexGenerator(regex_format='id_[0-5]{1}', enforce_uniqueness=False)})

transformed = ht.fit_transform(data)
double_transformed = pd.concat([transformed, transformed], axis=0).reset_index().drop(columns=['index'])

ht.reverse_transform(double_transformed)

Related Issue

This was first observed in SDV Issue 1729.

@npatki npatki added the bug Something isn't working label Jan 3, 2024
@npatki npatki changed the title Regex formatter's gives a confusing message: # of possibilities are shown as an imaginary number RegexGenerator gives a confusing message: # of possibilities are shown as an imaginary number Jan 3, 2024
@amontanez24 amontanez24 added this to the 1.9.2 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants