Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow columns to not have a transformer #473

Closed
npatki opened this issue Apr 6, 2022 · 0 comments
Closed

Allow columns to not have a transformer #473

npatki opened this issue Apr 6, 2022 · 0 comments
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Apr 6, 2022

Problem Description

It should be possible to skip the transformation of select columns by specifying that no transformation is needed in the config.

Expected behavior

  • Use None in place of a transformer object to specify that no transformer is needed for the column
  • Add new methods remove_transformers and remove_transformers_by_sdtype for better usability in removing transformers from the config
  • When a column is marked as None, the column is carried over as-is during the transform call; no changes in value or column name

Additional context

This change will touch many methods.

set_config: Allow the use of None in place of a transformer object

import rdt.transformers as rt

ht = HyperTransformer()
ht.set_config(config={
  'column_A': rt.FloatFormatter(missing_value_replacement=0.00),
  'column_B': None, # don't transform this column
  'column_C': rt.FrequencyEncoder()
})

update_transformers and update_transformers_by_sdtype: Also allow the use of None, though this is not the preferred approach

ht = HyperTransformer()
ht.detect_initial_config(data)

# the following methods will work but will not be mentioned in the docs
ht.update_transformers(column_name_to_transformer={
  'column_B': None
})

ht.update_transformers_by_sdtype(sdtype='categorical', transformer=None)
transformed = ht.fit_transform(data)

remove_transformers and remove_transformers_by_sdtype: The preferred way to remove transformers from the config

ht = HyperTransformer()
ht.detect_initial_config(data)

# remove transformers for the given list of column names
ht.remove_transformers(column_names=['column_B'])

# remove the transformers for everything in the given sdtype
ht.remove_transformers_by_sdtype(sdtype='categorical')

transformed = ht.fit_transform(data)

get_config: Return (and print out) None for the appropriate transformer(s)

{
  'sdtypes': {
    'column_A': 'float',
    'column_B': 'datetime',
    'column_C': 'categorical'
  },
  'transformers': {
    'column_A': FloatFormatter(missing_value_replacement=0.00)
    'column_B': None
    'column_C': FrequencyEncoder()
  }
}

User Validation

1. Invalid sdtype

Check the sdtype the user passes into remove_transformers_by_sdtype

ht.remove_transformers_by_sdtype(sdtype='unkown_type')
Error: Invalid sdtype '<name>'. If you are trying to use a premium sdtype, contact info@sdv.dev about RDT Add-Ons.

2. Invalid column name

ht.remove_transformers(column_names=['column_A', 'unknown_column'])
Error: Invalid column names: ['unknown_column'...]. These columns do not exist in the config. Use 'get_config()' to see the expected values

3. Sdtype & transformer are not compatible

Ignore this case. It does not apply for these functions.

4. User tries to use either of these functions after already calling fit

ht = HyperTransformer()
ht.detect_initial_config(data)
ht.fit()

ht.remove_transformers(column_names=['column_A'])
Warning: For this change to take effect, please refit your data using 'fit' or 'fit_transform'.

ht.remove_transformers_by_sdtype(sdtype='categorical')
Warning: For this change to take effect, please refit your data using 'fit' or 'fit_transform'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants