You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of SDV 1.8.0, the metadata can auto-detect semantic (PII) sdtypes such as vin (vehicle identifier number) or administrative_unit (a state/province). It does this by checking to see if the column name contains important substrings such as 'vin' or 'state'.
Unfortunately this can lead to unexpected results:
A column named resolving_loans would be identified as sdtype 'vin' be cause the word resolving contains the substring 'vin'
A column named RealEstateLoans would be identified as sdtype 'administrative_unit' because the word Estate contains the substring 'state' (which is a type of administrative region).
Expected behavior
Instead of checking to see if any substring matches the keywords, the metadata auto-detection script should tokenize the column names first. Then, it should check for exact matches within the tokenized words.
We can tokenize names that contain underscores or camel-case letters. Consider the above examples:
Column resolving_loans would be tokenized into ['resolving', 'loans']. None of these words exactly match the keyword vin so the sdtype cannot be vin.
Column RealEstateLoans would be tokenized into ['real', 'estate', 'loans']. None of these words exactly match the keyword state so the sdtype cannot be state.
On the other hand:
Column vin_number would be tokenized into ['vin', 'number'], which exactly matches 'vin'
Column StateDepartment would be tokenized into ['state', 'department'], which exactly matches 'state'
Additional context
We should be careful with camel-case.
All-caps words should not be tokenized (EXAMPLE)
All-caps words with undersscores should be tokenized (EXAMPLE_COLUMN --> ['example', 'column'])
The text was updated successfully, but these errors were encountered:
Problem Description
As of SDV 1.8.0, the metadata can auto-detect semantic (PII) sdtypes such as
vin
(vehicle identifier number) oradministrative_unit
(a state/province). It does this by checking to see if the column name contains important substrings such as'vin'
or'state'
.Unfortunately this can lead to unexpected results:
resolving_loans
would be identified as sdtype'vin'
be cause the word resolving contains the substring'vin'
RealEstateLoans
would be identified as sdtype'administrative_unit'
because the word Estate contains the substring'state'
(which is a type of administrative region).Expected behavior
Instead of checking to see if any substring matches the keywords, the metadata auto-detection script should tokenize the column names first. Then, it should check for exact matches within the tokenized words.
We can tokenize names that contain underscores or camel-case letters. Consider the above examples:
resolving_loans
would be tokenized into['resolving', 'loans']
. None of these words exactly match the keywordvin
so the sdtype cannot bevin
.RealEstateLoans
would be tokenized into['real', 'estate', 'loans']
. None of these words exactly match the keywordstate
so the sdtype cannot bestate
.On the other hand:
vin_number
would be tokenized into['vin', 'number']
, which exactly matches'vin'
StateDepartment
would be tokenized into['state', 'department']
, which exactly matches'state'
Additional context
We should be careful with camel-case.
EXAMPLE
)EXAMPLE_COLUMN
-->['example', 'column']
)The text was updated successfully, but these errors were encountered: