Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bag of words - what is the delimiter? #129

Open
Overload119 opened this issue Dec 20, 2022 · 3 comments
Open

Bag of words - what is the delimiter? #129

Overload119 opened this issue Dec 20, 2022 · 3 comments

Comments

@Overload119
Copy link

Consider a table:

target words
1 This, That, And The Other
0 This
1 And The Other, That

Am I using the commas to infer the bag of words correctly?

@isabella
Copy link
Contributor

isabella commented Dec 20, 2022

The tokenizer will tokenize the string in the following way:

words tokens
This, That, And the Other this , that , and the other

It's not splitting text into tokens using a comma delimiter.

If you want the behavior to instead be three tokens This, That, And The Other, I suggest preprocessing those columns and pass text that has already been feature engineered.

@Overload119
Copy link
Author

Do you have an example of how that would work?
How can I pass text in any other way in the column?

@isabella
Copy link
Contributor

You would need to pre-process your csv using another tool. Alternatively, you can use an enum column by using a custom config file as described here: https://www.modelfox.dev/docs/guides/train_with_custom_configuration.

In the example linked above, the "chest_pain" column is specified as type "enum" with four variants.

{
  "dataset": {
    "columns": [
    {
      "name": "chest_pain",
      "type": "enum",
      "variants": [
        "asymptomatic",
        "atypical angina",
        "non-angina pain",
        "typical angina"
      ]
    },
...
  }
}

For your dataset, you would specify that the words column is an enum with 3 variants: "This", "That", "And The Other".

Then, use the config file by passing --config path/to/config.json on the CLI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants