Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature engineering #48

Open
fsx950223 opened this issue Sep 10, 2021 · 4 comments
Open

Feature engineering #48

fsx950223 opened this issue Sep 10, 2021 · 4 comments
Labels

Comments

@fsx950223
Copy link

fsx950223 commented Sep 10, 2021

I have a question about feature engineering.
Why do you use chars as inputs instead of words?
For example,

Hello world!
<tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd',
       b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(11,), dtype=string, numpy=
array([b'H e', b'e l', b'l l', b'l o', b'o  ', b'  w', b'w o', b'o r',
       b'r l', b'l d', b'd !'], dtype=object)>

is better than

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Hello', b'world', b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world', b'world !'], dtype=object)>

?

@fsx950223
Copy link
Author

fsx950223 commented Sep 25, 2021

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332.
@yoeo

@yoeo
Copy link
Owner

yoeo commented Sep 27, 2021

Hi @fsx950223

Why do you use chars as inputs instead of words?

In fact, I tested both chars and words with various preprocessing tricks and chose the one that gave the best predictions with the current model & training dataset.
If one day I switch to a new machine learning model or change the way I build the training dataset, I'll have to test the different preprocessing options again and choose the best one -> and it could be "words" this time.

By the way, if you know any general rule about when to use chars or words for feature engineering, I'll be happy to learn and test it 🙂

@yoeo yoeo added the question label Sep 27, 2021
@yoeo
Copy link
Owner

yoeo commented Sep 27, 2021

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332.

In theory yes. You probably could use tflite by:

  1. hacking the model trained model to take integer input instead of the string ones
  2. extract the string -> integer mappings from the model
  3. convert the hacked trained model (without the mappings) to tflite
  4. use the extracted mappings to convert your input strings into integer inputs
  5. send the integer inputs to the new tflite model to generate predictions

I don't know if it will actually work, but if you find a way to make work, please share the details here #26

@fsx950223
Copy link
Author

fsx950223 commented Sep 28, 2021

For improving model performance, I recommend tf.keras.layers.TextVectorization + FastText model which is similar to the current model. For more details, taking a look at https://www.tensorflow.org/text/guide/word_embeddings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants