Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question of bow_vocabulary in w2_homework_part1 #113

Open
ZequnZ opened this issue Feb 10, 2023 · 0 comments
Open

A question of bow_vocabulary in w2_homework_part1 #113

ZequnZ opened this issue Feb 10, 2023 · 0 comments

Comments

@ZequnZ
Copy link

ZequnZ commented Feb 10, 2023

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:
image

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.
image

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

  1. Is my way creating bow_vocabulary correct?
  2. My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
    However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
    Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant