A question of bow_vocabulary in w2_homework_part1 #113

ZequnZ · 2023-02-10T15:14:29Z

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

Is my way creating bow_vocabulary correct?
My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question of bow_vocabulary in w2_homework_part1 #113

A question of bow_vocabulary in w2_homework_part1 #113

ZequnZ commented Feb 10, 2023

A question of bow_vocabulary in w2_homework_part1 #113

A question of bow_vocabulary in w2_homework_part1 #113

Comments

ZequnZ commented Feb 10, 2023