You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:
The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.
When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:
Is my way creating bow_vocabulary correct?
My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
Could you shed some light on this? That would be super helpful!
Thanks for your time to look at this and I am looking forward to hearing your reply!
The text was updated successfully, but these errors were encountered:
Greetings,
I am working on the week2 homework(part1) notebook and have question about
![image](https://user-images.githubusercontent.com/58477055/218123812-14124c51-af75-4dc0-ac42-69249e24347e.png)
bow_vocabulary
, as the length of thebow_vocabulary
I created is different from thelength of the set of all tokens
in the training set.See the screenshot below:
The way I created
![image](https://user-images.githubusercontent.com/58477055/218124076-2411af3a-9574-4849-a192-c3f56c338e19.png)
bow_vocabulary
as follows:Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from
TweetTokenizer
.Then I count the occurrence of each token and only keep the top
k
words in the vocabulary.When putting all tokens into a
set
, some tokens(str) would be treated as the same one so that the length decreases.I am wondering:
bow_vocabulary
correct?However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the
set
used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.Could you shed some light on this? That would be super helpful!
Thanks for your time to look at this and I am looking forward to hearing your reply!
The text was updated successfully, but these errors were encountered: