-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating dictionairy files #4
Comments
I don't think the dictionary quality will improve significantly if you use more data. I even limited the number of lines to process because Sentence Piece itself recommends that. The lines that are taken for processing are sampled randomly so it doesn't mean it will only take the first million lines from your file. To get the most efficient dictionary you should include the most common phrases into it, not just use a lot of data. Any words that don't make it to the dictionary will be encoded as individual characters. But then again, if the word is so rare it can't be encoded with 2-3 tokens it's unlikely the model will use it in future at all. |
Thank you for the clear answer. One quick question: do I have to determine the optimal vocab_size by trial and error? I've searched through the entire sentencepiece repository but could not find any straightforward guidelines regarding this. |
Probably, yes. I can't find it now but I remember that some paper mentioned around 30-50k tokens for this encoding type. It depends on the language and this whole field is pretty much intuition-driven (from what I read and saw). You can't calcualte the optimal network architecture or the number of tokens, the networks are so big that the only method is trial, error, rinse and repeat. Here's a couple of easy to read papers to get you started: https://arxiv.org/pdf/1808.06226.pdf and https://arxiv.org/pdf/1508.07909.pdf They contain no math (I just don't get complex math tbh) and everything else is mostly common sense and logic. I personally tried vocabularies with 10k and 50k tokens, surprisingly the 10k model converged faster and the resulting loss was much lower (around 3.5 compared to 4+ for 50k model). But the output was still not impressive and maybe the 50k model has more potential for improving in time. It all requires a lot of experimentation. Also, one thing to remember: your data size (in tokens) must be a lot bigger than your model size. Or else it will just memorize the corpus and produce garbage on arbitrary input. I used a huge Russian books dump, it contains zipped fb2 books, the overall size is more than 400 Gb. Of course, there are many duplicates and not all books are in Russian so I did some filtering first and in the end produced a corpus of around 10 Gb or so. To fully sample it (the train script selects ranodm lines, not sequential) my system would require about 6 days. |
Dude, you're awesome! Thanks for the valuable information, I will definitely study the papers you mentioned. I will try different values for the vocab_size and see what happens. However, after you mentioned this:
I just realised the biggest challenge will be finding sufficient amounts of texts written in Dutch, as the total size of all Dutch books on Gutenberg.org is less than 100MB. Anyways, things are starting to get more and more clear to me now. Many thanks again. |
Yeah, that corpus is way too small. You can try translating books with Google for starters or find other sources (you're not expecting me to buy 400 Gb of compressed books and I don't think you can find that many in public domain so...). The whole point of neural networks is to lossy "compress" the data into their internal structure to be able to find patterns in it. That's because you require it to correctly predict the next token based on the previous tokens and it should be able to do that on a lot of different text lines that can't be stored inside. If your data can be stored "as is" because the model size allows it, it's not forced to optimize itself and find the patterns, hence it doesn't learn at all, it just memorizes. |
As this issue is still 'Open', I guess this is a good place to ask the following question: If you replace the new line character with a custom character like |
As far as I remember the script doesn't replace the new lines but insert that token before it so the sentences are short enough. Take a look at |
Right now I'm executing
createspmodel.sh
with a text-file containing all books from Project Gutenberg written in the Dutch language to generate the dictionary files. Do you think this is sufficient? Or should I also use a wikipedia-scraper for example, to extend the amount of text for creating the dictionary files?To me it seems 'the more data, the better' when initialising the vocabulairy files. @rkfg could you give your opinion about this?
The text was updated successfully, but these errors were encountered: