-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset balancing #2
Comments
Nonstandard languages
Russian normalized datasetNormalized russian data: vocab size has decreased 4 times, from ~90k to ~25k |
Final dataset distributionLanguages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian. Path to the balanced file: Path to the file without balancing for languages above: Emoji distributionSimilar to Russian merged distribution, but can differ a little: '😂': 0.25, The smallest class '😡' in Indonesian is only 3.8k. In Russian it's ~6k. VocabsPreprocessingVocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set. Regular expressions for removing extra chars:
BalancingSizes in final dataset:
All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training. |
@Islanna |
@Islanna |
No description provided.
The text was updated successfully, but these errors were encountered: