Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset balancing #2

Open
Islanna opened this issue May 6, 2019 · 5 comments
Open

Dataset balancing #2

Islanna opened this issue May 6, 2019 · 5 comments
Assignees

Comments

@Islanna
Copy link
Collaborator

Islanna commented May 6, 2019

No description provided.

@Islanna Islanna changed the title Datase Dataset balancing May 6, 2019
@Islanna
Copy link
Collaborator Author

Islanna commented May 6, 2019

Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.

Languages

Full 2018 dataset size

en    3918333
ja    2603697
ar    1416102
es    1237730
pt    869292 
th    620532 
ko    493476 
fr    349677 
tr    302217 
tl    129997 
id    109838 
it    86488  
de    85671  
ru    84824  

Emoji merging

exclude_emojis = ['🙌','👊','🎶','💁','✋','🎧','🔫','🙅','👀','💯']

merge_dict = {
    '💕':'😍',
    '❤':'😍',
    '💙':'😍',
    '♥':'😍',
    '💜':'😍',
    '💖':'😍',
    '💟':'😍',
    '😘':'😍',
    '😉':'😏',
    '😢':'😭',
    '😁':'😊',
    '😄':'😊',
    '😌':'😊',
    '☺':'😊',
    '👌':'👍',
    '👏':'👍',
    '💪':'👍',
    '✨':'👍',
    '✌':'👍',
    '😋':'😜',
    '😐':'😑',
    '😒':'😑',
    '😕':'😑',
    '😠':'😡',
    '💀':'😡',
    '😤':'😡',
    '😈':'😡',
    '😩':'😔',
    '😞':'😔',
    '😪':'😔',
    '😷':'😔',
    '😴':'😔',
    '🙈':'😅',
    '🙊':'😅',
    '😳':'😅',
    '😫':'😣',  
    '😓':'😣',
    '😖':'😣',
    '😬':'😣',
    '🙏':'😣'
}

Emoji distribution

Distribution in the Russian 2018 dataset (~85k tweets)

{'😂': 21529,
 '😍': 17369,
 '😊': 8777,
 '👍': 6195,
 '😏': 5559,
 '😭': 4556,
 '😅': 4336,
 '😑': 2542,
 '💔': 2481,
 '😣': 2065,
 '😔': 1924,
 '😡': 1884,
 '😎': 1782,
 '😜': 1454}

Probably, we can further merge classes 😜 and 😏 , 😎 and 😊

Vocabulary distribution

Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample.
Words - processed text(no numbers and punctuation) split by spaces

Lang Stratified sample size Len word vocab Len ngram vocab
en 96544 50779 303809
ja 96036 224019 5380959
ar 96544 156335 751645
es 96544 63004 304999
pt 96544 45251 240593
th 94232 186816 935840
ko 93859 281594 2105074
fr 95571 53860 286240
tr 95135 122695 469954
tl 80123 52671 255014
id 76675 58604 296513
it 82705 57282 269276
de 79234 56615 332311
ru 84824 93778 492304

Cover

Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.

N%:[10%,...,90%]

Chars

Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.
full_chars
sample_chars

Japanese and Korean chars look much more like ngrams.

Ngrams

Only for the sample. Calculation for the full dataset is time-consuming.
sample_ngrams

Words

full_words

sample_words

@Islanna
Copy link
Collaborator Author

Islanna commented May 7, 2019

Nonstandard languages

  • ko - found library to split syllables to chars jamotools. Well, no dramatical changes, the word vocabulary is the same size(obviously), the ngram vocabulary is 2x smaller, but still is >1mln. But plot for ngram distribution looks much better.

koupd

  • ar - removed all short vowels and other symbols (harakat, tashkeel?) that interfere. Only 4% of the whole dataset has changed. Word and ngram vocabs are pretty much the same.
  • th - found out that I've dropped some necessary symbols occasionally during preprocessing: r'\W+' also contains some thai accents. Changed the preprocessing to remain only thai and english chars, added a right tokenization from pythainlp. Final word vocab size is ~20k and ~180k for ngram vocabulary.
  • tr - is an agglutinative language. Probably it explains why word vocab is larger than Russian, but ngram vocab is comparable. No special tools for tokenization.
  • ja - removed it from final data.

Russian normalized dataset

Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k

@Islanna
Copy link
Collaborator Author

Islanna commented May 8, 2019

Final dataset distribution

Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian.
Removed Japanese and Tagalog.

Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather

Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather

Emoji distribution

Similar to Russian merged distribution, but can differ a little:

'😂': 0.25,
'😍': 0.23,
'😊': 0.13,
'😏': 0.08,
'😭': 0.07,
'👍': 0.07,
'😅': 0.05,
'😑': 0.04,
'😔': 0.03,
'😣': 0.03,
'😡': 0.02

The smallest class '😡' in Indonesian is only 3.8k. In Russian it's ~6k.

Vocabs

Preprocessing

Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.

Regular expressions for removing extra chars:

lang_unuse = {'en':'[^a-zA-Z]',
             'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
             'es':'[^a-zA-ZáéíóúüñÁÉÍÓÚÜÑ]',
             'th':'[^\u0E00-\u0E7Fa-zA-Z]',
             'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
             'fr':'[^a-zA-ZÀ-ÿ]',
             'tr':'[^a-zA-ZğşöçĞŞÖÇıIiİuUüÜ]',
             'id':'[^a-zA-Z]',
             'it':'[^a-zA-Z]',
             'de':'[^a-zA-ZÀ-ÿ]',
             'ru':'[^a-zA-Zа-яА-ЯЁё]'}

Balancing

Sizes in final dataset:

Lang Lang size Word vocab Ngram vocab
en 299995 95640 522879
ar 199993 253023 1127338
es 299995 117597 498495
th 349995 46542 331081
ko 198561 515949 1859535
fr 299995 99587 475570
tr 199993 201967 671532
id 199357 100246 457841
it 210703 95578 397849
de 184109 99169 515266
ru 241117 172594 810772

All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.

Ngram vocab cut
ngram vocab

Word vocab cut
word vocab

@snakers4
Copy link
Owner

@Islanna
Some formatting ideas to paste data into an article for easier storytelling
Also describing what you did with words will also help

stats.xlsx

@snakers4
Copy link
Owner

stats.xlsx

@Islanna
updated file for article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants