-
Notifications
You must be signed in to change notification settings - Fork 3.7k
algorithmic_reverse_nlplike generator #57
algorithmic_reverse_nlplike generator #57
Conversation
Update to v1.0.8
… following Zipf's LAw
lukaszkaiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks! This looks almost ready, just 2 small questions. It's great to have this problem done :).
| """ | ||
| u = np.random.random(sample_len) | ||
| return [t+1 for t in np.searchsorted(distr_map, u)] # 0 pad and 1 EOS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1 enough here for both PAD and EOS? Just asking, if it is, please add a comment making that clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be enough, but i have continued to think about this line which is a little bit tricky. From the numpy docs about numpy.random.random they said that the values returned are in the range [0.0,1.0) but obtain an absolute zero(0.00000...0) is possible but almost improbable. So, maybe it's better to add sanity check about the improbable zero. And comment in a more clear way.
tensor2tensor/bin/t2t-datagen
Outdated
| "algorithmic_reverse_nlplike_decimal8K": ( | ||
| lambda: algorithmic.reverse_generator_nlplike(8000, 40, 100000, | ||
| 10, 1.250), | ||
| lambda: algorithmic.reverse_generator_nlplike(8000, 400, 10000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping length 40 for both train and dev makes more sense for nlplike tasks. On purely algorithmic tasks, we want to see generalization to much higher lengths. It's a nice-have in NLP too, but less important. Maybe just a little larger, like 60 or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 70 in train and 700 in dev, would it be better?
…ax_length and add __pycache__ entry in .gitignore
lukaszkaiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
@lukaszkaiser I've struggled a bit around Zipf distribution because
numpy.random.zipfis first of all a Zeta-distribution(are similar but not equal) and second doesn't allow to generate sample from a given range or chose alpha values less than 1.0. So i've followed the advices in this two stackoverflow posts(first and second) and created a function to generate the distribution and another for generating samples(both with test).As i said in the closed issue, i have found that alpha(for the Zipf Distr) usually is in the range [1.1-1.6] for modelling natural text: so for generating sample which could potentially emulate nlp like task i've chosen(and tested) values that generate samples for all the range following the Zipf distribution.