Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

@ReDeiPirati
Copy link
Contributor

@ReDeiPirati ReDeiPirati commented Jun 27, 2017

@lukaszkaiser I've struggled a bit around Zipf distribution because numpy.random.zipf is first of all a Zeta-distribution(are similar but not equal) and second doesn't allow to generate sample from a given range or chose alpha values less than 1.0. So i've followed the advices in this two stackoverflow posts(first and second) and created a function to generate the distribution and another for generating samples(both with test).
As i said in the closed issue, i have found that alpha(for the Zipf Distr) usually is in the range [1.1-1.6] for modelling natural text: so for generating sample which could potentially emulate nlp like task i've chosen(and tested) values that generate samples for all the range following the Zipf distribution.

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! This looks almost ready, just 2 small questions. It's great to have this problem done :).

"""
u = np.random.random(sample_len)
return [t+1 for t in np.searchsorted(distr_map, u)] # 0 pad and 1 EOS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 enough here for both PAD and EOS? Just asking, if it is, please add a comment making that clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be enough, but i have continued to think about this line which is a little bit tricky. From the numpy docs about numpy.random.random they said that the values returned are in the range [0.0,1.0) but obtain an absolute zero(0.00000...0) is possible but almost improbable. So, maybe it's better to add sanity check about the improbable zero. And comment in a more clear way.

"algorithmic_reverse_nlplike_decimal8K": (
lambda: algorithmic.reverse_generator_nlplike(8000, 40, 100000,
10, 1.250),
lambda: algorithmic.reverse_generator_nlplike(8000, 400, 10000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping length 40 for both train and dev makes more sense for nlplike tasks. On purely algorithmic tasks, we want to see generalization to much higher lengths. It's a nice-have in NLP too, but less important. Maybe just a little larger, like 60 or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 70 in train and 700 in dev, would it be better?

…ax_length and add __pycache__ entry in .gitignore
Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@lukaszkaiser lukaszkaiser merged commit a2a6178 into tensorflow:master Jun 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants