subsetcount

Here's a question: what fraction of sentences don't use the letters "a", "b", or "c"? In other words, if you randomly sample a sentence, what is the probability that it won't use the letters "a", "b", or "c"? This might seem difficult to answer accurately. One way to answer the question would be to look at all the books in the world, and simply count the number of sentences that meet the criteria. However, if the probability is very low, this sample estimate might easily give us a flat 0.

It's possible to use a language model to get a more accurate answer to the above question. With a language model, we can explicitly sample a sequence using a subset of characters (masking out the disallowed characters), and multiply the estimated probability by (1-P(forbidden characters)) at each timestep. This gives an unbiased sample estimate of the probability that a sampled sequence will not use a disallowed character.

Results

I trained a character-level Transformer on the Amazon Food Reviews corpus for a few days on a single GPU. The resulting per-character loss was 0.84 nats, which is not very good. However, the model still generates fully formed words and often produces coherent sentences.

Using the above language model, I measured some probabilities. All of these are probabilities that the model will produce 256-character chunks without using some subset of characters:

Characters omitted	Log probability (base e)
aA	-18.11
jqxzJQXZ	-1.23
abcABC	-28.47

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
experiment		experiment
subsetcount		subsetcount
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

subsetcount

Results

About

Uh oh!

Releases

Packages

Languages

unixpickle/subsetcount

Folders and files

Latest commit

History

Repository files navigation

subsetcount

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages