Compound Passphrase List Safety Checker
This command line tool checks whether a given passphrase word list (such as a diceware word list) has any words that can be combined to make another word on the list. It's written in Rust, which I am new to. This is very much a work in progress, so I'd heavily caution against trusting it for real results. I forked off an earlier version of this project when it was in a simpler state if you want to check that out.
I've now written a blog post about this tool.
Initially I wanted to make sure that no two words in the EFF's long diceware word list could be combined to make another word on the list. I later checked other lists.
Disclosure: I am not a professional researcher or statistician, and frankly I'm pretty fuzzy on some of this math. This code/theory/explanation could be very wrong (but hopefully not harmful?). If you think it could be wrong or harmful, please leave an issue!
Further disclosure: see "Caveat" section below.
What is "compound-safety"?
I made up the term. Here's what I mean by it: A passphrase word list is "compound-safe" if it...
does NOT contain any pairs of words that can be combined to make another word on the list. (We'll call this a "compounding")
does NOT contain any pairs of words that can be combined such that they can be guessed in two distinct ways within the same word-length space (We'll call this a "problematic overlap").
Brief examples of each of these conditions being violated
An example of condition #1: If a word list included "under", "dog", and "underdog" as three separate words, it would NOT be compound-safe, since "under" and "dog" can be combined to make the word "underdog". A user not using spaces between words might get a passphrase that included the character string "underdog" as two words, but a brute-force attack would guess it as one word. Therefore this word list would NOT be compound-safe. (I refer to this as a "compounding".)
I heard of this potential issue in this YouTube video.
An example of condition #2: Let's say a word list included "paper", "paperboy", "boyhood", and "hood". A user not using spaces between words might get the following two words next to each other in a passphrase: "paperboyhood", which would be able to be brute-force guessed as both
[paper][boyhood]. Therefore this word list would NOT be compound-safe. (I call this a "problematic overlap".)
Another way to think about problematic overlaps: if, for every pair of words, you mash them together, there must be only ONE way to split them apart and make two words on the list.
Why is the compound-safety of a passphrase word list notable?
Let's say we're using the word list described above, which has "under", "dog" and "underdog" in it. A user might randomly get "under" and "dog" in a row, for example in the six-word passphrase "crueltyfrailunderdogcyclingapostle". The user might assume they had six words worth of entropy. But really, an attacker brute forcing their way through five-word passphrases would eventually crack the passphrase. We can call this event "a compounding".
Likewise if we got the 6-word phrase "divingpaperboyhoodemployeepastelgravity", an attacker running through six-word combinations would have two chances of guessing "paperboyhood" rather than one.
It's important to note that if the passphrase has any punctuation (for example, a period, comma, hyphen, space) between words, both of these issues go away completely. If our passphrase is "cruelty under dog daylight paper boyhood": (1) an attacker who tries "underdog" as the third word does not get a match, (2) and the attacker likewise does not get a match if "paperboy" is guessed in the fifth slot and "hood is guessed as the sixth.
Are compound-safe passphrases "stronger" or "better" than non-compound-safe passphrases?
Is "crueltyfrailunderdogcyclingapostle" a "weaker" passphrase than a 6-word phrase that does not have a compounding in it? Honestly I'm not sure.
But if an attacker knew your passphrase was 6 words, I'm not sure if a phrase with a compounding is "worse" (i.e. going to be cracked earlier) or as good as one without.
What about a "problematic overlap"?
I think a passphrase with a problematic overlap is a clearer issue. This means that, in the same word-length guess space, this passphrase will appear twice rather than once.
Realistically, what are the odds of either a compounding or a problematic overlap occurring in a randomly generated passphrase?
I don't know! If you think you have a formula for calculating this on a per-list basis, feel free to submit an issue or pull request!
What this tool does
This tool takes a word list (as a text file) as an input. It then searches the given list for both compoundings and problematic overlaps (see above).
Next, it attempts to find the smallest number of words that need to be removed in order to make the given word list "compound-safe". Finally, it prints out this new, shorter, compound-safe list to a new text file. In this way it makes word lists "compound-safe" (or at least more safe-- see "Known issue" and "Caveat" sections below).
How to use this tool to check a word list
First you'll need to install Rust. Make sure running the command
cargo --version returns something that starts with something like
Next, clone down this repo. To run the script, cd into the repo's directory and run:
cargo run --release <wordlist.txt>
This will create a file named
wordlist.txt.compound-safe that is the compound-safe list of your word list (obviously may be shorter).
You can also specify a specific output file location:
cargo run --release <wordlist-to-check.txt> <output.txt>
Some initial findings
I found the EFF long word list to be compound-safe (which is really cool!). EFF notes that in making the line "We also ensured that no word is an exact prefix of any other word." I'm not sure if that condition is enough on its own to ensure "compound-safety" as I've defined it here.
In contrast, in the 1Password list (labeled
word_lists/agile_words.txt in this project, copied from this 1Password challenge),
Re: compoundings: I found 2,661 compound words (see:
scrap-lists-of-compound-words-and-components/agile_double_bad_words.txt), made up of 1,511 unique bad single words (see:
scrap-lists-of-compound-words-and-components/agile_single_bad_words.txt). The tool was able to remove only 498 words to make compoundings impossible.
The tool also found 2,117 problematic overlaps in the 1Password list, and marked 2,117 words for removal.
All told, the tool removed 2,225 unique words from the 1Password list to make a new, compound-safe list. The compound-safe version of the Agile list has 16,103 words and a copy of the list is located at
word_lists/agile_words-compound-safe.txt. With 16,103, each word from this list would add about 13.98 bits of entropy to a passphrase, compared to the original 1Password list, which adds about 14.16 bits.
NOTE: 1Password's software, as far as I know, does NOT allow users to generate random passphrase without punctuation between words. Users must choose to separate words with a period, hyphen, space, comma, or underscore. So these findings do NOT constitute a security issue with 1Password.
Caveats / Known issues
We've explored "two-word compounding", where two words are actually one, but is there a possibility of a three-word compounding -- where three words become two? This tool does NOT currently check for this, so I can't actually guarantee that the lists outputted by the tool are completely compound-safe.
Also, currently this script runs really slowly on lists with a lot of overlaps (problematic or not). Using threads in Rust would help, but I'm sure there's a more efficient way to check for problematic overhangs.
- Use multiple threads to speed up the process.
- Make the command line text output during the process cleaner and more professional-looking.
- Make the Rust code simpler and/or more idiomatic.
1a. Given a word list that is not compound-safe, calculate the probability of a compounding (generating a non-safe pair in a passphrase)? 1b. Given this probability, does it make sense, or is it useful, to calculate a revised bits-per-word measure of the list? (For the record I think this would be harmful, but I pose it here for inspiration.)