Inferring password composition policies from breached user credential databases.
Sometimes as security researchers we need to be able to work out the password composition policy that some publicly-available breached user credential database was created under. This tool is able to assist with this, even when the data is "contaminated" with passwords that do not comply with the policy.
This library requires you to have the following software installed:
- Python 3.7.2 or later [^]
- Pandas for loading large CSV files [^]
- Matplotlib for plotting figures [^]
Both Pandas and Matplotlib can be installed using pip [^]:
pip install pandas
pip install matplotlib
Using the utility is a two-step process. The first thing you'll need is a plaintext password database (try SecLists for these), which you'll need to format as a CSV file like so:
password, frequency
"123456", 290729
"12345", 79076
"123456789", 76789
"password", 59462
"iloveyou", 49952
"princess", 33291
"1234567", 21725
"rockyou", 20901
"12345678", 20553
...
Now, you'll be able to pass this file to /src/extractfeatures.py
to generate a JSON file containing features of the database. For convenience, I've included some of these files under /features
to save you doing this part yourself:
000webhost.json
is from the 000webhost breach. This service apparently had a password composition policy in place mandating that passwords be at least length 6 with at least one letter and at least one number.linkedin.json
is from the LinkedIn breach. Reported password composition policy is length 6 with no other constraints.rockyou.json
is from the RockYou breach. Reported password composition policy is length 5 with no other constraints.xato.json
is from the data dump compiled by Mark Burnett sampled randomly from several breaches. Because this is a compound dataset, passwords here are likely to have been created under multiple different policies (or no policy at all).yahoo.json
is from the Yahoo Voice breach sampled randomly from several breaches. Reported password composition policy is length 6 with no other constraints.
Some feature files created from synthetic datasets are also included. These are:
linkedin-2class8-errors.json
is the LinkedIn dataset (seelinkedin.json
) fitlered according to a2class8
policy (two character classes from lowercase, uppercase, digits and symbols, length at least 8), then run throughintroduceerrors.py
which simulates common data formatting errors by splitting passwords along potentially problematic tokens (,
).linkedin-2word12-padded.json
as above, but filtered according to a2word12
policy (at least two letter sequences separated by non-letter sequences, length at least 12) and padded with the singles.org, elitehacker, hak5 and faithwriters datasets usingcombine.py
. This is designed to simulate intentional padding of a dataset with smaller ones in order to increase its resale value.
Here's what these files look like:
{
"lengths": {
"1": 314,
"2": 1042,
"3": 6725,
// ...
},
"lowerCounts": {
"0": 6329765,
"1": 333254,
"2": 449242,
"3": 852241,
// ...
},
"upperCounts": {
"0": 30653712,
"1": 668835,
"2": 162895,
"3": 89374,
// ...
},
// ...
}
Here's how you generate one for rockyou.csv
for example (the CSV file is way too big to include here, check out SecLists for the raw data):
python ./src/extractfeatures.py rockyou.csv > rockyou.json
Now for the interesting bit. Using src/polinfer.py
to infer password composition policy rules. First, let's determine that most of the passwords in the set described by rockyou.json
were created under a policy enforcing a minimum length constraint of 5:
python ./src/polinfer.py -k lengths ./features/rockyou.json
# > Lower constraint on lengths inferred as 5.
Nice, this is backed up by existing literature (for example, see the work by Golla and Dürmuth here).
Now, let's check for a minimum number of digits:
python ./src/polinfer.py -k digitCounts ./features/rockyou.json
# > Lower constraint on digitCounts unlikely to be present in policy.
This gives us the correct answer, that RockYou did not mandate a minimum number of digits in passwords.
We are similarly able to infer the policy in place for webhost (minimum length 6, at least 1 number):
python ./src/polinfer.py -k lengths ./features/000webhost.json
# > Lower constraint on lengths inferred as 5.
python ./src/polinfer.py -k digitCounts -l 0 ./features/000webhost.json
# > Lower constraint on digitCounts inferred as 1.
You can get a better idea about command-line arguments you can pass to each utility using the -h
help flag:
python ./src/extractfeatures.py -h
# > Help information...
python ./src/polinfer.py -h
# > Help information...
It's possible to use the utility to generate some interesting figures (included under /docs/figures
). Matplotlib is used for this purpose. Here's an example:
The above figure was generated like this:
python ./src/polinfer.py -t '000webhost: $mult(l)$ for $l=1$ to $l=20$' -x 'Length ($l$)' -y '$mult(l)$' -o ./docs/figures/000webhost_lengthsAccum.svg -s ./features/000webhost.json
I wish to thank the following parties for their contribution to this project:
- The font used in the logo is Monofur by Tobias Benjamin Köhler.
- The Tango Icon Library (used in the logo) is an excellent free icon pack that I recommend checking out.