<a href="https://colab.research.google.com/github/sgzmd/password-estimator/blob/main/ExistingEstimators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Existing Estimators Research

## Executive summary

Evaluated rules-based password strength estimators, even the state-of-the art ones such as `zxcvbn` are suffering from a number of issues, most notably miscategorising strong passwords as weak. Some of the best ones are generally very reliable with detecting weak passwords, but at the price of higher miscategorisation on the strong end.

In short, no single estimator stands out as perfect, and most are struggling with passwords that in fact are secure.

## Overview

The goal of this notebook is to evaluate a number of existing password estimators, with the following goals:

1. Establish a baseline of how well they work against a pre-defined dataset
2. Understand how well they can handle the corner cases (i.e. passwords that appear to be secure, but aren't, and the other way around)

We start with 5 arbitrary reasonably popular libraries.

In [None]:
# Let's start by installing the required libraries
!pip install zxcvbn password_strength passwordmeter password-validator

# Following libraries we'll need for the actual evaluation
!pip install datasets scikit-learn ipywidgets numpy pandas tqdm

We will now write a method which would allow us to validate password strength using all 4 libraries

In [None]:
def check_password(password, method="zxcvbn"):
    """
    Check the strength of a password using one of several popular Python libraries.

    The following methods are available:
      - "zxcvbn": Uses the zxcvbn-python library. Secure if score >= 3.
      - "password_validator": Uses the password-validator package with a defined schema.
      - "password_strength": Uses the password_strength package and evaluates entropy.
                               Secure if entropy > 50 bits.
      - "passwordmeter": Uses the passwordmeter package. Secure if score > 0.5.
      - "password_checker": Uses the py-password-checker package (assumed API).
                            Secure if the checker validates the password.

    Args:
        password (str): The password to be evaluated.
        method (str): Which method to use for estimation.

    Returns:
        int: 1 if the password is considered secure, 0 otherwise.
    """

    if password is None:
        return 0

    if method.lower() == "zxcvbn":
        try:
            from zxcvbn import zxcvbn
            result = zxcvbn(password)
            # zxcvbn scores range from 0 (weak) to 4 (strong)
            return 1 if result.get("score", 0) >= 3 else 0
        except ImportError:
            print("Error: zxcvbn library is not installed.")
            return 0

    elif method.lower() == "password_validator":
        try:
            from password_validator import PasswordValidator
            # Define a schema: minimum 8 characters, maximum 100, at least one uppercase, one lowercase, and one digit.
            schema = PasswordValidator()
            schema.min(8).max(100).has().uppercase().has().lowercase().has().digits()
            is_valid = schema.validate(password)
            return 1 if is_valid else 0
        except ImportError:
            print("Error: password-validator module is not installed.")
            return 0
        except Exception as e:
            print(f"password_validator check failed: {e}")
            return 0

    elif method.lower() == "password_strength":
        try:
            from password_strength import PasswordStats
            stats = PasswordStats(password)

            # Documentation suggests 0.66 as a good magic number
            return 1 if stats.strength() > 0.66 else 0
        except ImportError:
            print("Error: password_strength module is not installed.")
            return 0
        except Exception as e:
            print(f"password_strength check failed: {e}")
            return 0

    elif method.lower() == "passwordmeter":
        try:
            import passwordmeter
            score, improvements = passwordmeter.test(password)
            # Assume a score greater than 0.5 indicates a secure password.
            return 1 if score > 0.5 else 0
        except ImportError:
            print("Error: passwordmeter module is not installed.")
            return 0
        except Exception as e:
            print(f"passwordmeter check failed: {e}")
            return 0

    else:
        raise ValueError("Unknown method specified. Choose one of: zxcvbn, password_validator, password_strength, passwordmeter, password_checker.")


# Example usage:
test_password = "ExamplePass123!"
VALIDATION_METHODS = ["zxcvbn", "password_validator", "password_strength", "passwordmeter"]

for m in VALIDATION_METHODS:
    result = check_password(test_password, method=m)
    print(f"Method '{m}': {'Secure' if result == 1 else 'Not secure'}")

Method 'zxcvbn': Secure
Method 'password_validator': Secure
Method 'password_strength': Secure
Method 'passwordmeter': Secure


We can now load a dataset which we'll be using for evaluation. We'll use `InfinitodeLTD/PWLDS` which is a very large synthetic dataset created for this very purpose.

In [None]:
import random
from datasets import load_dataset

ds = load_dataset("InfinitodeLTD/PWLDS", split="train")

# Let's only take a subset of the data for faster processing
# Value 2 corresponds to neither secure nor insecure passwords, so we'll exclude it
my_filter = lambda x: random.random() < 0.01 and x["Strength_Level"] != 2

ds = ds.filter(my_filter)

Filter:   0%|          | 0/10000192 [00:00<?, ? examples/s]

Let's jot down some general information about the dataset:

In [None]:
print("General Dataset Information:")
print("----------------------------")
print("Total examples:", ds.num_rows)
print("Columns:", ds.column_names)
print("Features:", ds.features)
print("\nSample entries:")
for i in range(min(5, ds.num_rows)):
    print(ds[i])

General Dataset Information:
----------------------------
Total examples: 80140
Columns: ['Password', 'Strength_Level']
Features: {'Password': Value(dtype='string', id=None), 'Strength_Level': Value(dtype='int64', id=None)}

Sample entries:
{'Password': 'qmpo', 'Strength_Level': 0}
{'Password': 'eyju', 'Strength_Level': 0}
{'Password': 'ssgsh', 'Strength_Level': 0}
{'Password': 'arrisc', 'Strength_Level': 0}
{'Password': '5ukjv', 'Strength_Level': 0}


Cool, let's run this dataset through our estimators:

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from tqdm.notebook import tqdm
import pandas as pd
from IPython.display import display, Markdown

def map_label(strength):
    return 1 if strength in [3, 4] else 0

# Define true_labels based on your dataset's 'Strength_Level'
true_labels = [map_label(sl) for sl in ds["Strength_Level"]]

for method in VALIDATION_METHODS:
    predictions = [
        check_password(password, method=method)
        for password in tqdm(ds["Password"], desc=f"Processing with {method}")
    ]

    # Compute confusion matrix and classification report
    cm = confusion_matrix(true_labels, predictions)
    cr = classification_report(true_labels, predictions, output_dict=True)

    # Format Markdown output
    md_output = f"## Evaluating Method: **{method}**\n\n"

    # Confusion Matrix
    cm_df = pd.DataFrame(
        cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
    )
    md_output += "**Confusion Matrix:**\n\n"
    md_output += cm_df.to_markdown() + "\n\n"

    # Classification Report
    cr_df = pd.DataFrame(cr).transpose().round(2)
    cr_df.index = cr_df.index.map(lambda x: {"0": "Class 0", "1": "Class 1"}.get(x, x.title()))
    md_output += "**Classification Report:**\n\n"
    md_output += cr_df.to_markdown() + "\n\n"

    # Identify mispredictions (up to 5 FPs and 10 FNs)
    mispred_indices = [i for i, (t, p) in enumerate(zip(true_labels, predictions)) if t != p]
    fp_indices = [i for i in mispred_indices if true_labels[i] == 0 and predictions[i] == 1][:5]
    fn_indices = [i for i in mispred_indices if true_labels[i] == 1 and predictions[i] == 0][:10]

    md_output += "**Mispredictions (up to 5 FPs and 10 FNs):**\n\n"

    if not (fp_indices or fn_indices):
        md_output += "No mispredictions found!\n"
    else:
        mispred_list = []
        for idx in fp_indices:
            mispred_list.append({
                "Type": "False Positive",
                "Password": ds['Password'][idx],
                "True Label": true_labels[idx],
                "Prediction": predictions[idx]
            })
        for idx in fn_indices:
            mispred_list.append({
                "Type": "False Negative",
                "Password": ds['Password'][idx],
                "True Label": true_labels[idx],
                "Prediction": predictions[idx]
            })

        mispred_df = pd.DataFrame(mispred_list)
        md_output += mispred_df.to_markdown(index=False) + "\n"
    # Display formatted markdown
    display(Markdown(md_output))

Processing with zxcvbn:   0%|          | 0/80140 [00:00<?, ?it/s]

## Evaluating Method: **zxcvbn**

**Confusion Matrix:**

|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         24101 |         15887 |
| Actual 1 |          1732 |         38420 |

**Classification Report:**

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| Class 0      |        0.93 |     0.6  |       0.73 |  39988    |
| Class 1      |        0.71 |     0.96 |       0.81 |  40152    |
| Accuracy     |        0.78 |     0.78 |       0.78 |      0.78 |
| Macro Avg    |        0.82 |     0.78 |       0.77 |  80140    |
| Weighted Avg |        0.82 |     0.78 |       0.77 |  80140    |

**Mispredictions (up to 5 FPs and 10 FNs):**

| Type           | Password     |   True Label |   Prediction |
|:---------------|:-------------|-------------:|-------------:|
| False Positive | 1111acinosE  |            0 |            1 |
| False Positive | adenomacvbn  |            0 |            1 |
| False Positive | acrinyllkjh  |            0 |            1 |
| False Positive | acantha9asdf |            0 |            1 |
| False Positive | rrrradelopod |            0 |            1 |
| False Negative | acceptedM|   |            1 |            0 |
| False Negative | absolve=BR   |            1 |            0 |
| False Negative | abroadkR>P   |            1 |            0 |
| False Negative | abolish5*T   |            1 |            0 |
| False Negative | absorberaz   |            1 |            0 |
| False Negative | absinthejM   |            1 |            0 |
| False Negative | Academicr-   |            1 |            0 |
| False Negative | Aaronitic*   |            1 |            0 |
| False Negative | ablativek}   |            1 |            0 |
| False Negative | absenter(S   |            1 |            0 |


Processing with password_validator:   0%|          | 0/80140 [00:00<?, ?it/s]

## Evaluating Method: **password_validator**

**Confusion Matrix:**

|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         34724 |          5264 |
| Actual 1 |         14916 |         25236 |

**Classification Report:**

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| Class 0      |        0.7  |     0.87 |       0.77 |  39988    |
| Class 1      |        0.83 |     0.63 |       0.71 |  40152    |
| Accuracy     |        0.75 |     0.75 |       0.75 |      0.75 |
| Macro Avg    |        0.76 |     0.75 |       0.74 |  80140    |
| Weighted Avg |        0.76 |     0.75 |       0.74 |  80140    |

**Mispredictions (up to 5 FPs and 10 FNs):**

| Type           | Password      |   True Label |   Prediction |
|:---------------|:--------------|-------------:|-------------:|
| False Positive | 1111acinosE   |            0 |            1 |
| False Positive | aBlation0000  |            0 |            1 |
| False Positive | acroMial3     |            0 |            1 |
| False Positive | aaaa7aeFauld  |            0 |            1 |
| False Positive | abysmAl6666   |            0 |            1 |
| False Negative | acceptedM|    |            1 |            0 |
| False Negative | aching_fSS    |            1 |            0 |
| False Negative | aclinal|=x$   |            1 |            0 |
| False Negative | CR-|r*qzNS    |            1 |            0 |
| False Negative | absconsaW#    |            1 |            0 |
| False Negative | achingZu)C    |            1 |            0 |
| False Negative | [gEmaccentor  |            1 |            0 |
| False Negative | AcemeticixR   |            1 |            0 |
| False Negative | v|hdabsconder |            1 |            0 |
| False Negative | Aon<@hok;o    |            1 |            0 |


Processing with password_strength:   0%|          | 0/80140 [00:00<?, ?it/s]

## Evaluating Method: **password_strength**

**Confusion Matrix:**

|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         39929 |            59 |
| Actual 1 |         20037 |         20115 |

**Classification Report:**

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| Class 0      |        0.67 |     1    |       0.8  |  39988    |
| Class 1      |        1    |     0.5  |       0.67 |  40152    |
| Accuracy     |        0.75 |     0.75 |       0.75 |      0.75 |
| Macro Avg    |        0.83 |     0.75 |       0.73 |  80140    |
| Weighted Avg |        0.83 |     0.75 |       0.73 |  80140    |

**Mispredictions (up to 5 FPs and 10 FNs):**

| Type           | Password        |   True Label |   Prediction |
|:---------------|:----------------|-------------:|-------------:|
| False Positive | adenosis9qwerty |            0 |            1 |
| False Positive | qwertyadamsite2 |            0 |            1 |
| False Positive | accolent8qwerty |            0 |            1 |
| False Positive | qwerty4absciSse |            0 |            1 |
| False Positive | qwertyabsolver0 |            0 |            1 |
| False Negative | Abelite/57      |            1 |            0 |
| False Negative | acceptedM|      |            1 |            0 |
| False Negative | aching_fSS      |            1 |            0 |
| False Negative | Ed5#|>X9M^(     |            1 |            0 |
| False Negative | aclinal|=x$     |            1 |            0 |
| False Negative | |HV??_<efO3     |            1 |            0 |
| False Negative | t;7XE4oPE@xV0   |            1 |            0 |
| False Negative | CR-|r*qzNS      |            1 |            0 |
| False Negative | absconsaW#      |            1 |            0 |
| False Negative | achingZu)C      |            1 |            0 |


Processing with passwordmeter:   0%|          | 0/80140 [00:00<?, ?it/s]

## Evaluating Method: **passwordmeter**

**Confusion Matrix:**

|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         39988 |             0 |
| Actual 1 |         16362 |         23790 |

**Classification Report:**

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| Class 0      |        0.71 |     1    |       0.83 |   39988   |
| Class 1      |        1    |     0.59 |       0.74 |   40152   |
| Accuracy     |        0.8  |     0.8  |       0.8  |       0.8 |
| Macro Avg    |        0.85 |     0.8  |       0.79 |   80140   |
| Weighted Avg |        0.86 |     0.8  |       0.79 |   80140   |

**Mispredictions (up to 5 FPs and 10 FNs):**

| Type           | Password      |   True Label |   Prediction |
|:---------------|:--------------|-------------:|-------------:|
| False Negative | acceptedM|    |            1 |            0 |
| False Negative | aching_fSS    |            1 |            0 |
| False Negative | aclinal|=x$   |            1 |            0 |
| False Negative | CR-|r*qzNS    |            1 |            0 |
| False Negative | absconsaW#    |            1 |            0 |
| False Negative | achingZu)C    |            1 |            0 |
| False Negative | [gEmaccentor  |            1 |            0 |
| False Negative | AcemeticixR   |            1 |            0 |
| False Negative | v|hdabsconder |            1 |            0 |
| False Negative | Aon<@hok;o    |            1 |            0 |


# Password Complexity Estimators Evaluation Summary

## 1. zxcvbn

**Accuracy:** 78%

**Strengths:**
- High recall (96%) for detecting secure passwords.
- Excellent precision (93%) for identifying insecure passwords.

**Weaknesses:**
- Low recall (60%) for insecure passwords, resulting in many false positives.
- Tends to misclassify passwords with predictable patterns (e.g., `adhamantqweRty`) as secure.

## 2. password_validator

**Accuracy:** 75%

**Strengths:**
- High recall (87%) for insecure passwords.
- Good precision (83%) when identifying strong passwords.

**Weaknesses:**
- Low recall (63%) for secure passwords, leading to high false negatives.
- Struggles with complex passwords containing special characters or unusual structures (e.g., `abutter/ib`, `accession<`).

## 3. password_strength

**Accuracy:** 75%

**Strengths:**
- Perfect recall (100%) for insecure passwords.
- Excellent precision (100%) for secure passwords.

**Weaknesses:**
- Poor recall (50%) for secure passwords, causing numerous false negatives.
- Often misclassifies complex passwords with special characters (`accollevY@6`, `woSlwt,59*tb,`) as weak.

## 4. passwordmeter

**Accuracy:** 79%

**Strengths:**
- Perfect precision (100%) in identifying secure passwords; no false positives.
- Perfect recall (100%) for insecure passwords.

**Weaknesses:**
- Moderate recall (59%) for secure passwords, frequently missing unconventional but secure passwords (`AbaditeG7m`, `z>ca<YIxd]U`).

---

## Overall Analysis and Recommendations

**Common Strengths:**
- All estimators effectively identify weak passwords with high reliability.

**Common Weaknesses:**
- Consistently struggle with accurately recognizing strong passwords containing special characters or unusual patterns, resulting in false negatives.

In short, no single estimator stands out as perfect, and most are struggling with passwords that in fact are secure.