# Password Strength - ML

`Author:` [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)\
`Date:` 18-Sept-2024\
[Send me an email](mailto:mohammadebad1@hotmail.com)\
[Visit my GitHub profile](https://github.com/smebad)

[Dataset used in this notebook](https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset)

## Introduction
In this notebook, we explore a machine learning approach to classify the strength of passwords based on their composition. We use a RandomForestClassifier combined with TF-IDF vectorization to predict whether a password is "Weak", "Medium", or "Strong".


## Dataset
The dataset used in this project contains password-strength pairs, with the strength labeled as 0 (Weak), 1 (Medium), or 2 (Strong). Here's a preview of the dataset:

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import getpass
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load the dataset
df = pd.read_csv('data.csv', on_bad_lines='skip')
print (df.head(10))

           password  strength
0          kzde5577         1
1          kino3434         1
2         visi7k1yr         1
3          megzy123         1
4       lamborghin1         1
5  AVYq1lDE4MgAZfNt         2
6          u6c8vhow         1
7          v1118714         1
8      universe2908         1
9          as326159         1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669640 entries, 0 to 669639
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   password  669639 non-null  object
 1   strength  669640 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 10.2+ MB


## Data Exploration
We begin by examining the dataset to ensure it's clean and well-formatted. The dataset is checked for missing values and the strength column is mapped to descriptive labels.

In [4]:
df = df.dropna()
df["strength"] = df["strength"].map({0: "Weak Password", 1: "Medium Password", 2: "Strong Password"})

print(df.head(10))

           password         strength
0          kzde5577  Medium Password
1          kino3434  Medium Password
2         visi7k1yr  Medium Password
3          megzy123  Medium Password
4       lamborghin1  Medium Password
5  AVYq1lDE4MgAZfNt  Strong Password
6          u6c8vhow  Medium Password
7          v1118714  Medium Password
8      universe2908  Medium Password
9          as326159  Medium Password


## Data Preprocessing
### Tokenization:

A custom tokenizer is defined to convert each password into a list of characters, which is necessary for the TF-IDF vectorizer.

In [5]:
def word(password):
  character = []
  for i in password:
    character.append(i)
  return character


## Vectorization
We use the TfidfVectorizer to convert passwords into a numerical format that the machine learning model can understand.

In [None]:
x = np.array(df["password"])
y = np.array(df["strength"])

cv = TfidfVectorizer(tokenizer=word)
X = cv.fit_transform(x)

##Train-Test Split
The data is split into training and testing sets to evaluate the model's performance.

In [6]:
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.05, random_state=42)

## Model Training
A RandomForestClassifier is trained on the training data. We use 50 estimators and a maximum depth of 10 for this model, with class weights balanced to handle any imbalanced classes.

In [7]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=10, n_jobs=-1, class_weight='balanced')
rfc.fit(xtrain, ytrain)
print("Training Accuracy: ", rfc.score(xtrain, ytrain))

Training Accuracy:  0.8332188437759861


## Model Prediction
We test the model by predicting the strength of a password entered by the user.

In [9]:
user = getpass.getpass("Enter Password: ")
data = cv.transform([user]).toarray()
output = rfc.predict(data)
print("Your Password is: ", output[0])

Enter Password: ··········
Your Password is:  Medium Password


# Summary
In this notebook, we implemented a password strength classification system using a RandomForestClassifier and TF-IDF vectorization. The model achieved an accuracy of approximately 83% on the training data. The system is capable of classifying passwords into three categories: Weak, Medium, or Strong.