# Introduction
The motivation behind this notebook is to determine how strong a password is. And not just by looking at entropy analysis or how complex a password is, but by looking at the correlation of password strength and how it varies when compared against a number of different categories which we can categorize the password string by.

# Background Research
It is so often said to us that we should use passwords that are not common words. But let's say my password is 'klo$0', instead of what I would usually use (not) 'y@tin'.

Both these passwords are 8 characters each. The idea is that a computer program can probably guess these passwords in an insignificant amount of time can be worrying. Let's just try it ourselves and see how long it would take for a computer to find these two passwords:

In [1]:
import time
import string
import random

# all the characters that can be used in a password
all_chars = string.ascii_uppercase + string.ascii_lowercase + string.punctuation + string.digits
slen = len(all_chars)
# calculating time for random 5 character string:
start_rand = time.time()
check = ''.join(random.choice(all_chars) for _ in range(5))
end_rand = time.time()
elapsed = end_rand - start_rand
print("It took {0} seconds to guess a random 5 character string from a range of characters {1} characters long".format(elapsed, slen))

It took 0.00010609626770019531 seconds to guess a random 5 character string from a range of characters 94 characters long


We know that there are 6 possibilities, and we also know that there are 94 characters to chose from. So we can have 94^6 possible combinations of possible 6 character strings. As such, we can multiply the number of combinations with the amount of time it takes to compute one combination, and find out how much time (years) it would have at most taken to guess a random password of 6 characters. 

In [2]:
pos = 94 ** 5
total = pos * elapsed # total time taken
days = total / 86400 # seconds to days conversion
years = days / 365 # days to years conversion
print("It will take at most {0} years to guess your 5 character password!".format(years))

It will take at most 0.024690663884703364 years to guess your 5 character password!


So if I don't change my password every 3 or so years, it could potentially be hacked. Now obviously my password is not that low, because let's observe what happens when we change the number of characters in the password from 5 to 12.

In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

# define function to check for how long it would take to generate passowrd of n characters
pass_speeds = [] # parallel lists keeping track of speeds and corresponding number of chars
num_chars = []
def guess(n):
    check = ''.join(random.choice(all_chars) for _ in range(n))
    pos = 94 ** n
    total = pos * elapsed # total time taken
    days = total / 86400 # seconds to days conversion
    years = days / 365 # days to years conversion
    num_chars.append(n)
    pass_speeds.append(years)

# call function from ranges of 5 to 12 to populate dictionary with values
for i in range(5,13):
    guess(i)

# create data frame with speeds of corresponding password lengths
speeds = pd.DataFrame({"speed (years)": pass_speeds, "length of password (chars)": num_chars})
print(speeds)

   length of password (chars)  speed (years)
0                           5   2.469066e-02
1                           6   2.320922e+00
2                           7   2.181667e+02
3                           8   2.050767e+04
4                           9   1.927721e+06
5                          10   1.812058e+08
6                          11   1.703334e+10
7                          12   1.601134e+12


Look at that, the longer your password is, the longer it will take to crack it using a computer. At this point it is probably right to think that it might be a bit overkill to make a password of anything near 9 characters. But remember that these operations can be made easier with synchronization of more than one computer together, and these computers that which can probably operate at much, much higher speeds than mine. 

*transition to analysis of actual dataset, after stating hypothesis*

In [4]:
p = pd.read_csv('8-more-passwords.txt', delim_whitespace=True, header=None)
p.columns = ['Passwords']
p.head()

Unnamed: 0,Passwords
0,Ainslie1
1,146Dudley
2,Amanda94
3,Ambrose1
4,Yorkshire1


In [5]:
# add lists categorizing the different types of chars
rHand = "6^7&8*9(0)-_=+yYuUiIoOpP[{]}\|hHjJkKlL;:'nNmM,<.>/?" # chars typed by right hand
lHand = "1!2@3#4$5%qwertasdfgzxcvbQWERTASDFGZXCVB" # chars typed by left hand
punc_in = string.punctuation # special characters (punctuation)
nums_in = "1234567890" # all the numbers
vows_in = "aeiouAEIOU" # all the vowels

After extracting the passwords from the text file, it is easy to then make lists of characters that categorize different traits for a string. So can then add the corresponding columns of these lists by looping through the passwords and adding up how many times each char falls into the categories.

In [6]:
# dicts for each category of char
keys = np.arange(61682)
right = dict(zip(keys, [0] * len(keys)))
left = dict(zip(keys, [0] * len(keys)))
punctuations = dict(zip(keys, [0] * len(keys)))
numbers = dict(zip(keys, [0] * len(keys)))
vowels = dict(zip(keys, [0] * len(keys)))
length = dict(zip(keys, [0] * len(keys)))
upper = dict(zip(keys, [0] * len(keys)))
lower = dict(zip(keys, [0] * len(keys)))
# list that contains these dicts
    
sent = 0 # sentinel keeps track of index in dicts
for i in p.loc[:, "Passwords"]:
    length[sent] = len(i)
    for char in i:
        if char in rHand:
            right[sent] += 1
        if char in lHand:
            left[sent] += 1
        if char in punc_in:
            punctuations[sent] += 1    
        if char in nums_in:
            numbers[sent] += 1 
        if char in vows_in:
            vowels[sent] += 1 
        if char.islower():
            lower[sent] += 1
        if char.isupper():
            upper[sent] += 1
    sent += 1

In [7]:
p["Right Hand"] = pd.Series(right)
p["Left Hand"] = pd.Series(left)
p["Puncuations"] = pd.Series(punctuations)
p["Numbers"] = pd.Series(numbers)
p["Vowels"] = pd.Series(vowels)
p["Length"] = pd.Series(length)
p["Uppercase"] = pd.Series(upper)
p["Lowercase"] = pd.Series(lower)